reader/README.md

# Reader

Your LLMs deserve better input.

Reader converts any URL to an **LLM-friendly** input with a simple prefix `https://r.jina.ai/`. Get improved output for your agent and RAG systems at no cost.

- Live demo: https://jina.ai/reader
- Or just visit these URLs https://r.jina.ai/https://github.com/jina-ai/reader, https://r.jina.ai/https://x.com/elonmusk and see yourself.

> Feel free to use https://r.jina.ai/* in production. It is free, stable and scalable. We are maintaining it actively as one of the core products of Jina AI.

[![banner-reader-api.png](https://jina.ai/banner-reader-api.png)](https://jina.ai/reader)


## Updates

- **2024-04-15**: Reader now supports image reading! It captions all images at the specified URL and adds `Image [idx]: [caption]` as an alt tag (if they initially lack one). This enables downstream LLMs to interact with the images in reasoning, summarizing etc. [See example here](https://x.com/JinaAI_/status/1780094402071023926).

## Usage

### Standard mode

Simply prepend `https://r.jina.ai/` to any URL. For example, to convert the URL `https://en.wikipedia.org/wiki/Artificial_intelligence` to an LLM-friendly input, use the following URL:

https://r.jina.ai/https://en.wikipedia.org/wiki/Artificial_intelligence

### Streaming Mode

Streaming mode is useful when you find that the standard mode provides an incomplete result. This is because streaming mode will wait a bit longer until the page is fully rendered.

Use the accept-header to control the streaming behavior:

```bash
curl -H "Accept: text/event-stream" https://r.jina.ai/https://en.m.wikipedia.org/wiki/Main_Page
```

The data comes in a stream; each subsequent chunk contains more complete information. The last chunk should provide the most complete and final result.

For example, compare these two curl commands below:
```bash
curl -H 'x-no-cache: true' https://access.redhat.com/security/cve/CVE-2023-45853
curl -H "Accept: text/event-stream" -H 'x-no-cache: true' https://r.jina.ai/https://access.redhat.com/security/cve/CVE-2023-45853
```

> Note: `-H 'x-no-cache: true'` is used only for demonstration purposes to bypass the cache.

Streaming mode is also useful if your downstream LLM/agent system requires immediate content delivery or needs to process data in chunks to interleave I/O and LLM processing times. This allows for quicker access and more efficient data handling:

```text
Reader API:  streamContent1 ----> streamContent2 ----> streamContent3 ---> ... 
                          |                    |                     |
                          v                    |                     |
Your LLM:                 LLM(streamContent1)  |                     |
                                               v                     |
                                               LLM(streamContent2)   |
                                                                     v
                                                                     LLM(streamContent3)
```

Note that in terms of completeness: `... > streamContent3 > streamContent2 > streamContent1`.

### JSON mode

This is still very early and the result is not really a "useful" JSON. It contains three fields `url`, `title` and `content` only. Nonetheless, you can use accept-header to control the output format:
```bash
curl -H "Accept: application/json" https://r.jina.ai/https://en.m.wikipedia.org/wiki/Main_Page
```

## Install

You will need the following tools to run the project:
- Node v18 (The build fails for Node version >18)
- Firebase CLI (`npm install -g firebase-tools`)

For backend, go to the `backend/functions` directory and install the npm dependencies.

```bash
git clone git@github.com:jina-ai/reader.git
cd backend/functions
npm install
```

## What is `thinapps-shared` submodule?

You might notice a reference to `thinapps-shared` submodule, an internal package we use to share code across our products. While it’s not open-sourced and isn't integral to the Reader's functions, it mainly helps with decorators, logging, secrets management, etc. Feel free to ignore it for now.

That said, this is *the single codebase* behind `https://r.jina.ai`, so everytime we commit here, we will deploy the new version to the `https://r.jina.ai`.

## Having trouble on some websites?
Please raise an issue with the URL you are having trouble with. We will look into it and try to fix it.

## License
Reader is backed by [Jina AI](https://jina.ai) and licensed under [Apache-2.0](./LICENSE).
-												chore: rename url2text to reader

											
										
										
											2024-04-14 02:42:15 +08:00
+								# Reader
-												wip

											
										
										
											2024-04-10 19:32:07 +08:00
-												chore: rename url2text to reader

											
										
										
											2024-04-14 03:41:38 +08:00
+								Your LLMs deserve better input.
-												chore: rename url2text to reader

											
										
										
											2024-04-14 03:39:00 +08:00
-												docs: explain stream mode

											
										
										
											2024-04-14 10:27:10 +08:00
+								Reader converts any URL to an **LLM-friendly** input with a simple prefix `https://r.jina.ai/`. Get improved output for your agent and RAG systems at no cost.
-												chore: rename url2text to reader

											
										
										
											2024-04-14 03:41:38 +08:00
-												chore: rename url2text to reader

											
										
										
											2024-04-14 03:42:40 +08:00
+								- Live demo: https://jina.ai/reader
-												chore: rename url2text to reader

											
										
										
											2024-04-14 03:50:45 +08:00
+								- Or just visit these URLs https://r.jina.ai/https://github.com/jina-ai/reader, https://r.jina.ai/https://x.com/elonmusk and see yourself.
-												wip

											
										
										
											2024-04-10 19:32:07 +08:00
-												chore: update readme

											
										
										
											2024-04-16 08:23:16 +08:00
+								> Feel free to use https://r.jina.ai/* in production. It is free, stable and scalable. We are maintaining it actively as one of the core products of Jina AI.
-												docs: fix readme

											
										
										
											2024-04-14 04:13:24 +08:00
+								[![banner-reader-api.png](https://jina.ai/banner-reader-api.png)](https://jina.ai/reader)
-												chore: rename url2text to reader

											
										
										
											2024-04-14 03:39:00 +08:00
-												docs: update readme

											
										
										
											2024-04-16 12:50:34 +08:00
 								## Updates
 								- **2024-04-15**: Reader now supports image reading! It captions all images at the specified URL and adds `Image [idx]: [caption]` as an alt tag (if they initially lack one). This enables downstream LLMs to interact with the images in reasoning, summarizing etc. [See example here](https://x.com/JinaAI_/status/1780094402071023926).
-												chore: rename url2text to reader

											
										
										
											2024-04-14 02:42:15 +08:00
+								## Usage
-												wip

											
										
										
											2024-04-10 19:32:07 +08:00
-												chore: rename url2text to reader

											
										
										
											2024-04-14 03:55:07 +08:00
+								### Standard mode
-												chore: rename url2text to reader

											
										
										
											2024-04-14 03:22:36 +08:00
-												docs: fix readme

											
										
										
											2024-04-14 04:13:24 +08:00
+								Simply prepend `https://r.jina.ai/` to any URL. For example, to convert the URL `https://en.wikipedia.org/wiki/Artificial_intelligence` to an LLM-friendly input, use the following URL:
-												wip

											
										
										
											2024-04-10 19:32:07 +08:00
-												chore: rename url2text to reader

											
										
										
											2024-04-14 02:42:15 +08:00
+								https://r.jina.ai/https://en.wikipedia.org/wiki/Artificial_intelligence
-												wip

											
										
										
											2024-04-10 19:32:07 +08:00
-												docs: update explain of streaming mode
											
										
										
											2024-04-18 12:48:42 +08:00
+								### Streaming Mode
-												wip

											
										
										
											2024-04-10 19:32:07 +08:00
-												docs: update explain of streaming mode
											
										
										
											2024-04-18 12:48:42 +08:00
+								Streaming mode is useful when you find that the standard mode provides an incomplete result. This is because streaming mode will wait a bit longer until the page is fully rendered.
 								Use the accept-header to control the streaming behavior:
-												wip

											
										
										
											2024-04-10 19:32:07 +08:00
 								```bash
-												chore: rename url2text to reader

											
										
										
											2024-04-14 02:42:15 +08:00
+								curl -H "Accept: text/event-stream" https://r.jina.ai/https://en.m.wikipedia.org/wiki/Main_Page
-												wip

											
										
										
											2024-04-10 19:32:07 +08:00
+								```
-												docs: update explain of streaming mode
											
										
										
											2024-04-18 12:48:42 +08:00
+								The data comes in a stream; each subsequent chunk contains more complete information. The last chunk should provide the most complete and final result.
-												chore: rename url2text to reader

											
										
										
											2024-04-14 03:33:51 +08:00
-												docs: update explain of streaming mode
											
										
										
											2024-04-18 12:48:42 +08:00
+								For example, compare these two curl commands below:
 								```bash
 								curl -H 'x-no-cache: true' https://access.redhat.com/security/cve/CVE-2023-45853
 								curl -H "Accept: text/event-stream" -H 'x-no-cache: true' https://r.jina.ai/https://access.redhat.com/security/cve/CVE-2023-45853
 								```
-												chore: rename url2text to reader

											
										
										
											2024-04-14 03:33:51 +08:00
-												docs: update explain of streaming mode
											
										
										
											2024-04-18 12:48:42 +08:00
+								> Note: `-H 'x-no-cache: true'` is used only for demonstration purposes to bypass the cache.
 								Streaming mode is also useful if your downstream LLM/agent system requires immediate content delivery or needs to process data in chunks to interleave I/O and LLM processing times. This allows for quicker access and more efficient data handling:
 								```text
-												chore: rename url2text to reader

											
										
										
											2024-04-14 03:33:51 +08:00
+								Reader API:  streamContent1 ----> streamContent2 ----> streamContent3 ---> ...
 								                          |                    |                     |
 								                          v                    |                     |
 								Your LLM:                 LLM(streamContent1)  |                     |
 								                                               v                     |
 								                                               LLM(streamContent2)   |
 								                                                                     v
 								                                                                     LLM(streamContent3)
 								```
-												docs: update explain of streaming mode
											
										
										
											2024-04-18 12:48:42 +08:00
+								Note that in terms of completeness: `... > streamContent3 > streamContent2 > streamContent1`.
-												docs: explain stream mode

											
										
										
											2024-04-14 10:25:51 +08:00
-												chore: rename url2text to reader

											
										
										
											2024-04-14 03:22:36 +08:00
+								### JSON mode
-												chore: rename url2text to reader

											
										
										
											2024-04-14 03:33:51 +08:00
+								This is still very early and the result is not really a "useful" JSON. It contains three fields `url`, `title` and `content` only. Nonetheless, you can use accept-header to control the output format:
-												chore: rename url2text to reader

											
										
										
											2024-04-14 03:22:36 +08:00
+								```bash
 								curl -H "Accept: application/json" https://r.jina.ai/https://en.m.wikipedia.org/wiki/Main_Page
 								```
-												wip

											
										
										
											2024-04-10 19:32:07 +08:00
-												chore: rename url2text to reader

											
										
										
											2024-04-14 02:42:15 +08:00
+								## Install
-												wip

											
										
										
											2024-04-10 19:32:07 +08:00
-												chore: rename url2text to reader

											
										
										
											2024-04-14 02:42:15 +08:00
+								You will need the following tools to run the project:
 								- Node v18 (The build fails for Node version >18)
 								- Firebase CLI (`npm install -g firebase-tools`)
-												wip

											
										
										
											2024-04-10 19:32:07 +08:00
-												chore: rename url2text to reader

											
										
										
											2024-04-14 02:42:15 +08:00
+								For backend, go to the `backend/functions` directory and install the npm dependencies.
-												wip

											
										
										
											2024-04-10 19:32:07 +08:00
 								```bash
-												chore: rename url2text to reader

											
										
										
											2024-04-14 02:42:15 +08:00
+								git clone git@github.com:jina-ai/reader.git
 								cd backend/functions
 								npm install
-												wip

											
										
										
											2024-04-10 19:32:07 +08:00
+								```
-												chore: rename url2text to reader

											
										
										
											2024-04-14 03:22:36 +08:00
-												chore: rename url2text to reader

											
										
										
											2024-04-14 03:51:36 +08:00
+								## What is `thinapps-shared` submodule?
-												chore: rename url2text to reader

											
										
										
											2024-04-14 03:22:36 +08:00
-												chore: rename url2text to reader

											
										
										
											2024-04-14 03:47:55 +08:00
+								You might notice a reference to `thinapps-shared` submodule, an internal package we use to share code across our products. While it’s not open-sourced and isn't integral to the Reader's functions, it mainly helps with decorators, logging, secrets management, etc. Feel free to ignore it for now.
-												chore: rename url2text to reader

											
										
										
											2024-04-14 03:22:36 +08:00
-												chore: rename url2text to reader

											
										
										
											2024-04-14 03:47:55 +08:00
+								That said, this is *the single codebase* behind `https://r.jina.ai`, so everytime we commit here, we will deploy the new version to the `https://r.jina.ai`.
-												chore: rename url2text to reader

											
										
										
											2024-04-14 03:25:42 +08:00
-												chore: rename url2text to reader

											
										
										
											2024-04-14 03:33:51 +08:00
+								## Having trouble on some websites?
 								Please raise an issue with the URL you are having trouble with. We will look into it and try to fix it.
-												chore: rename url2text to reader

											
										
										
											2024-04-14 03:22:36 +08:00
+								## License
-												docs: update explain of streaming mode
											
										
										
											2024-04-18 12:48:42 +08:00
+								Reader is backed by [Jina AI](https://jina.ai) and licensed under [Apache-2.0](./LICENSE).