reader/README.md

94 lines
4.4 KiB
Markdown
Raw Normal View History

2024-04-14 02:42:15 +08:00
# Reader
2024-04-10 19:32:07 +08:00
2024-04-14 03:41:38 +08:00
Your LLMs deserve better input.
2024-04-14 03:39:00 +08:00
2024-04-14 10:27:10 +08:00
Reader converts any URL to an **LLM-friendly** input with a simple prefix `https://r.jina.ai/`. Get improved output for your agent and RAG systems at no cost.
2024-04-14 03:41:38 +08:00
2024-04-14 03:42:40 +08:00
- Live demo: https://jina.ai/reader
2024-04-14 03:50:45 +08:00
- Or just visit these URLs https://r.jina.ai/https://github.com/jina-ai/reader, https://r.jina.ai/https://x.com/elonmusk and see yourself.
2024-04-10 19:32:07 +08:00
2024-04-16 08:23:16 +08:00
> Feel free to use https://r.jina.ai/* in production. It is free, stable and scalable. We are maintaining it actively as one of the core products of Jina AI.
2024-04-14 04:13:24 +08:00
[![banner-reader-api.png](https://jina.ai/banner-reader-api.png)](https://jina.ai/reader)
2024-04-14 03:39:00 +08:00
2024-04-16 12:50:34 +08:00
## Updates
- **2024-04-15**: Reader now supports image reading! It captions all images at the specified URL and adds `Image [idx]: [caption]` as an alt tag (if they initially lack one). This enables downstream LLMs to interact with the images in reasoning, summarizing etc. [See example here](https://x.com/JinaAI_/status/1780094402071023926).
2024-04-14 02:42:15 +08:00
## Usage
2024-04-10 19:32:07 +08:00
2024-04-14 03:55:07 +08:00
### Standard mode
2024-04-14 03:22:36 +08:00
2024-04-14 04:13:24 +08:00
Simply prepend `https://r.jina.ai/` to any URL. For example, to convert the URL `https://en.wikipedia.org/wiki/Artificial_intelligence` to an LLM-friendly input, use the following URL:
2024-04-10 19:32:07 +08:00
2024-04-14 02:42:15 +08:00
https://r.jina.ai/https://en.wikipedia.org/wiki/Artificial_intelligence
2024-04-10 19:32:07 +08:00
2024-04-18 12:48:42 +08:00
### Streaming Mode
2024-04-10 19:32:07 +08:00
2024-04-18 12:48:42 +08:00
Streaming mode is useful when you find that the standard mode provides an incomplete result. This is because streaming mode will wait a bit longer until the page is fully rendered.
Use the accept-header to control the streaming behavior:
2024-04-10 19:32:07 +08:00
```bash
2024-04-14 02:42:15 +08:00
curl -H "Accept: text/event-stream" https://r.jina.ai/https://en.m.wikipedia.org/wiki/Main_Page
2024-04-10 19:32:07 +08:00
```
2024-04-18 12:48:42 +08:00
The data comes in a stream; each subsequent chunk contains more complete information. The last chunk should provide the most complete and final result.
2024-04-14 03:33:51 +08:00
2024-04-18 12:48:42 +08:00
For example, compare these two curl commands below:
```bash
curl -H 'x-no-cache: true' https://access.redhat.com/security/cve/CVE-2023-45853
curl -H "Accept: text/event-stream" -H 'x-no-cache: true' https://r.jina.ai/https://access.redhat.com/security/cve/CVE-2023-45853
```
2024-04-14 03:33:51 +08:00
2024-04-18 12:48:42 +08:00
> Note: `-H 'x-no-cache: true'` is used only for demonstration purposes to bypass the cache.
Streaming mode is also useful if your downstream LLM/agent system requires immediate content delivery or needs to process data in chunks to interleave I/O and LLM processing times. This allows for quicker access and more efficient data handling:
```text
2024-04-14 03:33:51 +08:00
Reader API: streamContent1 ----> streamContent2 ----> streamContent3 ---> ...
| | |
v | |
Your LLM: LLM(streamContent1) | |
v |
LLM(streamContent2) |
v
LLM(streamContent3)
```
2024-04-18 12:48:42 +08:00
Note that in terms of completeness: `... > streamContent3 > streamContent2 > streamContent1`.
2024-04-14 10:25:51 +08:00
2024-04-14 03:22:36 +08:00
### JSON mode
2024-04-14 03:33:51 +08:00
This is still very early and the result is not really a "useful" JSON. It contains three fields `url`, `title` and `content` only. Nonetheless, you can use accept-header to control the output format:
2024-04-14 03:22:36 +08:00
```bash
curl -H "Accept: application/json" https://r.jina.ai/https://en.m.wikipedia.org/wiki/Main_Page
```
2024-04-10 19:32:07 +08:00
2024-04-14 02:42:15 +08:00
## Install
2024-04-10 19:32:07 +08:00
2024-04-14 02:42:15 +08:00
You will need the following tools to run the project:
- Node v18 (The build fails for Node version >18)
- Firebase CLI (`npm install -g firebase-tools`)
2024-04-10 19:32:07 +08:00
2024-04-14 02:42:15 +08:00
For backend, go to the `backend/functions` directory and install the npm dependencies.
2024-04-10 19:32:07 +08:00
```bash
2024-04-14 02:42:15 +08:00
git clone git@github.com:jina-ai/reader.git
cd backend/functions
npm install
2024-04-10 19:32:07 +08:00
```
2024-04-14 03:22:36 +08:00
2024-04-14 03:51:36 +08:00
## What is `thinapps-shared` submodule?
2024-04-14 03:22:36 +08:00
2024-04-14 03:47:55 +08:00
You might notice a reference to `thinapps-shared` submodule, an internal package we use to share code across our products. While its not open-sourced and isn't integral to the Reader's functions, it mainly helps with decorators, logging, secrets management, etc. Feel free to ignore it for now.
2024-04-14 03:22:36 +08:00
2024-04-14 03:47:55 +08:00
That said, this is *the single codebase* behind `https://r.jina.ai`, so everytime we commit here, we will deploy the new version to the `https://r.jina.ai`.
2024-04-14 03:25:42 +08:00
2024-04-14 03:33:51 +08:00
## Having trouble on some websites?
Please raise an issue with the URL you are having trouble with. We will look into it and try to fix it.
2024-04-14 03:22:36 +08:00
## License
2024-04-18 12:48:42 +08:00
Reader is backed by [Jina AI](https://jina.ai) and licensed under [Apache-2.0](./LICENSE).