reader/README.md

83 lines
4.0 KiB
Markdown
Raw Normal View History

2024-04-14 02:42:15 +08:00
# Reader
2024-04-10 19:32:07 +08:00
2024-04-14 03:41:38 +08:00
Your LLMs deserve better input.
2024-04-14 03:39:00 +08:00
2024-04-14 10:27:10 +08:00
Reader converts any URL to an **LLM-friendly** input with a simple prefix `https://r.jina.ai/`. Get improved output for your agent and RAG systems at no cost.
2024-04-14 03:41:38 +08:00
2024-04-14 03:42:40 +08:00
- Live demo: https://jina.ai/reader
2024-04-14 03:50:45 +08:00
- Or just visit these URLs https://r.jina.ai/https://github.com/jina-ai/reader, https://r.jina.ai/https://x.com/elonmusk and see yourself.
2024-04-10 19:32:07 +08:00
2024-04-16 08:23:16 +08:00
> Feel free to use https://r.jina.ai/* in production. It is free, stable and scalable. We are maintaining it actively as one of the core products of Jina AI.
2024-04-14 04:13:24 +08:00
[![banner-reader-api.png](https://jina.ai/banner-reader-api.png)](https://jina.ai/reader)
2024-04-14 03:39:00 +08:00
2024-04-16 12:50:34 +08:00
## Updates
- **2024-04-15**: Reader now supports image reading! It captions all images at the specified URL and adds `Image [idx]: [caption]` as an alt tag (if they initially lack one). This enables downstream LLMs to interact with the images in reasoning, summarizing etc. [See example here](https://x.com/JinaAI_/status/1780094402071023926).
2024-04-14 02:42:15 +08:00
## Usage
2024-04-10 19:32:07 +08:00
2024-04-14 03:55:07 +08:00
### Standard mode
2024-04-14 03:22:36 +08:00
2024-04-14 04:13:24 +08:00
Simply prepend `https://r.jina.ai/` to any URL. For example, to convert the URL `https://en.wikipedia.org/wiki/Artificial_intelligence` to an LLM-friendly input, use the following URL:
2024-04-10 19:32:07 +08:00
2024-04-14 02:42:15 +08:00
https://r.jina.ai/https://en.wikipedia.org/wiki/Artificial_intelligence
2024-04-10 19:32:07 +08:00
2024-04-14 03:22:36 +08:00
### Streaming mode
2024-04-10 19:32:07 +08:00
2024-04-14 02:42:15 +08:00
Use accept-header to control the streaming behavior:
2024-04-10 19:32:07 +08:00
2024-04-14 03:39:00 +08:00
> Note, if you run this example below and not see streaming output but a single response, it means someone else has just run this within 5 min you and the result is cached already. Hence, the server simply returns the result instantly. Try with a different URL and you will see the streaming output.
2024-04-10 19:32:07 +08:00
```bash
2024-04-14 02:42:15 +08:00
curl -H "Accept: text/event-stream" https://r.jina.ai/https://en.m.wikipedia.org/wiki/Main_Page
2024-04-10 19:32:07 +08:00
```
2024-04-14 03:33:51 +08:00
If your downstream LLM/agent system requires immediate content delivery or needs to process data in chunks to interleave the IO and LLM time, use Streaming Mode. This allows for quicker access and efficient handling of data:
```text
Reader API: streamContent1 ----> streamContent2 ----> streamContent3 ---> ...
| | |
v | |
Your LLM: LLM(streamContent1) | |
v |
LLM(streamContent2) |
v
LLM(streamContent3)
```
2024-04-14 10:25:51 +08:00
Stream mode is also useful when the target page is large to render. If you find standard mode gives you incomplete content, try streaming mode.
2024-04-14 03:22:36 +08:00
### JSON mode
2024-04-14 03:33:51 +08:00
This is still very early and the result is not really a "useful" JSON. It contains three fields `url`, `title` and `content` only. Nonetheless, you can use accept-header to control the output format:
2024-04-14 03:22:36 +08:00
```bash
curl -H "Accept: application/json" https://r.jina.ai/https://en.m.wikipedia.org/wiki/Main_Page
```
2024-04-10 19:32:07 +08:00
2024-04-14 02:42:15 +08:00
## Install
2024-04-10 19:32:07 +08:00
2024-04-14 02:42:15 +08:00
You will need the following tools to run the project:
- Node v18 (The build fails for Node version >18)
- Firebase CLI (`npm install -g firebase-tools`)
2024-04-10 19:32:07 +08:00
2024-04-14 02:42:15 +08:00
For backend, go to the `backend/functions` directory and install the npm dependencies.
2024-04-10 19:32:07 +08:00
```bash
2024-04-14 02:42:15 +08:00
git clone git@github.com:jina-ai/reader.git
cd backend/functions
npm install
2024-04-10 19:32:07 +08:00
```
2024-04-14 03:22:36 +08:00
2024-04-14 03:51:36 +08:00
## What is `thinapps-shared` submodule?
2024-04-14 03:22:36 +08:00
2024-04-14 03:47:55 +08:00
You might notice a reference to `thinapps-shared` submodule, an internal package we use to share code across our products. While its not open-sourced and isn't integral to the Reader's functions, it mainly helps with decorators, logging, secrets management, etc. Feel free to ignore it for now.
2024-04-14 03:22:36 +08:00
2024-04-14 03:47:55 +08:00
That said, this is *the single codebase* behind `https://r.jina.ai`, so everytime we commit here, we will deploy the new version to the `https://r.jina.ai`.
2024-04-14 03:25:42 +08:00
2024-04-14 03:33:51 +08:00
## Having trouble on some websites?
Please raise an issue with the URL you are having trouble with. We will look into it and try to fix it.
2024-04-14 03:22:36 +08:00
## License
2024-04-14 03:47:55 +08:00
Reader is backed by [Jina AI](https://jina.ai) and licensed under [Apache-2.0](./LICENSE).