Commit Graph

160 Commits

Author SHA1 Message Date
Yanlong Wang
908157b61e
fix: pdf cache 2024-05-31 19:05:17 +08:00
Yanlong Wang
9c60b4b93d
fix: setup expire for pdf caches 2024-05-31 18:36:23 +08:00
Yanlong Wang
1ba21da0c5
fix: pdf cache 2024-05-31 18:26:05 +08:00
Yanlong Wang
fd0b77285f
fix: firebase fail to save large docs 2024-05-31 18:16:37 +08:00
Yanlong Wang
964b66b6ab
fix: data crunching import 2024-05-31 17:32:16 +08:00
Yanlong Wang
9ac40606d5
fix: bulk fix multiple issues 2024-05-31 17:30:57 +08:00
Yanlong Wang
0c15946874
fix: trimstart url 2024-05-30 20:29:31 +08:00
Yanlong Wang
33e14e5404
feat: extract text from pdf (#70)
* feat: pdf

* fix

* fix
2024-05-30 20:21:33 +08:00
yanlong.wang
7c5712363c
feat: allow custom rate limit per uid 2024-05-23 15:36:09 +08:00
yanlong.wang
8eee95119d
feat: index brief in JSON format 2024-05-23 12:06:07 +08:00
yanlong.wang
4f37de24f6
fix: docs 2024-05-21 17:35:16 +08:00
Yanlong Wang
a8e0628460
feat: links and images summary (#63)
* wip: dedicated link and image summary

* fix

* fix

* fix

* fix: docs

* fix

* fix

* fix
2024-05-21 17:34:19 +08:00
Yanlong Wang
df71c9a534
fix: stop using pool 2024-05-20 01:12:22 +08:00
Yanlong Wang
4077fa7040
fix: geoip encoding 2024-05-17 09:31:22 +08:00
Yanlong Wang
2941be6096
fix: potential unencoded query 2024-05-17 09:15:37 +08:00
Yanlong Wang
ed9e9f43cf
fix: block rough requests 2024-05-16 20:22:26 +08:00
yanlong.wang
8ec8c1e718
fix: logging for search error 2024-05-16 19:01:30 +08:00
yanlong.wang
e0e37ad4d7
fix: potential chargeAmount mismatch 2024-05-16 18:43:41 +08:00
yanlong.wang
8b0916f858
fix: race condition while logging chargeAmount 2024-05-16 18:26:18 +08:00
yanlong.wang
6f4819bc49
chore: tweak deployment 2024-05-16 17:46:53 +08:00
yanlong.wang
322cb86f21
fix: on no results 2024-05-16 17:30:47 +08:00
yanlong.wang
e2698b48bd
fix: rate limit tag for search 2024-05-16 16:58:10 +08:00
yanlong.wang
72e1c46a6c
fix: improve search responsiveness 2024-05-16 15:47:49 +08:00
Yanlong Wang
0583645613
fix: noCache in search 2024-05-16 00:42:30 +08:00
Yanlong Wang
4556954d17
fix: image url 2024-05-16 00:39:24 +08:00
Yanlong Wang
6f65083f8d
feat: control cache tolerance and select target using headers 2024-05-16 00:10:20 +08:00
yanlong.wang
77fc500f41
fix: allow x-return-format header alias 2024-05-15 12:24:46 +08:00
Yanlong Wang
445624c405
fix: early return for search 2024-05-15 08:47:16 +08:00
Yanlong Wang
1cf8e83857
fix: add cache tolerance 2024-05-15 08:06:35 +08:00
Yanlong Wang
d100c3fc5f
fix: search result cache save 2024-05-14 19:57:49 +08:00
Yanlong Wang
ec4ce4fef3
chore: update rate limits 2024-05-14 19:44:35 +08:00
Yanlong Wang
2e3c217479
feat: web search (#57) 2024-05-14 19:39:43 +08:00
Yanlong Wang
f171e54ac9
fix: log charge amount 2024-05-14 17:25:59 +08:00
yanlong.wang
ffc6899acd
chore: reduce resource 2024-05-13 18:35:11 +08:00
yanlong.wang
e417cd8a53
fix: tidyMarkdown feature in turndown rues 2024-05-09 15:15:15 +08:00
Yanlong Wang
36bf5d96b5
fix: remove tidyMarkdown at all 2024-05-09 11:33:56 +08:00
Yanlong Wang
59f807cb7c
fix: tidyMarkdown 2024-05-09 11:32:26 +08:00
Yanlong Wang
6b6774f43b
fix: tidyMarkdown 2024-05-09 11:25:51 +08:00
Yanlong Wang
4bee36ed4a
fix: patch tidyMarkdown 2024-05-09 11:06:20 +08:00
Yanlong Wang
de22127d2f
fix: leak of crippled listeners 2024-05-08 19:51:55 +08:00
Yanlong Wang
62dc75f78e
fix: consider image data-src and make generated alt text optional (#50)
* fix: image src and alt

* fix

* docs: doc about x-with-generated-alt

* fix: deps
2024-05-08 18:29:11 +08:00
Yanlong Wang
8cfd0d67dc
feat: jina paywall (#49)
* feat: integrate with jina embeddings paywall
2024-05-08 18:25:26 +08:00
Yanlong Wang
2e025d10cf
fix: the complex regexp caused node.js process to hang
Co-authored-by:  Claude 3 opus
2024-05-05 16:29:39 +08:00
Yanlong Wang
fef1d0faf1
bump: deps 2024-05-05 10:54:11 +08:00
Yanlong Wang
a0d1a7234b
chore: tweak health check 2024-05-02 08:39:54 +08:00
Yanlong Wang
9e02080103
fix: error on browser crashes 2024-05-02 03:23:57 +08:00
Yanlong Wang
55b954ffeb
fix: tweak health check 2024-04-30 18:56:46 +08:00
Yanlong Wang
528b3e5fed
fix: add health check to detect puppeteer stall 2024-04-30 18:30:31 +08:00
Yanlong Wang
ae29055142
chore: tweaks 2024-04-29 20:12:11 +08:00
yanlong.wang
867636d037
fix: apply rate limit to 100qpm per IP 2024-04-29 18:54:51 +08:00
yanlong.wang
15606f38d7
fix: on null element 2024-04-29 17:28:07 +08:00
yanlong.wang
53a4361c23
fix: block firebase runtime intrusion 2024-04-29 17:21:34 +08:00
yanlong.wang
059c8aa61e
fix: remove exposed function before cleanup 2024-04-29 15:51:23 +08:00
yanlong.wang
bfc6d678d8
fix: split report handler from other page preps 2024-04-29 15:19:05 +08:00
Yanlong Wang
036f6dc776
chore: tweak runtime config 2024-04-29 09:49:29 +08:00
Yanlong Wang
6ac2863e89
bump: deps 2024-04-28 22:28:24 +08:00
yanlong.wang
a6a5b7c530
fix: respond with markdown 2024-04-25 18:58:42 +08:00
yanlong.wang
69231ad59e
feat: full markdown mode 2024-04-25 18:21:04 +08:00
yanlong.wang
adc05fe20a
fix 2024-04-25 16:09:23 +08:00
yanlong.wang
39a446f5e7
fix: root content 2024-04-25 15:43:17 +08:00
yanlong.wang
f1016649ac
fix: firebase limit on document size causing cache failures 2024-04-25 12:24:05 +08:00
yanlong.wang
94a72052f4
fix: reduce frequency of screenshot if possible 2024-04-24 19:43:24 +08:00
Yanlong Wang
7ee2c327a3
refactor: reorganize features (#37)
* wip

* fix

* wip

* cleanup

* fix

* fix

* cache: may rescue using stale cache

* fix: target 384mb ram per page

* fix: log about pool size

* fix

* clean

* fix: cache and snapshot reporting
2024-04-24 19:21:12 +08:00
dependabot[bot]
e36d3b0f24
chore(deps): bump protobufjs and firebase-admin in /backend/functions (#35)
Bumps [protobufjs](https://github.com/protobufjs/protobuf.js) to 7.2.6 and updates ancestor dependency [firebase-admin](https://github.com/firebase/firebase-admin-node). These dependencies need to be updated together.


Updates `protobufjs` from 7.2.4 to 7.2.6
- [Release notes](https://github.com/protobufjs/protobuf.js/releases)
- [Changelog](https://github.com/protobufjs/protobuf.js/blob/master/CHANGELOG.md)
- [Commits](https://github.com/protobufjs/protobuf.js/compare/protobufjs-v7.2.4...protobufjs-v7.2.6)

Updates `firebase-admin` from 11.11.1 to 12.1.0
- [Release notes](https://github.com/firebase/firebase-admin-node/releases)
- [Commits](https://github.com/firebase/firebase-admin-node/compare/v11.11.1...v12.1.0)

---
updated-dependencies:
- dependency-name: protobufjs
  dependency-type: indirect
- dependency-name: firebase-admin
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-04-24 16:37:38 +08:00
Yanlong Wang
4b208f44b5
fix: process not quitting on errors 2024-04-21 10:17:05 +08:00
Yanlong Wang
5d255dda3b
chore: update deps 2024-04-19 09:30:19 +08:00
Charuka Samarakoon
d47310a6f7
fix: allocating incorrect max value due to missing parentheses (#26) 2024-04-19 09:01:23 +08:00
yanlong.wang
d4ca381c38
fix: explicitly reject non http protocols 2024-04-18 15:35:06 +08:00
yanlong.wang
abc817e960
feat: block media resources to improve speed 2024-04-18 15:06:28 +08:00
yanlong.wang
cbc13ecbbd
fix: catch turndown errors 2024-04-18 13:51:54 +08:00
yanlong.wang
0975b35ca2
chore: turn up concurrency a bit base on analysis 2024-04-18 11:53:55 +08:00
yanlong.wang
a211366501
fix: expose publishedTime if possible 2024-04-17 12:36:36 +08:00
Yanlong Wang
6e36f0a447
fix: url wrong normalization 2024-04-17 09:55:41 +08:00
Yanlong Wang
781b835466
fix: keep url details 2024-04-17 09:48:26 +08:00
Yanlong Wang
11a5a90611
fix: favor nominal url over real url 2024-04-17 09:30:49 +08:00
Yanlong Wang
bda7e76e50
chore: increase max instances to target 10k concurrent requests 2024-04-17 09:22:26 +08:00
Yanlong Wang
50ed9cc248
feat: fallback to google archive (#16)
* feat: fallback to google archive

* fix
2024-04-16 09:17:45 -07:00
yanlong.wang
8a2b095bd7
fix: give expireAt for image cache 2024-04-16 15:46:05 +08:00
Han Xiao
b3fb4c5c57
feat: add image captioning (#6)
* Fix contentText assignment in CrawlerHost class

* fix: recover vscode configurations

* feat: add image captioning

* feat: add image captioning

* clean: vscode config

* chore: fix some ts warnings

* feat: auto alt text

* fix

* chore: improve prompt

* clean: unused config

* fix: failure condition

* fix: remove redundant code

* fix: catch parse error

* fix: catch parse error

---------

Co-authored-by: Yanlong Wang <yanlong.wang@naiver.org>
2024-04-15 20:51:31 -07:00
Han Xiao
18373626b2 fix: catch parse error 2024-04-15 19:27:40 -07:00
Han Xiao
9b190127aa fix: clean broken markdown 2024-04-13 21:40:51 -07:00
Han Xiao
ef23d810f8 feat: clean broken markdown 2024-04-13 19:21:35 -07:00
Han Xiao
8378cb06ee chore: rename url2text to reader 2024-04-13 12:25:42 -07:00
Han Xiao
e050a5bffa Merge remote-tracking branch 'origin/main' 2024-04-13 11:42:21 -07:00
Han Xiao
8e241c7f5a chore: rename url2text to reader 2024-04-13 11:42:15 -07:00
Yanlong Wang
dbeb69582a
puppeteer stealth 2024-04-13 22:27:50 +08:00
Yanlong Wang
33d7cfc41c
fix 2024-04-13 08:25:52 +08:00
Yanlong Wang
95799988da
fix: use gpt bot UA 2024-04-13 08:13:50 +08:00
Yanlong Wang
950338261a
fix 2024-04-13 08:07:55 +08:00
Yanlong Wang
5199b00eeb
fix 2024-04-13 08:04:07 +08:00
Yanlong Wang
5ed3f90b9c
fix 2024-04-13 07:53:58 +08:00
Yanlong Wang
be7eeec11b
fix 2024-04-12 14:17:30 +08:00
Yanlong Wang
2da1b7f3a5
fix 2024-04-12 14:17:04 +08:00
Yanlong Wang
fdd8a8aa8d
fix 2024-04-12 12:27:42 +08:00
Yanlong Wang
78c8444096
fix 2024-04-12 10:59:37 +08:00
Yanlong Wang
629ab270be
fix 2024-04-12 10:24:56 +08:00
Yanlong Wang
664d4b1c9f
fix 2024-04-12 09:25:19 +08:00
Han Xiao
2dc0850c8c chore: rename url2text to reader 2024-04-11 15:44:12 -07:00
Han Xiao
c1743db305 chore: clean code 2024-04-11 15:29:57 -07:00
yanlong.wang
b29a569d39
fix 2024-04-11 19:20:17 +08:00