Commit Graph

994 Commits

Author SHA1 Message Date
Rafael Miller
36e4b2cf49
Update .env.example 2024-08-12 10:37:00 -03:00
Quan Ming
a96ad4b0e2 Update redis url to use comment 2024-08-10 12:33:26 +08:00
Nicolas
e28c415cf4 Nick: 2024-08-09 14:07:46 -04:00
Gergo Moricz
5a778f2c22 fix(js-sdk): add type metadata to exports
Some checks failed
Fly Deploy / Pre-deploy checks (push) Has been cancelled
Fly Deploy / Test Suite (push) Has been cancelled
Fly Deploy / Python SDK Tests (push) Has been cancelled
Fly Deploy / JavaScript SDK Tests (push) Has been cancelled
Fly Deploy / Go SDK Tests (push) Has been cancelled
Fly Deploy / Deploy app (push) Has been cancelled
Fly Deploy / Build and publish Python SDK (push) Has been cancelled
Fly Deploy / Build and publish JavaScript SDK (push) Has been cancelled
2024-08-09 20:05:36 +02:00
Rafael Miller
6a78f6fe78
Merge pull request #497 from KentHsu/feat/add-go-sdk
[Feat] Add Go SDK implementation
2024-08-09 14:58:20 -03:00
rafaelsideguide
0591000b64 bugfix includes excludes 2024-08-09 14:30:41 -03:00
Kent (Chia-Hao), Hsu
1fda882983
Merge branch 'mendableai:main' into feat/add-go-sdk 2024-08-10 00:46:15 +08:00
Quan Ming
0221872a70 Update redis urls in example .env 2024-08-10 00:40:11 +08:00
rafaelsideguide
b802ea02a1 small improvements
- wait for getting results on crawl: sometimes crawl takes some a second to save the data on the db and this causes response.data to be empty
- added timeout value to test script
- increased http client timeout (llm extract was failing on e2e tests)
- fixed env path on test script
2024-08-09 11:13:14 -03:00
Nicolas
f1f5605010 Update website_params.ts 2024-08-08 12:31:58 -04:00
Nicolas
b0abad07da
Merge pull request #496 from tak-s/improve-logging-level
Some checks are pending
Fly Deploy / Pre-deploy checks (push) Waiting to run
Fly Deploy / Test Suite (push) Blocked by required conditions
Fly Deploy / Python SDK Tests (push) Blocked by required conditions
Fly Deploy / JavaScript SDK Tests (push) Blocked by required conditions
Fly Deploy / Deploy app (push) Blocked by required conditions
Fly Deploy / Build and publish Python SDK (push) Blocked by required conditions
Fly Deploy / Build and publish JavaScript SDK (push) Blocked by required conditions
Improve logs
2024-08-07 22:01:12 -04:00
Gergo Moricz
920b7f2f44 fix(runWebScraper): don't filter empty docs 2024-08-07 21:00:22 +02:00
Gergo Moricz
55ec96c23f fix(queue-worker): bad job lock extension time 2024-08-07 20:24:16 +02:00
Gergo Moricz
ab7a35c581 fix(queue-worker): log lock extensions 2024-08-07 19:49:48 +02:00
Gergo Moricz
a1c2ee5aa9 fix: always complete job, no try 2024-08-07 19:39:09 +02:00
Gergo Moricz
191dfbd9ca fix: move to completed in one place 2024-08-07 18:49:58 +02:00
Nicolas
457c082ba1 Nick: fixed tests 2024-08-07 11:08:53 -04:00
Nicolas
8a992b1596 Merge branch 'main' of https://github.com/mendableai/firecrawl
Some checks are pending
Fly Deploy / Pre-deploy checks (push) Waiting to run
Fly Deploy / Test Suite (push) Blocked by required conditions
Fly Deploy / Python SDK Tests (push) Blocked by required conditions
Fly Deploy / JavaScript SDK Tests (push) Blocked by required conditions
Fly Deploy / Deploy app (push) Blocked by required conditions
Fly Deploy / Build and publish Python SDK (push) Blocked by required conditions
Fly Deploy / Build and publish JavaScript SDK (push) Blocked by required conditions
2024-08-07 10:40:06 -04:00
Nicolas
b12e1157cc Nick: v35 bump 2024-08-07 10:40:00 -04:00
Gergő Móricz
5fc7fcb77c
Merge branch 'main' into feat/queue-scrapes 2024-08-07 16:35:44 +02:00
Gergo Moricz
fe9fdb578b revert bad hotfixes 2024-08-07 16:34:25 +02:00
Gergo Moricz
b7c01dcb9b fix(webScraperQueue): reduce retries to 2 2024-08-07 16:31:50 +02:00
Gergo Moricz
cdf7bad5b4 fix(runWebScraper): don't move to completed 2024-08-07 15:20:56 +02:00
Gergo Moricz
9df8719efa fix(queue-worker): raise queue log level to info 2024-08-07 14:56:04 +02:00
Gergo Moricz
7bb922071c fix(queue-worker): manually renew lock (testing) 2024-08-07 14:35:20 +02:00
Gergo Moricz
8216266d16 fix(scrape_log): display error properly 2024-08-07 14:19:20 +02:00
Gergo Moricz
2e2e80d679 fix(scrape-events): updateScrapeResult fix 2024-08-07 14:17:50 +02:00
Gergo Moricz
b5ec47fd96 fix(runWebScraper): don't fetch next job 2024-08-07 13:53:04 +02:00
Gergo Moricz
020a5efdb7 Revert "Revert "Merge pull request #432 from mendableai/mog/js-sdk-cjs""
This reverts commit 5da4472842.
2024-08-07 01:27:26 +02:00
Gergő Móricz
7380d7799f
Merge branch 'main' into mog/js-sdk-cjs 2024-08-07 01:12:36 +02:00
Gergo Moricz
5f7724205f fix(js-sdk): re-add types 2024-08-07 01:06:21 +02:00
Nicolas
f294d3922c Nick: revert 2024-08-06 18:44:45 -04:00
Nicolas
5da4472842 Revert "Merge pull request #432 from mendableai/mog/js-sdk-cjs"
This reverts commit bb90e03dea, reversing
changes made to 3321ca9398.
2024-08-06 18:41:06 -04:00
Nicolas
a67a5c04c9 Revert "Merge pull request #432 from mendableai/mog/js-sdk-cjs"
This reverts commit bb90e03dea, reversing
changes made to 3321ca9398.
2024-08-06 18:02:56 -04:00
Nicolas
bb90e03dea
Merge pull request #432 from mendableai/mog/js-sdk-cjs
fix(js-sdk): build both CommonJS and ESM versions
2024-08-06 17:38:57 -04:00
rafaelsideguide
6cdf4c68ec wip: map, crawl, scrape mockups 2024-08-06 15:24:45 -03:00
Nicolas
3321ca9398
Merge pull request #504 from mendableai/feat/fullpage-screenshot
[Feat] Added fullpagescreenshot capabilities
2024-08-06 13:52:29 -04:00
Gergo Moricz
b60ee30dba fix(single_url): accept 500 2024-08-06 18:00:56 +02:00
Gergo Moricz
06751a8e21 fix(crawl-status): missing partial data after cancel 2024-08-06 17:31:20 +02:00
Gergo Moricz
810b98ec38 fix(scrape): fix timeout error code 2024-08-06 17:30:01 +02:00
Gergo Moricz
3ae95a2740 fix(scrape): consider timeout property 2024-08-06 17:25:58 +02:00
Gergo Moricz
8566ece700 fix(scrape): pass extractorOptions 2024-08-06 17:15:19 +02:00
Gergo Moricz
8e0aa69603 fix(crawl-status): partial_data 2024-08-06 17:06:21 +02:00
Gergo Moricz
1ab119c874 fix(scrape): don't double-bill for scrape 2024-08-06 16:57:23 +02:00
Gergo Moricz
7c5cda7b45 fix(queue-worker): concurrency 2024-08-06 16:57:00 +02:00
Gergo Moricz
d7d63790e5 fix(crawl-status): isCancelled should be status failed 2024-08-06 16:35:55 +02:00
Gergo Moricz
03c84a9372 cleanup and fix cancelling 2024-08-06 16:26:46 +02:00
rafaelsideguide
4d24a99d50 fix params 2024-08-06 09:34:43 -03:00
rafaelsideguide
3edc3a3d15 added fullpagescreenshot capabilities, wip on fire-engine side 2024-08-05 18:17:37 -03:00
rafaelsideguide
f32e8de156 fixes the empty excludes.filter undefined bug 2024-08-05 18:13:31 -03:00
KentHsu
1378ffc138 feat: add go-sdk 2024-08-04 17:33:33 +08:00
tak-s
af9bc5c8bb Suppressed repetitive logs 2024-08-04 15:09:36 +09:00
Nicolas
1742e4ceae Nick: 2024-08-02 19:25:15 -04:00
Nicolas
39aecd974b Update redis-health.ts 2024-08-02 17:43:45 -04:00
Nicolas
b448e3c3ad Update website_params.ts 2024-08-02 14:26:35 -04:00
rafaelsideguide
4051630632 Update sitemap.ts 2024-08-02 11:32:48 -03:00
rafaelsideguide
8568b61015 bugfix for sitemaps 2024-08-02 11:03:01 -03:00
Nicolas
af68b7a785
Merge pull request #475 from mendableai/bugfix/issue-466
Some checks failed
Fly Deploy / Pre-deploy checks (push) Waiting to run
Fly Deploy / Test Suite (push) Blocked by required conditions
Fly Deploy / Python SDK Tests (push) Blocked by required conditions
Fly Deploy / JavaScript SDK Tests (push) Blocked by required conditions
Fly Deploy / Deploy app (push) Blocked by required conditions
Fly Deploy / Build and publish Python SDK (push) Blocked by required conditions
Fly Deploy / Build and publish JavaScript SDK (push) Blocked by required conditions
Check Redis / clean-jobs (push) Has been cancelled
[Bug] pdfs and logging pdf events, also added trycatchs for docx
2024-08-01 22:05:26 -04:00
rafaelsideguide
f48ff36b32 added .inc files and forced lower case comparison 2024-07-31 09:28:43 -03:00
Nicolas
ad6f6eff4b Update fireEngine.ts 2024-07-30 19:15:54 -04:00
Nicolas
f9827b2151 Update credit_billing.ts 2024-07-30 19:13:17 -04:00
Nicolas
6d99dedd3c Nick: fixed tests 2024-07-30 19:11:01 -04:00
Nicolas
a28ecc1f61 Nick: caching 2024-07-30 18:59:35 -04:00
Nicolas
52198f2991 Nick: 2024-07-30 16:15:08 -04:00
Nicolas
f43d5e7895 Nick: scrape queue 2024-07-30 14:44:13 -04:00
Nicolas
7e002a8b06 Nick: bull mq 2024-07-30 13:27:23 -04:00
Nicolas
46bcbd931f Merge branch 'main' into feat/queue-scrapes 2024-07-30 12:44:07 -04:00
Nicolas
fd2452ec9c Update scrape.ts 2024-07-30 12:42:12 -04:00
rafaelsideguide
8f5174ffc7 Update auth.ts 2024-07-30 10:37:33 -03:00
rafaelsideguide
d25d7e7244 special case: developer.apple.com
Some checks are pending
Fly Deploy / Pre-deploy checks (push) Waiting to run
Fly Deploy / Test Suite (push) Blocked by required conditions
Fly Deploy / Python SDK Tests (push) Blocked by required conditions
Fly Deploy / JavaScript SDK Tests (push) Blocked by required conditions
Fly Deploy / Deploy app (push) Blocked by required conditions
Fly Deploy / Build and publish Python SDK (push) Blocked by required conditions
Fly Deploy / Build and publish JavaScript SDK (push) Blocked by required conditions
2024-07-30 10:13:09 -03:00
Nicolas
5e8ffcf505 Update website_params.ts 2024-07-29 20:43:47 -04:00
Nicolas
7b813883ef Nick: first layer 2024-07-29 20:31:51 -04:00
Nicolas
e99c2568f4 Update auth.ts 2024-07-29 18:44:18 -04:00
Nicolas
968a2dc753 Nick: 2024-07-29 18:37:09 -04:00
Nicolas
04942bb9de Nick: 2024-07-29 18:31:43 -04:00
Nicolas
267d4681bf Merge branch 'main' of https://github.com/mendableai/firecrawl 2024-07-29 17:21:15 -04:00
Nicolas
b4833c1694 Nick: increasing default timeout to 45s 2024-07-29 17:21:11 -04:00
Nicolas
7fa08100bf
Merge pull request #414 from NiuBlibing/support_model_name
Some checks are pending
Fly Deploy / Pre-deploy checks (push) Waiting to run
Fly Deploy / Test Suite (push) Blocked by required conditions
Fly Deploy / Python SDK Tests (push) Blocked by required conditions
Fly Deploy / JavaScript SDK Tests (push) Blocked by required conditions
Fly Deploy / Deploy app (push) Blocked by required conditions
Fly Deploy / Build and publish Python SDK (push) Blocked by required conditions
Fly Deploy / Build and publish JavaScript SDK (push) Blocked by required conditions
support custom models
2024-07-29 13:21:29 -04:00
rafaelsideguide
49e3e64787 bugfix for pdfs and logging pdf events, also added trycatchs for docx 2024-07-29 14:13:46 -03:00
Nicolas
4c9d62f6d3 Nick: fixing sitemap fallback 2024-07-26 18:25:44 -04:00
Nicolas
091924a636 Nick: moving machines from mia to virginia 2024-07-26 17:37:46 -04:00
Nicolas
cb97871ff9 Merge branch 'main' of https://github.com/mendableai/firecrawl 2024-07-26 17:21:11 -04:00
Nicolas
ff4266f09e Update pdfProcessor.ts 2024-07-26 17:21:09 -04:00
Nicolas
0c2e3a72cc
Merge pull request #460 from mendableai/nsc/admin-router
Admin router + Improve redis notifications
2024-07-26 12:16:14 -04:00
rafaelsideguide
96cec2a673 fix checking scrape log success content length 2024-07-26 12:00:52 -03:00
Nicolas
542270f4c2
Merge pull request #461 from mendableai/nsc/small-handle-for-client-side-errors
Client side error handling
2024-07-25 20:53:10 -04:00
Nicolas
dc6f825270 Update email_notification.ts 2024-07-25 20:43:50 -04:00
Nicolas
f82ca3be17 Nick: 2024-07-25 19:53:29 -04:00
Nicolas
01fab6e036 Update single_url.ts 2024-07-25 17:51:41 -04:00
Nicolas
56042d090c Update single_url.ts 2024-07-25 17:48:44 -04:00
Nicolas
88f5efce8f Merge branch 'feat/scrape-monitoring' 2024-07-25 17:44:21 -04:00
Nicolas
3242872503 Update single_url.ts 2024-07-25 17:43:55 -04:00
Nicolas
ffd430f198
Merge pull request #457 from JakobStadlhuber/Readiness-Liveness-Probes
Readiness liveness probes
2024-07-25 17:20:31 -04:00
Nicolas
7129d7993e
Update v0.ts 2024-07-25 17:19:45 -04:00
rafaelsideguide
e0954d7f59 Merge branch 'main' of https://github.com/mendableai/firecrawl 2024-07-25 17:48:43 -03:00
rafaelsideguide
81aa919262 fix 2024-07-25 17:47:43 -03:00
Nicolas
10e80f00cf Merge branch 'main' into nsc/admin-router 2024-07-25 16:46:38 -04:00
Nicolas
e5b797549e Merge branch 'main' into feat/scrape-monitoring 2024-07-25 16:21:02 -04:00
Nicolas
50d2426fc4 Update scrape-events.ts 2024-07-25 16:20:29 -04:00
Nicolas
28a8a98491 Update admin.ts 2024-07-25 14:58:14 -04:00
Nicolas
2014d9dd2e Nick: admin router 2024-07-25 14:54:20 -04:00
rafaelsideguide
1f1c068eea changing from error to debug 2024-07-25 10:00:50 -03:00
rafaelsideguide
e720e1bacf Merge remote-tracking branch 'origin/main' into feat/logger 2024-07-25 09:49:27 -03:00
rafaelsideguide
309728a482 updated logs 2024-07-25 09:48:06 -03:00
Nicolas
2c1221750b
Merge pull request #449 from mendableai/bugfix/malformed-url-sitemap
Added regex for links in sitemap
2024-07-24 20:37:35 -04:00
Gergő Móricz
d1a3df6d08 fix: aaaaahhh 2024-07-25 00:50:03 +02:00
Nicolas
6ad7e24403 Update ingestion.tsx 2024-07-24 18:15:51 -04:00
Gergő Móricz
6798695ee4 feat: move scraper to queue 2024-07-25 00:14:25 +02:00
Nicolas
92843a356d Merge branch 'main' of https://github.com/mendableai/firecrawl 2024-07-24 18:13:36 -04:00
Nicolas
1e13ddbe8e Nick: changes to the ui component 2024-07-24 18:13:34 -04:00
Gergő Móricz
623b547292 fix(fly.toml): scale up memory limit 2024-07-24 23:39:00 +02:00
Nicolas
15890772be Scale bump 2024-07-24 16:56:19 -04:00
Eric Ciarla
a4bccbe3bb
Firecrawl UI Template
Firecrawl UI template
2024-07-24 15:05:55 -04:00
Eric Ciarla
4596d0b2e6 Add ReadMe and LICENSE 2024-07-24 14:56:53 -04:00
Eric Ciarla
9654721bf2 Vite commit 2024-07-24 14:27:50 -04:00
rafaelsideguide
cc98f83fda added failed and completed log events 2024-07-24 15:25:36 -03:00
Jakob Stadlhuber
be9e7f9edf Update Kubernetes configs for playwright-service, api, and worker
Added new ConfigMap for playwright-service and adjusted existing references.
Applied imagePullPolicy: Always to ensure all images are updated promptly.
Updated README to include --no-cache for Docker build instructions.
2024-07-24 18:54:16 +02:00
Gergo Moricz
60c74357df feat(ScrapeEvents): log queue events 2024-07-24 18:44:14 +02:00
rafaelsideguide
4eca6bd301 fix/check-for-auth-on-scrape-log 2024-07-24 12:54:14 -03:00
Nicolas
3a1b8a9797 Update website_params.ts 2024-07-24 11:04:47 -04:00
Nicolas
8b48ec8d30 Update website_params.ts 2024-07-24 11:02:20 -04:00
Gergo Moricz
4d35ad073c feat(monitoring/scrape): include url, worker, response_size 2024-07-24 16:43:39 +02:00
Gergo Moricz
64bcedeefc fix(monitoring): bad success check on scrape 2024-07-24 16:21:59 +02:00
Gergo Moricz
d57dbbd0c6 fix: add jobId for scrape 2024-07-24 15:18:12 +02:00
Gergo Moricz
71072fef3b fix(scrape-events): bad logic 2024-07-24 14:46:41 +02:00
Gergo Moricz
7cd9bf92e3 feat: scrape event logging to DB 2024-07-24 14:31:25 +02:00
Rafael Miller
5e728c1a4d
Update apps/api/src/scraper/WebScraper/crawler.ts
no need for regex

Co-authored-by: Gergő Móricz <mo.geryy@gmail.com>
2024-07-24 08:33:00 -03:00
Eric Ciarla
1b7a00624d Delete old comp 2024-07-23 21:51:08 -04:00
Eric Ciarla
565bc09439 Basic react app 2024-07-23 21:48:11 -04:00
rafaelsideguide
6208ecdbc0 added logger 2024-07-23 17:30:46 -03:00
Eric Ciarla
a0d89169ed init 2024-07-23 15:48:12 -04:00
Nicolas
f0b07b509b Update index.ts 2024-07-23 15:15:56 -04:00
rafaelsideguide
a684bd3c5d added regex for links in sitemap 2024-07-23 09:07:23 -03:00
Nicolas
30e706b43f Update scrape.ts 2024-07-22 19:15:24 -04:00
Nicolas
8916fec66c Update index.ts 2024-07-22 19:14:53 -04:00
Nicolas
575ddc9e6e Update scrape.ts 2024-07-22 19:12:51 -04:00
Nicolas
e31a5007d5 Nick: speed improvements 2024-07-22 18:30:58 -04:00
Nicolas
b229fbebd8 Update scrape_log.ts 2024-07-19 12:53:26 -04:00
rafaelsideguide
5c02dbe20c fix(isFile): added .tiff extension 2024-07-18 17:07:21 -03:00
Gergo Moricz
f0e95ce399 fix(WebCrawler): filter out file URLs when taking URLs from sitemap 2024-07-18 21:49:37 +02:00
Gergo Moricz
95c6c63b85 fix(fly): raise heap limit to 4G per process 2024-07-18 20:56:54 +02:00
Nicolas
5f14f4f788 Update blocklist.ts 2024-07-18 14:20:19 -04:00
Nicolas
6161b83890 Update scrape_log.ts 2024-07-18 14:17:08 -04:00
Nicolas
2dd7398aad Update scrape_log.ts 2024-07-18 14:16:46 -04:00
Nicolas
f10f3f886b
Merge pull request #410 from mendableai/feat/fire-engine-chrome-cdp
Support chrome-cdp and restructure sitemap fire-engine support.
2024-07-18 13:52:08 -04:00
Nicolas
9a1a227797 Update crawl-cancel.ts 2024-07-18 13:49:51 -04:00
Nicolas
11768571ed Update crawl-cancel.ts 2024-07-18 13:43:03 -04:00
Nicolas
ce804d3c20 Update crawl-cancel.ts 2024-07-18 13:40:24 -04:00
Nicolas
d2de01d342 Nick: fixes 2024-07-18 13:19:44 -04:00
Gergo Moricz
0b8047c7a0 fix(WebScraper): infinite regex leading to fly.io instance hangs 2024-07-18 19:13:43 +02:00