mirror of
https://github.com/mendableai/firecrawl.git
synced 2024-11-16 03:32:22 +08:00
Initial commit
This commit is contained in:
commit
a6c2a87811
2
.gitattributes
vendored
Normal file
2
.gitattributes
vendored
Normal file
|
@ -0,0 +1,2 @@
|
|||
# Auto detect text files and perform LF normalization
|
||||
* text=auto
|
20
.github/workflows/fly.yml
vendored
Normal file
20
.github/workflows/fly.yml
vendored
Normal file
|
@ -0,0 +1,20 @@
|
|||
name: Fly Deploy
|
||||
on:
|
||||
push:
|
||||
branches:
|
||||
- main
|
||||
# schedule:
|
||||
# - cron: '0 */4 * * *'
|
||||
|
||||
jobs:
|
||||
deploy:
|
||||
name: Deploy app
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- uses: actions/checkout@v3
|
||||
- uses: superfly/flyctl-actions/setup-flyctl@master
|
||||
- name: Change directory
|
||||
run: cd apps/api
|
||||
- run: flyctl deploy --remote-only
|
||||
env:
|
||||
FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN }}
|
6
.gitignore
vendored
Normal file
6
.gitignore
vendored
Normal file
|
@ -0,0 +1,6 @@
|
|||
/node_modules/
|
||||
/dist/
|
||||
.env
|
||||
*.csv
|
||||
dump.rdb
|
||||
/mongo-data
|
4
CONTRIBUTING.md
Normal file
4
CONTRIBUTING.md
Normal file
|
@ -0,0 +1,4 @@
|
|||
# Contributing
|
||||
|
||||
We love contributions! Please read our [contributing guide](CONTRIBUTING.md) before submitting a pull request.
|
||||
|
201
LICENSE
Normal file
201
LICENSE
Normal file
|
@ -0,0 +1,201 @@
|
|||
Apache License
|
||||
Version 2.0, January 2004
|
||||
http://www.apache.org/licenses/
|
||||
|
||||
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
|
||||
|
||||
1. Definitions.
|
||||
|
||||
"License" shall mean the terms and conditions for use, reproduction,
|
||||
and distribution as defined by Sections 1 through 9 of this document.
|
||||
|
||||
"Licensor" shall mean the copyright owner or entity authorized by
|
||||
the copyright owner that is granting the License.
|
||||
|
||||
"Legal Entity" shall mean the union of the acting entity and all
|
||||
other entities that control, are controlled by, or are under common
|
||||
control with that entity. For the purposes of this definition,
|
||||
"control" means (i) the power, direct or indirect, to cause the
|
||||
direction or management of such entity, whether by contract or
|
||||
otherwise, or (ii) ownership of fifty percent (50%) or more of the
|
||||
outstanding shares, or (iii) beneficial ownership of such entity.
|
||||
|
||||
"You" (or "Your") shall mean an individual or Legal Entity
|
||||
exercising permissions granted by this License.
|
||||
|
||||
"Source" form shall mean the preferred form for making modifications,
|
||||
including but not limited to software source code, documentation
|
||||
source, and configuration files.
|
||||
|
||||
"Object" form shall mean any form resulting from mechanical
|
||||
transformation or translation of a Source form, including but
|
||||
not limited to compiled object code, generated documentation,
|
||||
and conversions to other media types.
|
||||
|
||||
"Work" shall mean the work of authorship, whether in Source or
|
||||
Object form, made available under the License, as indicated by a
|
||||
copyright notice that is included in or attached to the work
|
||||
(an example is provided in the Appendix below).
|
||||
|
||||
"Derivative Works" shall mean any work, whether in Source or Object
|
||||
form, that is based on (or derived from) the Work and for which the
|
||||
editorial revisions, annotations, elaborations, or other modifications
|
||||
represent, as a whole, an original work of authorship. For the purposes
|
||||
of this License, Derivative Works shall not include works that remain
|
||||
separable from, or merely link (or bind by name) to the interfaces of,
|
||||
the Work and Derivative Works thereof.
|
||||
|
||||
"Contribution" shall mean any work of authorship, including
|
||||
the original version of the Work and any modifications or additions
|
||||
to that Work or Derivative Works thereof, that is intentionally
|
||||
submitted to Licensor for inclusion in the Work by the copyright owner
|
||||
or by an individual or Legal Entity authorized to submit on behalf of
|
||||
the copyright owner. For the purposes of this definition, "submitted"
|
||||
means any form of electronic, verbal, or written communication sent
|
||||
to the Licensor or its representatives, including but not limited to
|
||||
communication on electronic mailing lists, source code control systems,
|
||||
and issue tracking systems that are managed by, or on behalf of, the
|
||||
Licensor for the purpose of discussing and improving the Work, but
|
||||
excluding communication that is conspicuously marked or otherwise
|
||||
designated in writing by the copyright owner as "Not a Contribution."
|
||||
|
||||
"Contributor" shall mean Licensor and any individual or Legal Entity
|
||||
on behalf of whom a Contribution has been received by Licensor and
|
||||
subsequently incorporated within the Work.
|
||||
|
||||
2. Grant of Copyright License. Subject to the terms and conditions of
|
||||
this License, each Contributor hereby grants to You a perpetual,
|
||||
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
|
||||
copyright license to reproduce, prepare Derivative Works of,
|
||||
publicly display, publicly perform, sublicense, and distribute the
|
||||
Work and such Derivative Works in Source or Object form.
|
||||
|
||||
3. Grant of Patent License. Subject to the terms and conditions of
|
||||
this License, each Contributor hereby grants to You a perpetual,
|
||||
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
|
||||
(except as stated in this section) patent license to make, have made,
|
||||
use, offer to sell, sell, import, and otherwise transfer the Work,
|
||||
where such license applies only to those patent claims licensable
|
||||
by such Contributor that are necessarily infringed by their
|
||||
Contribution(s) alone or by combination of their Contribution(s)
|
||||
with the Work to which such Contribution(s) was submitted. If You
|
||||
institute patent litigation against any entity (including a
|
||||
cross-claim or counterclaim in a lawsuit) alleging that the Work
|
||||
or a Contribution incorporated within the Work constitutes direct
|
||||
or contributory patent infringement, then any patent licenses
|
||||
granted to You under this License for that Work shall terminate
|
||||
as of the date such litigation is filed.
|
||||
|
||||
4. Redistribution. You may reproduce and distribute copies of the
|
||||
Work or Derivative Works thereof in any medium, with or without
|
||||
modifications, and in Source or Object form, provided that You
|
||||
meet the following conditions:
|
||||
|
||||
(a) You must give any other recipients of the Work or
|
||||
Derivative Works a copy of this License; and
|
||||
|
||||
(b) You must cause any modified files to carry prominent notices
|
||||
stating that You changed the files; and
|
||||
|
||||
(c) You must retain, in the Source form of any Derivative Works
|
||||
that You distribute, all copyright, patent, trademark, and
|
||||
attribution notices from the Source form of the Work,
|
||||
excluding those notices that do not pertain to any part of
|
||||
the Derivative Works; and
|
||||
|
||||
(d) If the Work includes a "NOTICE" text file as part of its
|
||||
distribution, then any Derivative Works that You distribute must
|
||||
include a readable copy of the attribution notices contained
|
||||
within such NOTICE file, excluding those notices that do not
|
||||
pertain to any part of the Derivative Works, in at least one
|
||||
of the following places: within a NOTICE text file distributed
|
||||
as part of the Derivative Works; within the Source form or
|
||||
documentation, if provided along with the Derivative Works; or,
|
||||
within a display generated by the Derivative Works, if and
|
||||
wherever such third-party notices normally appear. The contents
|
||||
of the NOTICE file are for informational purposes only and
|
||||
do not modify the License. You may add Your own attribution
|
||||
notices within Derivative Works that You distribute, alongside
|
||||
or as an addendum to the NOTICE text from the Work, provided
|
||||
that such additional attribution notices cannot be construed
|
||||
as modifying the License.
|
||||
|
||||
You may add Your own copyright statement to Your modifications and
|
||||
may provide additional or different license terms and conditions
|
||||
for use, reproduction, or distribution of Your modifications, or
|
||||
for any such Derivative Works as a whole, provided Your use,
|
||||
reproduction, and distribution of the Work otherwise complies with
|
||||
the conditions stated in this License.
|
||||
|
||||
5. Submission of Contributions. Unless You explicitly state otherwise,
|
||||
any Contribution intentionally submitted for inclusion in the Work
|
||||
by You to the Licensor shall be under the terms and conditions of
|
||||
this License, without any additional terms or conditions.
|
||||
Notwithstanding the above, nothing herein shall supersede or modify
|
||||
the terms of any separate license agreement you may have executed
|
||||
with Licensor regarding such Contributions.
|
||||
|
||||
6. Trademarks. This License does not grant permission to use the trade
|
||||
names, trademarks, service marks, or product names of the Licensor,
|
||||
except as required for reasonable and customary use in describing the
|
||||
origin of the Work and reproducing the content of the NOTICE file.
|
||||
|
||||
7. Disclaimer of Warranty. Unless required by applicable law or
|
||||
agreed to in writing, Licensor provides the Work (and each
|
||||
Contributor provides its Contributions) on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
|
||||
implied, including, without limitation, any warranties or conditions
|
||||
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
|
||||
PARTICULAR PURPOSE. You are solely responsible for determining the
|
||||
appropriateness of using or redistributing the Work and assume any
|
||||
risks associated with Your exercise of permissions under this License.
|
||||
|
||||
8. Limitation of Liability. In no event and under no legal theory,
|
||||
whether in tort (including negligence), contract, or otherwise,
|
||||
unless required by applicable law (such as deliberate and grossly
|
||||
negligent acts) or agreed to in writing, shall any Contributor be
|
||||
liable to You for damages, including any direct, indirect, special,
|
||||
incidental, or consequential damages of any character arising as a
|
||||
result of this License or out of the use or inability to use the
|
||||
Work (including but not limited to damages for loss of goodwill,
|
||||
work stoppage, computer failure or malfunction, or any and all
|
||||
other commercial damages or losses), even if such Contributor
|
||||
has been advised of the possibility of such damages.
|
||||
|
||||
9. Accepting Warranty or Additional Liability. While redistributing
|
||||
the Work or Derivative Works thereof, You may choose to offer,
|
||||
and charge a fee for, acceptance of support, warranty, indemnity,
|
||||
or other liability obligations and/or rights consistent with this
|
||||
License. However, in accepting such obligations, You may act only
|
||||
on Your own behalf and on Your sole responsibility, not on behalf
|
||||
of any other Contributor, and only if You agree to indemnify,
|
||||
defend, and hold each Contributor harmless for any liability
|
||||
incurred by, or claims asserted against, such Contributor by reason
|
||||
of your accepting any such warranty or additional liability.
|
||||
|
||||
END OF TERMS AND CONDITIONS
|
||||
|
||||
APPENDIX: How to apply the Apache License to your work.
|
||||
|
||||
To apply the Apache License to your work, attach the following
|
||||
boilerplate notice, with the fields enclosed by brackets "[]"
|
||||
replaced with your own identifying information. (Don't include
|
||||
the brackets!) The text should be enclosed in the appropriate
|
||||
comment syntax for the file format. We also recommend that a
|
||||
file or class name and description of purpose be included on the
|
||||
same "printed page" as the copyright notice for easier
|
||||
identification within third-party archives.
|
||||
|
||||
Copyright 2024 Firecrawl | Mendable.ai
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License");
|
||||
you may not use this file except in compliance with the License.
|
||||
You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License.
|
108
README.md
Normal file
108
README.md
Normal file
|
@ -0,0 +1,108 @@
|
|||
# 🔥 Firecrawl
|
||||
|
||||
Crawl and convert any website into clean markdown
|
||||
|
||||
|
||||
*This repo is still in early development.*
|
||||
|
||||
## What is Firecrawl?
|
||||
|
||||
[Firecrawl](https://firecrawl.dev?ref=github) is an API service that takes a URL, crawls it, and converts it into clean markdown. We crawl all accessible subpages and give you clean markdown for each. No sitemap required.
|
||||
|
||||
## How to use it?
|
||||
|
||||
We provide an easy to use API with our hosted version. You can find the playground and documentation [here](https://firecrawl.com/playground). You can also self host the backend if you'd like.
|
||||
|
||||
- [x] API
|
||||
- [x] Python SDK
|
||||
- [x] JS SDK - Coming Soon
|
||||
|
||||
Self-host. To self-host refer to guide [here](https://github.com/mendableai/firecrawl/blob/main/SELF_HOST.md).
|
||||
|
||||
### API Key
|
||||
|
||||
To use the API, you need to sign up on [Firecrawl](https://firecrawl.com) and get an API key.
|
||||
|
||||
### Crawling
|
||||
|
||||
Used to crawl a URL and all accessible subpages. This submits a crawl job and returns a job ID to check the status of the crawl.
|
||||
|
||||
```bash
|
||||
curl -X POST https://api.firecrawl.dev/v0/crawl \
|
||||
-H 'Content-Type: application/json' \
|
||||
-H 'Authorization: Bearer YOUR_API_KEY' \
|
||||
-d '{
|
||||
"url": "https://mendable.ai"
|
||||
}'
|
||||
```
|
||||
|
||||
Returns a jobId
|
||||
|
||||
```json
|
||||
{ "jobId": "1234-5678-9101" }
|
||||
```
|
||||
|
||||
### Check Crawl Job
|
||||
|
||||
Used to check the status of a crawl job and get its result.
|
||||
|
||||
```bash
|
||||
curl -X GET https://api.firecrawl.dev/v0/crawl/status/1234-5678-9101 \
|
||||
-H 'Content-Type: application/json' \
|
||||
-H 'Authorization: Bearer YOUR_API_KEY'
|
||||
```
|
||||
|
||||
```json
|
||||
{
|
||||
"status": "completed",
|
||||
"current": 22,
|
||||
"total": 22,
|
||||
"data": [
|
||||
{
|
||||
"content": "Raw Content ",
|
||||
"markdown": "# Markdown Content",
|
||||
"provider": "web-scraper",
|
||||
"metadata": {
|
||||
"title": "Mendable | AI for CX and Sales",
|
||||
"description": "AI for CX and Sales",
|
||||
"language": null,
|
||||
"sourceURL": "https://www.mendable.ai/",
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Using Python SDK
|
||||
|
||||
### Installing Python SDK
|
||||
|
||||
```bash
|
||||
pip install firecrawl-py
|
||||
```
|
||||
|
||||
### Crawl a website
|
||||
|
||||
```python
|
||||
from firecrawl import FirecrawlApp
|
||||
|
||||
app = FirecrawlApp(api_key="YOUR_API_KEY")
|
||||
|
||||
crawl_result = app.crawl_url('mendable.ai', {'crawlerOptions': {'excludes': ['blog/*']}})
|
||||
|
||||
# Get the markdown
|
||||
for result in crawl_result:
|
||||
print(result['markdown'])
|
||||
```
|
||||
|
||||
### Scraping a URL
|
||||
|
||||
To scrape a single URL, use the `scrape_url` method. It takes the URL as a parameter and returns the scraped data as a dictionary.
|
||||
|
||||
```python
|
||||
url = 'https://example.com'
|
||||
scraped_data = app.scrape_url(url)
|
||||
```
|
||||
|
||||
## Contributing
|
||||
|
||||
We love contributions! Please read our [contributing guide](CONTRIBUTING.md) before submitting a pull request.
|
6
SELF_HOST.md
Normal file
6
SELF_HOST.md
Normal file
|
@ -0,0 +1,6 @@
|
|||
# Self-hosting Firecrawl
|
||||
|
||||
Guide coming soon.
|
||||
|
||||
|
||||
|
BIN
apps/.DS_Store
vendored
Normal file
BIN
apps/.DS_Store
vendored
Normal file
Binary file not shown.
4
apps/api/.dockerignore
Normal file
4
apps/api/.dockerignore
Normal file
|
@ -0,0 +1,4 @@
|
|||
/node_modules/
|
||||
/dist/
|
||||
.env
|
||||
*.csv
|
8
apps/api/.env.local
Normal file
8
apps/api/.env.local
Normal file
|
@ -0,0 +1,8 @@
|
|||
PORT=
|
||||
HOST=
|
||||
SUPABASE_ANON_TOKEN=
|
||||
SUPABASE_URL=
|
||||
SUPABASE_SERVICE_TOKEN=
|
||||
REDIS_URL=
|
||||
OPENAI_API_KEY=
|
||||
SCRAPING_BEE_API_KEY=
|
2
apps/api/.gitattributes
vendored
Normal file
2
apps/api/.gitattributes
vendored
Normal file
|
@ -0,0 +1,2 @@
|
|||
# Auto detect text files and perform LF normalization
|
||||
* text=auto
|
6
apps/api/.gitignore
vendored
Normal file
6
apps/api/.gitignore
vendored
Normal file
|
@ -0,0 +1,6 @@
|
|||
/node_modules/
|
||||
/dist/
|
||||
.env
|
||||
*.csv
|
||||
dump.rdb
|
||||
/mongo-data
|
36
apps/api/Dockerfile
Normal file
36
apps/api/Dockerfile
Normal file
|
@ -0,0 +1,36 @@
|
|||
FROM node:20-slim AS base
|
||||
ENV PNPM_HOME="/pnpm"
|
||||
ENV PATH="$PNPM_HOME:$PATH"
|
||||
LABEL fly_launch_runtime="Node.js"
|
||||
RUN corepack enable
|
||||
COPY . /app
|
||||
WORKDIR /app
|
||||
|
||||
FROM base AS prod-deps
|
||||
RUN --mount=type=cache,id=pnpm,target=/pnpm/store pnpm install --prod --frozen-lockfile
|
||||
|
||||
FROM base AS build
|
||||
RUN --mount=type=cache,id=pnpm,target=/pnpm/store pnpm install --frozen-lockfile
|
||||
|
||||
RUN pnpm install
|
||||
RUN pnpm run build
|
||||
|
||||
# Install packages needed for deployment
|
||||
|
||||
|
||||
FROM base
|
||||
RUN apt-get update -qq && \
|
||||
apt-get install --no-install-recommends -y chromium chromium-sandbox && \
|
||||
rm -rf /var/lib/apt/lists /var/cache/apt/archives
|
||||
COPY --from=prod-deps /app/node_modules /app/node_modules
|
||||
COPY --from=build /app /app
|
||||
|
||||
|
||||
|
||||
|
||||
# Start the server by default, this can be overwritten at runtime
|
||||
EXPOSE 8080
|
||||
ENV PUPPETEER_EXECUTABLE_PATH="/usr/bin/chromium"
|
||||
CMD [ "pnpm", "run", "start:production" ]
|
||||
CMD [ "pnpm", "run", "worker:production" ]
|
||||
|
47
apps/api/fly.toml
Normal file
47
apps/api/fly.toml
Normal file
|
@ -0,0 +1,47 @@
|
|||
# fly.toml app configuration file generated for firecrawl-scraper-js on 2024-04-07T21:09:59-03:00
|
||||
#
|
||||
# See https://fly.io/docs/reference/configuration/ for information about how to use this file.
|
||||
#
|
||||
|
||||
app = 'firecrawl-scraper-js'
|
||||
primary_region = 'mia'
|
||||
kill_signal = 'SIGINT'
|
||||
kill_timeout = '5s'
|
||||
|
||||
[build]
|
||||
|
||||
[processes]
|
||||
app = 'npm run start:production'
|
||||
worker = 'npm run worker:production'
|
||||
|
||||
[http_service]
|
||||
internal_port = 8080
|
||||
force_https = true
|
||||
auto_stop_machines = true
|
||||
auto_start_machines = true
|
||||
min_machines_running = 0
|
||||
processes = ['app']
|
||||
|
||||
[[services]]
|
||||
protocol = 'tcp'
|
||||
internal_port = 8080
|
||||
processes = ['app']
|
||||
|
||||
[[services.ports]]
|
||||
port = 80
|
||||
handlers = ['http']
|
||||
force_https = true
|
||||
|
||||
[[services.ports]]
|
||||
port = 443
|
||||
handlers = ['tls', 'http']
|
||||
|
||||
[services.concurrency]
|
||||
type = 'connections'
|
||||
hard_limit = 45
|
||||
soft_limit = 20
|
||||
|
||||
[[vm]]
|
||||
size = 'performance-1x'
|
||||
|
||||
|
5
apps/api/jest.config.js
Normal file
5
apps/api/jest.config.js
Normal file
|
@ -0,0 +1,5 @@
|
|||
module.exports = {
|
||||
preset: "ts-jest",
|
||||
testEnvironment: "node",
|
||||
setupFiles: ["./jest.setup.js"],
|
||||
};
|
1
apps/api/jest.setup.js
Normal file
1
apps/api/jest.setup.js
Normal file
|
@ -0,0 +1 @@
|
|||
global.fetch = require('jest-fetch-mock');
|
98
apps/api/package.json
Normal file
98
apps/api/package.json
Normal file
|
@ -0,0 +1,98 @@
|
|||
{
|
||||
"name": "firecrawl-scraper-js",
|
||||
"version": "1.0.0",
|
||||
"description": "",
|
||||
"main": "index.js",
|
||||
"scripts": {
|
||||
"start": "nodemon --exec ts-node src/index.ts",
|
||||
"start:production": "tsc && node dist/src/index.js",
|
||||
"format": "prettier --write \"src/**/*.(js|ts)\"",
|
||||
"flyio": "node dist/src/index.js",
|
||||
"start:dev": "nodemon --exec ts-node src/index.ts",
|
||||
"build": "tsc",
|
||||
"test": "jest --verbose",
|
||||
"workers": "nodemon --exec ts-node src/services/queue-worker.ts",
|
||||
"worker:production": "node dist/src/services/queue-worker.js",
|
||||
"mongo-docker": "docker run -d -p 2717:27017 -v ./mongo-data:/data/db --name mongodb mongo:latest",
|
||||
"mongo-docker-console": "docker exec -it mongodb mongosh",
|
||||
"run-example": "npx ts-node src/example.ts"
|
||||
},
|
||||
"author": "",
|
||||
"license": "ISC",
|
||||
"devDependencies": {
|
||||
"@flydotio/dockerfile": "^0.4.10",
|
||||
"@tsconfig/recommended": "^1.0.3",
|
||||
"@types/body-parser": "^1.19.2",
|
||||
"@types/bull": "^4.10.0",
|
||||
"@types/cors": "^2.8.13",
|
||||
"@types/express": "^4.17.17",
|
||||
"@types/jest": "^29.5.6",
|
||||
"body-parser": "^1.20.1",
|
||||
"express": "^4.18.2",
|
||||
"jest": "^29.6.3",
|
||||
"jest-fetch-mock": "^3.0.3",
|
||||
"nodemon": "^2.0.20",
|
||||
"supabase": "^1.77.9",
|
||||
"supertest": "^6.3.3",
|
||||
"ts-jest": "^29.1.1",
|
||||
"ts-node": "^10.9.1",
|
||||
"typescript": "^5.4.2"
|
||||
},
|
||||
"dependencies": {
|
||||
"@brillout/import": "^0.2.2",
|
||||
"@bull-board/api": "^5.14.2",
|
||||
"@bull-board/express": "^5.8.0",
|
||||
"@devil7softwares/pos": "^1.0.2",
|
||||
"@dqbd/tiktoken": "^1.0.7",
|
||||
"@logtail/node": "^0.4.12",
|
||||
"@nangohq/node": "^0.36.33",
|
||||
"@sentry/node": "^7.48.0",
|
||||
"@supabase/supabase-js": "^2.7.1",
|
||||
"async": "^3.2.5",
|
||||
"async-mutex": "^0.4.0",
|
||||
"axios": "^1.3.4",
|
||||
"bottleneck": "^2.19.5",
|
||||
"bull": "^4.11.4",
|
||||
"cheerio": "^1.0.0-rc.12",
|
||||
"cohere": "^1.1.1",
|
||||
"cors": "^2.8.5",
|
||||
"cron-parser": "^4.9.0",
|
||||
"date-fns": "^2.29.3",
|
||||
"dotenv": "^16.3.1",
|
||||
"express-rate-limit": "^6.7.0",
|
||||
"glob": "^10.3.12",
|
||||
"gpt3-tokenizer": "^1.1.5",
|
||||
"ioredis": "^5.3.2",
|
||||
"keyword-extractor": "^0.0.25",
|
||||
"langchain": "^0.1.25",
|
||||
"languagedetect": "^2.0.0",
|
||||
"logsnag": "^0.1.6",
|
||||
"luxon": "^3.4.3",
|
||||
"md5": "^2.3.0",
|
||||
"moment": "^2.29.4",
|
||||
"mongoose": "^8.0.3",
|
||||
"natural": "^6.3.0",
|
||||
"openai": "^4.28.4",
|
||||
"pos": "^0.4.2",
|
||||
"promptable": "^0.0.9",
|
||||
"puppeteer": "^22.6.3",
|
||||
"rate-limiter-flexible": "^2.4.2",
|
||||
"redis": "^4.6.7",
|
||||
"robots-parser": "^3.0.1",
|
||||
"scrapingbee": "^1.7.4",
|
||||
"stripe": "^12.2.0",
|
||||
"turndown": "^7.1.3",
|
||||
"typesense": "^1.5.4",
|
||||
"unstructured-client": "^0.9.4",
|
||||
"uuid": "^9.0.1",
|
||||
"wordpos": "^2.1.0",
|
||||
"xml2js": "^0.6.2"
|
||||
},
|
||||
"nodemonConfig": {
|
||||
"ignore": [
|
||||
"*.docx",
|
||||
"*.json",
|
||||
"temp"
|
||||
]
|
||||
}
|
||||
}
|
6146
apps/api/pnpm-lock.yaml
Normal file
6146
apps/api/pnpm-lock.yaml
Normal file
File diff suppressed because it is too large
Load Diff
53
apps/api/requests.http
Normal file
53
apps/api/requests.http
Normal file
|
@ -0,0 +1,53 @@
|
|||
|
||||
|
||||
### Crawl Website
|
||||
POST http://localhost:3002/v0/crawl HTTP/1.1
|
||||
Authorization: Bearer 1b5b20e4-3417-4298-b1b2-983920e1b74d
|
||||
|
||||
{
|
||||
"url":"https://docs.mendable.ai"
|
||||
}
|
||||
|
||||
|
||||
### Check Job Status
|
||||
GET http://localhost:3002/v0/jobs/active HTTP/1.1
|
||||
|
||||
|
||||
### Scrape Website
|
||||
POST https://api.firecrawl.dev/v0/scrape HTTP/1.1
|
||||
Authorization: Bearer 1b5b20e4-3417-4298-b1b2-983920e1b74d
|
||||
content-type: application/json
|
||||
|
||||
{
|
||||
"url":"https://www.agentops.ai"
|
||||
}
|
||||
|
||||
|
||||
### Scrape Website
|
||||
POST http://localhost:3002/v0/scrape HTTP/1.1
|
||||
Authorization: Bearer 1b5b20e4-3417-4298-b1b2-983920e1b74d
|
||||
content-type: application/json
|
||||
|
||||
{
|
||||
"url":"https://www.agentops.ai"
|
||||
}
|
||||
|
||||
|
||||
|
||||
### Check Job Status
|
||||
GET http://localhost:3002/v0/crawl/status/333ab225-dc3e-418b-9d4b-8fb833cbaf89 HTTP/1.1
|
||||
Authorization: Bearer 1b5b20e4-3417-4298-b1b2-983920e1b74d
|
||||
|
||||
### Get Job Result
|
||||
|
||||
POST https://api.firecrawl.dev/v0/crawl HTTP/1.1
|
||||
Authorization: Bearer 30c90634-8377-4446-9ef9-a280b9be1702
|
||||
content-type: application/json
|
||||
|
||||
{
|
||||
"url":"https://markprompt.com"
|
||||
}
|
||||
|
||||
### Check Job Status
|
||||
GET https://api.firecrawl.dev/v0/crawl/status/cfcb71ac-23a3-4da5-bd85-d4e58b871d66
|
||||
Authorization: Bearer 30c90634-8377-4446-9ef9-a280b9be1702
|
BIN
apps/api/src/.DS_Store
vendored
Normal file
BIN
apps/api/src/.DS_Store
vendored
Normal file
Binary file not shown.
2
apps/api/src/control.ts
Normal file
2
apps/api/src/control.ts
Normal file
|
@ -0,0 +1,2 @@
|
|||
// ! IN CASE OPENAI goes down, then activate the fallback -> true
|
||||
export const is_fallback = false;
|
18
apps/api/src/example.ts
Normal file
18
apps/api/src/example.ts
Normal file
|
@ -0,0 +1,18 @@
|
|||
import { WebScraperDataProvider } from "./scraper/WebScraper";
|
||||
|
||||
async function example() {
|
||||
const example = new WebScraperDataProvider();
|
||||
|
||||
await example.setOptions({
|
||||
mode: "crawl",
|
||||
urls: ["https://mendable.ai"],
|
||||
crawlerOptions: {},
|
||||
});
|
||||
const docs = await example.getDocuments(false);
|
||||
docs.map((doc) => {
|
||||
console.log(doc.metadata.sourceURL);
|
||||
});
|
||||
console.log(docs.length);
|
||||
}
|
||||
|
||||
// example();
|
352
apps/api/src/index.ts
Normal file
352
apps/api/src/index.ts
Normal file
|
@ -0,0 +1,352 @@
|
|||
import express from "express";
|
||||
import bodyParser from "body-parser";
|
||||
import cors from "cors";
|
||||
import "dotenv/config";
|
||||
import { getWebScraperQueue } from "./services/queue-service";
|
||||
import { addWebScraperJob } from "./services/queue-jobs";
|
||||
import { supabase_service } from "./services/supabase";
|
||||
import { WebScraperDataProvider } from "./scraper/WebScraper";
|
||||
import { billTeam, checkTeamCredits } from "./services/billing/credit_billing";
|
||||
import { getRateLimiter, redisClient } from "./services/rate-limiter";
|
||||
|
||||
const { createBullBoard } = require("@bull-board/api");
|
||||
const { BullAdapter } = require("@bull-board/api/bullAdapter");
|
||||
const { ExpressAdapter } = require("@bull-board/express");
|
||||
|
||||
export const app = express();
|
||||
|
||||
global.isProduction = process.env.IS_PRODUCTION === "true";
|
||||
|
||||
app.use(bodyParser.urlencoded({ extended: true }));
|
||||
app.use(bodyParser.json({ limit: "10mb" }));
|
||||
|
||||
app.use(cors()); // Add this line to enable CORS
|
||||
|
||||
const serverAdapter = new ExpressAdapter();
|
||||
serverAdapter.setBasePath(`/admin/${process.env.BULL_AUTH_KEY}/queues`);
|
||||
|
||||
const { addQueue, removeQueue, setQueues, replaceQueues } = createBullBoard({
|
||||
queues: [new BullAdapter(getWebScraperQueue())],
|
||||
serverAdapter: serverAdapter,
|
||||
});
|
||||
|
||||
app.use(
|
||||
`/admin/${process.env.BULL_AUTH_KEY}/queues`,
|
||||
serverAdapter.getRouter()
|
||||
);
|
||||
|
||||
app.get("/", (req, res) => {
|
||||
res.send("SCRAPERS-JS: Hello, world! Fly.io");
|
||||
});
|
||||
|
||||
//write a simple test function
|
||||
app.get("/test", async (req, res) => {
|
||||
res.send("Hello, world!");
|
||||
});
|
||||
|
||||
async function authenticateUser(req, res, mode?: string): Promise<string> {
|
||||
const authHeader = req.headers.authorization;
|
||||
if (!authHeader) {
|
||||
return res.status(401).json({ error: "Unauthorized" });
|
||||
}
|
||||
const token = authHeader.split(" ")[1]; // Extract the token from "Bearer <token>"
|
||||
if (!token) {
|
||||
return res.status(401).json({ error: "Unauthorized: Token missing" });
|
||||
}
|
||||
|
||||
try {
|
||||
const incomingIP = (req.headers["x-forwarded-for"] ||
|
||||
req.socket.remoteAddress) as string;
|
||||
const iptoken = incomingIP + token;
|
||||
await getRateLimiter(
|
||||
token === "this_is_just_a_preview_token" ? true : false
|
||||
).consume(iptoken);
|
||||
} catch (rateLimiterRes) {
|
||||
console.error(rateLimiterRes);
|
||||
return res.status(429).json({
|
||||
error: "Rate limit exceeded. Too many requests, try again in 1 minute.",
|
||||
});
|
||||
}
|
||||
|
||||
if (token === "this_is_just_a_preview_token" && mode === "scrape") {
|
||||
return "preview";
|
||||
}
|
||||
// make sure api key is valid, based on the api_keys table in supabase
|
||||
const { data, error } = await supabase_service
|
||||
.from("api_keys")
|
||||
.select("*")
|
||||
.eq("key", token);
|
||||
if (error || !data || data.length === 0) {
|
||||
return res.status(401).json({ error: "Unauthorized: Invalid token" });
|
||||
}
|
||||
|
||||
return data[0].team_id;
|
||||
}
|
||||
|
||||
app.post("/v0/scrape", async (req, res) => {
|
||||
try {
|
||||
// make sure to authenticate user first, Bearer <token>
|
||||
const team_id = await authenticateUser(req, res, "scrape");
|
||||
|
||||
try {
|
||||
const { success: creditsCheckSuccess, message: creditsCheckMessage } =
|
||||
await checkTeamCredits(team_id, 1);
|
||||
if (!creditsCheckSuccess) {
|
||||
return res.status(402).json({ error: "Insufficient credits" });
|
||||
}
|
||||
} catch (error) {
|
||||
console.error(error);
|
||||
return res.status(500).json({ error: "Internal server error" });
|
||||
}
|
||||
|
||||
// authenticate on supabase
|
||||
const url = req.body.url;
|
||||
if (!url) {
|
||||
return res.status(400).json({ error: "Url is required" });
|
||||
}
|
||||
|
||||
try {
|
||||
const a = new WebScraperDataProvider();
|
||||
await a.setOptions({
|
||||
mode: "single_urls",
|
||||
urls: [url],
|
||||
});
|
||||
|
||||
const docs = await a.getDocuments(false);
|
||||
// make sure doc.content is not empty
|
||||
const filteredDocs = docs.filter(
|
||||
(doc: { content?: string }) =>
|
||||
doc.content && doc.content.trim().length > 0
|
||||
);
|
||||
if (filteredDocs.length === 0) {
|
||||
return res.status(200).json({ success: true, data: [] });
|
||||
}
|
||||
const { success, credit_usage } = await billTeam(
|
||||
team_id,
|
||||
filteredDocs.length
|
||||
);
|
||||
if (!success) {
|
||||
// throw new Error("Failed to bill team, no subscribtion was found");
|
||||
// return {
|
||||
// success: false,
|
||||
// message: "Failed to bill team, no subscribtion was found",
|
||||
// docs: [],
|
||||
// };
|
||||
return res
|
||||
.status(402)
|
||||
.json({ error: "Failed to bill, no subscribtion was found" });
|
||||
}
|
||||
return res.json({
|
||||
success: true,
|
||||
data: filteredDocs[0],
|
||||
});
|
||||
} catch (error) {
|
||||
console.error(error);
|
||||
return res.status(500).json({ error: error.message });
|
||||
}
|
||||
} catch (error) {
|
||||
console.error(error);
|
||||
return res.status(500).json({ error: error.message });
|
||||
}
|
||||
});
|
||||
|
||||
app.post("/v0/crawl", async (req, res) => {
|
||||
try {
|
||||
const team_id = await authenticateUser(req, res);
|
||||
|
||||
const { success: creditsCheckSuccess, message: creditsCheckMessage } =
|
||||
await checkTeamCredits(team_id, 1);
|
||||
if (!creditsCheckSuccess) {
|
||||
return res.status(402).json({ error: "Insufficient credits" });
|
||||
}
|
||||
|
||||
// authenticate on supabase
|
||||
const url = req.body.url;
|
||||
if (!url) {
|
||||
return res.status(400).json({ error: "Url is required" });
|
||||
}
|
||||
const mode = req.body.mode ?? "crawl";
|
||||
const crawlerOptions = req.body.crawlerOptions ?? {};
|
||||
|
||||
if (mode === "single_urls" && !url.includes(",")) {
|
||||
try {
|
||||
const a = new WebScraperDataProvider();
|
||||
await a.setOptions({
|
||||
mode: "single_urls",
|
||||
urls: [url],
|
||||
crawlerOptions: {
|
||||
returnOnlyUrls: true,
|
||||
},
|
||||
});
|
||||
|
||||
const docs = await a.getDocuments(false, (progress) => {
|
||||
job.progress({
|
||||
current: progress.current,
|
||||
total: progress.total,
|
||||
current_step: "SCRAPING",
|
||||
current_url: progress.currentDocumentUrl,
|
||||
});
|
||||
});
|
||||
return res.json({
|
||||
success: true,
|
||||
documents: docs,
|
||||
});
|
||||
} catch (error) {
|
||||
console.error(error);
|
||||
return res.status(500).json({ error: error.message });
|
||||
}
|
||||
}
|
||||
const job = await addWebScraperJob({
|
||||
url: url,
|
||||
mode: mode ?? "crawl", // fix for single urls not working
|
||||
crawlerOptions: { ...crawlerOptions },
|
||||
team_id: team_id,
|
||||
});
|
||||
|
||||
res.json({ jobId: job.id });
|
||||
} catch (error) {
|
||||
console.error(error);
|
||||
return res.status(500).json({ error: error.message });
|
||||
}
|
||||
});
|
||||
app.post("/v0/crawlWebsitePreview", async (req, res) => {
|
||||
try {
|
||||
// make sure to authenticate user first, Bearer <token>
|
||||
const authHeader = req.headers.authorization;
|
||||
if (!authHeader) {
|
||||
return res.status(401).json({ error: "Unauthorized" });
|
||||
}
|
||||
const token = authHeader.split(" ")[1]; // Extract the token from "Bearer <token>"
|
||||
if (!token) {
|
||||
return res.status(401).json({ error: "Unauthorized: Token missing" });
|
||||
}
|
||||
|
||||
// authenticate on supabase
|
||||
const url = req.body.url;
|
||||
if (!url) {
|
||||
return res.status(400).json({ error: "Url is required" });
|
||||
}
|
||||
const mode = req.body.mode ?? "crawl";
|
||||
const crawlerOptions = req.body.crawlerOptions ?? {};
|
||||
const job = await addWebScraperJob({
|
||||
url: url,
|
||||
mode: mode ?? "crawl", // fix for single urls not working
|
||||
crawlerOptions: { ...crawlerOptions, limit: 5, maxCrawledLinks: 5 },
|
||||
team_id: "preview",
|
||||
});
|
||||
|
||||
res.json({ jobId: job.id });
|
||||
} catch (error) {
|
||||
console.error(error);
|
||||
return res.status(500).json({ error: error.message });
|
||||
}
|
||||
});
|
||||
|
||||
app.get("/v0/crawl/status/:jobId", async (req, res) => {
|
||||
try {
|
||||
const authHeader = req.headers.authorization;
|
||||
if (!authHeader) {
|
||||
return res.status(401).json({ error: "Unauthorized" });
|
||||
}
|
||||
const token = authHeader.split(" ")[1]; // Extract the token from "Bearer <token>"
|
||||
if (!token) {
|
||||
return res.status(401).json({ error: "Unauthorized: Token missing" });
|
||||
}
|
||||
|
||||
// make sure api key is valid, based on the api_keys table in supabase
|
||||
const { data, error } = await supabase_service
|
||||
.from("api_keys")
|
||||
.select("*")
|
||||
.eq("key", token);
|
||||
if (error || !data || data.length === 0) {
|
||||
return res.status(401).json({ error: "Unauthorized: Invalid token" });
|
||||
}
|
||||
const job = await getWebScraperQueue().getJob(req.params.jobId);
|
||||
if (!job) {
|
||||
return res.status(404).json({ error: "Job not found" });
|
||||
}
|
||||
|
||||
const { current, current_url, total, current_step } = await job.progress();
|
||||
res.json({
|
||||
status: await job.getState(),
|
||||
// progress: job.progress(),
|
||||
current: current,
|
||||
current_url: current_url,
|
||||
current_step: current_step,
|
||||
total: total,
|
||||
data: job.returnvalue,
|
||||
});
|
||||
} catch (error) {
|
||||
console.error(error);
|
||||
return res.status(500).json({ error: error.message });
|
||||
}
|
||||
});
|
||||
|
||||
app.get("/v0/checkJobStatus/:jobId", async (req, res) => {
|
||||
try {
|
||||
const job = await getWebScraperQueue().getJob(req.params.jobId);
|
||||
if (!job) {
|
||||
return res.status(404).json({ error: "Job not found" });
|
||||
}
|
||||
|
||||
const { current, current_url, total, current_step } = await job.progress();
|
||||
res.json({
|
||||
status: await job.getState(),
|
||||
// progress: job.progress(),
|
||||
current: current,
|
||||
current_url: current_url,
|
||||
current_step: current_step,
|
||||
total: total,
|
||||
data: job.returnvalue,
|
||||
});
|
||||
} catch (error) {
|
||||
console.error(error);
|
||||
return res.status(500).json({ error: error.message });
|
||||
}
|
||||
});
|
||||
|
||||
const DEFAULT_PORT = process.env.PORT ?? 3002;
|
||||
const HOST = process.env.HOST ?? "localhost";
|
||||
redisClient.connect();
|
||||
|
||||
export function startServer(port = DEFAULT_PORT) {
|
||||
const server = app.listen(Number(port), HOST, () => {
|
||||
console.log(`Server listening on port ${port}`);
|
||||
console.log(`For the UI, open http://${HOST}:${port}/admin/queues`);
|
||||
console.log("");
|
||||
console.log("1. Make sure Redis is running on port 6379 by default");
|
||||
console.log(
|
||||
"2. If you want to run nango, make sure you do port forwarding in 3002 using ngrok http 3002 "
|
||||
);
|
||||
});
|
||||
return server;
|
||||
}
|
||||
|
||||
if (require.main === module) {
|
||||
startServer();
|
||||
}
|
||||
|
||||
// Use this as a health check that way we dont destroy the server
|
||||
app.get(`/admin/${process.env.BULL_AUTH_KEY}/queues`, async (req, res) => {
|
||||
try {
|
||||
const webScraperQueue = getWebScraperQueue();
|
||||
const [webScraperActive] = await Promise.all([
|
||||
webScraperQueue.getActiveCount(),
|
||||
]);
|
||||
|
||||
const noActiveJobs = webScraperActive === 0;
|
||||
// 200 if no active jobs, 503 if there are active jobs
|
||||
return res.status(noActiveJobs ? 200 : 500).json({
|
||||
webScraperActive,
|
||||
noActiveJobs,
|
||||
});
|
||||
} catch (error) {
|
||||
console.error(error);
|
||||
return res.status(500).json({ error: error.message });
|
||||
}
|
||||
});
|
||||
|
||||
app.get("/is-production", (req, res) => {
|
||||
res.send({ isProduction: global.isProduction });
|
||||
});
|
||||
|
16
apps/api/src/lib/batch-process.ts
Normal file
16
apps/api/src/lib/batch-process.ts
Normal file
|
@ -0,0 +1,16 @@
|
|||
export async function batchProcess<T>(
|
||||
array: T[],
|
||||
batchSize: number,
|
||||
asyncFunction: (item: T, index: number) => Promise<void>
|
||||
): Promise<void> {
|
||||
const batches = [];
|
||||
for (let i = 0; i < array.length; i += batchSize) {
|
||||
const batch = array.slice(i, i + batchSize);
|
||||
batches.push(batch);
|
||||
}
|
||||
|
||||
for (const batch of batches) {
|
||||
await Promise.all(batch.map((item, i) => asyncFunction(item, i)));
|
||||
}
|
||||
}
|
||||
|
21
apps/api/src/lib/custom-error.ts
Normal file
21
apps/api/src/lib/custom-error.ts
Normal file
|
@ -0,0 +1,21 @@
|
|||
export class CustomError extends Error {
|
||||
statusCode: number;
|
||||
status: string;
|
||||
message: string;
|
||||
dataIngestionJob: any;
|
||||
|
||||
constructor(
|
||||
statusCode: number,
|
||||
status: string,
|
||||
message: string = "",
|
||||
dataIngestionJob?: any,
|
||||
) {
|
||||
super(message);
|
||||
this.statusCode = statusCode;
|
||||
this.status = status;
|
||||
this.message = message;
|
||||
this.dataIngestionJob = dataIngestionJob;
|
||||
|
||||
Object.setPrototypeOf(this, CustomError.prototype);
|
||||
}
|
||||
}
|
37
apps/api/src/lib/entities.ts
Normal file
37
apps/api/src/lib/entities.ts
Normal file
|
@ -0,0 +1,37 @@
|
|||
export interface Progress {
|
||||
current: number;
|
||||
total: number;
|
||||
status: string;
|
||||
metadata?: {
|
||||
sourceURL?: string;
|
||||
[key: string]: any;
|
||||
};
|
||||
currentDocumentUrl?: string;
|
||||
}
|
||||
|
||||
export class Document {
|
||||
id?: string;
|
||||
content: string;
|
||||
markdown?: string;
|
||||
createdAt?: Date;
|
||||
updatedAt?: Date;
|
||||
type?: string;
|
||||
metadata: {
|
||||
sourceURL?: string;
|
||||
[key: string]: any;
|
||||
};
|
||||
childrenLinks?: string[];
|
||||
|
||||
constructor(data: Partial<Document>) {
|
||||
if (!data.content) {
|
||||
throw new Error("Missing required fields");
|
||||
}
|
||||
this.content = data.content;
|
||||
this.createdAt = data.createdAt || new Date();
|
||||
this.updatedAt = data.updatedAt || new Date();
|
||||
this.type = data.type || "unknown";
|
||||
this.metadata = data.metadata || { sourceURL: "" };
|
||||
this.markdown = data.markdown || "";
|
||||
this.childrenLinks = data.childrenLinks || undefined;
|
||||
}
|
||||
}
|
51
apps/api/src/lib/html-to-markdown.ts
Normal file
51
apps/api/src/lib/html-to-markdown.ts
Normal file
|
@ -0,0 +1,51 @@
|
|||
export function parseMarkdown(html: string) {
|
||||
var TurndownService = require("turndown");
|
||||
|
||||
const turndownService = new TurndownService();
|
||||
turndownService.addRule("inlineLink", {
|
||||
filter: function (node, options) {
|
||||
return (
|
||||
options.linkStyle === "inlined" &&
|
||||
node.nodeName === "A" &&
|
||||
node.getAttribute("href")
|
||||
);
|
||||
},
|
||||
replacement: function (content, node) {
|
||||
var href = node.getAttribute("href").trim();
|
||||
var title = node.title ? ' "' + node.title + '"' : "";
|
||||
return "[" + content.trim() + "](" + href + title + ")\n";
|
||||
},
|
||||
});
|
||||
|
||||
let markdownContent = turndownService.turndown(html);
|
||||
|
||||
// multiple line links
|
||||
let insideLinkContent = false;
|
||||
let newMarkdownContent = "";
|
||||
let linkOpenCount = 0;
|
||||
for (let i = 0; i < markdownContent.length; i++) {
|
||||
const char = markdownContent[i];
|
||||
|
||||
if (char == "[") {
|
||||
linkOpenCount++;
|
||||
} else if (char == "]") {
|
||||
linkOpenCount = Math.max(0, linkOpenCount - 1);
|
||||
}
|
||||
insideLinkContent = linkOpenCount > 0;
|
||||
|
||||
if (insideLinkContent && char == "\n") {
|
||||
newMarkdownContent += "\\" + "\n";
|
||||
} else {
|
||||
newMarkdownContent += char;
|
||||
}
|
||||
}
|
||||
markdownContent = newMarkdownContent;
|
||||
|
||||
// Remove [Skip to Content](#page) and [Skip to content](#skip)
|
||||
markdownContent = markdownContent.replace(
|
||||
/\[Skip to Content\]\(#[^\)]*\)/gi,
|
||||
""
|
||||
);
|
||||
|
||||
return markdownContent;
|
||||
}
|
12
apps/api/src/lib/parse-mode.ts
Normal file
12
apps/api/src/lib/parse-mode.ts
Normal file
|
@ -0,0 +1,12 @@
|
|||
export function parseMode(mode: string) {
|
||||
switch (mode) {
|
||||
case "single_urls":
|
||||
return "single_urls";
|
||||
case "sitemap":
|
||||
return "sitemap";
|
||||
case "crawl":
|
||||
return "crawl";
|
||||
default:
|
||||
return "single_urls";
|
||||
}
|
||||
}
|
96
apps/api/src/main/runWebScraper.ts
Normal file
96
apps/api/src/main/runWebScraper.ts
Normal file
|
@ -0,0 +1,96 @@
|
|||
import { Job } from "bull";
|
||||
import { CrawlResult, WebScraperOptions } from "../types";
|
||||
import { WebScraperDataProvider } from "../scraper/WebScraper";
|
||||
import { Progress } from "../lib/entities";
|
||||
import { billTeam } from "../services/billing/credit_billing";
|
||||
|
||||
export async function startWebScraperPipeline({
|
||||
job,
|
||||
}: {
|
||||
job: Job<WebScraperOptions>;
|
||||
}) {
|
||||
return (await runWebScraper({
|
||||
url: job.data.url,
|
||||
mode: job.data.mode,
|
||||
crawlerOptions: job.data.crawlerOptions,
|
||||
inProgress: (progress) => {
|
||||
job.progress(progress);
|
||||
},
|
||||
onSuccess: (result) => {
|
||||
job.moveToCompleted(result);
|
||||
},
|
||||
onError: (error) => {
|
||||
job.moveToFailed(error);
|
||||
},
|
||||
team_id: job.data.team_id,
|
||||
})) as { success: boolean; message: string; docs: CrawlResult[] };
|
||||
}
|
||||
export async function runWebScraper({
|
||||
url,
|
||||
mode,
|
||||
crawlerOptions,
|
||||
inProgress,
|
||||
onSuccess,
|
||||
onError,
|
||||
team_id,
|
||||
}: {
|
||||
url: string;
|
||||
mode: "crawl" | "single_urls" | "sitemap";
|
||||
crawlerOptions: any;
|
||||
inProgress: (progress: any) => void;
|
||||
onSuccess: (result: any) => void;
|
||||
onError: (error: any) => void;
|
||||
team_id: string;
|
||||
}): Promise<{ success: boolean; message: string; docs: CrawlResult[] }> {
|
||||
try {
|
||||
const provider = new WebScraperDataProvider();
|
||||
|
||||
if (mode === "crawl") {
|
||||
await provider.setOptions({
|
||||
mode: mode,
|
||||
urls: [url],
|
||||
crawlerOptions: crawlerOptions,
|
||||
});
|
||||
} else {
|
||||
await provider.setOptions({
|
||||
mode: mode,
|
||||
urls: url.split(","),
|
||||
crawlerOptions: crawlerOptions,
|
||||
});
|
||||
}
|
||||
const docs = (await provider.getDocuments(false, (progress: Progress) => {
|
||||
inProgress(progress);
|
||||
})) as CrawlResult[];
|
||||
|
||||
if (docs.length === 0) {
|
||||
return {
|
||||
success: true,
|
||||
message: "No pages found",
|
||||
docs: [],
|
||||
};
|
||||
}
|
||||
|
||||
// remove docs with empty content
|
||||
const filteredDocs = docs.filter((doc) => doc.content.trim().length > 0);
|
||||
onSuccess(filteredDocs);
|
||||
|
||||
const { success, credit_usage } = await billTeam(
|
||||
team_id,
|
||||
filteredDocs.length
|
||||
);
|
||||
if (!success) {
|
||||
// throw new Error("Failed to bill team, no subscribtion was found");
|
||||
return {
|
||||
success: false,
|
||||
message: "Failed to bill team, no subscribtion was found",
|
||||
docs: [],
|
||||
};
|
||||
}
|
||||
|
||||
return { success: true, message: "", docs: filteredDocs as CrawlResult[] };
|
||||
} catch (error) {
|
||||
console.error("Error running web scraper", error);
|
||||
onError(error);
|
||||
return { success: false, message: error.message, docs: [] };
|
||||
}
|
||||
}
|
295
apps/api/src/scraper/WebScraper/crawler.ts
Normal file
295
apps/api/src/scraper/WebScraper/crawler.ts
Normal file
|
@ -0,0 +1,295 @@
|
|||
import axios from "axios";
|
||||
import cheerio, { load } from "cheerio";
|
||||
import { URL } from "url";
|
||||
import { getLinksFromSitemap } from "./sitemap";
|
||||
import async from "async";
|
||||
import { Progress } from "../../lib/entities";
|
||||
import { scrapWithScrapingBee } from "./single_url";
|
||||
import robotsParser from "robots-parser";
|
||||
|
||||
export class WebCrawler {
|
||||
private initialUrl: string;
|
||||
private baseUrl: string;
|
||||
private includes: string[];
|
||||
private excludes: string[];
|
||||
private maxCrawledLinks: number;
|
||||
private visited: Set<string> = new Set();
|
||||
private crawledUrls: Set<string> = new Set();
|
||||
private limit: number;
|
||||
private robotsTxtUrl: string;
|
||||
private robots: any;
|
||||
|
||||
constructor({
|
||||
initialUrl,
|
||||
includes,
|
||||
excludes,
|
||||
maxCrawledLinks,
|
||||
limit = 10000,
|
||||
}: {
|
||||
initialUrl: string;
|
||||
includes?: string[];
|
||||
excludes?: string[];
|
||||
maxCrawledLinks?: number;
|
||||
limit?: number;
|
||||
}) {
|
||||
this.initialUrl = initialUrl;
|
||||
this.baseUrl = new URL(initialUrl).origin;
|
||||
this.includes = includes ?? [];
|
||||
this.excludes = excludes ?? [];
|
||||
this.limit = limit;
|
||||
this.robotsTxtUrl = `${this.baseUrl}/robots.txt`;
|
||||
this.robots = robotsParser(this.robotsTxtUrl, "");
|
||||
// Deprecated, use limit instead
|
||||
this.maxCrawledLinks = maxCrawledLinks ?? limit;
|
||||
}
|
||||
|
||||
private filterLinks(sitemapLinks: string[], limit: number): string[] {
|
||||
return sitemapLinks
|
||||
.filter((link) => {
|
||||
const url = new URL(link);
|
||||
const path = url.pathname;
|
||||
|
||||
// Check if the link should be excluded
|
||||
if (this.excludes.length > 0 && this.excludes[0] !== "") {
|
||||
if (
|
||||
this.excludes.some((excludePattern) =>
|
||||
new RegExp(excludePattern).test(path)
|
||||
)
|
||||
) {
|
||||
return false;
|
||||
}
|
||||
}
|
||||
|
||||
// Check if the link matches the include patterns, if any are specified
|
||||
if (this.includes.length > 0 && this.includes[0] !== "") {
|
||||
return this.includes.some((includePattern) =>
|
||||
new RegExp(includePattern).test(path)
|
||||
);
|
||||
}
|
||||
|
||||
const isAllowed = this.robots.isAllowed(link, "FireCrawlAgent") ?? true;
|
||||
// Check if the link is disallowed by robots.txt
|
||||
if (!isAllowed) {
|
||||
console.log(`Link disallowed by robots.txt: ${link}`);
|
||||
return false;
|
||||
}
|
||||
|
||||
return true;
|
||||
})
|
||||
.slice(0, limit);
|
||||
}
|
||||
|
||||
public async start(
|
||||
inProgress?: (progress: Progress) => void,
|
||||
concurrencyLimit: number = 5,
|
||||
limit: number = 10000
|
||||
): Promise<string[]> {
|
||||
// Fetch and parse robots.txt
|
||||
try {
|
||||
const response = await axios.get(this.robotsTxtUrl);
|
||||
this.robots = robotsParser(this.robotsTxtUrl, response.data);
|
||||
} catch (error) {
|
||||
console.error(`Failed to fetch robots.txt from ${this.robotsTxtUrl}`);
|
||||
}
|
||||
|
||||
const sitemapLinks = await this.tryFetchSitemapLinks(this.initialUrl);
|
||||
if (sitemapLinks.length > 0) {
|
||||
const filteredLinks = this.filterLinks(sitemapLinks, limit);
|
||||
return filteredLinks;
|
||||
}
|
||||
|
||||
const urls = await this.crawlUrls(
|
||||
[this.initialUrl],
|
||||
concurrencyLimit,
|
||||
inProgress
|
||||
);
|
||||
if (
|
||||
urls.length === 0 &&
|
||||
this.filterLinks([this.initialUrl], limit).length > 0
|
||||
) {
|
||||
return [this.initialUrl];
|
||||
}
|
||||
|
||||
// make sure to run include exclude here again
|
||||
return this.filterLinks(urls, limit);
|
||||
}
|
||||
|
||||
private async crawlUrls(
|
||||
urls: string[],
|
||||
concurrencyLimit: number,
|
||||
inProgress?: (progress: Progress) => void
|
||||
): Promise<string[]> {
|
||||
const queue = async.queue(async (task: string, callback) => {
|
||||
if (this.crawledUrls.size >= this.maxCrawledLinks) {
|
||||
if (callback && typeof callback === "function") {
|
||||
callback();
|
||||
}
|
||||
return;
|
||||
}
|
||||
const newUrls = await this.crawl(task);
|
||||
newUrls.forEach((url) => this.crawledUrls.add(url));
|
||||
if (inProgress && newUrls.length > 0) {
|
||||
inProgress({
|
||||
current: this.crawledUrls.size,
|
||||
total: this.maxCrawledLinks,
|
||||
status: "SCRAPING",
|
||||
currentDocumentUrl: newUrls[newUrls.length - 1],
|
||||
});
|
||||
} else if (inProgress) {
|
||||
inProgress({
|
||||
current: this.crawledUrls.size,
|
||||
total: this.maxCrawledLinks,
|
||||
status: "SCRAPING",
|
||||
currentDocumentUrl: task,
|
||||
});
|
||||
}
|
||||
await this.crawlUrls(newUrls, concurrencyLimit, inProgress);
|
||||
if (callback && typeof callback === "function") {
|
||||
callback();
|
||||
}
|
||||
}, concurrencyLimit);
|
||||
|
||||
queue.push(
|
||||
urls.filter(
|
||||
(url) =>
|
||||
!this.visited.has(url) && this.robots.isAllowed(url, "FireCrawlAgent")
|
||||
),
|
||||
(err) => {
|
||||
if (err) console.error(err);
|
||||
}
|
||||
);
|
||||
await queue.drain();
|
||||
return Array.from(this.crawledUrls);
|
||||
}
|
||||
|
||||
async crawl(url: string): Promise<string[]> {
|
||||
if (this.visited.has(url) || !this.robots.isAllowed(url, "FireCrawlAgent"))
|
||||
return [];
|
||||
this.visited.add(url);
|
||||
if (!url.startsWith("http")) {
|
||||
url = "https://" + url;
|
||||
}
|
||||
if (url.endsWith("/")) {
|
||||
url = url.slice(0, -1);
|
||||
}
|
||||
if (this.isFile(url) || this.isSocialMediaOrEmail(url)) {
|
||||
return [];
|
||||
}
|
||||
|
||||
try {
|
||||
let content;
|
||||
// If it is the first link, fetch with scrapingbee
|
||||
if (this.visited.size === 1) {
|
||||
content = await scrapWithScrapingBee(url, "load");
|
||||
} else {
|
||||
const response = await axios.get(url);
|
||||
content = response.data;
|
||||
}
|
||||
const $ = load(content);
|
||||
let links: string[] = [];
|
||||
|
||||
$("a").each((_, element) => {
|
||||
const href = $(element).attr("href");
|
||||
if (href) {
|
||||
let fullUrl = href;
|
||||
if (!href.startsWith("http")) {
|
||||
fullUrl = new URL(href, this.baseUrl).toString();
|
||||
}
|
||||
const url = new URL(fullUrl);
|
||||
const path = url.pathname;
|
||||
|
||||
if (
|
||||
// fullUrl.startsWith(this.initialUrl) && // this condition makes it stop crawling back the url
|
||||
this.isInternalLink(fullUrl) &&
|
||||
this.matchesPattern(fullUrl) &&
|
||||
this.noSections(fullUrl) &&
|
||||
this.matchesIncludes(path) &&
|
||||
!this.matchesExcludes(path) &&
|
||||
this.robots.isAllowed(fullUrl, "FireCrawlAgent")
|
||||
) {
|
||||
links.push(fullUrl);
|
||||
}
|
||||
}
|
||||
});
|
||||
|
||||
return links.filter((link) => !this.visited.has(link));
|
||||
} catch (error) {
|
||||
return [];
|
||||
}
|
||||
}
|
||||
|
||||
private matchesIncludes(url: string): boolean {
|
||||
if (this.includes.length === 0 || this.includes[0] == "") return true;
|
||||
return this.includes.some((pattern) => new RegExp(pattern).test(url));
|
||||
}
|
||||
|
||||
private matchesExcludes(url: string): boolean {
|
||||
if (this.excludes.length === 0 || this.excludes[0] == "") return false;
|
||||
return this.excludes.some((pattern) => new RegExp(pattern).test(url));
|
||||
}
|
||||
|
||||
private noSections(link: string): boolean {
|
||||
return !link.includes("#");
|
||||
}
|
||||
|
||||
private isInternalLink(link: string): boolean {
|
||||
const urlObj = new URL(link, this.baseUrl);
|
||||
const domainWithoutProtocol = this.baseUrl.replace(/^https?:\/\//, "");
|
||||
return urlObj.hostname === domainWithoutProtocol;
|
||||
}
|
||||
|
||||
private matchesPattern(link: string): boolean {
|
||||
return true; // Placeholder for future pattern matching implementation
|
||||
}
|
||||
|
||||
private isFile(url: string): boolean {
|
||||
const fileExtensions = [
|
||||
".png",
|
||||
".jpg",
|
||||
".jpeg",
|
||||
".gif",
|
||||
".css",
|
||||
".js",
|
||||
".ico",
|
||||
".svg",
|
||||
".pdf",
|
||||
".zip",
|
||||
".exe",
|
||||
".dmg",
|
||||
".mp4",
|
||||
".mp3",
|
||||
".pptx",
|
||||
".docx",
|
||||
".xlsx",
|
||||
".xml",
|
||||
];
|
||||
return fileExtensions.some((ext) => url.endsWith(ext));
|
||||
}
|
||||
|
||||
private isSocialMediaOrEmail(url: string): boolean {
|
||||
const socialMediaOrEmail = [
|
||||
"facebook.com",
|
||||
"twitter.com",
|
||||
"linkedin.com",
|
||||
"instagram.com",
|
||||
"pinterest.com",
|
||||
"mailto:",
|
||||
];
|
||||
return socialMediaOrEmail.some((ext) => url.includes(ext));
|
||||
}
|
||||
|
||||
private async tryFetchSitemapLinks(url: string): Promise<string[]> {
|
||||
const sitemapUrl = url.endsWith("/sitemap.xml")
|
||||
? url
|
||||
: `${url}/sitemap.xml`;
|
||||
try {
|
||||
const response = await axios.get(sitemapUrl);
|
||||
if (response.status === 200) {
|
||||
return await getLinksFromSitemap(sitemapUrl);
|
||||
}
|
||||
} catch (error) {
|
||||
// Error handling for failed sitemap fetch
|
||||
}
|
||||
return [];
|
||||
}
|
||||
}
|
287
apps/api/src/scraper/WebScraper/index.ts
Normal file
287
apps/api/src/scraper/WebScraper/index.ts
Normal file
|
@ -0,0 +1,287 @@
|
|||
import { Document } from "../../lib/entities";
|
||||
import { Progress } from "../../lib/entities";
|
||||
import { scrapSingleUrl } from "./single_url";
|
||||
import { SitemapEntry, fetchSitemapData, getLinksFromSitemap } from "./sitemap";
|
||||
import { WebCrawler } from "./crawler";
|
||||
import { getValue, setValue } from "../../services/redis";
|
||||
|
||||
export type WebScraperOptions = {
|
||||
urls: string[];
|
||||
mode: "single_urls" | "sitemap" | "crawl";
|
||||
crawlerOptions?: {
|
||||
returnOnlyUrls?: boolean;
|
||||
includes?: string[];
|
||||
excludes?: string[];
|
||||
maxCrawledLinks?: number;
|
||||
limit?: number;
|
||||
|
||||
};
|
||||
concurrentRequests?: number;
|
||||
};
|
||||
export class WebScraperDataProvider {
|
||||
private urls: string[] = [""];
|
||||
private mode: "single_urls" | "sitemap" | "crawl" = "single_urls";
|
||||
private includes: string[];
|
||||
private excludes: string[];
|
||||
private maxCrawledLinks: number;
|
||||
private returnOnlyUrls: boolean;
|
||||
private limit: number = 10000;
|
||||
private concurrentRequests: number = 20;
|
||||
|
||||
authorize(): void {
|
||||
throw new Error("Method not implemented.");
|
||||
}
|
||||
|
||||
authorizeNango(): Promise<void> {
|
||||
throw new Error("Method not implemented.");
|
||||
}
|
||||
|
||||
private async convertUrlsToDocuments(
|
||||
urls: string[],
|
||||
inProgress?: (progress: Progress) => void
|
||||
): Promise<Document[]> {
|
||||
const totalUrls = urls.length;
|
||||
let processedUrls = 0;
|
||||
console.log("Converting urls to documents");
|
||||
console.log("Total urls", urls);
|
||||
const results: (Document | null)[] = new Array(urls.length).fill(null);
|
||||
for (let i = 0; i < urls.length; i += this.concurrentRequests) {
|
||||
const batchUrls = urls.slice(i, i + this.concurrentRequests);
|
||||
await Promise.all(batchUrls.map(async (url, index) => {
|
||||
const result = await scrapSingleUrl(url, true);
|
||||
processedUrls++;
|
||||
if (inProgress) {
|
||||
inProgress({
|
||||
current: processedUrls,
|
||||
total: totalUrls,
|
||||
status: "SCRAPING",
|
||||
currentDocumentUrl: url,
|
||||
});
|
||||
}
|
||||
results[i + index] = result;
|
||||
}));
|
||||
}
|
||||
return results.filter((result) => result !== null) as Document[];
|
||||
}
|
||||
|
||||
async getDocuments(
|
||||
useCaching: boolean = false,
|
||||
inProgress?: (progress: Progress) => void
|
||||
): Promise<Document[]> {
|
||||
if (this.urls[0].trim() === "") {
|
||||
throw new Error("Url is required");
|
||||
}
|
||||
|
||||
if (!useCaching) {
|
||||
if (this.mode === "crawl") {
|
||||
const crawler = new WebCrawler({
|
||||
initialUrl: this.urls[0],
|
||||
includes: this.includes,
|
||||
excludes: this.excludes,
|
||||
maxCrawledLinks: this.maxCrawledLinks,
|
||||
limit: this.limit,
|
||||
});
|
||||
const links = await crawler.start(inProgress, 5, this.limit);
|
||||
if (this.returnOnlyUrls) {
|
||||
return links.map((url) => ({
|
||||
content: "",
|
||||
metadata: { sourceURL: url },
|
||||
provider: "web",
|
||||
type: "text",
|
||||
}));
|
||||
}
|
||||
let documents = await this.convertUrlsToDocuments(links, inProgress);
|
||||
documents = await this.getSitemapData(this.urls[0], documents);
|
||||
console.log("documents", documents)
|
||||
|
||||
// CACHING DOCUMENTS
|
||||
// - parent document
|
||||
const cachedParentDocumentString = await getValue('web-scraper-cache:' + this.normalizeUrl(this.urls[0]));
|
||||
if (cachedParentDocumentString != null) {
|
||||
let cachedParentDocument = JSON.parse(cachedParentDocumentString);
|
||||
if (!cachedParentDocument.childrenLinks || cachedParentDocument.childrenLinks.length < links.length - 1) {
|
||||
cachedParentDocument.childrenLinks = links.filter((link) => link !== this.urls[0]);
|
||||
await setValue('web-scraper-cache:' + this.normalizeUrl(this.urls[0]), JSON.stringify(cachedParentDocument), 60 * 60 * 24 * 10); // 10 days
|
||||
}
|
||||
} else {
|
||||
let parentDocument = documents.filter((document) => this.normalizeUrl(document.metadata.sourceURL) === this.normalizeUrl(this.urls[0]))
|
||||
await this.setCachedDocuments(parentDocument, links);
|
||||
}
|
||||
|
||||
await this.setCachedDocuments(documents.filter((document) => this.normalizeUrl(document.metadata.sourceURL) !== this.normalizeUrl(this.urls[0])), []);
|
||||
documents = this.removeChildLinks(documents);
|
||||
documents = documents.splice(0, this.limit);
|
||||
return documents;
|
||||
}
|
||||
|
||||
if (this.mode === "single_urls") {
|
||||
let documents = await this.convertUrlsToDocuments(this.urls, inProgress);
|
||||
|
||||
const baseUrl = new URL(this.urls[0]).origin;
|
||||
documents = await this.getSitemapData(baseUrl, documents);
|
||||
|
||||
await this.setCachedDocuments(documents);
|
||||
documents = this.removeChildLinks(documents);
|
||||
documents = documents.splice(0, this.limit);
|
||||
return documents;
|
||||
}
|
||||
if (this.mode === "sitemap") {
|
||||
const links = await getLinksFromSitemap(this.urls[0]);
|
||||
let documents = await this.convertUrlsToDocuments(links.slice(0, this.limit), inProgress);
|
||||
|
||||
documents = await this.getSitemapData(this.urls[0], documents);
|
||||
|
||||
await this.setCachedDocuments(documents);
|
||||
documents = this.removeChildLinks(documents);
|
||||
documents = documents.splice(0, this.limit);
|
||||
return documents;
|
||||
}
|
||||
|
||||
return [];
|
||||
}
|
||||
|
||||
let documents = await this.getCachedDocuments(this.urls.slice(0, this.limit));
|
||||
if (documents.length < this.limit) {
|
||||
const newDocuments: Document[] = await this.getDocuments(false, inProgress);
|
||||
newDocuments.forEach(doc => {
|
||||
if (!documents.some(d => this.normalizeUrl(d.metadata.sourceURL) === this.normalizeUrl(doc.metadata?.sourceURL))) {
|
||||
documents.push(doc);
|
||||
}
|
||||
});
|
||||
}
|
||||
documents = this.filterDocsExcludeInclude(documents);
|
||||
documents = this.removeChildLinks(documents);
|
||||
documents = documents.splice(0, this.limit);
|
||||
return documents;
|
||||
}
|
||||
|
||||
private filterDocsExcludeInclude(documents: Document[]): Document[] {
|
||||
return documents.filter((document) => {
|
||||
const url = new URL(document.metadata.sourceURL);
|
||||
const path = url.pathname;
|
||||
|
||||
if (this.excludes.length > 0 && this.excludes[0] !== '') {
|
||||
// Check if the link should be excluded
|
||||
if (this.excludes.some(excludePattern => new RegExp(excludePattern).test(path))) {
|
||||
return false;
|
||||
}
|
||||
}
|
||||
|
||||
if (this.includes.length > 0 && this.includes[0] !== '') {
|
||||
// Check if the link matches the include patterns, if any are specified
|
||||
if (this.includes.length > 0) {
|
||||
return this.includes.some(includePattern => new RegExp(includePattern).test(path));
|
||||
}
|
||||
}
|
||||
return true;
|
||||
});
|
||||
}
|
||||
|
||||
private normalizeUrl(url: string): string {
|
||||
if (url.includes("//www.")) {
|
||||
return url.replace("//www.", "//");
|
||||
}
|
||||
return url;
|
||||
}
|
||||
|
||||
private removeChildLinks(documents: Document[]): Document[] {
|
||||
for (let document of documents) {
|
||||
if (document?.childrenLinks) delete document.childrenLinks;
|
||||
};
|
||||
return documents;
|
||||
}
|
||||
|
||||
async setCachedDocuments(documents: Document[], childrenLinks?: string[]) {
|
||||
for (const document of documents) {
|
||||
if (document.content.trim().length === 0) {
|
||||
continue;
|
||||
}
|
||||
const normalizedUrl = this.normalizeUrl(document.metadata.sourceURL);
|
||||
await setValue('web-scraper-cache:' + normalizedUrl, JSON.stringify({
|
||||
...document,
|
||||
childrenLinks: childrenLinks || []
|
||||
}), 60 * 60 * 24 * 10); // 10 days
|
||||
}
|
||||
}
|
||||
|
||||
async getCachedDocuments(urls: string[]): Promise<Document[]> {
|
||||
let documents: Document[] = [];
|
||||
for (const url of urls) {
|
||||
const normalizedUrl = this.normalizeUrl(url);
|
||||
console.log("Getting cached document for web-scraper-cache:" + normalizedUrl)
|
||||
const cachedDocumentString = await getValue('web-scraper-cache:' + normalizedUrl);
|
||||
if (cachedDocumentString) {
|
||||
const cachedDocument = JSON.parse(cachedDocumentString);
|
||||
documents.push(cachedDocument);
|
||||
|
||||
// get children documents
|
||||
for (const childUrl of cachedDocument.childrenLinks) {
|
||||
const normalizedChildUrl = this.normalizeUrl(childUrl);
|
||||
const childCachedDocumentString = await getValue('web-scraper-cache:' + normalizedChildUrl);
|
||||
if (childCachedDocumentString) {
|
||||
const childCachedDocument = JSON.parse(childCachedDocumentString);
|
||||
if (!documents.find((doc) => doc.metadata.sourceURL === childCachedDocument.metadata.sourceURL)) {
|
||||
documents.push(childCachedDocument);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
return documents;
|
||||
}
|
||||
|
||||
setOptions(options: WebScraperOptions): void {
|
||||
if (!options.urls) {
|
||||
throw new Error("Urls are required");
|
||||
}
|
||||
|
||||
console.log("options", options.crawlerOptions?.excludes)
|
||||
this.urls = options.urls;
|
||||
this.mode = options.mode;
|
||||
this.concurrentRequests = options.concurrentRequests ?? 20;
|
||||
this.includes = options.crawlerOptions?.includes ?? [];
|
||||
this.excludes = options.crawlerOptions?.excludes ?? [];
|
||||
this.maxCrawledLinks = options.crawlerOptions?.maxCrawledLinks ?? 1000;
|
||||
this.returnOnlyUrls = options.crawlerOptions?.returnOnlyUrls ?? false;
|
||||
this.limit = options.crawlerOptions?.limit ?? 10000;
|
||||
|
||||
|
||||
//! @nicolas, for some reason this was being injected and breakign everything. Don't have time to find source of the issue so adding this check
|
||||
this.excludes = this.excludes.filter(item => item !== '');
|
||||
|
||||
|
||||
// make sure all urls start with https://
|
||||
this.urls = this.urls.map((url) => {
|
||||
if (!url.trim().startsWith("http")) {
|
||||
return `https://${url}`;
|
||||
}
|
||||
return url;
|
||||
});
|
||||
}
|
||||
|
||||
private async getSitemapData(baseUrl: string, documents: Document[]) {
|
||||
const sitemapData = await fetchSitemapData(baseUrl)
|
||||
if (sitemapData) {
|
||||
for (let i = 0; i < documents.length; i++) {
|
||||
const docInSitemapData = sitemapData.find((data) => this.normalizeUrl(data.loc) === this.normalizeUrl(documents[i].metadata.sourceURL))
|
||||
if (docInSitemapData) {
|
||||
let sitemapDocData: Partial<SitemapEntry> = {};
|
||||
if (docInSitemapData.changefreq) {
|
||||
sitemapDocData.changefreq = docInSitemapData.changefreq;
|
||||
}
|
||||
if (docInSitemapData.priority) {
|
||||
sitemapDocData.priority = Number(docInSitemapData.priority);
|
||||
}
|
||||
if (docInSitemapData.lastmod) {
|
||||
sitemapDocData.lastmod = docInSitemapData.lastmod;
|
||||
}
|
||||
if (Object.keys(sitemapDocData).length !== 0) {
|
||||
documents[i].metadata.sitemap = sitemapDocData;
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
return documents;
|
||||
}
|
||||
}
|
||||
|
145
apps/api/src/scraper/WebScraper/single_url.ts
Normal file
145
apps/api/src/scraper/WebScraper/single_url.ts
Normal file
|
@ -0,0 +1,145 @@
|
|||
import * as cheerio from "cheerio";
|
||||
import { ScrapingBeeClient } from "scrapingbee";
|
||||
import { attemptScrapWithRequests, sanitizeText } from "./utils/utils";
|
||||
import { extractMetadata } from "./utils/metadata";
|
||||
import dotenv from "dotenv";
|
||||
import { Document } from "../../lib/entities";
|
||||
import { parseMarkdown } from "../../lib/html-to-markdown";
|
||||
// import puppeteer from "puppeteer";
|
||||
|
||||
dotenv.config();
|
||||
|
||||
|
||||
|
||||
export async function scrapWithScrapingBee(url: string, wait_browser:string = "domcontentloaded"): Promise<string> {
|
||||
try {
|
||||
const client = new ScrapingBeeClient(process.env.SCRAPING_BEE_API_KEY);
|
||||
const response = await client.get({
|
||||
url: url,
|
||||
params: { timeout: 15000, wait_browser: wait_browser },
|
||||
headers: { "ScrapingService-Request": "TRUE" },
|
||||
});
|
||||
|
||||
if (response.status !== 200 && response.status !== 404) {
|
||||
console.error(
|
||||
`Scraping bee error in ${url} with status code ${response.status}`
|
||||
);
|
||||
return "";
|
||||
}
|
||||
const decoder = new TextDecoder();
|
||||
const text = decoder.decode(response.data);
|
||||
return text;
|
||||
} catch (error) {
|
||||
console.error(`Error scraping with Scraping Bee: ${error}`);
|
||||
return "";
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
export async function scrapWithPlaywright(url: string): Promise<string> {
|
||||
try {
|
||||
const response = await fetch(process.env.PLAYWRIGHT_MICROSERVICE_URL, {
|
||||
method: 'POST',
|
||||
headers: {
|
||||
"Content-Type": "application/json",
|
||||
},
|
||||
body: JSON.stringify({ url: url }),
|
||||
});
|
||||
|
||||
if (!response.ok) {
|
||||
console.error(`Error fetching w/ playwright server -> URL: ${url} with status: ${response.status}`);
|
||||
return "";
|
||||
}
|
||||
|
||||
const data = await response.json();
|
||||
const html = data.content;
|
||||
return html ?? "";
|
||||
} catch (error) {
|
||||
console.error(`Error scraping with Puppeteer: ${error}`);
|
||||
return "";
|
||||
}
|
||||
}
|
||||
|
||||
export async function scrapSingleUrl(
|
||||
urlToScrap: string,
|
||||
toMarkdown: boolean = true
|
||||
): Promise<Document> {
|
||||
console.log(`Scraping URL: ${urlToScrap}`);
|
||||
urlToScrap = urlToScrap.trim();
|
||||
|
||||
const removeUnwantedElements = (html: string) => {
|
||||
const soup = cheerio.load(html);
|
||||
soup("script, style, iframe, noscript, meta, head").remove();
|
||||
return soup.html();
|
||||
};
|
||||
|
||||
const attemptScraping = async (url: string, method: 'scrapingBee' | 'playwright' | 'scrapingBeeLoad' | 'fetch') => {
|
||||
let text = "";
|
||||
switch (method) {
|
||||
case 'scrapingBee':
|
||||
if (process.env.SCRAPING_BEE_API_KEY) {
|
||||
text = await scrapWithScrapingBee(url);
|
||||
}
|
||||
break;
|
||||
case 'playwright':
|
||||
if (process.env.PLAYWRIGHT_MICROSERVICE_URL) {
|
||||
text = await scrapWithPlaywright(url);
|
||||
}
|
||||
break;
|
||||
case 'scrapingBeeLoad':
|
||||
if (process.env.SCRAPING_BEE_API_KEY) {
|
||||
text = await scrapWithScrapingBee(url, "networkidle2");
|
||||
}
|
||||
break;
|
||||
case 'fetch':
|
||||
try {
|
||||
const response = await fetch(url);
|
||||
if (!response.ok) {
|
||||
console.error(`Error fetching URL: ${url} with status: ${response.status}`);
|
||||
return "";
|
||||
}
|
||||
text = await response.text();
|
||||
} catch (error) {
|
||||
console.error(`Error scraping URL: ${error}`);
|
||||
return "";
|
||||
}
|
||||
break;
|
||||
|
||||
}
|
||||
const cleanedHtml = removeUnwantedElements(text);
|
||||
return [await parseMarkdown(cleanedHtml), text];
|
||||
};
|
||||
|
||||
try {
|
||||
let [text, html ] = await attemptScraping(urlToScrap, 'scrapingBee');
|
||||
if (!text || text.length < 100) {
|
||||
console.log("Falling back to playwright");
|
||||
[text, html] = await attemptScraping(urlToScrap, 'playwright');
|
||||
}
|
||||
|
||||
if (!text || text.length < 100) {
|
||||
console.log("Falling back to scraping bee load");
|
||||
[text, html] = await attemptScraping(urlToScrap, 'scrapingBeeLoad');
|
||||
}
|
||||
if (!text || text.length < 100) {
|
||||
console.log("Falling back to fetch");
|
||||
[text, html] = await attemptScraping(urlToScrap, 'fetch');
|
||||
}
|
||||
|
||||
const soup = cheerio.load(html);
|
||||
const metadata = extractMetadata(soup, urlToScrap);
|
||||
|
||||
return {
|
||||
content: text,
|
||||
markdown: text,
|
||||
metadata: { ...metadata, sourceURL: urlToScrap },
|
||||
} as Document;
|
||||
} catch (error) {
|
||||
console.error(`Error: ${error} - Failed to fetch URL: ${urlToScrap}`);
|
||||
return {
|
||||
content: "",
|
||||
markdown: "",
|
||||
metadata: { sourceURL: urlToScrap },
|
||||
} as Document;
|
||||
}
|
||||
}
|
74
apps/api/src/scraper/WebScraper/sitemap.ts
Normal file
74
apps/api/src/scraper/WebScraper/sitemap.ts
Normal file
|
@ -0,0 +1,74 @@
|
|||
import axios from "axios";
|
||||
import { parseStringPromise } from "xml2js";
|
||||
|
||||
export async function getLinksFromSitemap(
|
||||
sitemapUrl: string,
|
||||
allUrls: string[] = []
|
||||
): Promise<string[]> {
|
||||
try {
|
||||
let content: string;
|
||||
try {
|
||||
const response = await axios.get(sitemapUrl);
|
||||
content = response.data;
|
||||
} catch (error) {
|
||||
console.error(`Request failed for ${sitemapUrl}: ${error}`);
|
||||
return allUrls;
|
||||
}
|
||||
|
||||
const parsed = await parseStringPromise(content);
|
||||
const root = parsed.urlset || parsed.sitemapindex;
|
||||
|
||||
if (root && root.sitemap) {
|
||||
for (const sitemap of root.sitemap) {
|
||||
if (sitemap.loc && sitemap.loc.length > 0) {
|
||||
await getLinksFromSitemap(sitemap.loc[0], allUrls);
|
||||
}
|
||||
}
|
||||
} else if (root && root.url) {
|
||||
for (const url of root.url) {
|
||||
if (url.loc && url.loc.length > 0) {
|
||||
allUrls.push(url.loc[0]);
|
||||
}
|
||||
}
|
||||
}
|
||||
} catch (error) {
|
||||
console.error(`Error processing ${sitemapUrl}: ${error}`);
|
||||
}
|
||||
|
||||
return allUrls;
|
||||
}
|
||||
|
||||
export const fetchSitemapData = async (url: string): Promise<SitemapEntry[] | null> => {
|
||||
const sitemapUrl = url.endsWith("/sitemap.xml") ? url : `${url}/sitemap.xml`;
|
||||
try {
|
||||
const response = await axios.get(sitemapUrl);
|
||||
if (response.status === 200) {
|
||||
const xml = response.data;
|
||||
const parsedXml = await parseStringPromise(xml);
|
||||
|
||||
const sitemapData: SitemapEntry[] = [];
|
||||
if (parsedXml.urlset && parsedXml.urlset.url) {
|
||||
for (const urlElement of parsedXml.urlset.url) {
|
||||
const sitemapEntry: SitemapEntry = { loc: urlElement.loc[0] };
|
||||
if (urlElement.lastmod) sitemapEntry.lastmod = urlElement.lastmod[0];
|
||||
if (urlElement.changefreq) sitemapEntry.changefreq = urlElement.changefreq[0];
|
||||
if (urlElement.priority) sitemapEntry.priority = Number(urlElement.priority[0]);
|
||||
sitemapData.push(sitemapEntry);
|
||||
}
|
||||
}
|
||||
|
||||
return sitemapData;
|
||||
}
|
||||
return null;
|
||||
} catch (error) {
|
||||
// Error handling for failed sitemap fetch
|
||||
}
|
||||
return [];
|
||||
}
|
||||
|
||||
export interface SitemapEntry {
|
||||
loc: string;
|
||||
lastmod?: string;
|
||||
changefreq?: string;
|
||||
priority?: number;
|
||||
}
|
109
apps/api/src/scraper/WebScraper/utils/metadata.ts
Normal file
109
apps/api/src/scraper/WebScraper/utils/metadata.ts
Normal file
|
@ -0,0 +1,109 @@
|
|||
// import * as cheerio from 'cheerio';
|
||||
import { CheerioAPI } from "cheerio";
|
||||
interface Metadata {
|
||||
title?: string;
|
||||
description?: string;
|
||||
language?: string;
|
||||
keywords?: string;
|
||||
robots?: string;
|
||||
ogTitle?: string;
|
||||
ogDescription?: string;
|
||||
dctermsCreated?: string;
|
||||
dcDateCreated?: string;
|
||||
dcDate?: string;
|
||||
dctermsType?: string;
|
||||
dcType?: string;
|
||||
dctermsAudience?: string;
|
||||
dctermsSubject?: string;
|
||||
dcSubject?: string;
|
||||
dcDescription?: string;
|
||||
ogImage?: string;
|
||||
dctermsKeywords?: string;
|
||||
modifiedTime?: string;
|
||||
publishedTime?: string;
|
||||
articleTag?: string;
|
||||
articleSection?: string;
|
||||
}
|
||||
|
||||
export function extractMetadata(soup: CheerioAPI, url: string): Metadata {
|
||||
let title: string | null = null;
|
||||
let description: string | null = null;
|
||||
let language: string | null = null;
|
||||
let keywords: string | null = null;
|
||||
let robots: string | null = null;
|
||||
let ogTitle: string | null = null;
|
||||
let ogDescription: string | null = null;
|
||||
let dctermsCreated: string | null = null;
|
||||
let dcDateCreated: string | null = null;
|
||||
let dcDate: string | null = null;
|
||||
let dctermsType: string | null = null;
|
||||
let dcType: string | null = null;
|
||||
let dctermsAudience: string | null = null;
|
||||
let dctermsSubject: string | null = null;
|
||||
let dcSubject: string | null = null;
|
||||
let dcDescription: string | null = null;
|
||||
let ogImage: string | null = null;
|
||||
let dctermsKeywords: string | null = null;
|
||||
let modifiedTime: string | null = null;
|
||||
let publishedTime: string | null = null;
|
||||
let articleTag: string | null = null;
|
||||
let articleSection: string | null = null;
|
||||
|
||||
try {
|
||||
title = soup("title").text() || null;
|
||||
description = soup('meta[name="description"]').attr("content") || null;
|
||||
|
||||
// Assuming the language is part of the URL as per the regex pattern
|
||||
const pattern = /([a-zA-Z]+-[A-Z]{2})/;
|
||||
const match = pattern.exec(url);
|
||||
language = match ? match[1] : null;
|
||||
|
||||
keywords = soup('meta[name="keywords"]').attr("content") || null;
|
||||
robots = soup('meta[name="robots"]').attr("content") || null;
|
||||
ogTitle = soup('meta[property="og:title"]').attr("content") || null;
|
||||
ogDescription = soup('meta[property="og:description"]').attr("content") || null;
|
||||
articleSection = soup('meta[name="article:section"]').attr("content") || null;
|
||||
articleTag = soup('meta[name="article:tag"]').attr("content") || null;
|
||||
publishedTime = soup('meta[property="article:published_time"]').attr("content") || null;
|
||||
modifiedTime = soup('meta[property="article:modified_time"]').attr("content") || null;
|
||||
ogImage = soup('meta[property="og:image"]').attr("content") || null;
|
||||
dctermsKeywords = soup('meta[name="dcterms.keywords"]').attr("content") || null;
|
||||
dcDescription = soup('meta[name="dc.description"]').attr("content") || null;
|
||||
dcSubject = soup('meta[name="dc.subject"]').attr("content") || null;
|
||||
dctermsSubject = soup('meta[name="dcterms.subject"]').attr("content") || null;
|
||||
dctermsAudience = soup('meta[name="dcterms.audience"]').attr("content") || null;
|
||||
dcType = soup('meta[name="dc.type"]').attr("content") || null;
|
||||
dctermsType = soup('meta[name="dcterms.type"]').attr("content") || null;
|
||||
dcDate = soup('meta[name="dc.date"]').attr("content") || null;
|
||||
dcDateCreated = soup('meta[name="dc.date.created"]').attr("content") || null;
|
||||
dctermsCreated = soup('meta[name="dcterms.created"]').attr("content") || null;
|
||||
|
||||
} catch (error) {
|
||||
console.error("Error extracting metadata:", error);
|
||||
}
|
||||
|
||||
return {
|
||||
...(title ? { title } : {}),
|
||||
...(description ? { description } : {}),
|
||||
...(language ? { language } : {}),
|
||||
...(keywords ? { keywords } : {}),
|
||||
...(robots ? { robots } : {}),
|
||||
...(ogTitle ? { ogTitle } : {}),
|
||||
...(ogDescription ? { ogDescription } : {}),
|
||||
...(dctermsCreated ? { dctermsCreated } : {}),
|
||||
...(dcDateCreated ? { dcDateCreated } : {}),
|
||||
...(dcDate ? { dcDate } : {}),
|
||||
...(dctermsType ? { dctermsType } : {}),
|
||||
...(dcType ? { dcType } : {}),
|
||||
...(dctermsAudience ? { dctermsAudience } : {}),
|
||||
...(dctermsSubject ? { dctermsSubject } : {}),
|
||||
...(dcSubject ? { dcSubject } : {}),
|
||||
...(dcDescription ? { dcDescription } : {}),
|
||||
...(ogImage ? { ogImage } : {}),
|
||||
...(dctermsKeywords ? { dctermsKeywords } : {}),
|
||||
...(modifiedTime ? { modifiedTime } : {}),
|
||||
...(publishedTime ? { publishedTime } : {}),
|
||||
...(articleTag ? { articleTag } : {}),
|
||||
...(articleSection ? { articleSection } : {}),
|
||||
};
|
||||
}
|
23
apps/api/src/scraper/WebScraper/utils/utils.ts
Normal file
23
apps/api/src/scraper/WebScraper/utils/utils.ts
Normal file
|
@ -0,0 +1,23 @@
|
|||
import axios from "axios";
|
||||
|
||||
export async function attemptScrapWithRequests(
|
||||
urlToScrap: string
|
||||
): Promise<string | null> {
|
||||
try {
|
||||
const response = await axios.get(urlToScrap);
|
||||
|
||||
if (!response.data) {
|
||||
console.log("Failed normal requests as well");
|
||||
return null;
|
||||
}
|
||||
|
||||
return response.data;
|
||||
} catch (error) {
|
||||
console.error(`Error in attemptScrapWithRequests: ${error}`);
|
||||
return null;
|
||||
}
|
||||
}
|
||||
|
||||
export function sanitizeText(text: string): string {
|
||||
return text.replace("\u0000", "");
|
||||
}
|
219
apps/api/src/services/billing/credit_billing.ts
Normal file
219
apps/api/src/services/billing/credit_billing.ts
Normal file
|
@ -0,0 +1,219 @@
|
|||
import { supabase_service } from "../supabase";
|
||||
|
||||
const FREE_CREDITS = 100;
|
||||
export async function billTeam(team_id: string, credits: number) {
|
||||
if (team_id === "preview") {
|
||||
return { success: true, message: "Preview team, no credits used" };
|
||||
}
|
||||
console.log(`Billing team ${team_id} for ${credits} credits`);
|
||||
// When the API is used, you can log the credit usage in the credit_usage table:
|
||||
// team_id: The ID of the team using the API.
|
||||
// subscription_id: The ID of the team's active subscription.
|
||||
// credits_used: The number of credits consumed by the API call.
|
||||
// created_at: The timestamp of the API usage.
|
||||
|
||||
// 1. get the subscription
|
||||
|
||||
const { data: subscription } = await supabase_service
|
||||
.from("subscriptions")
|
||||
.select("*")
|
||||
.eq("team_id", team_id)
|
||||
.eq("status", "active")
|
||||
.single();
|
||||
|
||||
if (!subscription) {
|
||||
const { data: credit_usage } = await supabase_service
|
||||
.from("credit_usage")
|
||||
.insert([
|
||||
{
|
||||
team_id,
|
||||
credits_used: credits,
|
||||
created_at: new Date(),
|
||||
},
|
||||
])
|
||||
.select();
|
||||
|
||||
return { success: true, credit_usage };
|
||||
}
|
||||
|
||||
// 2. add the credits to the credits_usage
|
||||
const { data: credit_usage } = await supabase_service
|
||||
.from("credit_usage")
|
||||
.insert([
|
||||
{
|
||||
team_id,
|
||||
subscription_id: subscription.id,
|
||||
credits_used: credits,
|
||||
created_at: new Date(),
|
||||
},
|
||||
])
|
||||
.select();
|
||||
|
||||
return { success: true, credit_usage };
|
||||
}
|
||||
|
||||
// if team has enough credits for the operation, return true, else return false
|
||||
export async function checkTeamCredits(team_id: string, credits: number) {
|
||||
if (team_id === "preview") {
|
||||
return { success: true, message: "Preview team, no credits used" };
|
||||
}
|
||||
// 1. Retrieve the team's active subscription based on the team_id.
|
||||
const { data: subscription, error: subscriptionError } =
|
||||
await supabase_service
|
||||
.from("subscriptions")
|
||||
.select("id, price_id, current_period_start, current_period_end")
|
||||
.eq("team_id", team_id)
|
||||
.eq("status", "active")
|
||||
.single();
|
||||
|
||||
if (subscriptionError || !subscription) {
|
||||
const { data: creditUsages, error: creditUsageError } =
|
||||
await supabase_service
|
||||
.from("credit_usage")
|
||||
.select("credits_used")
|
||||
.is("subscription_id", null)
|
||||
.eq("team_id", team_id);
|
||||
// .gte("created_at", subscription.current_period_start)
|
||||
// .lte("created_at", subscription.current_period_end);
|
||||
|
||||
if (creditUsageError) {
|
||||
throw new Error(
|
||||
`Failed to retrieve credit usage for subscription_id: ${subscription.id}`
|
||||
);
|
||||
}
|
||||
|
||||
const totalCreditsUsed = creditUsages.reduce(
|
||||
(acc, usage) => acc + usage.credits_used,
|
||||
0
|
||||
);
|
||||
|
||||
console.log("totalCreditsUsed", totalCreditsUsed);
|
||||
// 5. Compare the total credits used with the credits allowed by the plan.
|
||||
if (totalCreditsUsed + credits > FREE_CREDITS) {
|
||||
return {
|
||||
success: false,
|
||||
message: "Insufficient credits, please upgrade!",
|
||||
};
|
||||
}
|
||||
return { success: true, message: "Sufficient credits available" };
|
||||
}
|
||||
|
||||
// 2. Get the price_id from the subscription.
|
||||
const { data: price, error: priceError } = await supabase_service
|
||||
.from("prices")
|
||||
.select("credits")
|
||||
.eq("id", subscription.price_id)
|
||||
.single();
|
||||
|
||||
if (priceError) {
|
||||
throw new Error(
|
||||
`Failed to retrieve price for price_id: ${subscription.price_id}`
|
||||
);
|
||||
}
|
||||
|
||||
// 4. Calculate the total credits used by the team within the current billing period.
|
||||
const { data: creditUsages, error: creditUsageError } = await supabase_service
|
||||
.from("credit_usage")
|
||||
.select("credits_used")
|
||||
.eq("subscription_id", subscription.id)
|
||||
.gte("created_at", subscription.current_period_start)
|
||||
.lte("created_at", subscription.current_period_end);
|
||||
|
||||
if (creditUsageError) {
|
||||
throw new Error(
|
||||
`Failed to retrieve credit usage for subscription_id: ${subscription.id}`
|
||||
);
|
||||
}
|
||||
|
||||
const totalCreditsUsed = creditUsages.reduce(
|
||||
(acc, usage) => acc + usage.credits_used,
|
||||
0
|
||||
);
|
||||
|
||||
// 5. Compare the total credits used with the credits allowed by the plan.
|
||||
if (totalCreditsUsed + credits > price.credits) {
|
||||
return { success: false, message: "Insufficient credits, please upgrade!" };
|
||||
}
|
||||
|
||||
return { success: true, message: "Sufficient credits available" };
|
||||
}
|
||||
|
||||
// Count the total credits used by a team within the current billing period and return the remaining credits.
|
||||
export async function countCreditsAndRemainingForCurrentBillingPeriod(
|
||||
team_id: string
|
||||
) {
|
||||
// 1. Retrieve the team's active subscription based on the team_id.
|
||||
const { data: subscription, error: subscriptionError } =
|
||||
await supabase_service
|
||||
.from("subscriptions")
|
||||
.select("id, price_id, current_period_start, current_period_end")
|
||||
.eq("team_id", team_id)
|
||||
.single();
|
||||
|
||||
if (subscriptionError || !subscription) {
|
||||
// throw new Error(`Failed to retrieve subscription for team_id: ${team_id}`);
|
||||
|
||||
// Free
|
||||
const { data: creditUsages, error: creditUsageError } =
|
||||
await supabase_service
|
||||
.from("credit_usage")
|
||||
.select("credits_used")
|
||||
.is("subscription_id", null)
|
||||
.eq("team_id", team_id);
|
||||
// .gte("created_at", subscription.current_period_start)
|
||||
// .lte("created_at", subscription.current_period_end);
|
||||
|
||||
if (creditUsageError || !creditUsages) {
|
||||
throw new Error(
|
||||
`Failed to retrieve credit usage for subscription_id: ${subscription.id}`
|
||||
);
|
||||
}
|
||||
|
||||
const totalCreditsUsed = creditUsages.reduce(
|
||||
(acc, usage) => acc + usage.credits_used,
|
||||
0
|
||||
);
|
||||
|
||||
// 4. Calculate remaining credits.
|
||||
const remainingCredits = FREE_CREDITS - totalCreditsUsed;
|
||||
|
||||
return { totalCreditsUsed, remainingCredits, totalCredits: FREE_CREDITS };
|
||||
}
|
||||
|
||||
// 2. Get the price_id from the subscription to retrieve the total credits available.
|
||||
const { data: price, error: priceError } = await supabase_service
|
||||
.from("prices")
|
||||
.select("credits")
|
||||
.eq("id", subscription.price_id)
|
||||
.single();
|
||||
|
||||
if (priceError || !price) {
|
||||
throw new Error(
|
||||
`Failed to retrieve price for price_id: ${subscription.price_id}`
|
||||
);
|
||||
}
|
||||
|
||||
// 3. Calculate the total credits used by the team within the current billing period.
|
||||
const { data: creditUsages, error: creditUsageError } = await supabase_service
|
||||
.from("credit_usage")
|
||||
.select("credits_used")
|
||||
.eq("subscription_id", subscription.id)
|
||||
.gte("created_at", subscription.current_period_start)
|
||||
.lte("created_at", subscription.current_period_end);
|
||||
|
||||
if (creditUsageError || !creditUsages) {
|
||||
throw new Error(
|
||||
`Failed to retrieve credit usage for subscription_id: ${subscription.id}`
|
||||
);
|
||||
}
|
||||
|
||||
const totalCreditsUsed = creditUsages.reduce(
|
||||
(acc, usage) => acc + usage.credits_used,
|
||||
0
|
||||
);
|
||||
|
||||
// 4. Calculate remaining credits.
|
||||
const remainingCredits = price.credits - totalCreditsUsed;
|
||||
|
||||
return { totalCreditsUsed, remainingCredits, totalCredits: price.credits };
|
||||
}
|
4
apps/api/src/services/logtail.ts
Normal file
4
apps/api/src/services/logtail.ts
Normal file
|
@ -0,0 +1,4 @@
|
|||
const { Logtail } = require("@logtail/node");
|
||||
//dot env
|
||||
require("dotenv").config();
|
||||
export const logtail = new Logtail(process.env.LOGTAIL_KEY);
|
17
apps/api/src/services/queue-jobs.ts
Normal file
17
apps/api/src/services/queue-jobs.ts
Normal file
|
@ -0,0 +1,17 @@
|
|||
import { Job, Queue } from "bull";
|
||||
import {
|
||||
getWebScraperQueue,
|
||||
} from "./queue-service";
|
||||
import { v4 as uuidv4 } from "uuid";
|
||||
import { WebScraperOptions } from "../types";
|
||||
|
||||
export async function addWebScraperJob(
|
||||
webScraperOptions: WebScraperOptions,
|
||||
options: any = {}
|
||||
): Promise<Job> {
|
||||
return await getWebScraperQueue().add(webScraperOptions, {
|
||||
...options,
|
||||
jobId: uuidv4(),
|
||||
});
|
||||
}
|
||||
|
16
apps/api/src/services/queue-service.ts
Normal file
16
apps/api/src/services/queue-service.ts
Normal file
|
@ -0,0 +1,16 @@
|
|||
import Queue from "bull";
|
||||
|
||||
let webScraperQueue;
|
||||
|
||||
export function getWebScraperQueue() {
|
||||
if (!webScraperQueue) {
|
||||
webScraperQueue = new Queue("web-scraper", process.env.REDIS_URL, {
|
||||
settings: {
|
||||
lockDuration: 4 * 60 * 60 * 1000, // 4 hours in milliseconds,
|
||||
lockRenewTime: 30 * 60 * 1000, // 30 minutes in milliseconds
|
||||
},
|
||||
});
|
||||
console.log("Web scraper queue created");
|
||||
}
|
||||
return webScraperQueue;
|
||||
}
|
62
apps/api/src/services/queue-worker.ts
Normal file
62
apps/api/src/services/queue-worker.ts
Normal file
|
@ -0,0 +1,62 @@
|
|||
import { CustomError } from "../lib/custom-error";
|
||||
import { getWebScraperQueue } from "./queue-service";
|
||||
import "dotenv/config";
|
||||
import { logtail } from "./logtail";
|
||||
import { startWebScraperPipeline } from "../main/runWebScraper";
|
||||
import { WebScraperDataProvider } from "../scraper/WebScraper";
|
||||
import { callWebhook } from "./webhook";
|
||||
|
||||
getWebScraperQueue().process(
|
||||
Math.floor(Number(process.env.NUM_WORKERS_PER_QUEUE ?? 8)),
|
||||
async function (job, done) {
|
||||
try {
|
||||
job.progress({
|
||||
current: 1,
|
||||
total: 100,
|
||||
current_step: "SCRAPING",
|
||||
current_url: "",
|
||||
});
|
||||
const { success, message, docs } = await startWebScraperPipeline({ job });
|
||||
|
||||
const data = {
|
||||
success: success,
|
||||
result: {
|
||||
links: docs.map((doc) => {
|
||||
return { content: doc, source: doc.metadata.sourceURL };
|
||||
}),
|
||||
},
|
||||
project_id: job.data.project_id,
|
||||
error: message /* etc... */,
|
||||
};
|
||||
|
||||
await callWebhook(job.data.team_id, data);
|
||||
done(null, data);
|
||||
} catch (error) {
|
||||
if (error instanceof CustomError) {
|
||||
// Here we handle the error, then save the failed job
|
||||
console.error(error.message); // or any other error handling
|
||||
|
||||
logtail.error("Custom error while ingesting", {
|
||||
job_id: job.id,
|
||||
error: error.message,
|
||||
dataIngestionJob: error.dataIngestionJob,
|
||||
});
|
||||
}
|
||||
console.log(error);
|
||||
|
||||
logtail.error("Overall error ingesting", {
|
||||
job_id: job.id,
|
||||
error: error.message,
|
||||
});
|
||||
|
||||
const data = {
|
||||
success: false,
|
||||
project_id: job.data.project_id,
|
||||
error:
|
||||
"Something went wrong... Contact help@mendable.ai or try again." /* etc... */,
|
||||
};
|
||||
await callWebhook(job.data.team_id, data);
|
||||
done(null, data);
|
||||
}
|
||||
}
|
||||
);
|
65
apps/api/src/services/rate-limiter.ts
Normal file
65
apps/api/src/services/rate-limiter.ts
Normal file
|
@ -0,0 +1,65 @@
|
|||
import { RateLimiterRedis } from "rate-limiter-flexible";
|
||||
import * as redis from "redis";
|
||||
|
||||
const MAX_REQUESTS_PER_MINUTE_PREVIEW = 5;
|
||||
const MAX_CRAWLS_PER_MINUTE_STARTER = 2;
|
||||
const MAX_CRAWLS_PER_MINUTE_STANDAR = 4;
|
||||
const MAX_CRAWLS_PER_MINUTE_SCALE = 20;
|
||||
|
||||
const MAX_REQUESTS_PER_MINUTE_ACCOUNT = 40;
|
||||
|
||||
|
||||
|
||||
export const redisClient = redis.createClient({
|
||||
url: process.env.REDIS_URL,
|
||||
legacyMode: true,
|
||||
});
|
||||
|
||||
export const previewRateLimiter = new RateLimiterRedis({
|
||||
storeClient: redisClient,
|
||||
keyPrefix: "middleware",
|
||||
points: MAX_REQUESTS_PER_MINUTE_PREVIEW,
|
||||
duration: 60, // Duration in seconds
|
||||
});
|
||||
|
||||
export const serverRateLimiter = new RateLimiterRedis({
|
||||
storeClient: redisClient,
|
||||
keyPrefix: "middleware",
|
||||
points: MAX_REQUESTS_PER_MINUTE_ACCOUNT,
|
||||
duration: 60, // Duration in seconds
|
||||
});
|
||||
|
||||
|
||||
export function crawlRateLimit(plan: string){
|
||||
if(plan === "standard"){
|
||||
return new RateLimiterRedis({
|
||||
storeClient: redisClient,
|
||||
keyPrefix: "middleware",
|
||||
points: MAX_CRAWLS_PER_MINUTE_STANDAR,
|
||||
duration: 60, // Duration in seconds
|
||||
});
|
||||
}else if(plan === "scale"){
|
||||
return new RateLimiterRedis({
|
||||
storeClient: redisClient,
|
||||
keyPrefix: "middleware",
|
||||
points: MAX_CRAWLS_PER_MINUTE_SCALE,
|
||||
duration: 60, // Duration in seconds
|
||||
});
|
||||
}
|
||||
return new RateLimiterRedis({
|
||||
storeClient: redisClient,
|
||||
keyPrefix: "middleware",
|
||||
points: MAX_CRAWLS_PER_MINUTE_STARTER,
|
||||
duration: 60, // Duration in seconds
|
||||
});
|
||||
|
||||
}
|
||||
|
||||
|
||||
export function getRateLimiter(preview: boolean){
|
||||
if(preview){
|
||||
return previewRateLimiter;
|
||||
}else{
|
||||
return serverRateLimiter;
|
||||
}
|
||||
}
|
38
apps/api/src/services/redis.ts
Normal file
38
apps/api/src/services/redis.ts
Normal file
|
@ -0,0 +1,38 @@
|
|||
import Redis from 'ioredis';
|
||||
|
||||
// Initialize Redis client
|
||||
const redis = new Redis(process.env.REDIS_URL);
|
||||
|
||||
/**
|
||||
* Set a value in Redis with an optional expiration time.
|
||||
* @param {string} key The key under which to store the value.
|
||||
* @param {string} value The value to store.
|
||||
* @param {number} [expire] Optional expiration time in seconds.
|
||||
*/
|
||||
const setValue = async (key: string, value: string, expire?: number) => {
|
||||
if (expire) {
|
||||
await redis.set(key, value, 'EX', expire);
|
||||
} else {
|
||||
await redis.set(key, value);
|
||||
}
|
||||
};
|
||||
|
||||
/**
|
||||
* Get a value from Redis.
|
||||
* @param {string} key The key of the value to retrieve.
|
||||
* @returns {Promise<string|null>} The value, if found, otherwise null.
|
||||
*/
|
||||
const getValue = async (key: string): Promise<string | null> => {
|
||||
const value = await redis.get(key);
|
||||
return value;
|
||||
};
|
||||
|
||||
/**
|
||||
* Delete a key from Redis.
|
||||
* @param {string} key The key to delete.
|
||||
*/
|
||||
const deleteKey = async (key: string) => {
|
||||
await redis.del(key);
|
||||
};
|
||||
|
||||
export { setValue, getValue, deleteKey };
|
6
apps/api/src/services/supabase.ts
Normal file
6
apps/api/src/services/supabase.ts
Normal file
|
@ -0,0 +1,6 @@
|
|||
import { createClient } from "@supabase/supabase-js";
|
||||
|
||||
export const supabase_service = createClient<any>(
|
||||
process.env.SUPABASE_URL,
|
||||
process.env.SUPABASE_SERVICE_TOKEN,
|
||||
);
|
41
apps/api/src/services/webhook.ts
Normal file
41
apps/api/src/services/webhook.ts
Normal file
|
@ -0,0 +1,41 @@
|
|||
import { supabase_service } from "./supabase";
|
||||
|
||||
export const callWebhook = async (teamId: string, data: any) => {
|
||||
const { data: webhooksData, error } = await supabase_service
|
||||
.from('webhooks')
|
||||
.select('url')
|
||||
.eq('team_id', teamId)
|
||||
.limit(1);
|
||||
|
||||
if (error) {
|
||||
console.error(`Error fetching webhook URL for team ID: ${teamId}`, error.message);
|
||||
return null;
|
||||
}
|
||||
|
||||
if (!webhooksData || webhooksData.length === 0) {
|
||||
return null;
|
||||
}
|
||||
|
||||
let dataToSend = [];
|
||||
if (data.result.links && data.result.links.length !== 0) {
|
||||
for (let i = 0; i < data.result.links.length; i++) {
|
||||
dataToSend.push({
|
||||
content: data.result.links[i].content.content,
|
||||
markdown: data.result.links[i].content.markdown,
|
||||
metadata: data.result.links[i].content.metadata,
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
await fetch(webhooksData[0].url, {
|
||||
method: 'POST',
|
||||
headers: {
|
||||
'Content-Type': 'application/json',
|
||||
},
|
||||
body: JSON.stringify({
|
||||
success: data.success,
|
||||
data: dataToSend,
|
||||
error: data.error || undefined,
|
||||
}),
|
||||
});
|
||||
}
|
2
apps/api/src/strings.ts
Normal file
2
apps/api/src/strings.ts
Normal file
|
@ -0,0 +1,2 @@
|
|||
export const errorNoResults =
|
||||
"No results found, please check the URL or contact us at help@mendable.ai to file a ticket.";
|
1514
apps/api/src/supabase_types.ts
Normal file
1514
apps/api/src/supabase_types.ts
Normal file
File diff suppressed because it is too large
Load Diff
26
apps/api/src/types.ts
Normal file
26
apps/api/src/types.ts
Normal file
|
@ -0,0 +1,26 @@
|
|||
export interface CrawlResult {
|
||||
source: string;
|
||||
content: string;
|
||||
options?: {
|
||||
summarize?: boolean;
|
||||
summarize_max_chars?: number;
|
||||
};
|
||||
metadata?: any;
|
||||
raw_context_id?: number | string;
|
||||
permissions?: any[];
|
||||
}
|
||||
|
||||
export interface IngestResult {
|
||||
success: boolean;
|
||||
error: string;
|
||||
data: CrawlResult[];
|
||||
}
|
||||
|
||||
export interface WebScraperOptions {
|
||||
url: string;
|
||||
mode: "crawl" | "single_urls" | "sitemap";
|
||||
crawlerOptions: any;
|
||||
team_id: string;
|
||||
}
|
||||
|
||||
|
17
apps/api/tsconfig.json
Normal file
17
apps/api/tsconfig.json
Normal file
|
@ -0,0 +1,17 @@
|
|||
{
|
||||
"compilerOptions": {
|
||||
"rootDir": "./src",
|
||||
"lib": ["es6","DOM"],
|
||||
"target": "ES2020", // or higher
|
||||
"module": "commonjs",
|
||||
"esModuleInterop": true,
|
||||
"sourceMap": true,
|
||||
"outDir": "./dist/src",
|
||||
"moduleResolution": "node",
|
||||
"baseUrl": ".",
|
||||
"paths": {
|
||||
"*": ["node_modules/*", "src/types/*"],
|
||||
}
|
||||
},
|
||||
"include": ["src/","src/**/*", "services/db/supabase.ts", "utils/utils.ts", "services/db/supabaseEmbeddings.ts", "utils/EventEmmitter.ts", "src/services/queue-service.ts"]
|
||||
}
|
BIN
apps/playwright-service/.DS_Store
vendored
Normal file
BIN
apps/playwright-service/.DS_Store
vendored
Normal file
Binary file not shown.
152
apps/playwright-service/.gitignore
vendored
Normal file
152
apps/playwright-service/.gitignore
vendored
Normal file
|
@ -0,0 +1,152 @@
|
|||
# Byte-compiled / optimized / DLL files
|
||||
__pycache__/
|
||||
*.py[cod]
|
||||
*$py.class
|
||||
|
||||
# C extensions
|
||||
*.so
|
||||
|
||||
# Distribution / packaging
|
||||
.Python
|
||||
build/
|
||||
develop-eggs/
|
||||
dist/
|
||||
downloads/
|
||||
eggs/
|
||||
.eggs/
|
||||
lib/
|
||||
lib64/
|
||||
parts/
|
||||
sdist/
|
||||
var/
|
||||
wheels/
|
||||
share/python-wheels/
|
||||
*.egg-info/
|
||||
.installed.cfg
|
||||
*.egg
|
||||
MANIFEST
|
||||
|
||||
# PyInstaller
|
||||
# Usually these files are written by a python script from a template
|
||||
# before PyInstaller builds the exe, so as to inject date/other infos into it.
|
||||
*.manifest
|
||||
*.spec
|
||||
|
||||
# Installer logs
|
||||
pip-log.txt
|
||||
pip-delete-this-directory.txt
|
||||
|
||||
# Unit test / coverage reports
|
||||
htmlcov/
|
||||
.tox/
|
||||
.nox/
|
||||
.coverage
|
||||
.coverage.*
|
||||
.cache
|
||||
nosetests.xml
|
||||
coverage.xml
|
||||
*.cover
|
||||
*.py,cover
|
||||
.hypothesis/
|
||||
.pytest_cache/
|
||||
cover/
|
||||
|
||||
# Translations
|
||||
*.mo
|
||||
*.pot
|
||||
|
||||
# Django stuff:
|
||||
*.log
|
||||
local_settings.py
|
||||
db.sqlite3
|
||||
db.sqlite3-journal
|
||||
|
||||
# Flask stuff:
|
||||
instance/
|
||||
.webassets-cache
|
||||
|
||||
# Scrapy stuff:
|
||||
.scrapy
|
||||
|
||||
# Sphinx documentation
|
||||
docs/_build/
|
||||
|
||||
# PyBuilder
|
||||
.pybuilder/
|
||||
target/
|
||||
|
||||
# Jupyter Notebook
|
||||
.ipynb_checkpoints
|
||||
|
||||
# IPython
|
||||
profile_default/
|
||||
ipython_config.py
|
||||
|
||||
# pyenv
|
||||
# For a library or package, you might want to ignore these files since the code is
|
||||
# intended to run in multiple environments; otherwise, check them in:
|
||||
# .python-version
|
||||
|
||||
# pipenv
|
||||
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
|
||||
# However, in case of collaboration, if having platform-specific dependencies or dependencies
|
||||
# having no cross-platform support, pipenv may install dependencies that don't work, or not
|
||||
# install all needed dependencies.
|
||||
#Pipfile.lock
|
||||
|
||||
# poetry
|
||||
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
|
||||
# This is especially recommended for binary packages to ensure reproducibility, and is more
|
||||
# commonly ignored for libraries.
|
||||
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
|
||||
#poetry.lock
|
||||
|
||||
# PEP 582; used by e.g. github.com/David-OConnor/pyflow
|
||||
__pypackages__/
|
||||
|
||||
# Celery stuff
|
||||
celerybeat-schedule
|
||||
celerybeat.pid
|
||||
|
||||
# SageMath parsed files
|
||||
*.sage.py
|
||||
|
||||
# Environments
|
||||
.env
|
||||
.venv
|
||||
env/
|
||||
venv/
|
||||
ENV/
|
||||
env.bak/
|
||||
venv.bak/
|
||||
|
||||
# Spyder project settings
|
||||
.spyderproject
|
||||
.spyproject
|
||||
|
||||
# Rope project settings
|
||||
.ropeproject
|
||||
|
||||
# mkdocs documentation
|
||||
/site
|
||||
|
||||
# mypy
|
||||
.mypy_cache/
|
||||
.dmypy.json
|
||||
dmypy.json
|
||||
|
||||
# Pyre type checker
|
||||
.pyre/
|
||||
|
||||
# pytype static type analyzer
|
||||
.pytype/
|
||||
|
||||
# Cython debug symbols
|
||||
cython_debug/
|
||||
|
||||
# PyCharm
|
||||
# JetBrains specific template is maintainted in a separate JetBrains.gitignore that can
|
||||
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
|
||||
# and can be added to the global gitignore or merged into this file. For a more nuclear
|
||||
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
|
||||
#.idea/
|
38
apps/playwright-service/Dockerfile
Normal file
38
apps/playwright-service/Dockerfile
Normal file
|
@ -0,0 +1,38 @@
|
|||
FROM python:3.11-slim
|
||||
|
||||
ENV PYTHONUNBUFFERED=1
|
||||
ENV PYTHONDONTWRITEBYTECODE=1
|
||||
ENV PIP_DISABLE_PIP_VERSION_CHECK=1
|
||||
|
||||
RUN apt-get update && apt-get install -y --no-install-recommends \
|
||||
gcc \
|
||||
libstdc++6
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
# Install Python dependencies
|
||||
COPY requirements.txt ./
|
||||
|
||||
# Remove py which is pulled in by retry, py is not needed and is a CVE
|
||||
RUN pip install --no-cache-dir --upgrade -r requirements.txt && \
|
||||
pip uninstall -y py && \
|
||||
playwright install chromium && playwright install-deps chromium && \
|
||||
ln -s /usr/local/bin/supervisord /usr/bin/supervisord
|
||||
|
||||
# Cleanup for CVEs and size reduction
|
||||
# https://github.com/tornadoweb/tornado/issues/3107
|
||||
# xserver-common and xvfb included by playwright installation but not needed after
|
||||
# perl-base is part of the base Python Debian image but not needed for Danswer functionality
|
||||
# perl-base could only be removed with --allow-remove-essential
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
COPY . ./
|
||||
|
||||
EXPOSE $PORT
|
||||
# run fast api hypercorn
|
||||
CMD hypercorn main:app --bind [::]:$PORT
|
||||
# CMD ["hypercorn", "main:app", "--bind", "[::]:$PORT"]
|
||||
# CMD ["sh", "-c", "uvicorn main:app --host 0.0.0.0 --port $PORT"]
|
0
apps/playwright-service/README.md
Normal file
0
apps/playwright-service/README.md
Normal file
28
apps/playwright-service/main.py
Normal file
28
apps/playwright-service/main.py
Normal file
|
@ -0,0 +1,28 @@
|
|||
from fastapi import FastAPI, Response
|
||||
from playwright.async_api import async_playwright
|
||||
import os
|
||||
from fastapi.responses import JSONResponse
|
||||
from pydantic import BaseModel
|
||||
app = FastAPI()
|
||||
|
||||
from pydantic import BaseModel
|
||||
|
||||
class UrlModel(BaseModel):
|
||||
url: str
|
||||
|
||||
@app.post("/html") # Kept as POST to accept body parameters
|
||||
async def root(body: UrlModel): # Using Pydantic model for request body
|
||||
async with async_playwright() as p:
|
||||
browser = await p.chromium.launch()
|
||||
|
||||
context = await browser.new_context()
|
||||
page = await context.new_page()
|
||||
|
||||
await page.goto(body.url) # Adjusted to use the url from the request body model
|
||||
page_content = await page.content() # Get the HTML content of the page
|
||||
|
||||
await browser.close()
|
||||
|
||||
json_compatible_item_data = {"content": page_content}
|
||||
return JSONResponse(content=json_compatible_item_data)
|
||||
|
0
apps/playwright-service/requests.http
Normal file
0
apps/playwright-service/requests.http
Normal file
4
apps/playwright-service/requirements.txt
Normal file
4
apps/playwright-service/requirements.txt
Normal file
|
@ -0,0 +1,4 @@
|
|||
hypercorn==0.16.0
|
||||
fastapi==0.110.0
|
||||
playwright==1.42.0
|
||||
uvicorn
|
1
apps/playwright-service/runtime.txt
Normal file
1
apps/playwright-service/runtime.txt
Normal file
|
@ -0,0 +1 @@
|
|||
3.11
|
91
apps/python-sdk/README.md
Normal file
91
apps/python-sdk/README.md
Normal file
|
@ -0,0 +1,91 @@
|
|||
# Firecrawl Python SDK
|
||||
|
||||
The Firecrawl Python SDK is a library that allows you to easily scrape and crawl websites, and output the data in a format ready for use with language models (LLMs). It provides a simple and intuitive interface for interacting with the Firecrawl API.
|
||||
|
||||
## Installation
|
||||
|
||||
To install the Firecrawl Python SDK, you can use pip:
|
||||
|
||||
```bash
|
||||
pip install firecrawl-py
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
1. Get an API key from [firecrawl.dev](https://firecrawl.dev)
|
||||
2. Set the API key as an environment variable named `FIRECRAWL_API_KEY` or pass it as a parameter to the `FirecrawlApp` class.
|
||||
|
||||
|
||||
Here's an example of how to use the SDK:
|
||||
|
||||
```python
|
||||
from firecrawl import FirecrawlApp
|
||||
|
||||
# Initialize the FirecrawlApp with your API key
|
||||
app = FirecrawlApp(api_key='your_api_key')
|
||||
|
||||
# Scrape a single URL
|
||||
url = 'https://mendable.ai'
|
||||
scraped_data = app.scrape_url(url)
|
||||
|
||||
# Crawl a website
|
||||
crawl_url = 'https://mendable.ai'
|
||||
crawl_params = {
|
||||
'crawlerOptions': {
|
||||
'excludes': ['blog/*'],
|
||||
'includes': [], # leave empty for all pages
|
||||
'limit': 1000,
|
||||
}
|
||||
}
|
||||
crawl_result = app.crawl_url(crawl_url, params=crawl_params)
|
||||
```
|
||||
|
||||
### Scraping a URL
|
||||
|
||||
To scrape a single URL, use the `scrape_url` method. It takes the URL as a parameter and returns the scraped data as a dictionary.
|
||||
|
||||
```python
|
||||
url = 'https://example.com'
|
||||
scraped_data = app.scrape_url(url)
|
||||
```
|
||||
|
||||
### Crawling a Website
|
||||
|
||||
To crawl a website, use the `crawl_url` method. It takes the starting URL and optional parameters as arguments. The `params` argument allows you to specify additional options for the crawl job, such as the maximum number of pages to crawl, allowed domains, and the output format.
|
||||
|
||||
The `wait_until_done` parameter determines whether the method should wait for the crawl job to complete before returning the result. If set to `True`, the method will periodically check the status of the crawl job until it is completed or the specified `timeout` (in seconds) is reached. If set to `False`, the method will return immediately with the job ID, and you can manually check the status of the crawl job using the `check_crawl_status` method.
|
||||
|
||||
```python
|
||||
crawl_url = 'https://example.com'
|
||||
crawl_params = {
|
||||
'crawlerOptions': {
|
||||
'excludes': ['blog/*'],
|
||||
'includes': [], # leave empty for all pages
|
||||
'limit': 1000,
|
||||
}
|
||||
}
|
||||
crawl_result = app.crawl_url(crawl_url, params=crawl_params, wait_until_done=True, timeout=5)
|
||||
```
|
||||
|
||||
If `wait_until_done` is set to `True`, the `crawl_url` method will return the crawl result once the job is completed. If the job fails or is stopped, an exception will be raised.
|
||||
|
||||
### Checking Crawl Status
|
||||
|
||||
To check the status of a crawl job, use the `check_crawl_status` method. It takes the job ID as a parameter and returns the current status of the crawl job.
|
||||
|
||||
```python
|
||||
job_id = crawl_result['jobId']
|
||||
status = app.check_crawl_status(job_id)
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
The SDK handles errors returned by the Firecrawl API and raises appropriate exceptions. If an error occurs during a request, an exception will be raised with a descriptive error message.
|
||||
|
||||
## Contributing
|
||||
|
||||
Contributions to the Firecrawl Python SDK are welcome! If you find any issues or have suggestions for improvements, please open an issue or submit a pull request on the GitHub repository.
|
||||
|
||||
## License
|
||||
|
||||
The Firecrawl Python SDK is open-source and released under the [MIT License](https://opensource.org/licenses/MIT).
|
1
apps/python-sdk/build/lib/firecrawl/__init__.py
Normal file
1
apps/python-sdk/build/lib/firecrawl/__init__.py
Normal file
|
@ -0,0 +1 @@
|
|||
from .firecrawl import FirecrawlApp
|
96
apps/python-sdk/build/lib/firecrawl/firecrawl.py
Normal file
96
apps/python-sdk/build/lib/firecrawl/firecrawl.py
Normal file
|
@ -0,0 +1,96 @@
|
|||
import os
|
||||
import requests
|
||||
|
||||
class FirecrawlApp:
|
||||
def __init__(self, api_key=None):
|
||||
self.api_key = api_key or os.getenv('FIRECRAWL_API_KEY')
|
||||
if self.api_key is None:
|
||||
raise ValueError('No API key provided')
|
||||
|
||||
def scrape_url(self, url, params=None):
|
||||
headers = {
|
||||
'Content-Type': 'application/json',
|
||||
'Authorization': f'Bearer {self.api_key}'
|
||||
}
|
||||
json_data = {'url': url}
|
||||
if params:
|
||||
json_data.update(params)
|
||||
response = requests.post(
|
||||
'https://api.firecrawl.dev/v0/scrape',
|
||||
headers=headers,
|
||||
json=json_data
|
||||
)
|
||||
if response.status_code == 200:
|
||||
response = response.json()
|
||||
if response['success'] == True:
|
||||
return response['data']
|
||||
else:
|
||||
raise Exception(f'Failed to scrape URL. Error: {response["error"]}')
|
||||
|
||||
elif response.status_code in [402, 409, 500]:
|
||||
error_message = response.json().get('error', 'Unknown error occurred')
|
||||
raise Exception(f'Failed to scrape URL. Status code: {response.status_code}. Error: {error_message}')
|
||||
else:
|
||||
raise Exception(f'Failed to scrape URL. Status code: {response.status_code}')
|
||||
|
||||
def crawl_url(self, url, params=None, wait_until_done=True, timeout=2):
|
||||
headers = self._prepare_headers()
|
||||
json_data = {'url': url}
|
||||
if params:
|
||||
json_data.update(params)
|
||||
response = self._post_request('https://api.firecrawl.dev/v0/crawl', json_data, headers)
|
||||
if response.status_code == 200:
|
||||
job_id = response.json().get('jobId')
|
||||
if wait_until_done:
|
||||
return self._monitor_job_status(job_id, headers, timeout)
|
||||
else:
|
||||
return {'jobId': job_id}
|
||||
else:
|
||||
self._handle_error(response, 'start crawl job')
|
||||
|
||||
def check_crawl_status(self, job_id):
|
||||
headers = self._prepare_headers()
|
||||
response = self._get_request(f'https://api.firecrawl.dev/v0/crawl/status/{job_id}', headers)
|
||||
if response.status_code == 200:
|
||||
return response.json()
|
||||
else:
|
||||
self._handle_error(response, 'check crawl status')
|
||||
|
||||
def _prepare_headers(self):
|
||||
return {
|
||||
'Content-Type': 'application/json',
|
||||
'Authorization': f'Bearer {self.api_key}'
|
||||
}
|
||||
|
||||
def _post_request(self, url, data, headers):
|
||||
return requests.post(url, headers=headers, json=data)
|
||||
|
||||
def _get_request(self, url, headers):
|
||||
return requests.get(url, headers=headers)
|
||||
|
||||
def _monitor_job_status(self, job_id, headers, timeout):
|
||||
import time
|
||||
while True:
|
||||
status_response = self._get_request(f'https://api.firecrawl.dev/v0/crawl/status/{job_id}', headers)
|
||||
if status_response.status_code == 200:
|
||||
status_data = status_response.json()
|
||||
if status_data['status'] == 'completed':
|
||||
if 'data' in status_data:
|
||||
return status_data['data']
|
||||
else:
|
||||
raise Exception('Crawl job completed but no data was returned')
|
||||
elif status_data['status'] in ['active', 'paused', 'pending', 'queued']:
|
||||
if timeout < 2:
|
||||
timeout = 2
|
||||
time.sleep(timeout) # Wait for the specified timeout before checking again
|
||||
else:
|
||||
raise Exception(f'Crawl job failed or was stopped. Status: {status_data["status"]}')
|
||||
else:
|
||||
self._handle_error(status_response, 'check crawl status')
|
||||
|
||||
def _handle_error(self, response, action):
|
||||
if response.status_code in [402, 409, 500]:
|
||||
error_message = response.json().get('error', 'Unknown error occurred')
|
||||
raise Exception(f'Failed to {action}. Status code: {response.status_code}. Error: {error_message}')
|
||||
else:
|
||||
raise Exception(f'Unexpected error occurred while trying to {action}. Status code: {response.status_code}')
|
BIN
apps/python-sdk/dist/firecrawl-py-0.0.5.tar.gz
vendored
Normal file
BIN
apps/python-sdk/dist/firecrawl-py-0.0.5.tar.gz
vendored
Normal file
Binary file not shown.
BIN
apps/python-sdk/dist/firecrawl_py-0.0.5-py3-none-any.whl
vendored
Normal file
BIN
apps/python-sdk/dist/firecrawl_py-0.0.5-py3-none-any.whl
vendored
Normal file
Binary file not shown.
13
apps/python-sdk/example.py
Normal file
13
apps/python-sdk/example.py
Normal file
|
@ -0,0 +1,13 @@
|
|||
from firecrawl import FirecrawlApp
|
||||
|
||||
|
||||
app = FirecrawlApp(api_key="a6a2d63a-ed2b-46a9-946d-2a7207efed4d")
|
||||
|
||||
crawl_result = app.crawl_url('mendable.ai', {'crawlerOptions': {'excludes': ['blog/*']}})
|
||||
print(crawl_result[0]['markdown'])
|
||||
|
||||
job_id = crawl_result['jobId']
|
||||
print(job_id)
|
||||
|
||||
status = app.check_crawl_status(job_id)
|
||||
print(status)
|
1
apps/python-sdk/firecrawl/__init__.py
Normal file
1
apps/python-sdk/firecrawl/__init__.py
Normal file
|
@ -0,0 +1 @@
|
|||
from .firecrawl import FirecrawlApp
|
BIN
apps/python-sdk/firecrawl/__pycache__/__init__.cpython-311.pyc
Normal file
BIN
apps/python-sdk/firecrawl/__pycache__/__init__.cpython-311.pyc
Normal file
Binary file not shown.
BIN
apps/python-sdk/firecrawl/__pycache__/firecrawl.cpython-311.pyc
Normal file
BIN
apps/python-sdk/firecrawl/__pycache__/firecrawl.cpython-311.pyc
Normal file
Binary file not shown.
96
apps/python-sdk/firecrawl/firecrawl.py
Normal file
96
apps/python-sdk/firecrawl/firecrawl.py
Normal file
|
@ -0,0 +1,96 @@
|
|||
import os
|
||||
import requests
|
||||
|
||||
class FirecrawlApp:
|
||||
def __init__(self, api_key=None):
|
||||
self.api_key = api_key or os.getenv('FIRECRAWL_API_KEY')
|
||||
if self.api_key is None:
|
||||
raise ValueError('No API key provided')
|
||||
|
||||
def scrape_url(self, url, params=None):
|
||||
headers = {
|
||||
'Content-Type': 'application/json',
|
||||
'Authorization': f'Bearer {self.api_key}'
|
||||
}
|
||||
json_data = {'url': url}
|
||||
if params:
|
||||
json_data.update(params)
|
||||
response = requests.post(
|
||||
'https://api.firecrawl.dev/v0/scrape',
|
||||
headers=headers,
|
||||
json=json_data
|
||||
)
|
||||
if response.status_code == 200:
|
||||
response = response.json()
|
||||
if response['success'] == True:
|
||||
return response['data']
|
||||
else:
|
||||
raise Exception(f'Failed to scrape URL. Error: {response["error"]}')
|
||||
|
||||
elif response.status_code in [402, 409, 500]:
|
||||
error_message = response.json().get('error', 'Unknown error occurred')
|
||||
raise Exception(f'Failed to scrape URL. Status code: {response.status_code}. Error: {error_message}')
|
||||
else:
|
||||
raise Exception(f'Failed to scrape URL. Status code: {response.status_code}')
|
||||
|
||||
def crawl_url(self, url, params=None, wait_until_done=True, timeout=2):
|
||||
headers = self._prepare_headers()
|
||||
json_data = {'url': url}
|
||||
if params:
|
||||
json_data.update(params)
|
||||
response = self._post_request('https://api.firecrawl.dev/v0/crawl', json_data, headers)
|
||||
if response.status_code == 200:
|
||||
job_id = response.json().get('jobId')
|
||||
if wait_until_done:
|
||||
return self._monitor_job_status(job_id, headers, timeout)
|
||||
else:
|
||||
return {'jobId': job_id}
|
||||
else:
|
||||
self._handle_error(response, 'start crawl job')
|
||||
|
||||
def check_crawl_status(self, job_id):
|
||||
headers = self._prepare_headers()
|
||||
response = self._get_request(f'https://api.firecrawl.dev/v0/crawl/status/{job_id}', headers)
|
||||
if response.status_code == 200:
|
||||
return response.json()
|
||||
else:
|
||||
self._handle_error(response, 'check crawl status')
|
||||
|
||||
def _prepare_headers(self):
|
||||
return {
|
||||
'Content-Type': 'application/json',
|
||||
'Authorization': f'Bearer {self.api_key}'
|
||||
}
|
||||
|
||||
def _post_request(self, url, data, headers):
|
||||
return requests.post(url, headers=headers, json=data)
|
||||
|
||||
def _get_request(self, url, headers):
|
||||
return requests.get(url, headers=headers)
|
||||
|
||||
def _monitor_job_status(self, job_id, headers, timeout):
|
||||
import time
|
||||
while True:
|
||||
status_response = self._get_request(f'https://api.firecrawl.dev/v0/crawl/status/{job_id}', headers)
|
||||
if status_response.status_code == 200:
|
||||
status_data = status_response.json()
|
||||
if status_data['status'] == 'completed':
|
||||
if 'data' in status_data:
|
||||
return status_data['data']
|
||||
else:
|
||||
raise Exception('Crawl job completed but no data was returned')
|
||||
elif status_data['status'] in ['active', 'paused', 'pending', 'queued']:
|
||||
if timeout < 2:
|
||||
timeout = 2
|
||||
time.sleep(timeout) # Wait for the specified timeout before checking again
|
||||
else:
|
||||
raise Exception(f'Crawl job failed or was stopped. Status: {status_data["status"]}')
|
||||
else:
|
||||
self._handle_error(status_response, 'check crawl status')
|
||||
|
||||
def _handle_error(self, response, action):
|
||||
if response.status_code in [402, 409, 500]:
|
||||
error_message = response.json().get('error', 'Unknown error occurred')
|
||||
raise Exception(f'Failed to {action}. Status code: {response.status_code}. Error: {error_message}')
|
||||
else:
|
||||
raise Exception(f'Unexpected error occurred while trying to {action}. Status code: {response.status_code}')
|
7
apps/python-sdk/firecrawl_py.egg-info/PKG-INFO
Normal file
7
apps/python-sdk/firecrawl_py.egg-info/PKG-INFO
Normal file
|
@ -0,0 +1,7 @@
|
|||
Metadata-Version: 2.1
|
||||
Name: firecrawl-py
|
||||
Version: 0.0.5
|
||||
Summary: Python SDK for Firecrawl API
|
||||
Home-page: https://github.com/mendableai/firecrawl-py
|
||||
Author: Mendable.ai
|
||||
Author-email: nick@mendable.ai
|
9
apps/python-sdk/firecrawl_py.egg-info/SOURCES.txt
Normal file
9
apps/python-sdk/firecrawl_py.egg-info/SOURCES.txt
Normal file
|
@ -0,0 +1,9 @@
|
|||
README.md
|
||||
setup.py
|
||||
firecrawl/__init__.py
|
||||
firecrawl/firecrawl.py
|
||||
firecrawl_py.egg-info/PKG-INFO
|
||||
firecrawl_py.egg-info/SOURCES.txt
|
||||
firecrawl_py.egg-info/dependency_links.txt
|
||||
firecrawl_py.egg-info/requires.txt
|
||||
firecrawl_py.egg-info/top_level.txt
|
|
@ -0,0 +1 @@
|
|||
|
1
apps/python-sdk/firecrawl_py.egg-info/requires.txt
Normal file
1
apps/python-sdk/firecrawl_py.egg-info/requires.txt
Normal file
|
@ -0,0 +1 @@
|
|||
requests
|
1
apps/python-sdk/firecrawl_py.egg-info/top_level.txt
Normal file
1
apps/python-sdk/firecrawl_py.egg-info/top_level.txt
Normal file
|
@ -0,0 +1 @@
|
|||
firecrawl
|
14
apps/python-sdk/setup.py
Normal file
14
apps/python-sdk/setup.py
Normal file
|
@ -0,0 +1,14 @@
|
|||
from setuptools import setup, find_packages
|
||||
|
||||
setup(
|
||||
name='firecrawl-py',
|
||||
version='0.0.5',
|
||||
url='https://github.com/mendableai/firecrawl-py',
|
||||
author='Mendable.ai',
|
||||
author_email='nick@mendable.ai',
|
||||
description='Python SDK for Firecrawl API',
|
||||
packages=find_packages(),
|
||||
install_requires=[
|
||||
'requests',
|
||||
],
|
||||
)
|
1
apps/www/README.md
Normal file
1
apps/www/README.md
Normal file
|
@ -0,0 +1 @@
|
|||
Coming soon!
|
Loading…
Reference in New Issue
Block a user