mirror of
https://github.com/mendableai/firecrawl.git
synced 2024-11-16 03:32:22 +08:00
Initial commit
This commit is contained in:
commit
a6c2a87811
2
.gitattributes
vendored
Normal file
2
.gitattributes
vendored
Normal file
|
@ -0,0 +1,2 @@
|
||||||
|
# Auto detect text files and perform LF normalization
|
||||||
|
* text=auto
|
20
.github/workflows/fly.yml
vendored
Normal file
20
.github/workflows/fly.yml
vendored
Normal file
|
@ -0,0 +1,20 @@
|
||||||
|
name: Fly Deploy
|
||||||
|
on:
|
||||||
|
push:
|
||||||
|
branches:
|
||||||
|
- main
|
||||||
|
# schedule:
|
||||||
|
# - cron: '0 */4 * * *'
|
||||||
|
|
||||||
|
jobs:
|
||||||
|
deploy:
|
||||||
|
name: Deploy app
|
||||||
|
runs-on: ubuntu-latest
|
||||||
|
steps:
|
||||||
|
- uses: actions/checkout@v3
|
||||||
|
- uses: superfly/flyctl-actions/setup-flyctl@master
|
||||||
|
- name: Change directory
|
||||||
|
run: cd apps/api
|
||||||
|
- run: flyctl deploy --remote-only
|
||||||
|
env:
|
||||||
|
FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN }}
|
6
.gitignore
vendored
Normal file
6
.gitignore
vendored
Normal file
|
@ -0,0 +1,6 @@
|
||||||
|
/node_modules/
|
||||||
|
/dist/
|
||||||
|
.env
|
||||||
|
*.csv
|
||||||
|
dump.rdb
|
||||||
|
/mongo-data
|
4
CONTRIBUTING.md
Normal file
4
CONTRIBUTING.md
Normal file
|
@ -0,0 +1,4 @@
|
||||||
|
# Contributing
|
||||||
|
|
||||||
|
We love contributions! Please read our [contributing guide](CONTRIBUTING.md) before submitting a pull request.
|
||||||
|
|
201
LICENSE
Normal file
201
LICENSE
Normal file
|
@ -0,0 +1,201 @@
|
||||||
|
Apache License
|
||||||
|
Version 2.0, January 2004
|
||||||
|
http://www.apache.org/licenses/
|
||||||
|
|
||||||
|
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
|
||||||
|
|
||||||
|
1. Definitions.
|
||||||
|
|
||||||
|
"License" shall mean the terms and conditions for use, reproduction,
|
||||||
|
and distribution as defined by Sections 1 through 9 of this document.
|
||||||
|
|
||||||
|
"Licensor" shall mean the copyright owner or entity authorized by
|
||||||
|
the copyright owner that is granting the License.
|
||||||
|
|
||||||
|
"Legal Entity" shall mean the union of the acting entity and all
|
||||||
|
other entities that control, are controlled by, or are under common
|
||||||
|
control with that entity. For the purposes of this definition,
|
||||||
|
"control" means (i) the power, direct or indirect, to cause the
|
||||||
|
direction or management of such entity, whether by contract or
|
||||||
|
otherwise, or (ii) ownership of fifty percent (50%) or more of the
|
||||||
|
outstanding shares, or (iii) beneficial ownership of such entity.
|
||||||
|
|
||||||
|
"You" (or "Your") shall mean an individual or Legal Entity
|
||||||
|
exercising permissions granted by this License.
|
||||||
|
|
||||||
|
"Source" form shall mean the preferred form for making modifications,
|
||||||
|
including but not limited to software source code, documentation
|
||||||
|
source, and configuration files.
|
||||||
|
|
||||||
|
"Object" form shall mean any form resulting from mechanical
|
||||||
|
transformation or translation of a Source form, including but
|
||||||
|
not limited to compiled object code, generated documentation,
|
||||||
|
and conversions to other media types.
|
||||||
|
|
||||||
|
"Work" shall mean the work of authorship, whether in Source or
|
||||||
|
Object form, made available under the License, as indicated by a
|
||||||
|
copyright notice that is included in or attached to the work
|
||||||
|
(an example is provided in the Appendix below).
|
||||||
|
|
||||||
|
"Derivative Works" shall mean any work, whether in Source or Object
|
||||||
|
form, that is based on (or derived from) the Work and for which the
|
||||||
|
editorial revisions, annotations, elaborations, or other modifications
|
||||||
|
represent, as a whole, an original work of authorship. For the purposes
|
||||||
|
of this License, Derivative Works shall not include works that remain
|
||||||
|
separable from, or merely link (or bind by name) to the interfaces of,
|
||||||
|
the Work and Derivative Works thereof.
|
||||||
|
|
||||||
|
"Contribution" shall mean any work of authorship, including
|
||||||
|
the original version of the Work and any modifications or additions
|
||||||
|
to that Work or Derivative Works thereof, that is intentionally
|
||||||
|
submitted to Licensor for inclusion in the Work by the copyright owner
|
||||||
|
or by an individual or Legal Entity authorized to submit on behalf of
|
||||||
|
the copyright owner. For the purposes of this definition, "submitted"
|
||||||
|
means any form of electronic, verbal, or written communication sent
|
||||||
|
to the Licensor or its representatives, including but not limited to
|
||||||
|
communication on electronic mailing lists, source code control systems,
|
||||||
|
and issue tracking systems that are managed by, or on behalf of, the
|
||||||
|
Licensor for the purpose of discussing and improving the Work, but
|
||||||
|
excluding communication that is conspicuously marked or otherwise
|
||||||
|
designated in writing by the copyright owner as "Not a Contribution."
|
||||||
|
|
||||||
|
"Contributor" shall mean Licensor and any individual or Legal Entity
|
||||||
|
on behalf of whom a Contribution has been received by Licensor and
|
||||||
|
subsequently incorporated within the Work.
|
||||||
|
|
||||||
|
2. Grant of Copyright License. Subject to the terms and conditions of
|
||||||
|
this License, each Contributor hereby grants to You a perpetual,
|
||||||
|
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
|
||||||
|
copyright license to reproduce, prepare Derivative Works of,
|
||||||
|
publicly display, publicly perform, sublicense, and distribute the
|
||||||
|
Work and such Derivative Works in Source or Object form.
|
||||||
|
|
||||||
|
3. Grant of Patent License. Subject to the terms and conditions of
|
||||||
|
this License, each Contributor hereby grants to You a perpetual,
|
||||||
|
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
|
||||||
|
(except as stated in this section) patent license to make, have made,
|
||||||
|
use, offer to sell, sell, import, and otherwise transfer the Work,
|
||||||
|
where such license applies only to those patent claims licensable
|
||||||
|
by such Contributor that are necessarily infringed by their
|
||||||
|
Contribution(s) alone or by combination of their Contribution(s)
|
||||||
|
with the Work to which such Contribution(s) was submitted. If You
|
||||||
|
institute patent litigation against any entity (including a
|
||||||
|
cross-claim or counterclaim in a lawsuit) alleging that the Work
|
||||||
|
or a Contribution incorporated within the Work constitutes direct
|
||||||
|
or contributory patent infringement, then any patent licenses
|
||||||
|
granted to You under this License for that Work shall terminate
|
||||||
|
as of the date such litigation is filed.
|
||||||
|
|
||||||
|
4. Redistribution. You may reproduce and distribute copies of the
|
||||||
|
Work or Derivative Works thereof in any medium, with or without
|
||||||
|
modifications, and in Source or Object form, provided that You
|
||||||
|
meet the following conditions:
|
||||||
|
|
||||||
|
(a) You must give any other recipients of the Work or
|
||||||
|
Derivative Works a copy of this License; and
|
||||||
|
|
||||||
|
(b) You must cause any modified files to carry prominent notices
|
||||||
|
stating that You changed the files; and
|
||||||
|
|
||||||
|
(c) You must retain, in the Source form of any Derivative Works
|
||||||
|
that You distribute, all copyright, patent, trademark, and
|
||||||
|
attribution notices from the Source form of the Work,
|
||||||
|
excluding those notices that do not pertain to any part of
|
||||||
|
the Derivative Works; and
|
||||||
|
|
||||||
|
(d) If the Work includes a "NOTICE" text file as part of its
|
||||||
|
distribution, then any Derivative Works that You distribute must
|
||||||
|
include a readable copy of the attribution notices contained
|
||||||
|
within such NOTICE file, excluding those notices that do not
|
||||||
|
pertain to any part of the Derivative Works, in at least one
|
||||||
|
of the following places: within a NOTICE text file distributed
|
||||||
|
as part of the Derivative Works; within the Source form or
|
||||||
|
documentation, if provided along with the Derivative Works; or,
|
||||||
|
within a display generated by the Derivative Works, if and
|
||||||
|
wherever such third-party notices normally appear. The contents
|
||||||
|
of the NOTICE file are for informational purposes only and
|
||||||
|
do not modify the License. You may add Your own attribution
|
||||||
|
notices within Derivative Works that You distribute, alongside
|
||||||
|
or as an addendum to the NOTICE text from the Work, provided
|
||||||
|
that such additional attribution notices cannot be construed
|
||||||
|
as modifying the License.
|
||||||
|
|
||||||
|
You may add Your own copyright statement to Your modifications and
|
||||||
|
may provide additional or different license terms and conditions
|
||||||
|
for use, reproduction, or distribution of Your modifications, or
|
||||||
|
for any such Derivative Works as a whole, provided Your use,
|
||||||
|
reproduction, and distribution of the Work otherwise complies with
|
||||||
|
the conditions stated in this License.
|
||||||
|
|
||||||
|
5. Submission of Contributions. Unless You explicitly state otherwise,
|
||||||
|
any Contribution intentionally submitted for inclusion in the Work
|
||||||
|
by You to the Licensor shall be under the terms and conditions of
|
||||||
|
this License, without any additional terms or conditions.
|
||||||
|
Notwithstanding the above, nothing herein shall supersede or modify
|
||||||
|
the terms of any separate license agreement you may have executed
|
||||||
|
with Licensor regarding such Contributions.
|
||||||
|
|
||||||
|
6. Trademarks. This License does not grant permission to use the trade
|
||||||
|
names, trademarks, service marks, or product names of the Licensor,
|
||||||
|
except as required for reasonable and customary use in describing the
|
||||||
|
origin of the Work and reproducing the content of the NOTICE file.
|
||||||
|
|
||||||
|
7. Disclaimer of Warranty. Unless required by applicable law or
|
||||||
|
agreed to in writing, Licensor provides the Work (and each
|
||||||
|
Contributor provides its Contributions) on an "AS IS" BASIS,
|
||||||
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
|
||||||
|
implied, including, without limitation, any warranties or conditions
|
||||||
|
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
|
||||||
|
PARTICULAR PURPOSE. You are solely responsible for determining the
|
||||||
|
appropriateness of using or redistributing the Work and assume any
|
||||||
|
risks associated with Your exercise of permissions under this License.
|
||||||
|
|
||||||
|
8. Limitation of Liability. In no event and under no legal theory,
|
||||||
|
whether in tort (including negligence), contract, or otherwise,
|
||||||
|
unless required by applicable law (such as deliberate and grossly
|
||||||
|
negligent acts) or agreed to in writing, shall any Contributor be
|
||||||
|
liable to You for damages, including any direct, indirect, special,
|
||||||
|
incidental, or consequential damages of any character arising as a
|
||||||
|
result of this License or out of the use or inability to use the
|
||||||
|
Work (including but not limited to damages for loss of goodwill,
|
||||||
|
work stoppage, computer failure or malfunction, or any and all
|
||||||
|
other commercial damages or losses), even if such Contributor
|
||||||
|
has been advised of the possibility of such damages.
|
||||||
|
|
||||||
|
9. Accepting Warranty or Additional Liability. While redistributing
|
||||||
|
the Work or Derivative Works thereof, You may choose to offer,
|
||||||
|
and charge a fee for, acceptance of support, warranty, indemnity,
|
||||||
|
or other liability obligations and/or rights consistent with this
|
||||||
|
License. However, in accepting such obligations, You may act only
|
||||||
|
on Your own behalf and on Your sole responsibility, not on behalf
|
||||||
|
of any other Contributor, and only if You agree to indemnify,
|
||||||
|
defend, and hold each Contributor harmless for any liability
|
||||||
|
incurred by, or claims asserted against, such Contributor by reason
|
||||||
|
of your accepting any such warranty or additional liability.
|
||||||
|
|
||||||
|
END OF TERMS AND CONDITIONS
|
||||||
|
|
||||||
|
APPENDIX: How to apply the Apache License to your work.
|
||||||
|
|
||||||
|
To apply the Apache License to your work, attach the following
|
||||||
|
boilerplate notice, with the fields enclosed by brackets "[]"
|
||||||
|
replaced with your own identifying information. (Don't include
|
||||||
|
the brackets!) The text should be enclosed in the appropriate
|
||||||
|
comment syntax for the file format. We also recommend that a
|
||||||
|
file or class name and description of purpose be included on the
|
||||||
|
same "printed page" as the copyright notice for easier
|
||||||
|
identification within third-party archives.
|
||||||
|
|
||||||
|
Copyright 2024 Firecrawl | Mendable.ai
|
||||||
|
|
||||||
|
Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
you may not use this file except in compliance with the License.
|
||||||
|
You may obtain a copy of the License at
|
||||||
|
|
||||||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
|
||||||
|
Unless required by applicable law or agreed to in writing, software
|
||||||
|
distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
See the License for the specific language governing permissions and
|
||||||
|
limitations under the License.
|
108
README.md
Normal file
108
README.md
Normal file
|
@ -0,0 +1,108 @@
|
||||||
|
# 🔥 Firecrawl
|
||||||
|
|
||||||
|
Crawl and convert any website into clean markdown
|
||||||
|
|
||||||
|
|
||||||
|
*This repo is still in early development.*
|
||||||
|
|
||||||
|
## What is Firecrawl?
|
||||||
|
|
||||||
|
[Firecrawl](https://firecrawl.dev?ref=github) is an API service that takes a URL, crawls it, and converts it into clean markdown. We crawl all accessible subpages and give you clean markdown for each. No sitemap required.
|
||||||
|
|
||||||
|
## How to use it?
|
||||||
|
|
||||||
|
We provide an easy to use API with our hosted version. You can find the playground and documentation [here](https://firecrawl.com/playground). You can also self host the backend if you'd like.
|
||||||
|
|
||||||
|
- [x] API
|
||||||
|
- [x] Python SDK
|
||||||
|
- [x] JS SDK - Coming Soon
|
||||||
|
|
||||||
|
Self-host. To self-host refer to guide [here](https://github.com/mendableai/firecrawl/blob/main/SELF_HOST.md).
|
||||||
|
|
||||||
|
### API Key
|
||||||
|
|
||||||
|
To use the API, you need to sign up on [Firecrawl](https://firecrawl.com) and get an API key.
|
||||||
|
|
||||||
|
### Crawling
|
||||||
|
|
||||||
|
Used to crawl a URL and all accessible subpages. This submits a crawl job and returns a job ID to check the status of the crawl.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl -X POST https://api.firecrawl.dev/v0/crawl \
|
||||||
|
-H 'Content-Type: application/json' \
|
||||||
|
-H 'Authorization: Bearer YOUR_API_KEY' \
|
||||||
|
-d '{
|
||||||
|
"url": "https://mendable.ai"
|
||||||
|
}'
|
||||||
|
```
|
||||||
|
|
||||||
|
Returns a jobId
|
||||||
|
|
||||||
|
```json
|
||||||
|
{ "jobId": "1234-5678-9101" }
|
||||||
|
```
|
||||||
|
|
||||||
|
### Check Crawl Job
|
||||||
|
|
||||||
|
Used to check the status of a crawl job and get its result.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl -X GET https://api.firecrawl.dev/v0/crawl/status/1234-5678-9101 \
|
||||||
|
-H 'Content-Type: application/json' \
|
||||||
|
-H 'Authorization: Bearer YOUR_API_KEY'
|
||||||
|
```
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"status": "completed",
|
||||||
|
"current": 22,
|
||||||
|
"total": 22,
|
||||||
|
"data": [
|
||||||
|
{
|
||||||
|
"content": "Raw Content ",
|
||||||
|
"markdown": "# Markdown Content",
|
||||||
|
"provider": "web-scraper",
|
||||||
|
"metadata": {
|
||||||
|
"title": "Mendable | AI for CX and Sales",
|
||||||
|
"description": "AI for CX and Sales",
|
||||||
|
"language": null,
|
||||||
|
"sourceURL": "https://www.mendable.ai/",
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Using Python SDK
|
||||||
|
|
||||||
|
### Installing Python SDK
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip install firecrawl-py
|
||||||
|
```
|
||||||
|
|
||||||
|
### Crawl a website
|
||||||
|
|
||||||
|
```python
|
||||||
|
from firecrawl import FirecrawlApp
|
||||||
|
|
||||||
|
app = FirecrawlApp(api_key="YOUR_API_KEY")
|
||||||
|
|
||||||
|
crawl_result = app.crawl_url('mendable.ai', {'crawlerOptions': {'excludes': ['blog/*']}})
|
||||||
|
|
||||||
|
# Get the markdown
|
||||||
|
for result in crawl_result:
|
||||||
|
print(result['markdown'])
|
||||||
|
```
|
||||||
|
|
||||||
|
### Scraping a URL
|
||||||
|
|
||||||
|
To scrape a single URL, use the `scrape_url` method. It takes the URL as a parameter and returns the scraped data as a dictionary.
|
||||||
|
|
||||||
|
```python
|
||||||
|
url = 'https://example.com'
|
||||||
|
scraped_data = app.scrape_url(url)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Contributing
|
||||||
|
|
||||||
|
We love contributions! Please read our [contributing guide](CONTRIBUTING.md) before submitting a pull request.
|
6
SELF_HOST.md
Normal file
6
SELF_HOST.md
Normal file
|
@ -0,0 +1,6 @@
|
||||||
|
# Self-hosting Firecrawl
|
||||||
|
|
||||||
|
Guide coming soon.
|
||||||
|
|
||||||
|
|
||||||
|
|
BIN
apps/.DS_Store
vendored
Normal file
BIN
apps/.DS_Store
vendored
Normal file
Binary file not shown.
4
apps/api/.dockerignore
Normal file
4
apps/api/.dockerignore
Normal file
|
@ -0,0 +1,4 @@
|
||||||
|
/node_modules/
|
||||||
|
/dist/
|
||||||
|
.env
|
||||||
|
*.csv
|
8
apps/api/.env.local
Normal file
8
apps/api/.env.local
Normal file
|
@ -0,0 +1,8 @@
|
||||||
|
PORT=
|
||||||
|
HOST=
|
||||||
|
SUPABASE_ANON_TOKEN=
|
||||||
|
SUPABASE_URL=
|
||||||
|
SUPABASE_SERVICE_TOKEN=
|
||||||
|
REDIS_URL=
|
||||||
|
OPENAI_API_KEY=
|
||||||
|
SCRAPING_BEE_API_KEY=
|
2
apps/api/.gitattributes
vendored
Normal file
2
apps/api/.gitattributes
vendored
Normal file
|
@ -0,0 +1,2 @@
|
||||||
|
# Auto detect text files and perform LF normalization
|
||||||
|
* text=auto
|
6
apps/api/.gitignore
vendored
Normal file
6
apps/api/.gitignore
vendored
Normal file
|
@ -0,0 +1,6 @@
|
||||||
|
/node_modules/
|
||||||
|
/dist/
|
||||||
|
.env
|
||||||
|
*.csv
|
||||||
|
dump.rdb
|
||||||
|
/mongo-data
|
36
apps/api/Dockerfile
Normal file
36
apps/api/Dockerfile
Normal file
|
@ -0,0 +1,36 @@
|
||||||
|
FROM node:20-slim AS base
|
||||||
|
ENV PNPM_HOME="/pnpm"
|
||||||
|
ENV PATH="$PNPM_HOME:$PATH"
|
||||||
|
LABEL fly_launch_runtime="Node.js"
|
||||||
|
RUN corepack enable
|
||||||
|
COPY . /app
|
||||||
|
WORKDIR /app
|
||||||
|
|
||||||
|
FROM base AS prod-deps
|
||||||
|
RUN --mount=type=cache,id=pnpm,target=/pnpm/store pnpm install --prod --frozen-lockfile
|
||||||
|
|
||||||
|
FROM base AS build
|
||||||
|
RUN --mount=type=cache,id=pnpm,target=/pnpm/store pnpm install --frozen-lockfile
|
||||||
|
|
||||||
|
RUN pnpm install
|
||||||
|
RUN pnpm run build
|
||||||
|
|
||||||
|
# Install packages needed for deployment
|
||||||
|
|
||||||
|
|
||||||
|
FROM base
|
||||||
|
RUN apt-get update -qq && \
|
||||||
|
apt-get install --no-install-recommends -y chromium chromium-sandbox && \
|
||||||
|
rm -rf /var/lib/apt/lists /var/cache/apt/archives
|
||||||
|
COPY --from=prod-deps /app/node_modules /app/node_modules
|
||||||
|
COPY --from=build /app /app
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
# Start the server by default, this can be overwritten at runtime
|
||||||
|
EXPOSE 8080
|
||||||
|
ENV PUPPETEER_EXECUTABLE_PATH="/usr/bin/chromium"
|
||||||
|
CMD [ "pnpm", "run", "start:production" ]
|
||||||
|
CMD [ "pnpm", "run", "worker:production" ]
|
||||||
|
|
47
apps/api/fly.toml
Normal file
47
apps/api/fly.toml
Normal file
|
@ -0,0 +1,47 @@
|
||||||
|
# fly.toml app configuration file generated for firecrawl-scraper-js on 2024-04-07T21:09:59-03:00
|
||||||
|
#
|
||||||
|
# See https://fly.io/docs/reference/configuration/ for information about how to use this file.
|
||||||
|
#
|
||||||
|
|
||||||
|
app = 'firecrawl-scraper-js'
|
||||||
|
primary_region = 'mia'
|
||||||
|
kill_signal = 'SIGINT'
|
||||||
|
kill_timeout = '5s'
|
||||||
|
|
||||||
|
[build]
|
||||||
|
|
||||||
|
[processes]
|
||||||
|
app = 'npm run start:production'
|
||||||
|
worker = 'npm run worker:production'
|
||||||
|
|
||||||
|
[http_service]
|
||||||
|
internal_port = 8080
|
||||||
|
force_https = true
|
||||||
|
auto_stop_machines = true
|
||||||
|
auto_start_machines = true
|
||||||
|
min_machines_running = 0
|
||||||
|
processes = ['app']
|
||||||
|
|
||||||
|
[[services]]
|
||||||
|
protocol = 'tcp'
|
||||||
|
internal_port = 8080
|
||||||
|
processes = ['app']
|
||||||
|
|
||||||
|
[[services.ports]]
|
||||||
|
port = 80
|
||||||
|
handlers = ['http']
|
||||||
|
force_https = true
|
||||||
|
|
||||||
|
[[services.ports]]
|
||||||
|
port = 443
|
||||||
|
handlers = ['tls', 'http']
|
||||||
|
|
||||||
|
[services.concurrency]
|
||||||
|
type = 'connections'
|
||||||
|
hard_limit = 45
|
||||||
|
soft_limit = 20
|
||||||
|
|
||||||
|
[[vm]]
|
||||||
|
size = 'performance-1x'
|
||||||
|
|
||||||
|
|
5
apps/api/jest.config.js
Normal file
5
apps/api/jest.config.js
Normal file
|
@ -0,0 +1,5 @@
|
||||||
|
module.exports = {
|
||||||
|
preset: "ts-jest",
|
||||||
|
testEnvironment: "node",
|
||||||
|
setupFiles: ["./jest.setup.js"],
|
||||||
|
};
|
1
apps/api/jest.setup.js
Normal file
1
apps/api/jest.setup.js
Normal file
|
@ -0,0 +1 @@
|
||||||
|
global.fetch = require('jest-fetch-mock');
|
98
apps/api/package.json
Normal file
98
apps/api/package.json
Normal file
|
@ -0,0 +1,98 @@
|
||||||
|
{
|
||||||
|
"name": "firecrawl-scraper-js",
|
||||||
|
"version": "1.0.0",
|
||||||
|
"description": "",
|
||||||
|
"main": "index.js",
|
||||||
|
"scripts": {
|
||||||
|
"start": "nodemon --exec ts-node src/index.ts",
|
||||||
|
"start:production": "tsc && node dist/src/index.js",
|
||||||
|
"format": "prettier --write \"src/**/*.(js|ts)\"",
|
||||||
|
"flyio": "node dist/src/index.js",
|
||||||
|
"start:dev": "nodemon --exec ts-node src/index.ts",
|
||||||
|
"build": "tsc",
|
||||||
|
"test": "jest --verbose",
|
||||||
|
"workers": "nodemon --exec ts-node src/services/queue-worker.ts",
|
||||||
|
"worker:production": "node dist/src/services/queue-worker.js",
|
||||||
|
"mongo-docker": "docker run -d -p 2717:27017 -v ./mongo-data:/data/db --name mongodb mongo:latest",
|
||||||
|
"mongo-docker-console": "docker exec -it mongodb mongosh",
|
||||||
|
"run-example": "npx ts-node src/example.ts"
|
||||||
|
},
|
||||||
|
"author": "",
|
||||||
|
"license": "ISC",
|
||||||
|
"devDependencies": {
|
||||||
|
"@flydotio/dockerfile": "^0.4.10",
|
||||||
|
"@tsconfig/recommended": "^1.0.3",
|
||||||
|
"@types/body-parser": "^1.19.2",
|
||||||
|
"@types/bull": "^4.10.0",
|
||||||
|
"@types/cors": "^2.8.13",
|
||||||
|
"@types/express": "^4.17.17",
|
||||||
|
"@types/jest": "^29.5.6",
|
||||||
|
"body-parser": "^1.20.1",
|
||||||
|
"express": "^4.18.2",
|
||||||
|
"jest": "^29.6.3",
|
||||||
|
"jest-fetch-mock": "^3.0.3",
|
||||||
|
"nodemon": "^2.0.20",
|
||||||
|
"supabase": "^1.77.9",
|
||||||
|
"supertest": "^6.3.3",
|
||||||
|
"ts-jest": "^29.1.1",
|
||||||
|
"ts-node": "^10.9.1",
|
||||||
|
"typescript": "^5.4.2"
|
||||||
|
},
|
||||||
|
"dependencies": {
|
||||||
|
"@brillout/import": "^0.2.2",
|
||||||
|
"@bull-board/api": "^5.14.2",
|
||||||
|
"@bull-board/express": "^5.8.0",
|
||||||
|
"@devil7softwares/pos": "^1.0.2",
|
||||||
|
"@dqbd/tiktoken": "^1.0.7",
|
||||||
|
"@logtail/node": "^0.4.12",
|
||||||
|
"@nangohq/node": "^0.36.33",
|
||||||
|
"@sentry/node": "^7.48.0",
|
||||||
|
"@supabase/supabase-js": "^2.7.1",
|
||||||
|
"async": "^3.2.5",
|
||||||
|
"async-mutex": "^0.4.0",
|
||||||
|
"axios": "^1.3.4",
|
||||||
|
"bottleneck": "^2.19.5",
|
||||||
|
"bull": "^4.11.4",
|
||||||
|
"cheerio": "^1.0.0-rc.12",
|
||||||
|
"cohere": "^1.1.1",
|
||||||
|
"cors": "^2.8.5",
|
||||||
|
"cron-parser": "^4.9.0",
|
||||||
|
"date-fns": "^2.29.3",
|
||||||
|
"dotenv": "^16.3.1",
|
||||||
|
"express-rate-limit": "^6.7.0",
|
||||||
|
"glob": "^10.3.12",
|
||||||
|
"gpt3-tokenizer": "^1.1.5",
|
||||||
|
"ioredis": "^5.3.2",
|
||||||
|
"keyword-extractor": "^0.0.25",
|
||||||
|
"langchain": "^0.1.25",
|
||||||
|
"languagedetect": "^2.0.0",
|
||||||
|
"logsnag": "^0.1.6",
|
||||||
|
"luxon": "^3.4.3",
|
||||||
|
"md5": "^2.3.0",
|
||||||
|
"moment": "^2.29.4",
|
||||||
|
"mongoose": "^8.0.3",
|
||||||
|
"natural": "^6.3.0",
|
||||||
|
"openai": "^4.28.4",
|
||||||
|
"pos": "^0.4.2",
|
||||||
|
"promptable": "^0.0.9",
|
||||||
|
"puppeteer": "^22.6.3",
|
||||||
|
"rate-limiter-flexible": "^2.4.2",
|
||||||
|
"redis": "^4.6.7",
|
||||||
|
"robots-parser": "^3.0.1",
|
||||||
|
"scrapingbee": "^1.7.4",
|
||||||
|
"stripe": "^12.2.0",
|
||||||
|
"turndown": "^7.1.3",
|
||||||
|
"typesense": "^1.5.4",
|
||||||
|
"unstructured-client": "^0.9.4",
|
||||||
|
"uuid": "^9.0.1",
|
||||||
|
"wordpos": "^2.1.0",
|
||||||
|
"xml2js": "^0.6.2"
|
||||||
|
},
|
||||||
|
"nodemonConfig": {
|
||||||
|
"ignore": [
|
||||||
|
"*.docx",
|
||||||
|
"*.json",
|
||||||
|
"temp"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
}
|
6146
apps/api/pnpm-lock.yaml
Normal file
6146
apps/api/pnpm-lock.yaml
Normal file
File diff suppressed because it is too large
Load Diff
53
apps/api/requests.http
Normal file
53
apps/api/requests.http
Normal file
|
@ -0,0 +1,53 @@
|
||||||
|
|
||||||
|
|
||||||
|
### Crawl Website
|
||||||
|
POST http://localhost:3002/v0/crawl HTTP/1.1
|
||||||
|
Authorization: Bearer 1b5b20e4-3417-4298-b1b2-983920e1b74d
|
||||||
|
|
||||||
|
{
|
||||||
|
"url":"https://docs.mendable.ai"
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
### Check Job Status
|
||||||
|
GET http://localhost:3002/v0/jobs/active HTTP/1.1
|
||||||
|
|
||||||
|
|
||||||
|
### Scrape Website
|
||||||
|
POST https://api.firecrawl.dev/v0/scrape HTTP/1.1
|
||||||
|
Authorization: Bearer 1b5b20e4-3417-4298-b1b2-983920e1b74d
|
||||||
|
content-type: application/json
|
||||||
|
|
||||||
|
{
|
||||||
|
"url":"https://www.agentops.ai"
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
### Scrape Website
|
||||||
|
POST http://localhost:3002/v0/scrape HTTP/1.1
|
||||||
|
Authorization: Bearer 1b5b20e4-3417-4298-b1b2-983920e1b74d
|
||||||
|
content-type: application/json
|
||||||
|
|
||||||
|
{
|
||||||
|
"url":"https://www.agentops.ai"
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
### Check Job Status
|
||||||
|
GET http://localhost:3002/v0/crawl/status/333ab225-dc3e-418b-9d4b-8fb833cbaf89 HTTP/1.1
|
||||||
|
Authorization: Bearer 1b5b20e4-3417-4298-b1b2-983920e1b74d
|
||||||
|
|
||||||
|
### Get Job Result
|
||||||
|
|
||||||
|
POST https://api.firecrawl.dev/v0/crawl HTTP/1.1
|
||||||
|
Authorization: Bearer 30c90634-8377-4446-9ef9-a280b9be1702
|
||||||
|
content-type: application/json
|
||||||
|
|
||||||
|
{
|
||||||
|
"url":"https://markprompt.com"
|
||||||
|
}
|
||||||
|
|
||||||
|
### Check Job Status
|
||||||
|
GET https://api.firecrawl.dev/v0/crawl/status/cfcb71ac-23a3-4da5-bd85-d4e58b871d66
|
||||||
|
Authorization: Bearer 30c90634-8377-4446-9ef9-a280b9be1702
|
BIN
apps/api/src/.DS_Store
vendored
Normal file
BIN
apps/api/src/.DS_Store
vendored
Normal file
Binary file not shown.
2
apps/api/src/control.ts
Normal file
2
apps/api/src/control.ts
Normal file
|
@ -0,0 +1,2 @@
|
||||||
|
// ! IN CASE OPENAI goes down, then activate the fallback -> true
|
||||||
|
export const is_fallback = false;
|
18
apps/api/src/example.ts
Normal file
18
apps/api/src/example.ts
Normal file
|
@ -0,0 +1,18 @@
|
||||||
|
import { WebScraperDataProvider } from "./scraper/WebScraper";
|
||||||
|
|
||||||
|
async function example() {
|
||||||
|
const example = new WebScraperDataProvider();
|
||||||
|
|
||||||
|
await example.setOptions({
|
||||||
|
mode: "crawl",
|
||||||
|
urls: ["https://mendable.ai"],
|
||||||
|
crawlerOptions: {},
|
||||||
|
});
|
||||||
|
const docs = await example.getDocuments(false);
|
||||||
|
docs.map((doc) => {
|
||||||
|
console.log(doc.metadata.sourceURL);
|
||||||
|
});
|
||||||
|
console.log(docs.length);
|
||||||
|
}
|
||||||
|
|
||||||
|
// example();
|
352
apps/api/src/index.ts
Normal file
352
apps/api/src/index.ts
Normal file
|
@ -0,0 +1,352 @@
|
||||||
|
import express from "express";
|
||||||
|
import bodyParser from "body-parser";
|
||||||
|
import cors from "cors";
|
||||||
|
import "dotenv/config";
|
||||||
|
import { getWebScraperQueue } from "./services/queue-service";
|
||||||
|
import { addWebScraperJob } from "./services/queue-jobs";
|
||||||
|
import { supabase_service } from "./services/supabase";
|
||||||
|
import { WebScraperDataProvider } from "./scraper/WebScraper";
|
||||||
|
import { billTeam, checkTeamCredits } from "./services/billing/credit_billing";
|
||||||
|
import { getRateLimiter, redisClient } from "./services/rate-limiter";
|
||||||
|
|
||||||
|
const { createBullBoard } = require("@bull-board/api");
|
||||||
|
const { BullAdapter } = require("@bull-board/api/bullAdapter");
|
||||||
|
const { ExpressAdapter } = require("@bull-board/express");
|
||||||
|
|
||||||
|
export const app = express();
|
||||||
|
|
||||||
|
global.isProduction = process.env.IS_PRODUCTION === "true";
|
||||||
|
|
||||||
|
app.use(bodyParser.urlencoded({ extended: true }));
|
||||||
|
app.use(bodyParser.json({ limit: "10mb" }));
|
||||||
|
|
||||||
|
app.use(cors()); // Add this line to enable CORS
|
||||||
|
|
||||||
|
const serverAdapter = new ExpressAdapter();
|
||||||
|
serverAdapter.setBasePath(`/admin/${process.env.BULL_AUTH_KEY}/queues`);
|
||||||
|
|
||||||
|
const { addQueue, removeQueue, setQueues, replaceQueues } = createBullBoard({
|
||||||
|
queues: [new BullAdapter(getWebScraperQueue())],
|
||||||
|
serverAdapter: serverAdapter,
|
||||||
|
});
|
||||||
|
|
||||||
|
app.use(
|
||||||
|
`/admin/${process.env.BULL_AUTH_KEY}/queues`,
|
||||||
|
serverAdapter.getRouter()
|
||||||
|
);
|
||||||
|
|
||||||
|
app.get("/", (req, res) => {
|
||||||
|
res.send("SCRAPERS-JS: Hello, world! Fly.io");
|
||||||
|
});
|
||||||
|
|
||||||
|
//write a simple test function
|
||||||
|
app.get("/test", async (req, res) => {
|
||||||
|
res.send("Hello, world!");
|
||||||
|
});
|
||||||
|
|
||||||
|
async function authenticateUser(req, res, mode?: string): Promise<string> {
|
||||||
|
const authHeader = req.headers.authorization;
|
||||||
|
if (!authHeader) {
|
||||||
|
return res.status(401).json({ error: "Unauthorized" });
|
||||||
|
}
|
||||||
|
const token = authHeader.split(" ")[1]; // Extract the token from "Bearer <token>"
|
||||||
|
if (!token) {
|
||||||
|
return res.status(401).json({ error: "Unauthorized: Token missing" });
|
||||||
|
}
|
||||||
|
|
||||||
|
try {
|
||||||
|
const incomingIP = (req.headers["x-forwarded-for"] ||
|
||||||
|
req.socket.remoteAddress) as string;
|
||||||
|
const iptoken = incomingIP + token;
|
||||||
|
await getRateLimiter(
|
||||||
|
token === "this_is_just_a_preview_token" ? true : false
|
||||||
|
).consume(iptoken);
|
||||||
|
} catch (rateLimiterRes) {
|
||||||
|
console.error(rateLimiterRes);
|
||||||
|
return res.status(429).json({
|
||||||
|
error: "Rate limit exceeded. Too many requests, try again in 1 minute.",
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
if (token === "this_is_just_a_preview_token" && mode === "scrape") {
|
||||||
|
return "preview";
|
||||||
|
}
|
||||||
|
// make sure api key is valid, based on the api_keys table in supabase
|
||||||
|
const { data, error } = await supabase_service
|
||||||
|
.from("api_keys")
|
||||||
|
.select("*")
|
||||||
|
.eq("key", token);
|
||||||
|
if (error || !data || data.length === 0) {
|
||||||
|
return res.status(401).json({ error: "Unauthorized: Invalid token" });
|
||||||
|
}
|
||||||
|
|
||||||
|
return data[0].team_id;
|
||||||
|
}
|
||||||
|
|
||||||
|
app.post("/v0/scrape", async (req, res) => {
|
||||||
|
try {
|
||||||
|
// make sure to authenticate user first, Bearer <token>
|
||||||
|
const team_id = await authenticateUser(req, res, "scrape");
|
||||||
|
|
||||||
|
try {
|
||||||
|
const { success: creditsCheckSuccess, message: creditsCheckMessage } =
|
||||||
|
await checkTeamCredits(team_id, 1);
|
||||||
|
if (!creditsCheckSuccess) {
|
||||||
|
return res.status(402).json({ error: "Insufficient credits" });
|
||||||
|
}
|
||||||
|
} catch (error) {
|
||||||
|
console.error(error);
|
||||||
|
return res.status(500).json({ error: "Internal server error" });
|
||||||
|
}
|
||||||
|
|
||||||
|
// authenticate on supabase
|
||||||
|
const url = req.body.url;
|
||||||
|
if (!url) {
|
||||||
|
return res.status(400).json({ error: "Url is required" });
|
||||||
|
}
|
||||||
|
|
||||||
|
try {
|
||||||
|
const a = new WebScraperDataProvider();
|
||||||
|
await a.setOptions({
|
||||||
|
mode: "single_urls",
|
||||||
|
urls: [url],
|
||||||
|
});
|
||||||
|
|
||||||
|
const docs = await a.getDocuments(false);
|
||||||
|
// make sure doc.content is not empty
|
||||||
|
const filteredDocs = docs.filter(
|
||||||
|
(doc: { content?: string }) =>
|
||||||
|
doc.content && doc.content.trim().length > 0
|
||||||
|
);
|
||||||
|
if (filteredDocs.length === 0) {
|
||||||
|
return res.status(200).json({ success: true, data: [] });
|
||||||
|
}
|
||||||
|
const { success, credit_usage } = await billTeam(
|
||||||
|
team_id,
|
||||||
|
filteredDocs.length
|
||||||
|
);
|
||||||
|
if (!success) {
|
||||||
|
// throw new Error("Failed to bill team, no subscribtion was found");
|
||||||
|
// return {
|
||||||
|
// success: false,
|
||||||
|
// message: "Failed to bill team, no subscribtion was found",
|
||||||
|
// docs: [],
|
||||||
|
// };
|
||||||
|
return res
|
||||||
|
.status(402)
|
||||||
|
.json({ error: "Failed to bill, no subscribtion was found" });
|
||||||
|
}
|
||||||
|
return res.json({
|
||||||
|
success: true,
|
||||||
|
data: filteredDocs[0],
|
||||||
|
});
|
||||||
|
} catch (error) {
|
||||||
|
console.error(error);
|
||||||
|
return res.status(500).json({ error: error.message });
|
||||||
|
}
|
||||||
|
} catch (error) {
|
||||||
|
console.error(error);
|
||||||
|
return res.status(500).json({ error: error.message });
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
app.post("/v0/crawl", async (req, res) => {
|
||||||
|
try {
|
||||||
|
const team_id = await authenticateUser(req, res);
|
||||||
|
|
||||||
|
const { success: creditsCheckSuccess, message: creditsCheckMessage } =
|
||||||
|
await checkTeamCredits(team_id, 1);
|
||||||
|
if (!creditsCheckSuccess) {
|
||||||
|
return res.status(402).json({ error: "Insufficient credits" });
|
||||||
|
}
|
||||||
|
|
||||||
|
// authenticate on supabase
|
||||||
|
const url = req.body.url;
|
||||||
|
if (!url) {
|
||||||
|
return res.status(400).json({ error: "Url is required" });
|
||||||
|
}
|
||||||
|
const mode = req.body.mode ?? "crawl";
|
||||||
|
const crawlerOptions = req.body.crawlerOptions ?? {};
|
||||||
|
|
||||||
|
if (mode === "single_urls" && !url.includes(",")) {
|
||||||
|
try {
|
||||||
|
const a = new WebScraperDataProvider();
|
||||||
|
await a.setOptions({
|
||||||
|
mode: "single_urls",
|
||||||
|
urls: [url],
|
||||||
|
crawlerOptions: {
|
||||||
|
returnOnlyUrls: true,
|
||||||
|
},
|
||||||
|
});
|
||||||
|
|
||||||
|
const docs = await a.getDocuments(false, (progress) => {
|
||||||
|
job.progress({
|
||||||
|
current: progress.current,
|
||||||
|
total: progress.total,
|
||||||
|
current_step: "SCRAPING",
|
||||||
|
current_url: progress.currentDocumentUrl,
|
||||||
|
});
|
||||||
|
});
|
||||||
|
return res.json({
|
||||||
|
success: true,
|
||||||
|
documents: docs,
|
||||||
|
});
|
||||||
|
} catch (error) {
|
||||||
|
console.error(error);
|
||||||
|
return res.status(500).json({ error: error.message });
|
||||||
|
}
|
||||||
|
}
|
||||||
|
const job = await addWebScraperJob({
|
||||||
|
url: url,
|
||||||
|
mode: mode ?? "crawl", // fix for single urls not working
|
||||||
|
crawlerOptions: { ...crawlerOptions },
|
||||||
|
team_id: team_id,
|
||||||
|
});
|
||||||
|
|
||||||
|
res.json({ jobId: job.id });
|
||||||
|
} catch (error) {
|
||||||
|
console.error(error);
|
||||||
|
return res.status(500).json({ error: error.message });
|
||||||
|
}
|
||||||
|
});
|
||||||
|
app.post("/v0/crawlWebsitePreview", async (req, res) => {
|
||||||
|
try {
|
||||||
|
// make sure to authenticate user first, Bearer <token>
|
||||||
|
const authHeader = req.headers.authorization;
|
||||||
|
if (!authHeader) {
|
||||||
|
return res.status(401).json({ error: "Unauthorized" });
|
||||||
|
}
|
||||||
|
const token = authHeader.split(" ")[1]; // Extract the token from "Bearer <token>"
|
||||||
|
if (!token) {
|
||||||
|
return res.status(401).json({ error: "Unauthorized: Token missing" });
|
||||||
|
}
|
||||||
|
|
||||||
|
// authenticate on supabase
|
||||||
|
const url = req.body.url;
|
||||||
|
if (!url) {
|
||||||
|
return res.status(400).json({ error: "Url is required" });
|
||||||
|
}
|
||||||
|
const mode = req.body.mode ?? "crawl";
|
||||||
|
const crawlerOptions = req.body.crawlerOptions ?? {};
|
||||||
|
const job = await addWebScraperJob({
|
||||||
|
url: url,
|
||||||
|
mode: mode ?? "crawl", // fix for single urls not working
|
||||||
|
crawlerOptions: { ...crawlerOptions, limit: 5, maxCrawledLinks: 5 },
|
||||||
|
team_id: "preview",
|
||||||
|
});
|
||||||
|
|
||||||
|
res.json({ jobId: job.id });
|
||||||
|
} catch (error) {
|
||||||
|
console.error(error);
|
||||||
|
return res.status(500).json({ error: error.message });
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
app.get("/v0/crawl/status/:jobId", async (req, res) => {
|
||||||
|
try {
|
||||||
|
const authHeader = req.headers.authorization;
|
||||||
|
if (!authHeader) {
|
||||||
|
return res.status(401).json({ error: "Unauthorized" });
|
||||||
|
}
|
||||||
|
const token = authHeader.split(" ")[1]; // Extract the token from "Bearer <token>"
|
||||||
|
if (!token) {
|
||||||
|
return res.status(401).json({ error: "Unauthorized: Token missing" });
|
||||||
|
}
|
||||||
|
|
||||||
|
// make sure api key is valid, based on the api_keys table in supabase
|
||||||
|
const { data, error } = await supabase_service
|
||||||
|
.from("api_keys")
|
||||||
|
.select("*")
|
||||||
|
.eq("key", token);
|
||||||
|
if (error || !data || data.length === 0) {
|
||||||
|
return res.status(401).json({ error: "Unauthorized: Invalid token" });
|
||||||
|
}
|
||||||
|
const job = await getWebScraperQueue().getJob(req.params.jobId);
|
||||||
|
if (!job) {
|
||||||
|
return res.status(404).json({ error: "Job not found" });
|
||||||
|
}
|
||||||
|
|
||||||
|
const { current, current_url, total, current_step } = await job.progress();
|
||||||
|
res.json({
|
||||||
|
status: await job.getState(),
|
||||||
|
// progress: job.progress(),
|
||||||
|
current: current,
|
||||||
|
current_url: current_url,
|
||||||
|
current_step: current_step,
|
||||||
|
total: total,
|
||||||
|
data: job.returnvalue,
|
||||||
|
});
|
||||||
|
} catch (error) {
|
||||||
|
console.error(error);
|
||||||
|
return res.status(500).json({ error: error.message });
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
app.get("/v0/checkJobStatus/:jobId", async (req, res) => {
|
||||||
|
try {
|
||||||
|
const job = await getWebScraperQueue().getJob(req.params.jobId);
|
||||||
|
if (!job) {
|
||||||
|
return res.status(404).json({ error: "Job not found" });
|
||||||
|
}
|
||||||
|
|
||||||
|
const { current, current_url, total, current_step } = await job.progress();
|
||||||
|
res.json({
|
||||||
|
status: await job.getState(),
|
||||||
|
// progress: job.progress(),
|
||||||
|
current: current,
|
||||||
|
current_url: current_url,
|
||||||
|
current_step: current_step,
|
||||||
|
total: total,
|
||||||
|
data: job.returnvalue,
|
||||||
|
});
|
||||||
|
} catch (error) {
|
||||||
|
console.error(error);
|
||||||
|
return res.status(500).json({ error: error.message });
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
const DEFAULT_PORT = process.env.PORT ?? 3002;
|
||||||
|
const HOST = process.env.HOST ?? "localhost";
|
||||||
|
redisClient.connect();
|
||||||
|
|
||||||
|
export function startServer(port = DEFAULT_PORT) {
|
||||||
|
const server = app.listen(Number(port), HOST, () => {
|
||||||
|
console.log(`Server listening on port ${port}`);
|
||||||
|
console.log(`For the UI, open http://${HOST}:${port}/admin/queues`);
|
||||||
|
console.log("");
|
||||||
|
console.log("1. Make sure Redis is running on port 6379 by default");
|
||||||
|
console.log(
|
||||||
|
"2. If you want to run nango, make sure you do port forwarding in 3002 using ngrok http 3002 "
|
||||||
|
);
|
||||||
|
});
|
||||||
|
return server;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (require.main === module) {
|
||||||
|
startServer();
|
||||||
|
}
|
||||||
|
|
||||||
|
// Use this as a health check that way we dont destroy the server
|
||||||
|
app.get(`/admin/${process.env.BULL_AUTH_KEY}/queues`, async (req, res) => {
|
||||||
|
try {
|
||||||
|
const webScraperQueue = getWebScraperQueue();
|
||||||
|
const [webScraperActive] = await Promise.all([
|
||||||
|
webScraperQueue.getActiveCount(),
|
||||||
|
]);
|
||||||
|
|
||||||
|
const noActiveJobs = webScraperActive === 0;
|
||||||
|
// 200 if no active jobs, 503 if there are active jobs
|
||||||
|
return res.status(noActiveJobs ? 200 : 500).json({
|
||||||
|
webScraperActive,
|
||||||
|
noActiveJobs,
|
||||||
|
});
|
||||||
|
} catch (error) {
|
||||||
|
console.error(error);
|
||||||
|
return res.status(500).json({ error: error.message });
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
app.get("/is-production", (req, res) => {
|
||||||
|
res.send({ isProduction: global.isProduction });
|
||||||
|
});
|
||||||
|
|
16
apps/api/src/lib/batch-process.ts
Normal file
16
apps/api/src/lib/batch-process.ts
Normal file
|
@ -0,0 +1,16 @@
|
||||||
|
export async function batchProcess<T>(
|
||||||
|
array: T[],
|
||||||
|
batchSize: number,
|
||||||
|
asyncFunction: (item: T, index: number) => Promise<void>
|
||||||
|
): Promise<void> {
|
||||||
|
const batches = [];
|
||||||
|
for (let i = 0; i < array.length; i += batchSize) {
|
||||||
|
const batch = array.slice(i, i + batchSize);
|
||||||
|
batches.push(batch);
|
||||||
|
}
|
||||||
|
|
||||||
|
for (const batch of batches) {
|
||||||
|
await Promise.all(batch.map((item, i) => asyncFunction(item, i)));
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
21
apps/api/src/lib/custom-error.ts
Normal file
21
apps/api/src/lib/custom-error.ts
Normal file
|
@ -0,0 +1,21 @@
|
||||||
|
export class CustomError extends Error {
|
||||||
|
statusCode: number;
|
||||||
|
status: string;
|
||||||
|
message: string;
|
||||||
|
dataIngestionJob: any;
|
||||||
|
|
||||||
|
constructor(
|
||||||
|
statusCode: number,
|
||||||
|
status: string,
|
||||||
|
message: string = "",
|
||||||
|
dataIngestionJob?: any,
|
||||||
|
) {
|
||||||
|
super(message);
|
||||||
|
this.statusCode = statusCode;
|
||||||
|
this.status = status;
|
||||||
|
this.message = message;
|
||||||
|
this.dataIngestionJob = dataIngestionJob;
|
||||||
|
|
||||||
|
Object.setPrototypeOf(this, CustomError.prototype);
|
||||||
|
}
|
||||||
|
}
|
37
apps/api/src/lib/entities.ts
Normal file
37
apps/api/src/lib/entities.ts
Normal file
|
@ -0,0 +1,37 @@
|
||||||
|
export interface Progress {
|
||||||
|
current: number;
|
||||||
|
total: number;
|
||||||
|
status: string;
|
||||||
|
metadata?: {
|
||||||
|
sourceURL?: string;
|
||||||
|
[key: string]: any;
|
||||||
|
};
|
||||||
|
currentDocumentUrl?: string;
|
||||||
|
}
|
||||||
|
|
||||||
|
export class Document {
|
||||||
|
id?: string;
|
||||||
|
content: string;
|
||||||
|
markdown?: string;
|
||||||
|
createdAt?: Date;
|
||||||
|
updatedAt?: Date;
|
||||||
|
type?: string;
|
||||||
|
metadata: {
|
||||||
|
sourceURL?: string;
|
||||||
|
[key: string]: any;
|
||||||
|
};
|
||||||
|
childrenLinks?: string[];
|
||||||
|
|
||||||
|
constructor(data: Partial<Document>) {
|
||||||
|
if (!data.content) {
|
||||||
|
throw new Error("Missing required fields");
|
||||||
|
}
|
||||||
|
this.content = data.content;
|
||||||
|
this.createdAt = data.createdAt || new Date();
|
||||||
|
this.updatedAt = data.updatedAt || new Date();
|
||||||
|
this.type = data.type || "unknown";
|
||||||
|
this.metadata = data.metadata || { sourceURL: "" };
|
||||||
|
this.markdown = data.markdown || "";
|
||||||
|
this.childrenLinks = data.childrenLinks || undefined;
|
||||||
|
}
|
||||||
|
}
|
51
apps/api/src/lib/html-to-markdown.ts
Normal file
51
apps/api/src/lib/html-to-markdown.ts
Normal file
|
@ -0,0 +1,51 @@
|
||||||
|
export function parseMarkdown(html: string) {
|
||||||
|
var TurndownService = require("turndown");
|
||||||
|
|
||||||
|
const turndownService = new TurndownService();
|
||||||
|
turndownService.addRule("inlineLink", {
|
||||||
|
filter: function (node, options) {
|
||||||
|
return (
|
||||||
|
options.linkStyle === "inlined" &&
|
||||||
|
node.nodeName === "A" &&
|
||||||
|
node.getAttribute("href")
|
||||||
|
);
|
||||||
|
},
|
||||||
|
replacement: function (content, node) {
|
||||||
|
var href = node.getAttribute("href").trim();
|
||||||
|
var title = node.title ? ' "' + node.title + '"' : "";
|
||||||
|
return "[" + content.trim() + "](" + href + title + ")\n";
|
||||||
|
},
|
||||||
|
});
|
||||||
|
|
||||||
|
let markdownContent = turndownService.turndown(html);
|
||||||
|
|
||||||
|
// multiple line links
|
||||||
|
let insideLinkContent = false;
|
||||||
|
let newMarkdownContent = "";
|
||||||
|
let linkOpenCount = 0;
|
||||||
|
for (let i = 0; i < markdownContent.length; i++) {
|
||||||
|
const char = markdownContent[i];
|
||||||
|
|
||||||
|
if (char == "[") {
|
||||||
|
linkOpenCount++;
|
||||||
|
} else if (char == "]") {
|
||||||
|
linkOpenCount = Math.max(0, linkOpenCount - 1);
|
||||||
|
}
|
||||||
|
insideLinkContent = linkOpenCount > 0;
|
||||||
|
|
||||||
|
if (insideLinkContent && char == "\n") {
|
||||||
|
newMarkdownContent += "\\" + "\n";
|
||||||
|
} else {
|
||||||
|
newMarkdownContent += char;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
markdownContent = newMarkdownContent;
|
||||||
|
|
||||||
|
// Remove [Skip to Content](#page) and [Skip to content](#skip)
|
||||||
|
markdownContent = markdownContent.replace(
|
||||||
|
/\[Skip to Content\]\(#[^\)]*\)/gi,
|
||||||
|
""
|
||||||
|
);
|
||||||
|
|
||||||
|
return markdownContent;
|
||||||
|
}
|
12
apps/api/src/lib/parse-mode.ts
Normal file
12
apps/api/src/lib/parse-mode.ts
Normal file
|
@ -0,0 +1,12 @@
|
||||||
|
export function parseMode(mode: string) {
|
||||||
|
switch (mode) {
|
||||||
|
case "single_urls":
|
||||||
|
return "single_urls";
|
||||||
|
case "sitemap":
|
||||||
|
return "sitemap";
|
||||||
|
case "crawl":
|
||||||
|
return "crawl";
|
||||||
|
default:
|
||||||
|
return "single_urls";
|
||||||
|
}
|
||||||
|
}
|
96
apps/api/src/main/runWebScraper.ts
Normal file
96
apps/api/src/main/runWebScraper.ts
Normal file
|
@ -0,0 +1,96 @@
|
||||||
|
import { Job } from "bull";
|
||||||
|
import { CrawlResult, WebScraperOptions } from "../types";
|
||||||
|
import { WebScraperDataProvider } from "../scraper/WebScraper";
|
||||||
|
import { Progress } from "../lib/entities";
|
||||||
|
import { billTeam } from "../services/billing/credit_billing";
|
||||||
|
|
||||||
|
export async function startWebScraperPipeline({
|
||||||
|
job,
|
||||||
|
}: {
|
||||||
|
job: Job<WebScraperOptions>;
|
||||||
|
}) {
|
||||||
|
return (await runWebScraper({
|
||||||
|
url: job.data.url,
|
||||||
|
mode: job.data.mode,
|
||||||
|
crawlerOptions: job.data.crawlerOptions,
|
||||||
|
inProgress: (progress) => {
|
||||||
|
job.progress(progress);
|
||||||
|
},
|
||||||
|
onSuccess: (result) => {
|
||||||
|
job.moveToCompleted(result);
|
||||||
|
},
|
||||||
|
onError: (error) => {
|
||||||
|
job.moveToFailed(error);
|
||||||
|
},
|
||||||
|
team_id: job.data.team_id,
|
||||||
|
})) as { success: boolean; message: string; docs: CrawlResult[] };
|
||||||
|
}
|
||||||
|
export async function runWebScraper({
|
||||||
|
url,
|
||||||
|
mode,
|
||||||
|
crawlerOptions,
|
||||||
|
inProgress,
|
||||||
|
onSuccess,
|
||||||
|
onError,
|
||||||
|
team_id,
|
||||||
|
}: {
|
||||||
|
url: string;
|
||||||
|
mode: "crawl" | "single_urls" | "sitemap";
|
||||||
|
crawlerOptions: any;
|
||||||
|
inProgress: (progress: any) => void;
|
||||||
|
onSuccess: (result: any) => void;
|
||||||
|
onError: (error: any) => void;
|
||||||
|
team_id: string;
|
||||||
|
}): Promise<{ success: boolean; message: string; docs: CrawlResult[] }> {
|
||||||
|
try {
|
||||||
|
const provider = new WebScraperDataProvider();
|
||||||
|
|
||||||
|
if (mode === "crawl") {
|
||||||
|
await provider.setOptions({
|
||||||
|
mode: mode,
|
||||||
|
urls: [url],
|
||||||
|
crawlerOptions: crawlerOptions,
|
||||||
|
});
|
||||||
|
} else {
|
||||||
|
await provider.setOptions({
|
||||||
|
mode: mode,
|
||||||
|
urls: url.split(","),
|
||||||
|
crawlerOptions: crawlerOptions,
|
||||||
|
});
|
||||||
|
}
|
||||||
|
const docs = (await provider.getDocuments(false, (progress: Progress) => {
|
||||||
|
inProgress(progress);
|
||||||
|
})) as CrawlResult[];
|
||||||
|
|
||||||
|
if (docs.length === 0) {
|
||||||
|
return {
|
||||||
|
success: true,
|
||||||
|
message: "No pages found",
|
||||||
|
docs: [],
|
||||||
|
};
|
||||||
|
}
|
||||||
|
|
||||||
|
// remove docs with empty content
|
||||||
|
const filteredDocs = docs.filter((doc) => doc.content.trim().length > 0);
|
||||||
|
onSuccess(filteredDocs);
|
||||||
|
|
||||||
|
const { success, credit_usage } = await billTeam(
|
||||||
|
team_id,
|
||||||
|
filteredDocs.length
|
||||||
|
);
|
||||||
|
if (!success) {
|
||||||
|
// throw new Error("Failed to bill team, no subscribtion was found");
|
||||||
|
return {
|
||||||
|
success: false,
|
||||||
|
message: "Failed to bill team, no subscribtion was found",
|
||||||
|
docs: [],
|
||||||
|
};
|
||||||
|
}
|
||||||
|
|
||||||
|
return { success: true, message: "", docs: filteredDocs as CrawlResult[] };
|
||||||
|
} catch (error) {
|
||||||
|
console.error("Error running web scraper", error);
|
||||||
|
onError(error);
|
||||||
|
return { success: false, message: error.message, docs: [] };
|
||||||
|
}
|
||||||
|
}
|
295
apps/api/src/scraper/WebScraper/crawler.ts
Normal file
295
apps/api/src/scraper/WebScraper/crawler.ts
Normal file
|
@ -0,0 +1,295 @@
|
||||||
|
import axios from "axios";
|
||||||
|
import cheerio, { load } from "cheerio";
|
||||||
|
import { URL } from "url";
|
||||||
|
import { getLinksFromSitemap } from "./sitemap";
|
||||||
|
import async from "async";
|
||||||
|
import { Progress } from "../../lib/entities";
|
||||||
|
import { scrapWithScrapingBee } from "./single_url";
|
||||||
|
import robotsParser from "robots-parser";
|
||||||
|
|
||||||
|
export class WebCrawler {
|
||||||
|
private initialUrl: string;
|
||||||
|
private baseUrl: string;
|
||||||
|
private includes: string[];
|
||||||
|
private excludes: string[];
|
||||||
|
private maxCrawledLinks: number;
|
||||||
|
private visited: Set<string> = new Set();
|
||||||
|
private crawledUrls: Set<string> = new Set();
|
||||||
|
private limit: number;
|
||||||
|
private robotsTxtUrl: string;
|
||||||
|
private robots: any;
|
||||||
|
|
||||||
|
constructor({
|
||||||
|
initialUrl,
|
||||||
|
includes,
|
||||||
|
excludes,
|
||||||
|
maxCrawledLinks,
|
||||||
|
limit = 10000,
|
||||||
|
}: {
|
||||||
|
initialUrl: string;
|
||||||
|
includes?: string[];
|
||||||
|
excludes?: string[];
|
||||||
|
maxCrawledLinks?: number;
|
||||||
|
limit?: number;
|
||||||
|
}) {
|
||||||
|
this.initialUrl = initialUrl;
|
||||||
|
this.baseUrl = new URL(initialUrl).origin;
|
||||||
|
this.includes = includes ?? [];
|
||||||
|
this.excludes = excludes ?? [];
|
||||||
|
this.limit = limit;
|
||||||
|
this.robotsTxtUrl = `${this.baseUrl}/robots.txt`;
|
||||||
|
this.robots = robotsParser(this.robotsTxtUrl, "");
|
||||||
|
// Deprecated, use limit instead
|
||||||
|
this.maxCrawledLinks = maxCrawledLinks ?? limit;
|
||||||
|
}
|
||||||
|
|
||||||
|
private filterLinks(sitemapLinks: string[], limit: number): string[] {
|
||||||
|
return sitemapLinks
|
||||||
|
.filter((link) => {
|
||||||
|
const url = new URL(link);
|
||||||
|
const path = url.pathname;
|
||||||
|
|
||||||
|
// Check if the link should be excluded
|
||||||
|
if (this.excludes.length > 0 && this.excludes[0] !== "") {
|
||||||
|
if (
|
||||||
|
this.excludes.some((excludePattern) =>
|
||||||
|
new RegExp(excludePattern).test(path)
|
||||||
|
)
|
||||||
|
) {
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// Check if the link matches the include patterns, if any are specified
|
||||||
|
if (this.includes.length > 0 && this.includes[0] !== "") {
|
||||||
|
return this.includes.some((includePattern) =>
|
||||||
|
new RegExp(includePattern).test(path)
|
||||||
|
);
|
||||||
|
}
|
||||||
|
|
||||||
|
const isAllowed = this.robots.isAllowed(link, "FireCrawlAgent") ?? true;
|
||||||
|
// Check if the link is disallowed by robots.txt
|
||||||
|
if (!isAllowed) {
|
||||||
|
console.log(`Link disallowed by robots.txt: ${link}`);
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
|
||||||
|
return true;
|
||||||
|
})
|
||||||
|
.slice(0, limit);
|
||||||
|
}
|
||||||
|
|
||||||
|
public async start(
|
||||||
|
inProgress?: (progress: Progress) => void,
|
||||||
|
concurrencyLimit: number = 5,
|
||||||
|
limit: number = 10000
|
||||||
|
): Promise<string[]> {
|
||||||
|
// Fetch and parse robots.txt
|
||||||
|
try {
|
||||||
|
const response = await axios.get(this.robotsTxtUrl);
|
||||||
|
this.robots = robotsParser(this.robotsTxtUrl, response.data);
|
||||||
|
} catch (error) {
|
||||||
|
console.error(`Failed to fetch robots.txt from ${this.robotsTxtUrl}`);
|
||||||
|
}
|
||||||
|
|
||||||
|
const sitemapLinks = await this.tryFetchSitemapLinks(this.initialUrl);
|
||||||
|
if (sitemapLinks.length > 0) {
|
||||||
|
const filteredLinks = this.filterLinks(sitemapLinks, limit);
|
||||||
|
return filteredLinks;
|
||||||
|
}
|
||||||
|
|
||||||
|
const urls = await this.crawlUrls(
|
||||||
|
[this.initialUrl],
|
||||||
|
concurrencyLimit,
|
||||||
|
inProgress
|
||||||
|
);
|
||||||
|
if (
|
||||||
|
urls.length === 0 &&
|
||||||
|
this.filterLinks([this.initialUrl], limit).length > 0
|
||||||
|
) {
|
||||||
|
return [this.initialUrl];
|
||||||
|
}
|
||||||
|
|
||||||
|
// make sure to run include exclude here again
|
||||||
|
return this.filterLinks(urls, limit);
|
||||||
|
}
|
||||||
|
|
||||||
|
private async crawlUrls(
|
||||||
|
urls: string[],
|
||||||
|
concurrencyLimit: number,
|
||||||
|
inProgress?: (progress: Progress) => void
|
||||||
|
): Promise<string[]> {
|
||||||
|
const queue = async.queue(async (task: string, callback) => {
|
||||||
|
if (this.crawledUrls.size >= this.maxCrawledLinks) {
|
||||||
|
if (callback && typeof callback === "function") {
|
||||||
|
callback();
|
||||||
|
}
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
const newUrls = await this.crawl(task);
|
||||||
|
newUrls.forEach((url) => this.crawledUrls.add(url));
|
||||||
|
if (inProgress && newUrls.length > 0) {
|
||||||
|
inProgress({
|
||||||
|
current: this.crawledUrls.size,
|
||||||
|
total: this.maxCrawledLinks,
|
||||||
|
status: "SCRAPING",
|
||||||
|
currentDocumentUrl: newUrls[newUrls.length - 1],
|
||||||
|
});
|
||||||
|
} else if (inProgress) {
|
||||||
|
inProgress({
|
||||||
|
current: this.crawledUrls.size,
|
||||||
|
total: this.maxCrawledLinks,
|
||||||
|
status: "SCRAPING",
|
||||||
|
currentDocumentUrl: task,
|
||||||
|
});
|
||||||
|
}
|
||||||
|
await this.crawlUrls(newUrls, concurrencyLimit, inProgress);
|
||||||
|
if (callback && typeof callback === "function") {
|
||||||
|
callback();
|
||||||
|
}
|
||||||
|
}, concurrencyLimit);
|
||||||
|
|
||||||
|
queue.push(
|
||||||
|
urls.filter(
|
||||||
|
(url) =>
|
||||||
|
!this.visited.has(url) && this.robots.isAllowed(url, "FireCrawlAgent")
|
||||||
|
),
|
||||||
|
(err) => {
|
||||||
|
if (err) console.error(err);
|
||||||
|
}
|
||||||
|
);
|
||||||
|
await queue.drain();
|
||||||
|
return Array.from(this.crawledUrls);
|
||||||
|
}
|
||||||
|
|
||||||
|
async crawl(url: string): Promise<string[]> {
|
||||||
|
if (this.visited.has(url) || !this.robots.isAllowed(url, "FireCrawlAgent"))
|
||||||
|
return [];
|
||||||
|
this.visited.add(url);
|
||||||
|
if (!url.startsWith("http")) {
|
||||||
|
url = "https://" + url;
|
||||||
|
}
|
||||||
|
if (url.endsWith("/")) {
|
||||||
|
url = url.slice(0, -1);
|
||||||
|
}
|
||||||
|
if (this.isFile(url) || this.isSocialMediaOrEmail(url)) {
|
||||||
|
return [];
|
||||||
|
}
|
||||||
|
|
||||||
|
try {
|
||||||
|
let content;
|
||||||
|
// If it is the first link, fetch with scrapingbee
|
||||||
|
if (this.visited.size === 1) {
|
||||||
|
content = await scrapWithScrapingBee(url, "load");
|
||||||
|
} else {
|
||||||
|
const response = await axios.get(url);
|
||||||
|
content = response.data;
|
||||||
|
}
|
||||||
|
const $ = load(content);
|
||||||
|
let links: string[] = [];
|
||||||
|
|
||||||
|
$("a").each((_, element) => {
|
||||||
|
const href = $(element).attr("href");
|
||||||
|
if (href) {
|
||||||
|
let fullUrl = href;
|
||||||
|
if (!href.startsWith("http")) {
|
||||||
|
fullUrl = new URL(href, this.baseUrl).toString();
|
||||||
|
}
|
||||||
|
const url = new URL(fullUrl);
|
||||||
|
const path = url.pathname;
|
||||||
|
|
||||||
|
if (
|
||||||
|
// fullUrl.startsWith(this.initialUrl) && // this condition makes it stop crawling back the url
|
||||||
|
this.isInternalLink(fullUrl) &&
|
||||||
|
this.matchesPattern(fullUrl) &&
|
||||||
|
this.noSections(fullUrl) &&
|
||||||
|
this.matchesIncludes(path) &&
|
||||||
|
!this.matchesExcludes(path) &&
|
||||||
|
this.robots.isAllowed(fullUrl, "FireCrawlAgent")
|
||||||
|
) {
|
||||||
|
links.push(fullUrl);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
return links.filter((link) => !this.visited.has(link));
|
||||||
|
} catch (error) {
|
||||||
|
return [];
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
private matchesIncludes(url: string): boolean {
|
||||||
|
if (this.includes.length === 0 || this.includes[0] == "") return true;
|
||||||
|
return this.includes.some((pattern) => new RegExp(pattern).test(url));
|
||||||
|
}
|
||||||
|
|
||||||
|
private matchesExcludes(url: string): boolean {
|
||||||
|
if (this.excludes.length === 0 || this.excludes[0] == "") return false;
|
||||||
|
return this.excludes.some((pattern) => new RegExp(pattern).test(url));
|
||||||
|
}
|
||||||
|
|
||||||
|
private noSections(link: string): boolean {
|
||||||
|
return !link.includes("#");
|
||||||
|
}
|
||||||
|
|
||||||
|
private isInternalLink(link: string): boolean {
|
||||||
|
const urlObj = new URL(link, this.baseUrl);
|
||||||
|
const domainWithoutProtocol = this.baseUrl.replace(/^https?:\/\//, "");
|
||||||
|
return urlObj.hostname === domainWithoutProtocol;
|
||||||
|
}
|
||||||
|
|
||||||
|
private matchesPattern(link: string): boolean {
|
||||||
|
return true; // Placeholder for future pattern matching implementation
|
||||||
|
}
|
||||||
|
|
||||||
|
private isFile(url: string): boolean {
|
||||||
|
const fileExtensions = [
|
||||||
|
".png",
|
||||||
|
".jpg",
|
||||||
|
".jpeg",
|
||||||
|
".gif",
|
||||||
|
".css",
|
||||||
|
".js",
|
||||||
|
".ico",
|
||||||
|
".svg",
|
||||||
|
".pdf",
|
||||||
|
".zip",
|
||||||
|
".exe",
|
||||||
|
".dmg",
|
||||||
|
".mp4",
|
||||||
|
".mp3",
|
||||||
|
".pptx",
|
||||||
|
".docx",
|
||||||
|
".xlsx",
|
||||||
|
".xml",
|
||||||
|
];
|
||||||
|
return fileExtensions.some((ext) => url.endsWith(ext));
|
||||||
|
}
|
||||||
|
|
||||||
|
private isSocialMediaOrEmail(url: string): boolean {
|
||||||
|
const socialMediaOrEmail = [
|
||||||
|
"facebook.com",
|
||||||
|
"twitter.com",
|
||||||
|
"linkedin.com",
|
||||||
|
"instagram.com",
|
||||||
|
"pinterest.com",
|
||||||
|
"mailto:",
|
||||||
|
];
|
||||||
|
return socialMediaOrEmail.some((ext) => url.includes(ext));
|
||||||
|
}
|
||||||
|
|
||||||
|
private async tryFetchSitemapLinks(url: string): Promise<string[]> {
|
||||||
|
const sitemapUrl = url.endsWith("/sitemap.xml")
|
||||||
|
? url
|
||||||
|
: `${url}/sitemap.xml`;
|
||||||
|
try {
|
||||||
|
const response = await axios.get(sitemapUrl);
|
||||||
|
if (response.status === 200) {
|
||||||
|
return await getLinksFromSitemap(sitemapUrl);
|
||||||
|
}
|
||||||
|
} catch (error) {
|
||||||
|
// Error handling for failed sitemap fetch
|
||||||
|
}
|
||||||
|
return [];
|
||||||
|
}
|
||||||
|
}
|
287
apps/api/src/scraper/WebScraper/index.ts
Normal file
287
apps/api/src/scraper/WebScraper/index.ts
Normal file
|
@ -0,0 +1,287 @@
|
||||||
|
import { Document } from "../../lib/entities";
|
||||||
|
import { Progress } from "../../lib/entities";
|
||||||
|
import { scrapSingleUrl } from "./single_url";
|
||||||
|
import { SitemapEntry, fetchSitemapData, getLinksFromSitemap } from "./sitemap";
|
||||||
|
import { WebCrawler } from "./crawler";
|
||||||
|
import { getValue, setValue } from "../../services/redis";
|
||||||
|
|
||||||
|
export type WebScraperOptions = {
|
||||||
|
urls: string[];
|
||||||
|
mode: "single_urls" | "sitemap" | "crawl";
|
||||||
|
crawlerOptions?: {
|
||||||
|
returnOnlyUrls?: boolean;
|
||||||
|
includes?: string[];
|
||||||
|
excludes?: string[];
|
||||||
|
maxCrawledLinks?: number;
|
||||||
|
limit?: number;
|
||||||
|
|
||||||
|
};
|
||||||
|
concurrentRequests?: number;
|
||||||
|
};
|
||||||
|
export class WebScraperDataProvider {
|
||||||
|
private urls: string[] = [""];
|
||||||
|
private mode: "single_urls" | "sitemap" | "crawl" = "single_urls";
|
||||||
|
private includes: string[];
|
||||||
|
private excludes: string[];
|
||||||
|
private maxCrawledLinks: number;
|
||||||
|
private returnOnlyUrls: boolean;
|
||||||
|
private limit: number = 10000;
|
||||||
|
private concurrentRequests: number = 20;
|
||||||
|
|
||||||
|
authorize(): void {
|
||||||
|
throw new Error("Method not implemented.");
|
||||||
|
}
|
||||||
|
|
||||||
|
authorizeNango(): Promise<void> {
|
||||||
|
throw new Error("Method not implemented.");
|
||||||
|
}
|
||||||
|
|
||||||
|
private async convertUrlsToDocuments(
|
||||||
|
urls: string[],
|
||||||
|
inProgress?: (progress: Progress) => void
|
||||||
|
): Promise<Document[]> {
|
||||||
|
const totalUrls = urls.length;
|
||||||
|
let processedUrls = 0;
|
||||||
|
console.log("Converting urls to documents");
|
||||||
|
console.log("Total urls", urls);
|
||||||
|
const results: (Document | null)[] = new Array(urls.length).fill(null);
|
||||||
|
for (let i = 0; i < urls.length; i += this.concurrentRequests) {
|
||||||
|
const batchUrls = urls.slice(i, i + this.concurrentRequests);
|
||||||
|
await Promise.all(batchUrls.map(async (url, index) => {
|
||||||
|
const result = await scrapSingleUrl(url, true);
|
||||||
|
processedUrls++;
|
||||||
|
if (inProgress) {
|
||||||
|
inProgress({
|
||||||
|
current: processedUrls,
|
||||||
|
total: totalUrls,
|
||||||
|
status: "SCRAPING",
|
||||||
|
currentDocumentUrl: url,
|
||||||
|
});
|
||||||
|
}
|
||||||
|
results[i + index] = result;
|
||||||
|
}));
|
||||||
|
}
|
||||||
|
return results.filter((result) => result !== null) as Document[];
|
||||||
|
}
|
||||||
|
|
||||||
|
async getDocuments(
|
||||||
|
useCaching: boolean = false,
|
||||||
|
inProgress?: (progress: Progress) => void
|
||||||
|
): Promise<Document[]> {
|
||||||
|
if (this.urls[0].trim() === "") {
|
||||||
|
throw new Error("Url is required");
|
||||||
|
}
|
||||||
|
|
||||||
|
if (!useCaching) {
|
||||||
|
if (this.mode === "crawl") {
|
||||||
|
const crawler = new WebCrawler({
|
||||||
|
initialUrl: this.urls[0],
|
||||||
|
includes: this.includes,
|
||||||
|
excludes: this.excludes,
|
||||||
|
maxCrawledLinks: this.maxCrawledLinks,
|
||||||
|
limit: this.limit,
|
||||||
|
});
|
||||||
|
const links = await crawler.start(inProgress, 5, this.limit);
|
||||||
|
if (this.returnOnlyUrls) {
|
||||||
|
return links.map((url) => ({
|
||||||
|
content: "",
|
||||||
|
metadata: { sourceURL: url },
|
||||||
|
provider: "web",
|
||||||
|
type: "text",
|
||||||
|
}));
|
||||||
|
}
|
||||||
|
let documents = await this.convertUrlsToDocuments(links, inProgress);
|
||||||
|
documents = await this.getSitemapData(this.urls[0], documents);
|
||||||
|
console.log("documents", documents)
|
||||||
|
|
||||||
|
// CACHING DOCUMENTS
|
||||||
|
// - parent document
|
||||||
|
const cachedParentDocumentString = await getValue('web-scraper-cache:' + this.normalizeUrl(this.urls[0]));
|
||||||
|
if (cachedParentDocumentString != null) {
|
||||||
|
let cachedParentDocument = JSON.parse(cachedParentDocumentString);
|
||||||
|
if (!cachedParentDocument.childrenLinks || cachedParentDocument.childrenLinks.length < links.length - 1) {
|
||||||
|
cachedParentDocument.childrenLinks = links.filter((link) => link !== this.urls[0]);
|
||||||
|
await setValue('web-scraper-cache:' + this.normalizeUrl(this.urls[0]), JSON.stringify(cachedParentDocument), 60 * 60 * 24 * 10); // 10 days
|
||||||
|
}
|
||||||
|
} else {
|
||||||
|
let parentDocument = documents.filter((document) => this.normalizeUrl(document.metadata.sourceURL) === this.normalizeUrl(this.urls[0]))
|
||||||
|
await this.setCachedDocuments(parentDocument, links);
|
||||||
|
}
|
||||||
|
|
||||||
|
await this.setCachedDocuments(documents.filter((document) => this.normalizeUrl(document.metadata.sourceURL) !== this.normalizeUrl(this.urls[0])), []);
|
||||||
|
documents = this.removeChildLinks(documents);
|
||||||
|
documents = documents.splice(0, this.limit);
|
||||||
|
return documents;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (this.mode === "single_urls") {
|
||||||
|
let documents = await this.convertUrlsToDocuments(this.urls, inProgress);
|
||||||
|
|
||||||
|
const baseUrl = new URL(this.urls[0]).origin;
|
||||||
|
documents = await this.getSitemapData(baseUrl, documents);
|
||||||
|
|
||||||
|
await this.setCachedDocuments(documents);
|
||||||
|
documents = this.removeChildLinks(documents);
|
||||||
|
documents = documents.splice(0, this.limit);
|
||||||
|
return documents;
|
||||||
|
}
|
||||||
|
if (this.mode === "sitemap") {
|
||||||
|
const links = await getLinksFromSitemap(this.urls[0]);
|
||||||
|
let documents = await this.convertUrlsToDocuments(links.slice(0, this.limit), inProgress);
|
||||||
|
|
||||||
|
documents = await this.getSitemapData(this.urls[0], documents);
|
||||||
|
|
||||||
|
await this.setCachedDocuments(documents);
|
||||||
|
documents = this.removeChildLinks(documents);
|
||||||
|
documents = documents.splice(0, this.limit);
|
||||||
|
return documents;
|
||||||
|
}
|
||||||
|
|
||||||
|
return [];
|
||||||
|
}
|
||||||
|
|
||||||
|
let documents = await this.getCachedDocuments(this.urls.slice(0, this.limit));
|
||||||
|
if (documents.length < this.limit) {
|
||||||
|
const newDocuments: Document[] = await this.getDocuments(false, inProgress);
|
||||||
|
newDocuments.forEach(doc => {
|
||||||
|
if (!documents.some(d => this.normalizeUrl(d.metadata.sourceURL) === this.normalizeUrl(doc.metadata?.sourceURL))) {
|
||||||
|
documents.push(doc);
|
||||||
|
}
|
||||||
|
});
|
||||||
|
}
|
||||||
|
documents = this.filterDocsExcludeInclude(documents);
|
||||||
|
documents = this.removeChildLinks(documents);
|
||||||
|
documents = documents.splice(0, this.limit);
|
||||||
|
return documents;
|
||||||
|
}
|
||||||
|
|
||||||
|
private filterDocsExcludeInclude(documents: Document[]): Document[] {
|
||||||
|
return documents.filter((document) => {
|
||||||
|
const url = new URL(document.metadata.sourceURL);
|
||||||
|
const path = url.pathname;
|
||||||
|
|
||||||
|
if (this.excludes.length > 0 && this.excludes[0] !== '') {
|
||||||
|
// Check if the link should be excluded
|
||||||
|
if (this.excludes.some(excludePattern => new RegExp(excludePattern).test(path))) {
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if (this.includes.length > 0 && this.includes[0] !== '') {
|
||||||
|
// Check if the link matches the include patterns, if any are specified
|
||||||
|
if (this.includes.length > 0) {
|
||||||
|
return this.includes.some(includePattern => new RegExp(includePattern).test(path));
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return true;
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
private normalizeUrl(url: string): string {
|
||||||
|
if (url.includes("//www.")) {
|
||||||
|
return url.replace("//www.", "//");
|
||||||
|
}
|
||||||
|
return url;
|
||||||
|
}
|
||||||
|
|
||||||
|
private removeChildLinks(documents: Document[]): Document[] {
|
||||||
|
for (let document of documents) {
|
||||||
|
if (document?.childrenLinks) delete document.childrenLinks;
|
||||||
|
};
|
||||||
|
return documents;
|
||||||
|
}
|
||||||
|
|
||||||
|
async setCachedDocuments(documents: Document[], childrenLinks?: string[]) {
|
||||||
|
for (const document of documents) {
|
||||||
|
if (document.content.trim().length === 0) {
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
const normalizedUrl = this.normalizeUrl(document.metadata.sourceURL);
|
||||||
|
await setValue('web-scraper-cache:' + normalizedUrl, JSON.stringify({
|
||||||
|
...document,
|
||||||
|
childrenLinks: childrenLinks || []
|
||||||
|
}), 60 * 60 * 24 * 10); // 10 days
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
async getCachedDocuments(urls: string[]): Promise<Document[]> {
|
||||||
|
let documents: Document[] = [];
|
||||||
|
for (const url of urls) {
|
||||||
|
const normalizedUrl = this.normalizeUrl(url);
|
||||||
|
console.log("Getting cached document for web-scraper-cache:" + normalizedUrl)
|
||||||
|
const cachedDocumentString = await getValue('web-scraper-cache:' + normalizedUrl);
|
||||||
|
if (cachedDocumentString) {
|
||||||
|
const cachedDocument = JSON.parse(cachedDocumentString);
|
||||||
|
documents.push(cachedDocument);
|
||||||
|
|
||||||
|
// get children documents
|
||||||
|
for (const childUrl of cachedDocument.childrenLinks) {
|
||||||
|
const normalizedChildUrl = this.normalizeUrl(childUrl);
|
||||||
|
const childCachedDocumentString = await getValue('web-scraper-cache:' + normalizedChildUrl);
|
||||||
|
if (childCachedDocumentString) {
|
||||||
|
const childCachedDocument = JSON.parse(childCachedDocumentString);
|
||||||
|
if (!documents.find((doc) => doc.metadata.sourceURL === childCachedDocument.metadata.sourceURL)) {
|
||||||
|
documents.push(childCachedDocument);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return documents;
|
||||||
|
}
|
||||||
|
|
||||||
|
setOptions(options: WebScraperOptions): void {
|
||||||
|
if (!options.urls) {
|
||||||
|
throw new Error("Urls are required");
|
||||||
|
}
|
||||||
|
|
||||||
|
console.log("options", options.crawlerOptions?.excludes)
|
||||||
|
this.urls = options.urls;
|
||||||
|
this.mode = options.mode;
|
||||||
|
this.concurrentRequests = options.concurrentRequests ?? 20;
|
||||||
|
this.includes = options.crawlerOptions?.includes ?? [];
|
||||||
|
this.excludes = options.crawlerOptions?.excludes ?? [];
|
||||||
|
this.maxCrawledLinks = options.crawlerOptions?.maxCrawledLinks ?? 1000;
|
||||||
|
this.returnOnlyUrls = options.crawlerOptions?.returnOnlyUrls ?? false;
|
||||||
|
this.limit = options.crawlerOptions?.limit ?? 10000;
|
||||||
|
|
||||||
|
|
||||||
|
//! @nicolas, for some reason this was being injected and breakign everything. Don't have time to find source of the issue so adding this check
|
||||||
|
this.excludes = this.excludes.filter(item => item !== '');
|
||||||
|
|
||||||
|
|
||||||
|
// make sure all urls start with https://
|
||||||
|
this.urls = this.urls.map((url) => {
|
||||||
|
if (!url.trim().startsWith("http")) {
|
||||||
|
return `https://${url}`;
|
||||||
|
}
|
||||||
|
return url;
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
private async getSitemapData(baseUrl: string, documents: Document[]) {
|
||||||
|
const sitemapData = await fetchSitemapData(baseUrl)
|
||||||
|
if (sitemapData) {
|
||||||
|
for (let i = 0; i < documents.length; i++) {
|
||||||
|
const docInSitemapData = sitemapData.find((data) => this.normalizeUrl(data.loc) === this.normalizeUrl(documents[i].metadata.sourceURL))
|
||||||
|
if (docInSitemapData) {
|
||||||
|
let sitemapDocData: Partial<SitemapEntry> = {};
|
||||||
|
if (docInSitemapData.changefreq) {
|
||||||
|
sitemapDocData.changefreq = docInSitemapData.changefreq;
|
||||||
|
}
|
||||||
|
if (docInSitemapData.priority) {
|
||||||
|
sitemapDocData.priority = Number(docInSitemapData.priority);
|
||||||
|
}
|
||||||
|
if (docInSitemapData.lastmod) {
|
||||||
|
sitemapDocData.lastmod = docInSitemapData.lastmod;
|
||||||
|
}
|
||||||
|
if (Object.keys(sitemapDocData).length !== 0) {
|
||||||
|
documents[i].metadata.sitemap = sitemapDocData;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return documents;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
145
apps/api/src/scraper/WebScraper/single_url.ts
Normal file
145
apps/api/src/scraper/WebScraper/single_url.ts
Normal file
|
@ -0,0 +1,145 @@
|
||||||
|
import * as cheerio from "cheerio";
|
||||||
|
import { ScrapingBeeClient } from "scrapingbee";
|
||||||
|
import { attemptScrapWithRequests, sanitizeText } from "./utils/utils";
|
||||||
|
import { extractMetadata } from "./utils/metadata";
|
||||||
|
import dotenv from "dotenv";
|
||||||
|
import { Document } from "../../lib/entities";
|
||||||
|
import { parseMarkdown } from "../../lib/html-to-markdown";
|
||||||
|
// import puppeteer from "puppeteer";
|
||||||
|
|
||||||
|
dotenv.config();
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
export async function scrapWithScrapingBee(url: string, wait_browser:string = "domcontentloaded"): Promise<string> {
|
||||||
|
try {
|
||||||
|
const client = new ScrapingBeeClient(process.env.SCRAPING_BEE_API_KEY);
|
||||||
|
const response = await client.get({
|
||||||
|
url: url,
|
||||||
|
params: { timeout: 15000, wait_browser: wait_browser },
|
||||||
|
headers: { "ScrapingService-Request": "TRUE" },
|
||||||
|
});
|
||||||
|
|
||||||
|
if (response.status !== 200 && response.status !== 404) {
|
||||||
|
console.error(
|
||||||
|
`Scraping bee error in ${url} with status code ${response.status}`
|
||||||
|
);
|
||||||
|
return "";
|
||||||
|
}
|
||||||
|
const decoder = new TextDecoder();
|
||||||
|
const text = decoder.decode(response.data);
|
||||||
|
return text;
|
||||||
|
} catch (error) {
|
||||||
|
console.error(`Error scraping with Scraping Bee: ${error}`);
|
||||||
|
return "";
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
export async function scrapWithPlaywright(url: string): Promise<string> {
|
||||||
|
try {
|
||||||
|
const response = await fetch(process.env.PLAYWRIGHT_MICROSERVICE_URL, {
|
||||||
|
method: 'POST',
|
||||||
|
headers: {
|
||||||
|
"Content-Type": "application/json",
|
||||||
|
},
|
||||||
|
body: JSON.stringify({ url: url }),
|
||||||
|
});
|
||||||
|
|
||||||
|
if (!response.ok) {
|
||||||
|
console.error(`Error fetching w/ playwright server -> URL: ${url} with status: ${response.status}`);
|
||||||
|
return "";
|
||||||
|
}
|
||||||
|
|
||||||
|
const data = await response.json();
|
||||||
|
const html = data.content;
|
||||||
|
return html ?? "";
|
||||||
|
} catch (error) {
|
||||||
|
console.error(`Error scraping with Puppeteer: ${error}`);
|
||||||
|
return "";
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
export async function scrapSingleUrl(
|
||||||
|
urlToScrap: string,
|
||||||
|
toMarkdown: boolean = true
|
||||||
|
): Promise<Document> {
|
||||||
|
console.log(`Scraping URL: ${urlToScrap}`);
|
||||||
|
urlToScrap = urlToScrap.trim();
|
||||||
|
|
||||||
|
const removeUnwantedElements = (html: string) => {
|
||||||
|
const soup = cheerio.load(html);
|
||||||
|
soup("script, style, iframe, noscript, meta, head").remove();
|
||||||
|
return soup.html();
|
||||||
|
};
|
||||||
|
|
||||||
|
const attemptScraping = async (url: string, method: 'scrapingBee' | 'playwright' | 'scrapingBeeLoad' | 'fetch') => {
|
||||||
|
let text = "";
|
||||||
|
switch (method) {
|
||||||
|
case 'scrapingBee':
|
||||||
|
if (process.env.SCRAPING_BEE_API_KEY) {
|
||||||
|
text = await scrapWithScrapingBee(url);
|
||||||
|
}
|
||||||
|
break;
|
||||||
|
case 'playwright':
|
||||||
|
if (process.env.PLAYWRIGHT_MICROSERVICE_URL) {
|
||||||
|
text = await scrapWithPlaywright(url);
|
||||||
|
}
|
||||||
|
break;
|
||||||
|
case 'scrapingBeeLoad':
|
||||||
|
if (process.env.SCRAPING_BEE_API_KEY) {
|
||||||
|
text = await scrapWithScrapingBee(url, "networkidle2");
|
||||||
|
}
|
||||||
|
break;
|
||||||
|
case 'fetch':
|
||||||
|
try {
|
||||||
|
const response = await fetch(url);
|
||||||
|
if (!response.ok) {
|
||||||
|
console.error(`Error fetching URL: ${url} with status: ${response.status}`);
|
||||||
|
return "";
|
||||||
|
}
|
||||||
|
text = await response.text();
|
||||||
|
} catch (error) {
|
||||||
|
console.error(`Error scraping URL: ${error}`);
|
||||||
|
return "";
|
||||||
|
}
|
||||||
|
break;
|
||||||
|
|
||||||
|
}
|
||||||
|
const cleanedHtml = removeUnwantedElements(text);
|
||||||
|
return [await parseMarkdown(cleanedHtml), text];
|
||||||
|
};
|
||||||
|
|
||||||
|
try {
|
||||||
|
let [text, html ] = await attemptScraping(urlToScrap, 'scrapingBee');
|
||||||
|
if (!text || text.length < 100) {
|
||||||
|
console.log("Falling back to playwright");
|
||||||
|
[text, html] = await attemptScraping(urlToScrap, 'playwright');
|
||||||
|
}
|
||||||
|
|
||||||
|
if (!text || text.length < 100) {
|
||||||
|
console.log("Falling back to scraping bee load");
|
||||||
|
[text, html] = await attemptScraping(urlToScrap, 'scrapingBeeLoad');
|
||||||
|
}
|
||||||
|
if (!text || text.length < 100) {
|
||||||
|
console.log("Falling back to fetch");
|
||||||
|
[text, html] = await attemptScraping(urlToScrap, 'fetch');
|
||||||
|
}
|
||||||
|
|
||||||
|
const soup = cheerio.load(html);
|
||||||
|
const metadata = extractMetadata(soup, urlToScrap);
|
||||||
|
|
||||||
|
return {
|
||||||
|
content: text,
|
||||||
|
markdown: text,
|
||||||
|
metadata: { ...metadata, sourceURL: urlToScrap },
|
||||||
|
} as Document;
|
||||||
|
} catch (error) {
|
||||||
|
console.error(`Error: ${error} - Failed to fetch URL: ${urlToScrap}`);
|
||||||
|
return {
|
||||||
|
content: "",
|
||||||
|
markdown: "",
|
||||||
|
metadata: { sourceURL: urlToScrap },
|
||||||
|
} as Document;
|
||||||
|
}
|
||||||
|
}
|
74
apps/api/src/scraper/WebScraper/sitemap.ts
Normal file
74
apps/api/src/scraper/WebScraper/sitemap.ts
Normal file
|
@ -0,0 +1,74 @@
|
||||||
|
import axios from "axios";
|
||||||
|
import { parseStringPromise } from "xml2js";
|
||||||
|
|
||||||
|
export async function getLinksFromSitemap(
|
||||||
|
sitemapUrl: string,
|
||||||
|
allUrls: string[] = []
|
||||||
|
): Promise<string[]> {
|
||||||
|
try {
|
||||||
|
let content: string;
|
||||||
|
try {
|
||||||
|
const response = await axios.get(sitemapUrl);
|
||||||
|
content = response.data;
|
||||||
|
} catch (error) {
|
||||||
|
console.error(`Request failed for ${sitemapUrl}: ${error}`);
|
||||||
|
return allUrls;
|
||||||
|
}
|
||||||
|
|
||||||
|
const parsed = await parseStringPromise(content);
|
||||||
|
const root = parsed.urlset || parsed.sitemapindex;
|
||||||
|
|
||||||
|
if (root && root.sitemap) {
|
||||||
|
for (const sitemap of root.sitemap) {
|
||||||
|
if (sitemap.loc && sitemap.loc.length > 0) {
|
||||||
|
await getLinksFromSitemap(sitemap.loc[0], allUrls);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
} else if (root && root.url) {
|
||||||
|
for (const url of root.url) {
|
||||||
|
if (url.loc && url.loc.length > 0) {
|
||||||
|
allUrls.push(url.loc[0]);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
} catch (error) {
|
||||||
|
console.error(`Error processing ${sitemapUrl}: ${error}`);
|
||||||
|
}
|
||||||
|
|
||||||
|
return allUrls;
|
||||||
|
}
|
||||||
|
|
||||||
|
export const fetchSitemapData = async (url: string): Promise<SitemapEntry[] | null> => {
|
||||||
|
const sitemapUrl = url.endsWith("/sitemap.xml") ? url : `${url}/sitemap.xml`;
|
||||||
|
try {
|
||||||
|
const response = await axios.get(sitemapUrl);
|
||||||
|
if (response.status === 200) {
|
||||||
|
const xml = response.data;
|
||||||
|
const parsedXml = await parseStringPromise(xml);
|
||||||
|
|
||||||
|
const sitemapData: SitemapEntry[] = [];
|
||||||
|
if (parsedXml.urlset && parsedXml.urlset.url) {
|
||||||
|
for (const urlElement of parsedXml.urlset.url) {
|
||||||
|
const sitemapEntry: SitemapEntry = { loc: urlElement.loc[0] };
|
||||||
|
if (urlElement.lastmod) sitemapEntry.lastmod = urlElement.lastmod[0];
|
||||||
|
if (urlElement.changefreq) sitemapEntry.changefreq = urlElement.changefreq[0];
|
||||||
|
if (urlElement.priority) sitemapEntry.priority = Number(urlElement.priority[0]);
|
||||||
|
sitemapData.push(sitemapEntry);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return sitemapData;
|
||||||
|
}
|
||||||
|
return null;
|
||||||
|
} catch (error) {
|
||||||
|
// Error handling for failed sitemap fetch
|
||||||
|
}
|
||||||
|
return [];
|
||||||
|
}
|
||||||
|
|
||||||
|
export interface SitemapEntry {
|
||||||
|
loc: string;
|
||||||
|
lastmod?: string;
|
||||||
|
changefreq?: string;
|
||||||
|
priority?: number;
|
||||||
|
}
|
109
apps/api/src/scraper/WebScraper/utils/metadata.ts
Normal file
109
apps/api/src/scraper/WebScraper/utils/metadata.ts
Normal file
|
@ -0,0 +1,109 @@
|
||||||
|
// import * as cheerio from 'cheerio';
|
||||||
|
import { CheerioAPI } from "cheerio";
|
||||||
|
interface Metadata {
|
||||||
|
title?: string;
|
||||||
|
description?: string;
|
||||||
|
language?: string;
|
||||||
|
keywords?: string;
|
||||||
|
robots?: string;
|
||||||
|
ogTitle?: string;
|
||||||
|
ogDescription?: string;
|
||||||
|
dctermsCreated?: string;
|
||||||
|
dcDateCreated?: string;
|
||||||
|
dcDate?: string;
|
||||||
|
dctermsType?: string;
|
||||||
|
dcType?: string;
|
||||||
|
dctermsAudience?: string;
|
||||||
|
dctermsSubject?: string;
|
||||||
|
dcSubject?: string;
|
||||||
|
dcDescription?: string;
|
||||||
|
ogImage?: string;
|
||||||
|
dctermsKeywords?: string;
|
||||||
|
modifiedTime?: string;
|
||||||
|
publishedTime?: string;
|
||||||
|
articleTag?: string;
|
||||||
|
articleSection?: string;
|
||||||
|
}
|
||||||
|
|
||||||
|
export function extractMetadata(soup: CheerioAPI, url: string): Metadata {
|
||||||
|
let title: string | null = null;
|
||||||
|
let description: string | null = null;
|
||||||
|
let language: string | null = null;
|
||||||
|
let keywords: string | null = null;
|
||||||
|
let robots: string | null = null;
|
||||||
|
let ogTitle: string | null = null;
|
||||||
|
let ogDescription: string | null = null;
|
||||||
|
let dctermsCreated: string | null = null;
|
||||||
|
let dcDateCreated: string | null = null;
|
||||||
|
let dcDate: string | null = null;
|
||||||
|
let dctermsType: string | null = null;
|
||||||
|
let dcType: string | null = null;
|
||||||
|
let dctermsAudience: string | null = null;
|
||||||
|
let dctermsSubject: string | null = null;
|
||||||
|
let dcSubject: string | null = null;
|
||||||
|
let dcDescription: string | null = null;
|
||||||
|
let ogImage: string | null = null;
|
||||||
|
let dctermsKeywords: string | null = null;
|
||||||
|
let modifiedTime: string | null = null;
|
||||||
|
let publishedTime: string | null = null;
|
||||||
|
let articleTag: string | null = null;
|
||||||
|
let articleSection: string | null = null;
|
||||||
|
|
||||||
|
try {
|
||||||
|
title = soup("title").text() || null;
|
||||||
|
description = soup('meta[name="description"]').attr("content") || null;
|
||||||
|
|
||||||
|
// Assuming the language is part of the URL as per the regex pattern
|
||||||
|
const pattern = /([a-zA-Z]+-[A-Z]{2})/;
|
||||||
|
const match = pattern.exec(url);
|
||||||
|
language = match ? match[1] : null;
|
||||||
|
|
||||||
|
keywords = soup('meta[name="keywords"]').attr("content") || null;
|
||||||
|
robots = soup('meta[name="robots"]').attr("content") || null;
|
||||||
|
ogTitle = soup('meta[property="og:title"]').attr("content") || null;
|
||||||
|
ogDescription = soup('meta[property="og:description"]').attr("content") || null;
|
||||||
|
articleSection = soup('meta[name="article:section"]').attr("content") || null;
|
||||||
|
articleTag = soup('meta[name="article:tag"]').attr("content") || null;
|
||||||
|
publishedTime = soup('meta[property="article:published_time"]').attr("content") || null;
|
||||||
|
modifiedTime = soup('meta[property="article:modified_time"]').attr("content") || null;
|
||||||
|
ogImage = soup('meta[property="og:image"]').attr("content") || null;
|
||||||
|
dctermsKeywords = soup('meta[name="dcterms.keywords"]').attr("content") || null;
|
||||||
|
dcDescription = soup('meta[name="dc.description"]').attr("content") || null;
|
||||||
|
dcSubject = soup('meta[name="dc.subject"]').attr("content") || null;
|
||||||
|
dctermsSubject = soup('meta[name="dcterms.subject"]').attr("content") || null;
|
||||||
|
dctermsAudience = soup('meta[name="dcterms.audience"]').attr("content") || null;
|
||||||
|
dcType = soup('meta[name="dc.type"]').attr("content") || null;
|
||||||
|
dctermsType = soup('meta[name="dcterms.type"]').attr("content") || null;
|
||||||
|
dcDate = soup('meta[name="dc.date"]').attr("content") || null;
|
||||||
|
dcDateCreated = soup('meta[name="dc.date.created"]').attr("content") || null;
|
||||||
|
dctermsCreated = soup('meta[name="dcterms.created"]').attr("content") || null;
|
||||||
|
|
||||||
|
} catch (error) {
|
||||||
|
console.error("Error extracting metadata:", error);
|
||||||
|
}
|
||||||
|
|
||||||
|
return {
|
||||||
|
...(title ? { title } : {}),
|
||||||
|
...(description ? { description } : {}),
|
||||||
|
...(language ? { language } : {}),
|
||||||
|
...(keywords ? { keywords } : {}),
|
||||||
|
...(robots ? { robots } : {}),
|
||||||
|
...(ogTitle ? { ogTitle } : {}),
|
||||||
|
...(ogDescription ? { ogDescription } : {}),
|
||||||
|
...(dctermsCreated ? { dctermsCreated } : {}),
|
||||||
|
...(dcDateCreated ? { dcDateCreated } : {}),
|
||||||
|
...(dcDate ? { dcDate } : {}),
|
||||||
|
...(dctermsType ? { dctermsType } : {}),
|
||||||
|
...(dcType ? { dcType } : {}),
|
||||||
|
...(dctermsAudience ? { dctermsAudience } : {}),
|
||||||
|
...(dctermsSubject ? { dctermsSubject } : {}),
|
||||||
|
...(dcSubject ? { dcSubject } : {}),
|
||||||
|
...(dcDescription ? { dcDescription } : {}),
|
||||||
|
...(ogImage ? { ogImage } : {}),
|
||||||
|
...(dctermsKeywords ? { dctermsKeywords } : {}),
|
||||||
|
...(modifiedTime ? { modifiedTime } : {}),
|
||||||
|
...(publishedTime ? { publishedTime } : {}),
|
||||||
|
...(articleTag ? { articleTag } : {}),
|
||||||
|
...(articleSection ? { articleSection } : {}),
|
||||||
|
};
|
||||||
|
}
|
23
apps/api/src/scraper/WebScraper/utils/utils.ts
Normal file
23
apps/api/src/scraper/WebScraper/utils/utils.ts
Normal file
|
@ -0,0 +1,23 @@
|
||||||
|
import axios from "axios";
|
||||||
|
|
||||||
|
export async function attemptScrapWithRequests(
|
||||||
|
urlToScrap: string
|
||||||
|
): Promise<string | null> {
|
||||||
|
try {
|
||||||
|
const response = await axios.get(urlToScrap);
|
||||||
|
|
||||||
|
if (!response.data) {
|
||||||
|
console.log("Failed normal requests as well");
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
|
||||||
|
return response.data;
|
||||||
|
} catch (error) {
|
||||||
|
console.error(`Error in attemptScrapWithRequests: ${error}`);
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
export function sanitizeText(text: string): string {
|
||||||
|
return text.replace("\u0000", "");
|
||||||
|
}
|
219
apps/api/src/services/billing/credit_billing.ts
Normal file
219
apps/api/src/services/billing/credit_billing.ts
Normal file
|
@ -0,0 +1,219 @@
|
||||||
|
import { supabase_service } from "../supabase";
|
||||||
|
|
||||||
|
const FREE_CREDITS = 100;
|
||||||
|
export async function billTeam(team_id: string, credits: number) {
|
||||||
|
if (team_id === "preview") {
|
||||||
|
return { success: true, message: "Preview team, no credits used" };
|
||||||
|
}
|
||||||
|
console.log(`Billing team ${team_id} for ${credits} credits`);
|
||||||
|
// When the API is used, you can log the credit usage in the credit_usage table:
|
||||||
|
// team_id: The ID of the team using the API.
|
||||||
|
// subscription_id: The ID of the team's active subscription.
|
||||||
|
// credits_used: The number of credits consumed by the API call.
|
||||||
|
// created_at: The timestamp of the API usage.
|
||||||
|
|
||||||
|
// 1. get the subscription
|
||||||
|
|
||||||
|
const { data: subscription } = await supabase_service
|
||||||
|
.from("subscriptions")
|
||||||
|
.select("*")
|
||||||
|
.eq("team_id", team_id)
|
||||||
|
.eq("status", "active")
|
||||||
|
.single();
|
||||||
|
|
||||||
|
if (!subscription) {
|
||||||
|
const { data: credit_usage } = await supabase_service
|
||||||
|
.from("credit_usage")
|
||||||
|
.insert([
|
||||||
|
{
|
||||||
|
team_id,
|
||||||
|
credits_used: credits,
|
||||||
|
created_at: new Date(),
|
||||||
|
},
|
||||||
|
])
|
||||||
|
.select();
|
||||||
|
|
||||||
|
return { success: true, credit_usage };
|
||||||
|
}
|
||||||
|
|
||||||
|
// 2. add the credits to the credits_usage
|
||||||
|
const { data: credit_usage } = await supabase_service
|
||||||
|
.from("credit_usage")
|
||||||
|
.insert([
|
||||||
|
{
|
||||||
|
team_id,
|
||||||
|
subscription_id: subscription.id,
|
||||||
|
credits_used: credits,
|
||||||
|
created_at: new Date(),
|
||||||
|
},
|
||||||
|
])
|
||||||
|
.select();
|
||||||
|
|
||||||
|
return { success: true, credit_usage };
|
||||||
|
}
|
||||||
|
|
||||||
|
// if team has enough credits for the operation, return true, else return false
|
||||||
|
export async function checkTeamCredits(team_id: string, credits: number) {
|
||||||
|
if (team_id === "preview") {
|
||||||
|
return { success: true, message: "Preview team, no credits used" };
|
||||||
|
}
|
||||||
|
// 1. Retrieve the team's active subscription based on the team_id.
|
||||||
|
const { data: subscription, error: subscriptionError } =
|
||||||
|
await supabase_service
|
||||||
|
.from("subscriptions")
|
||||||
|
.select("id, price_id, current_period_start, current_period_end")
|
||||||
|
.eq("team_id", team_id)
|
||||||
|
.eq("status", "active")
|
||||||
|
.single();
|
||||||
|
|
||||||
|
if (subscriptionError || !subscription) {
|
||||||
|
const { data: creditUsages, error: creditUsageError } =
|
||||||
|
await supabase_service
|
||||||
|
.from("credit_usage")
|
||||||
|
.select("credits_used")
|
||||||
|
.is("subscription_id", null)
|
||||||
|
.eq("team_id", team_id);
|
||||||
|
// .gte("created_at", subscription.current_period_start)
|
||||||
|
// .lte("created_at", subscription.current_period_end);
|
||||||
|
|
||||||
|
if (creditUsageError) {
|
||||||
|
throw new Error(
|
||||||
|
`Failed to retrieve credit usage for subscription_id: ${subscription.id}`
|
||||||
|
);
|
||||||
|
}
|
||||||
|
|
||||||
|
const totalCreditsUsed = creditUsages.reduce(
|
||||||
|
(acc, usage) => acc + usage.credits_used,
|
||||||
|
0
|
||||||
|
);
|
||||||
|
|
||||||
|
console.log("totalCreditsUsed", totalCreditsUsed);
|
||||||
|
// 5. Compare the total credits used with the credits allowed by the plan.
|
||||||
|
if (totalCreditsUsed + credits > FREE_CREDITS) {
|
||||||
|
return {
|
||||||
|
success: false,
|
||||||
|
message: "Insufficient credits, please upgrade!",
|
||||||
|
};
|
||||||
|
}
|
||||||
|
return { success: true, message: "Sufficient credits available" };
|
||||||
|
}
|
||||||
|
|
||||||
|
// 2. Get the price_id from the subscription.
|
||||||
|
const { data: price, error: priceError } = await supabase_service
|
||||||
|
.from("prices")
|
||||||
|
.select("credits")
|
||||||
|
.eq("id", subscription.price_id)
|
||||||
|
.single();
|
||||||
|
|
||||||
|
if (priceError) {
|
||||||
|
throw new Error(
|
||||||
|
`Failed to retrieve price for price_id: ${subscription.price_id}`
|
||||||
|
);
|
||||||
|
}
|
||||||
|
|
||||||
|
// 4. Calculate the total credits used by the team within the current billing period.
|
||||||
|
const { data: creditUsages, error: creditUsageError } = await supabase_service
|
||||||
|
.from("credit_usage")
|
||||||
|
.select("credits_used")
|
||||||
|
.eq("subscription_id", subscription.id)
|
||||||
|
.gte("created_at", subscription.current_period_start)
|
||||||
|
.lte("created_at", subscription.current_period_end);
|
||||||
|
|
||||||
|
if (creditUsageError) {
|
||||||
|
throw new Error(
|
||||||
|
`Failed to retrieve credit usage for subscription_id: ${subscription.id}`
|
||||||
|
);
|
||||||
|
}
|
||||||
|
|
||||||
|
const totalCreditsUsed = creditUsages.reduce(
|
||||||
|
(acc, usage) => acc + usage.credits_used,
|
||||||
|
0
|
||||||
|
);
|
||||||
|
|
||||||
|
// 5. Compare the total credits used with the credits allowed by the plan.
|
||||||
|
if (totalCreditsUsed + credits > price.credits) {
|
||||||
|
return { success: false, message: "Insufficient credits, please upgrade!" };
|
||||||
|
}
|
||||||
|
|
||||||
|
return { success: true, message: "Sufficient credits available" };
|
||||||
|
}
|
||||||
|
|
||||||
|
// Count the total credits used by a team within the current billing period and return the remaining credits.
|
||||||
|
export async function countCreditsAndRemainingForCurrentBillingPeriod(
|
||||||
|
team_id: string
|
||||||
|
) {
|
||||||
|
// 1. Retrieve the team's active subscription based on the team_id.
|
||||||
|
const { data: subscription, error: subscriptionError } =
|
||||||
|
await supabase_service
|
||||||
|
.from("subscriptions")
|
||||||
|
.select("id, price_id, current_period_start, current_period_end")
|
||||||
|
.eq("team_id", team_id)
|
||||||
|
.single();
|
||||||
|
|
||||||
|
if (subscriptionError || !subscription) {
|
||||||
|
// throw new Error(`Failed to retrieve subscription for team_id: ${team_id}`);
|
||||||
|
|
||||||
|
// Free
|
||||||
|
const { data: creditUsages, error: creditUsageError } =
|
||||||
|
await supabase_service
|
||||||
|
.from("credit_usage")
|
||||||
|
.select("credits_used")
|
||||||
|
.is("subscription_id", null)
|
||||||
|
.eq("team_id", team_id);
|
||||||
|
// .gte("created_at", subscription.current_period_start)
|
||||||
|
// .lte("created_at", subscription.current_period_end);
|
||||||
|
|
||||||
|
if (creditUsageError || !creditUsages) {
|
||||||
|
throw new Error(
|
||||||
|
`Failed to retrieve credit usage for subscription_id: ${subscription.id}`
|
||||||
|
);
|
||||||
|
}
|
||||||
|
|
||||||
|
const totalCreditsUsed = creditUsages.reduce(
|
||||||
|
(acc, usage) => acc + usage.credits_used,
|
||||||
|
0
|
||||||
|
);
|
||||||
|
|
||||||
|
// 4. Calculate remaining credits.
|
||||||
|
const remainingCredits = FREE_CREDITS - totalCreditsUsed;
|
||||||
|
|
||||||
|
return { totalCreditsUsed, remainingCredits, totalCredits: FREE_CREDITS };
|
||||||
|
}
|
||||||
|
|
||||||
|
// 2. Get the price_id from the subscription to retrieve the total credits available.
|
||||||
|
const { data: price, error: priceError } = await supabase_service
|
||||||
|
.from("prices")
|
||||||
|
.select("credits")
|
||||||
|
.eq("id", subscription.price_id)
|
||||||
|
.single();
|
||||||
|
|
||||||
|
if (priceError || !price) {
|
||||||
|
throw new Error(
|
||||||
|
`Failed to retrieve price for price_id: ${subscription.price_id}`
|
||||||
|
);
|
||||||
|
}
|
||||||
|
|
||||||
|
// 3. Calculate the total credits used by the team within the current billing period.
|
||||||
|
const { data: creditUsages, error: creditUsageError } = await supabase_service
|
||||||
|
.from("credit_usage")
|
||||||
|
.select("credits_used")
|
||||||
|
.eq("subscription_id", subscription.id)
|
||||||
|
.gte("created_at", subscription.current_period_start)
|
||||||
|
.lte("created_at", subscription.current_period_end);
|
||||||
|
|
||||||
|
if (creditUsageError || !creditUsages) {
|
||||||
|
throw new Error(
|
||||||
|
`Failed to retrieve credit usage for subscription_id: ${subscription.id}`
|
||||||
|
);
|
||||||
|
}
|
||||||
|
|
||||||
|
const totalCreditsUsed = creditUsages.reduce(
|
||||||
|
(acc, usage) => acc + usage.credits_used,
|
||||||
|
0
|
||||||
|
);
|
||||||
|
|
||||||
|
// 4. Calculate remaining credits.
|
||||||
|
const remainingCredits = price.credits - totalCreditsUsed;
|
||||||
|
|
||||||
|
return { totalCreditsUsed, remainingCredits, totalCredits: price.credits };
|
||||||
|
}
|
4
apps/api/src/services/logtail.ts
Normal file
4
apps/api/src/services/logtail.ts
Normal file
|
@ -0,0 +1,4 @@
|
||||||
|
const { Logtail } = require("@logtail/node");
|
||||||
|
//dot env
|
||||||
|
require("dotenv").config();
|
||||||
|
export const logtail = new Logtail(process.env.LOGTAIL_KEY);
|
17
apps/api/src/services/queue-jobs.ts
Normal file
17
apps/api/src/services/queue-jobs.ts
Normal file
|
@ -0,0 +1,17 @@
|
||||||
|
import { Job, Queue } from "bull";
|
||||||
|
import {
|
||||||
|
getWebScraperQueue,
|
||||||
|
} from "./queue-service";
|
||||||
|
import { v4 as uuidv4 } from "uuid";
|
||||||
|
import { WebScraperOptions } from "../types";
|
||||||
|
|
||||||
|
export async function addWebScraperJob(
|
||||||
|
webScraperOptions: WebScraperOptions,
|
||||||
|
options: any = {}
|
||||||
|
): Promise<Job> {
|
||||||
|
return await getWebScraperQueue().add(webScraperOptions, {
|
||||||
|
...options,
|
||||||
|
jobId: uuidv4(),
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
16
apps/api/src/services/queue-service.ts
Normal file
16
apps/api/src/services/queue-service.ts
Normal file
|
@ -0,0 +1,16 @@
|
||||||
|
import Queue from "bull";
|
||||||
|
|
||||||
|
let webScraperQueue;
|
||||||
|
|
||||||
|
export function getWebScraperQueue() {
|
||||||
|
if (!webScraperQueue) {
|
||||||
|
webScraperQueue = new Queue("web-scraper", process.env.REDIS_URL, {
|
||||||
|
settings: {
|
||||||
|
lockDuration: 4 * 60 * 60 * 1000, // 4 hours in milliseconds,
|
||||||
|
lockRenewTime: 30 * 60 * 1000, // 30 minutes in milliseconds
|
||||||
|
},
|
||||||
|
});
|
||||||
|
console.log("Web scraper queue created");
|
||||||
|
}
|
||||||
|
return webScraperQueue;
|
||||||
|
}
|
62
apps/api/src/services/queue-worker.ts
Normal file
62
apps/api/src/services/queue-worker.ts
Normal file
|
@ -0,0 +1,62 @@
|
||||||
|
import { CustomError } from "../lib/custom-error";
|
||||||
|
import { getWebScraperQueue } from "./queue-service";
|
||||||
|
import "dotenv/config";
|
||||||
|
import { logtail } from "./logtail";
|
||||||
|
import { startWebScraperPipeline } from "../main/runWebScraper";
|
||||||
|
import { WebScraperDataProvider } from "../scraper/WebScraper";
|
||||||
|
import { callWebhook } from "./webhook";
|
||||||
|
|
||||||
|
getWebScraperQueue().process(
|
||||||
|
Math.floor(Number(process.env.NUM_WORKERS_PER_QUEUE ?? 8)),
|
||||||
|
async function (job, done) {
|
||||||
|
try {
|
||||||
|
job.progress({
|
||||||
|
current: 1,
|
||||||
|
total: 100,
|
||||||
|
current_step: "SCRAPING",
|
||||||
|
current_url: "",
|
||||||
|
});
|
||||||
|
const { success, message, docs } = await startWebScraperPipeline({ job });
|
||||||
|
|
||||||
|
const data = {
|
||||||
|
success: success,
|
||||||
|
result: {
|
||||||
|
links: docs.map((doc) => {
|
||||||
|
return { content: doc, source: doc.metadata.sourceURL };
|
||||||
|
}),
|
||||||
|
},
|
||||||
|
project_id: job.data.project_id,
|
||||||
|
error: message /* etc... */,
|
||||||
|
};
|
||||||
|
|
||||||
|
await callWebhook(job.data.team_id, data);
|
||||||
|
done(null, data);
|
||||||
|
} catch (error) {
|
||||||
|
if (error instanceof CustomError) {
|
||||||
|
// Here we handle the error, then save the failed job
|
||||||
|
console.error(error.message); // or any other error handling
|
||||||
|
|
||||||
|
logtail.error("Custom error while ingesting", {
|
||||||
|
job_id: job.id,
|
||||||
|
error: error.message,
|
||||||
|
dataIngestionJob: error.dataIngestionJob,
|
||||||
|
});
|
||||||
|
}
|
||||||
|
console.log(error);
|
||||||
|
|
||||||
|
logtail.error("Overall error ingesting", {
|
||||||
|
job_id: job.id,
|
||||||
|
error: error.message,
|
||||||
|
});
|
||||||
|
|
||||||
|
const data = {
|
||||||
|
success: false,
|
||||||
|
project_id: job.data.project_id,
|
||||||
|
error:
|
||||||
|
"Something went wrong... Contact help@mendable.ai or try again." /* etc... */,
|
||||||
|
};
|
||||||
|
await callWebhook(job.data.team_id, data);
|
||||||
|
done(null, data);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
);
|
65
apps/api/src/services/rate-limiter.ts
Normal file
65
apps/api/src/services/rate-limiter.ts
Normal file
|
@ -0,0 +1,65 @@
|
||||||
|
import { RateLimiterRedis } from "rate-limiter-flexible";
|
||||||
|
import * as redis from "redis";
|
||||||
|
|
||||||
|
const MAX_REQUESTS_PER_MINUTE_PREVIEW = 5;
|
||||||
|
const MAX_CRAWLS_PER_MINUTE_STARTER = 2;
|
||||||
|
const MAX_CRAWLS_PER_MINUTE_STANDAR = 4;
|
||||||
|
const MAX_CRAWLS_PER_MINUTE_SCALE = 20;
|
||||||
|
|
||||||
|
const MAX_REQUESTS_PER_MINUTE_ACCOUNT = 40;
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
export const redisClient = redis.createClient({
|
||||||
|
url: process.env.REDIS_URL,
|
||||||
|
legacyMode: true,
|
||||||
|
});
|
||||||
|
|
||||||
|
export const previewRateLimiter = new RateLimiterRedis({
|
||||||
|
storeClient: redisClient,
|
||||||
|
keyPrefix: "middleware",
|
||||||
|
points: MAX_REQUESTS_PER_MINUTE_PREVIEW,
|
||||||
|
duration: 60, // Duration in seconds
|
||||||
|
});
|
||||||
|
|
||||||
|
export const serverRateLimiter = new RateLimiterRedis({
|
||||||
|
storeClient: redisClient,
|
||||||
|
keyPrefix: "middleware",
|
||||||
|
points: MAX_REQUESTS_PER_MINUTE_ACCOUNT,
|
||||||
|
duration: 60, // Duration in seconds
|
||||||
|
});
|
||||||
|
|
||||||
|
|
||||||
|
export function crawlRateLimit(plan: string){
|
||||||
|
if(plan === "standard"){
|
||||||
|
return new RateLimiterRedis({
|
||||||
|
storeClient: redisClient,
|
||||||
|
keyPrefix: "middleware",
|
||||||
|
points: MAX_CRAWLS_PER_MINUTE_STANDAR,
|
||||||
|
duration: 60, // Duration in seconds
|
||||||
|
});
|
||||||
|
}else if(plan === "scale"){
|
||||||
|
return new RateLimiterRedis({
|
||||||
|
storeClient: redisClient,
|
||||||
|
keyPrefix: "middleware",
|
||||||
|
points: MAX_CRAWLS_PER_MINUTE_SCALE,
|
||||||
|
duration: 60, // Duration in seconds
|
||||||
|
});
|
||||||
|
}
|
||||||
|
return new RateLimiterRedis({
|
||||||
|
storeClient: redisClient,
|
||||||
|
keyPrefix: "middleware",
|
||||||
|
points: MAX_CRAWLS_PER_MINUTE_STARTER,
|
||||||
|
duration: 60, // Duration in seconds
|
||||||
|
});
|
||||||
|
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
export function getRateLimiter(preview: boolean){
|
||||||
|
if(preview){
|
||||||
|
return previewRateLimiter;
|
||||||
|
}else{
|
||||||
|
return serverRateLimiter;
|
||||||
|
}
|
||||||
|
}
|
38
apps/api/src/services/redis.ts
Normal file
38
apps/api/src/services/redis.ts
Normal file
|
@ -0,0 +1,38 @@
|
||||||
|
import Redis from 'ioredis';
|
||||||
|
|
||||||
|
// Initialize Redis client
|
||||||
|
const redis = new Redis(process.env.REDIS_URL);
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Set a value in Redis with an optional expiration time.
|
||||||
|
* @param {string} key The key under which to store the value.
|
||||||
|
* @param {string} value The value to store.
|
||||||
|
* @param {number} [expire] Optional expiration time in seconds.
|
||||||
|
*/
|
||||||
|
const setValue = async (key: string, value: string, expire?: number) => {
|
||||||
|
if (expire) {
|
||||||
|
await redis.set(key, value, 'EX', expire);
|
||||||
|
} else {
|
||||||
|
await redis.set(key, value);
|
||||||
|
}
|
||||||
|
};
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Get a value from Redis.
|
||||||
|
* @param {string} key The key of the value to retrieve.
|
||||||
|
* @returns {Promise<string|null>} The value, if found, otherwise null.
|
||||||
|
*/
|
||||||
|
const getValue = async (key: string): Promise<string | null> => {
|
||||||
|
const value = await redis.get(key);
|
||||||
|
return value;
|
||||||
|
};
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Delete a key from Redis.
|
||||||
|
* @param {string} key The key to delete.
|
||||||
|
*/
|
||||||
|
const deleteKey = async (key: string) => {
|
||||||
|
await redis.del(key);
|
||||||
|
};
|
||||||
|
|
||||||
|
export { setValue, getValue, deleteKey };
|
6
apps/api/src/services/supabase.ts
Normal file
6
apps/api/src/services/supabase.ts
Normal file
|
@ -0,0 +1,6 @@
|
||||||
|
import { createClient } from "@supabase/supabase-js";
|
||||||
|
|
||||||
|
export const supabase_service = createClient<any>(
|
||||||
|
process.env.SUPABASE_URL,
|
||||||
|
process.env.SUPABASE_SERVICE_TOKEN,
|
||||||
|
);
|
41
apps/api/src/services/webhook.ts
Normal file
41
apps/api/src/services/webhook.ts
Normal file
|
@ -0,0 +1,41 @@
|
||||||
|
import { supabase_service } from "./supabase";
|
||||||
|
|
||||||
|
export const callWebhook = async (teamId: string, data: any) => {
|
||||||
|
const { data: webhooksData, error } = await supabase_service
|
||||||
|
.from('webhooks')
|
||||||
|
.select('url')
|
||||||
|
.eq('team_id', teamId)
|
||||||
|
.limit(1);
|
||||||
|
|
||||||
|
if (error) {
|
||||||
|
console.error(`Error fetching webhook URL for team ID: ${teamId}`, error.message);
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (!webhooksData || webhooksData.length === 0) {
|
||||||
|
return null;
|
||||||
|
}
|
||||||
|
|
||||||
|
let dataToSend = [];
|
||||||
|
if (data.result.links && data.result.links.length !== 0) {
|
||||||
|
for (let i = 0; i < data.result.links.length; i++) {
|
||||||
|
dataToSend.push({
|
||||||
|
content: data.result.links[i].content.content,
|
||||||
|
markdown: data.result.links[i].content.markdown,
|
||||||
|
metadata: data.result.links[i].content.metadata,
|
||||||
|
});
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
await fetch(webhooksData[0].url, {
|
||||||
|
method: 'POST',
|
||||||
|
headers: {
|
||||||
|
'Content-Type': 'application/json',
|
||||||
|
},
|
||||||
|
body: JSON.stringify({
|
||||||
|
success: data.success,
|
||||||
|
data: dataToSend,
|
||||||
|
error: data.error || undefined,
|
||||||
|
}),
|
||||||
|
});
|
||||||
|
}
|
2
apps/api/src/strings.ts
Normal file
2
apps/api/src/strings.ts
Normal file
|
@ -0,0 +1,2 @@
|
||||||
|
export const errorNoResults =
|
||||||
|
"No results found, please check the URL or contact us at help@mendable.ai to file a ticket.";
|
1514
apps/api/src/supabase_types.ts
Normal file
1514
apps/api/src/supabase_types.ts
Normal file
File diff suppressed because it is too large
Load Diff
26
apps/api/src/types.ts
Normal file
26
apps/api/src/types.ts
Normal file
|
@ -0,0 +1,26 @@
|
||||||
|
export interface CrawlResult {
|
||||||
|
source: string;
|
||||||
|
content: string;
|
||||||
|
options?: {
|
||||||
|
summarize?: boolean;
|
||||||
|
summarize_max_chars?: number;
|
||||||
|
};
|
||||||
|
metadata?: any;
|
||||||
|
raw_context_id?: number | string;
|
||||||
|
permissions?: any[];
|
||||||
|
}
|
||||||
|
|
||||||
|
export interface IngestResult {
|
||||||
|
success: boolean;
|
||||||
|
error: string;
|
||||||
|
data: CrawlResult[];
|
||||||
|
}
|
||||||
|
|
||||||
|
export interface WebScraperOptions {
|
||||||
|
url: string;
|
||||||
|
mode: "crawl" | "single_urls" | "sitemap";
|
||||||
|
crawlerOptions: any;
|
||||||
|
team_id: string;
|
||||||
|
}
|
||||||
|
|
||||||
|
|
17
apps/api/tsconfig.json
Normal file
17
apps/api/tsconfig.json
Normal file
|
@ -0,0 +1,17 @@
|
||||||
|
{
|
||||||
|
"compilerOptions": {
|
||||||
|
"rootDir": "./src",
|
||||||
|
"lib": ["es6","DOM"],
|
||||||
|
"target": "ES2020", // or higher
|
||||||
|
"module": "commonjs",
|
||||||
|
"esModuleInterop": true,
|
||||||
|
"sourceMap": true,
|
||||||
|
"outDir": "./dist/src",
|
||||||
|
"moduleResolution": "node",
|
||||||
|
"baseUrl": ".",
|
||||||
|
"paths": {
|
||||||
|
"*": ["node_modules/*", "src/types/*"],
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"include": ["src/","src/**/*", "services/db/supabase.ts", "utils/utils.ts", "services/db/supabaseEmbeddings.ts", "utils/EventEmmitter.ts", "src/services/queue-service.ts"]
|
||||||
|
}
|
BIN
apps/playwright-service/.DS_Store
vendored
Normal file
BIN
apps/playwright-service/.DS_Store
vendored
Normal file
Binary file not shown.
152
apps/playwright-service/.gitignore
vendored
Normal file
152
apps/playwright-service/.gitignore
vendored
Normal file
|
@ -0,0 +1,152 @@
|
||||||
|
# Byte-compiled / optimized / DLL files
|
||||||
|
__pycache__/
|
||||||
|
*.py[cod]
|
||||||
|
*$py.class
|
||||||
|
|
||||||
|
# C extensions
|
||||||
|
*.so
|
||||||
|
|
||||||
|
# Distribution / packaging
|
||||||
|
.Python
|
||||||
|
build/
|
||||||
|
develop-eggs/
|
||||||
|
dist/
|
||||||
|
downloads/
|
||||||
|
eggs/
|
||||||
|
.eggs/
|
||||||
|
lib/
|
||||||
|
lib64/
|
||||||
|
parts/
|
||||||
|
sdist/
|
||||||
|
var/
|
||||||
|
wheels/
|
||||||
|
share/python-wheels/
|
||||||
|
*.egg-info/
|
||||||
|
.installed.cfg
|
||||||
|
*.egg
|
||||||
|
MANIFEST
|
||||||
|
|
||||||
|
# PyInstaller
|
||||||
|
# Usually these files are written by a python script from a template
|
||||||
|
# before PyInstaller builds the exe, so as to inject date/other infos into it.
|
||||||
|
*.manifest
|
||||||
|
*.spec
|
||||||
|
|
||||||
|
# Installer logs
|
||||||
|
pip-log.txt
|
||||||
|
pip-delete-this-directory.txt
|
||||||
|
|
||||||
|
# Unit test / coverage reports
|
||||||
|
htmlcov/
|
||||||
|
.tox/
|
||||||
|
.nox/
|
||||||
|
.coverage
|
||||||
|
.coverage.*
|
||||||
|
.cache
|
||||||
|
nosetests.xml
|
||||||
|
coverage.xml
|
||||||
|
*.cover
|
||||||
|
*.py,cover
|
||||||
|
.hypothesis/
|
||||||
|
.pytest_cache/
|
||||||
|
cover/
|
||||||
|
|
||||||
|
# Translations
|
||||||
|
*.mo
|
||||||
|
*.pot
|
||||||
|
|
||||||
|
# Django stuff:
|
||||||
|
*.log
|
||||||
|
local_settings.py
|
||||||
|
db.sqlite3
|
||||||
|
db.sqlite3-journal
|
||||||
|
|
||||||
|
# Flask stuff:
|
||||||
|
instance/
|
||||||
|
.webassets-cache
|
||||||
|
|
||||||
|
# Scrapy stuff:
|
||||||
|
.scrapy
|
||||||
|
|
||||||
|
# Sphinx documentation
|
||||||
|
docs/_build/
|
||||||
|
|
||||||
|
# PyBuilder
|
||||||
|
.pybuilder/
|
||||||
|
target/
|
||||||
|
|
||||||
|
# Jupyter Notebook
|
||||||
|
.ipynb_checkpoints
|
||||||
|
|
||||||
|
# IPython
|
||||||
|
profile_default/
|
||||||
|
ipython_config.py
|
||||||
|
|
||||||
|
# pyenv
|
||||||
|
# For a library or package, you might want to ignore these files since the code is
|
||||||
|
# intended to run in multiple environments; otherwise, check them in:
|
||||||
|
# .python-version
|
||||||
|
|
||||||
|
# pipenv
|
||||||
|
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
|
||||||
|
# However, in case of collaboration, if having platform-specific dependencies or dependencies
|
||||||
|
# having no cross-platform support, pipenv may install dependencies that don't work, or not
|
||||||
|
# install all needed dependencies.
|
||||||
|
#Pipfile.lock
|
||||||
|
|
||||||
|
# poetry
|
||||||
|
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
|
||||||
|
# This is especially recommended for binary packages to ensure reproducibility, and is more
|
||||||
|
# commonly ignored for libraries.
|
||||||
|
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
|
||||||
|
#poetry.lock
|
||||||
|
|
||||||
|
# PEP 582; used by e.g. github.com/David-OConnor/pyflow
|
||||||
|
__pypackages__/
|
||||||
|
|
||||||
|
# Celery stuff
|
||||||
|
celerybeat-schedule
|
||||||
|
celerybeat.pid
|
||||||
|
|
||||||
|
# SageMath parsed files
|
||||||
|
*.sage.py
|
||||||
|
|
||||||
|
# Environments
|
||||||
|
.env
|
||||||
|
.venv
|
||||||
|
env/
|
||||||
|
venv/
|
||||||
|
ENV/
|
||||||
|
env.bak/
|
||||||
|
venv.bak/
|
||||||
|
|
||||||
|
# Spyder project settings
|
||||||
|
.spyderproject
|
||||||
|
.spyproject
|
||||||
|
|
||||||
|
# Rope project settings
|
||||||
|
.ropeproject
|
||||||
|
|
||||||
|
# mkdocs documentation
|
||||||
|
/site
|
||||||
|
|
||||||
|
# mypy
|
||||||
|
.mypy_cache/
|
||||||
|
.dmypy.json
|
||||||
|
dmypy.json
|
||||||
|
|
||||||
|
# Pyre type checker
|
||||||
|
.pyre/
|
||||||
|
|
||||||
|
# pytype static type analyzer
|
||||||
|
.pytype/
|
||||||
|
|
||||||
|
# Cython debug symbols
|
||||||
|
cython_debug/
|
||||||
|
|
||||||
|
# PyCharm
|
||||||
|
# JetBrains specific template is maintainted in a separate JetBrains.gitignore that can
|
||||||
|
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
|
||||||
|
# and can be added to the global gitignore or merged into this file. For a more nuclear
|
||||||
|
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
|
||||||
|
#.idea/
|
38
apps/playwright-service/Dockerfile
Normal file
38
apps/playwright-service/Dockerfile
Normal file
|
@ -0,0 +1,38 @@
|
||||||
|
FROM python:3.11-slim
|
||||||
|
|
||||||
|
ENV PYTHONUNBUFFERED=1
|
||||||
|
ENV PYTHONDONTWRITEBYTECODE=1
|
||||||
|
ENV PIP_DISABLE_PIP_VERSION_CHECK=1
|
||||||
|
|
||||||
|
RUN apt-get update && apt-get install -y --no-install-recommends \
|
||||||
|
gcc \
|
||||||
|
libstdc++6
|
||||||
|
|
||||||
|
WORKDIR /app
|
||||||
|
|
||||||
|
# Install Python dependencies
|
||||||
|
COPY requirements.txt ./
|
||||||
|
|
||||||
|
# Remove py which is pulled in by retry, py is not needed and is a CVE
|
||||||
|
RUN pip install --no-cache-dir --upgrade -r requirements.txt && \
|
||||||
|
pip uninstall -y py && \
|
||||||
|
playwright install chromium && playwright install-deps chromium && \
|
||||||
|
ln -s /usr/local/bin/supervisord /usr/bin/supervisord
|
||||||
|
|
||||||
|
# Cleanup for CVEs and size reduction
|
||||||
|
# https://github.com/tornadoweb/tornado/issues/3107
|
||||||
|
# xserver-common and xvfb included by playwright installation but not needed after
|
||||||
|
# perl-base is part of the base Python Debian image but not needed for Danswer functionality
|
||||||
|
# perl-base could only be removed with --allow-remove-essential
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
COPY . ./
|
||||||
|
|
||||||
|
EXPOSE $PORT
|
||||||
|
# run fast api hypercorn
|
||||||
|
CMD hypercorn main:app --bind [::]:$PORT
|
||||||
|
# CMD ["hypercorn", "main:app", "--bind", "[::]:$PORT"]
|
||||||
|
# CMD ["sh", "-c", "uvicorn main:app --host 0.0.0.0 --port $PORT"]
|
0
apps/playwright-service/README.md
Normal file
0
apps/playwright-service/README.md
Normal file
28
apps/playwright-service/main.py
Normal file
28
apps/playwright-service/main.py
Normal file
|
@ -0,0 +1,28 @@
|
||||||
|
from fastapi import FastAPI, Response
|
||||||
|
from playwright.async_api import async_playwright
|
||||||
|
import os
|
||||||
|
from fastapi.responses import JSONResponse
|
||||||
|
from pydantic import BaseModel
|
||||||
|
app = FastAPI()
|
||||||
|
|
||||||
|
from pydantic import BaseModel
|
||||||
|
|
||||||
|
class UrlModel(BaseModel):
|
||||||
|
url: str
|
||||||
|
|
||||||
|
@app.post("/html") # Kept as POST to accept body parameters
|
||||||
|
async def root(body: UrlModel): # Using Pydantic model for request body
|
||||||
|
async with async_playwright() as p:
|
||||||
|
browser = await p.chromium.launch()
|
||||||
|
|
||||||
|
context = await browser.new_context()
|
||||||
|
page = await context.new_page()
|
||||||
|
|
||||||
|
await page.goto(body.url) # Adjusted to use the url from the request body model
|
||||||
|
page_content = await page.content() # Get the HTML content of the page
|
||||||
|
|
||||||
|
await browser.close()
|
||||||
|
|
||||||
|
json_compatible_item_data = {"content": page_content}
|
||||||
|
return JSONResponse(content=json_compatible_item_data)
|
||||||
|
|
0
apps/playwright-service/requests.http
Normal file
0
apps/playwright-service/requests.http
Normal file
4
apps/playwright-service/requirements.txt
Normal file
4
apps/playwright-service/requirements.txt
Normal file
|
@ -0,0 +1,4 @@
|
||||||
|
hypercorn==0.16.0
|
||||||
|
fastapi==0.110.0
|
||||||
|
playwright==1.42.0
|
||||||
|
uvicorn
|
1
apps/playwright-service/runtime.txt
Normal file
1
apps/playwright-service/runtime.txt
Normal file
|
@ -0,0 +1 @@
|
||||||
|
3.11
|
91
apps/python-sdk/README.md
Normal file
91
apps/python-sdk/README.md
Normal file
|
@ -0,0 +1,91 @@
|
||||||
|
# Firecrawl Python SDK
|
||||||
|
|
||||||
|
The Firecrawl Python SDK is a library that allows you to easily scrape and crawl websites, and output the data in a format ready for use with language models (LLMs). It provides a simple and intuitive interface for interacting with the Firecrawl API.
|
||||||
|
|
||||||
|
## Installation
|
||||||
|
|
||||||
|
To install the Firecrawl Python SDK, you can use pip:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip install firecrawl-py
|
||||||
|
```
|
||||||
|
|
||||||
|
## Usage
|
||||||
|
|
||||||
|
1. Get an API key from [firecrawl.dev](https://firecrawl.dev)
|
||||||
|
2. Set the API key as an environment variable named `FIRECRAWL_API_KEY` or pass it as a parameter to the `FirecrawlApp` class.
|
||||||
|
|
||||||
|
|
||||||
|
Here's an example of how to use the SDK:
|
||||||
|
|
||||||
|
```python
|
||||||
|
from firecrawl import FirecrawlApp
|
||||||
|
|
||||||
|
# Initialize the FirecrawlApp with your API key
|
||||||
|
app = FirecrawlApp(api_key='your_api_key')
|
||||||
|
|
||||||
|
# Scrape a single URL
|
||||||
|
url = 'https://mendable.ai'
|
||||||
|
scraped_data = app.scrape_url(url)
|
||||||
|
|
||||||
|
# Crawl a website
|
||||||
|
crawl_url = 'https://mendable.ai'
|
||||||
|
crawl_params = {
|
||||||
|
'crawlerOptions': {
|
||||||
|
'excludes': ['blog/*'],
|
||||||
|
'includes': [], # leave empty for all pages
|
||||||
|
'limit': 1000,
|
||||||
|
}
|
||||||
|
}
|
||||||
|
crawl_result = app.crawl_url(crawl_url, params=crawl_params)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Scraping a URL
|
||||||
|
|
||||||
|
To scrape a single URL, use the `scrape_url` method. It takes the URL as a parameter and returns the scraped data as a dictionary.
|
||||||
|
|
||||||
|
```python
|
||||||
|
url = 'https://example.com'
|
||||||
|
scraped_data = app.scrape_url(url)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Crawling a Website
|
||||||
|
|
||||||
|
To crawl a website, use the `crawl_url` method. It takes the starting URL and optional parameters as arguments. The `params` argument allows you to specify additional options for the crawl job, such as the maximum number of pages to crawl, allowed domains, and the output format.
|
||||||
|
|
||||||
|
The `wait_until_done` parameter determines whether the method should wait for the crawl job to complete before returning the result. If set to `True`, the method will periodically check the status of the crawl job until it is completed or the specified `timeout` (in seconds) is reached. If set to `False`, the method will return immediately with the job ID, and you can manually check the status of the crawl job using the `check_crawl_status` method.
|
||||||
|
|
||||||
|
```python
|
||||||
|
crawl_url = 'https://example.com'
|
||||||
|
crawl_params = {
|
||||||
|
'crawlerOptions': {
|
||||||
|
'excludes': ['blog/*'],
|
||||||
|
'includes': [], # leave empty for all pages
|
||||||
|
'limit': 1000,
|
||||||
|
}
|
||||||
|
}
|
||||||
|
crawl_result = app.crawl_url(crawl_url, params=crawl_params, wait_until_done=True, timeout=5)
|
||||||
|
```
|
||||||
|
|
||||||
|
If `wait_until_done` is set to `True`, the `crawl_url` method will return the crawl result once the job is completed. If the job fails or is stopped, an exception will be raised.
|
||||||
|
|
||||||
|
### Checking Crawl Status
|
||||||
|
|
||||||
|
To check the status of a crawl job, use the `check_crawl_status` method. It takes the job ID as a parameter and returns the current status of the crawl job.
|
||||||
|
|
||||||
|
```python
|
||||||
|
job_id = crawl_result['jobId']
|
||||||
|
status = app.check_crawl_status(job_id)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Error Handling
|
||||||
|
|
||||||
|
The SDK handles errors returned by the Firecrawl API and raises appropriate exceptions. If an error occurs during a request, an exception will be raised with a descriptive error message.
|
||||||
|
|
||||||
|
## Contributing
|
||||||
|
|
||||||
|
Contributions to the Firecrawl Python SDK are welcome! If you find any issues or have suggestions for improvements, please open an issue or submit a pull request on the GitHub repository.
|
||||||
|
|
||||||
|
## License
|
||||||
|
|
||||||
|
The Firecrawl Python SDK is open-source and released under the [MIT License](https://opensource.org/licenses/MIT).
|
1
apps/python-sdk/build/lib/firecrawl/__init__.py
Normal file
1
apps/python-sdk/build/lib/firecrawl/__init__.py
Normal file
|
@ -0,0 +1 @@
|
||||||
|
from .firecrawl import FirecrawlApp
|
96
apps/python-sdk/build/lib/firecrawl/firecrawl.py
Normal file
96
apps/python-sdk/build/lib/firecrawl/firecrawl.py
Normal file
|
@ -0,0 +1,96 @@
|
||||||
|
import os
|
||||||
|
import requests
|
||||||
|
|
||||||
|
class FirecrawlApp:
|
||||||
|
def __init__(self, api_key=None):
|
||||||
|
self.api_key = api_key or os.getenv('FIRECRAWL_API_KEY')
|
||||||
|
if self.api_key is None:
|
||||||
|
raise ValueError('No API key provided')
|
||||||
|
|
||||||
|
def scrape_url(self, url, params=None):
|
||||||
|
headers = {
|
||||||
|
'Content-Type': 'application/json',
|
||||||
|
'Authorization': f'Bearer {self.api_key}'
|
||||||
|
}
|
||||||
|
json_data = {'url': url}
|
||||||
|
if params:
|
||||||
|
json_data.update(params)
|
||||||
|
response = requests.post(
|
||||||
|
'https://api.firecrawl.dev/v0/scrape',
|
||||||
|
headers=headers,
|
||||||
|
json=json_data
|
||||||
|
)
|
||||||
|
if response.status_code == 200:
|
||||||
|
response = response.json()
|
||||||
|
if response['success'] == True:
|
||||||
|
return response['data']
|
||||||
|
else:
|
||||||
|
raise Exception(f'Failed to scrape URL. Error: {response["error"]}')
|
||||||
|
|
||||||
|
elif response.status_code in [402, 409, 500]:
|
||||||
|
error_message = response.json().get('error', 'Unknown error occurred')
|
||||||
|
raise Exception(f'Failed to scrape URL. Status code: {response.status_code}. Error: {error_message}')
|
||||||
|
else:
|
||||||
|
raise Exception(f'Failed to scrape URL. Status code: {response.status_code}')
|
||||||
|
|
||||||
|
def crawl_url(self, url, params=None, wait_until_done=True, timeout=2):
|
||||||
|
headers = self._prepare_headers()
|
||||||
|
json_data = {'url': url}
|
||||||
|
if params:
|
||||||
|
json_data.update(params)
|
||||||
|
response = self._post_request('https://api.firecrawl.dev/v0/crawl', json_data, headers)
|
||||||
|
if response.status_code == 200:
|
||||||
|
job_id = response.json().get('jobId')
|
||||||
|
if wait_until_done:
|
||||||
|
return self._monitor_job_status(job_id, headers, timeout)
|
||||||
|
else:
|
||||||
|
return {'jobId': job_id}
|
||||||
|
else:
|
||||||
|
self._handle_error(response, 'start crawl job')
|
||||||
|
|
||||||
|
def check_crawl_status(self, job_id):
|
||||||
|
headers = self._prepare_headers()
|
||||||
|
response = self._get_request(f'https://api.firecrawl.dev/v0/crawl/status/{job_id}', headers)
|
||||||
|
if response.status_code == 200:
|
||||||
|
return response.json()
|
||||||
|
else:
|
||||||
|
self._handle_error(response, 'check crawl status')
|
||||||
|
|
||||||
|
def _prepare_headers(self):
|
||||||
|
return {
|
||||||
|
'Content-Type': 'application/json',
|
||||||
|
'Authorization': f'Bearer {self.api_key}'
|
||||||
|
}
|
||||||
|
|
||||||
|
def _post_request(self, url, data, headers):
|
||||||
|
return requests.post(url, headers=headers, json=data)
|
||||||
|
|
||||||
|
def _get_request(self, url, headers):
|
||||||
|
return requests.get(url, headers=headers)
|
||||||
|
|
||||||
|
def _monitor_job_status(self, job_id, headers, timeout):
|
||||||
|
import time
|
||||||
|
while True:
|
||||||
|
status_response = self._get_request(f'https://api.firecrawl.dev/v0/crawl/status/{job_id}', headers)
|
||||||
|
if status_response.status_code == 200:
|
||||||
|
status_data = status_response.json()
|
||||||
|
if status_data['status'] == 'completed':
|
||||||
|
if 'data' in status_data:
|
||||||
|
return status_data['data']
|
||||||
|
else:
|
||||||
|
raise Exception('Crawl job completed but no data was returned')
|
||||||
|
elif status_data['status'] in ['active', 'paused', 'pending', 'queued']:
|
||||||
|
if timeout < 2:
|
||||||
|
timeout = 2
|
||||||
|
time.sleep(timeout) # Wait for the specified timeout before checking again
|
||||||
|
else:
|
||||||
|
raise Exception(f'Crawl job failed or was stopped. Status: {status_data["status"]}')
|
||||||
|
else:
|
||||||
|
self._handle_error(status_response, 'check crawl status')
|
||||||
|
|
||||||
|
def _handle_error(self, response, action):
|
||||||
|
if response.status_code in [402, 409, 500]:
|
||||||
|
error_message = response.json().get('error', 'Unknown error occurred')
|
||||||
|
raise Exception(f'Failed to {action}. Status code: {response.status_code}. Error: {error_message}')
|
||||||
|
else:
|
||||||
|
raise Exception(f'Unexpected error occurred while trying to {action}. Status code: {response.status_code}')
|
BIN
apps/python-sdk/dist/firecrawl-py-0.0.5.tar.gz
vendored
Normal file
BIN
apps/python-sdk/dist/firecrawl-py-0.0.5.tar.gz
vendored
Normal file
Binary file not shown.
BIN
apps/python-sdk/dist/firecrawl_py-0.0.5-py3-none-any.whl
vendored
Normal file
BIN
apps/python-sdk/dist/firecrawl_py-0.0.5-py3-none-any.whl
vendored
Normal file
Binary file not shown.
13
apps/python-sdk/example.py
Normal file
13
apps/python-sdk/example.py
Normal file
|
@ -0,0 +1,13 @@
|
||||||
|
from firecrawl import FirecrawlApp
|
||||||
|
|
||||||
|
|
||||||
|
app = FirecrawlApp(api_key="a6a2d63a-ed2b-46a9-946d-2a7207efed4d")
|
||||||
|
|
||||||
|
crawl_result = app.crawl_url('mendable.ai', {'crawlerOptions': {'excludes': ['blog/*']}})
|
||||||
|
print(crawl_result[0]['markdown'])
|
||||||
|
|
||||||
|
job_id = crawl_result['jobId']
|
||||||
|
print(job_id)
|
||||||
|
|
||||||
|
status = app.check_crawl_status(job_id)
|
||||||
|
print(status)
|
1
apps/python-sdk/firecrawl/__init__.py
Normal file
1
apps/python-sdk/firecrawl/__init__.py
Normal file
|
@ -0,0 +1 @@
|
||||||
|
from .firecrawl import FirecrawlApp
|
BIN
apps/python-sdk/firecrawl/__pycache__/__init__.cpython-311.pyc
Normal file
BIN
apps/python-sdk/firecrawl/__pycache__/__init__.cpython-311.pyc
Normal file
Binary file not shown.
BIN
apps/python-sdk/firecrawl/__pycache__/firecrawl.cpython-311.pyc
Normal file
BIN
apps/python-sdk/firecrawl/__pycache__/firecrawl.cpython-311.pyc
Normal file
Binary file not shown.
96
apps/python-sdk/firecrawl/firecrawl.py
Normal file
96
apps/python-sdk/firecrawl/firecrawl.py
Normal file
|
@ -0,0 +1,96 @@
|
||||||
|
import os
|
||||||
|
import requests
|
||||||
|
|
||||||
|
class FirecrawlApp:
|
||||||
|
def __init__(self, api_key=None):
|
||||||
|
self.api_key = api_key or os.getenv('FIRECRAWL_API_KEY')
|
||||||
|
if self.api_key is None:
|
||||||
|
raise ValueError('No API key provided')
|
||||||
|
|
||||||
|
def scrape_url(self, url, params=None):
|
||||||
|
headers = {
|
||||||
|
'Content-Type': 'application/json',
|
||||||
|
'Authorization': f'Bearer {self.api_key}'
|
||||||
|
}
|
||||||
|
json_data = {'url': url}
|
||||||
|
if params:
|
||||||
|
json_data.update(params)
|
||||||
|
response = requests.post(
|
||||||
|
'https://api.firecrawl.dev/v0/scrape',
|
||||||
|
headers=headers,
|
||||||
|
json=json_data
|
||||||
|
)
|
||||||
|
if response.status_code == 200:
|
||||||
|
response = response.json()
|
||||||
|
if response['success'] == True:
|
||||||
|
return response['data']
|
||||||
|
else:
|
||||||
|
raise Exception(f'Failed to scrape URL. Error: {response["error"]}')
|
||||||
|
|
||||||
|
elif response.status_code in [402, 409, 500]:
|
||||||
|
error_message = response.json().get('error', 'Unknown error occurred')
|
||||||
|
raise Exception(f'Failed to scrape URL. Status code: {response.status_code}. Error: {error_message}')
|
||||||
|
else:
|
||||||
|
raise Exception(f'Failed to scrape URL. Status code: {response.status_code}')
|
||||||
|
|
||||||
|
def crawl_url(self, url, params=None, wait_until_done=True, timeout=2):
|
||||||
|
headers = self._prepare_headers()
|
||||||
|
json_data = {'url': url}
|
||||||
|
if params:
|
||||||
|
json_data.update(params)
|
||||||
|
response = self._post_request('https://api.firecrawl.dev/v0/crawl', json_data, headers)
|
||||||
|
if response.status_code == 200:
|
||||||
|
job_id = response.json().get('jobId')
|
||||||
|
if wait_until_done:
|
||||||
|
return self._monitor_job_status(job_id, headers, timeout)
|
||||||
|
else:
|
||||||
|
return {'jobId': job_id}
|
||||||
|
else:
|
||||||
|
self._handle_error(response, 'start crawl job')
|
||||||
|
|
||||||
|
def check_crawl_status(self, job_id):
|
||||||
|
headers = self._prepare_headers()
|
||||||
|
response = self._get_request(f'https://api.firecrawl.dev/v0/crawl/status/{job_id}', headers)
|
||||||
|
if response.status_code == 200:
|
||||||
|
return response.json()
|
||||||
|
else:
|
||||||
|
self._handle_error(response, 'check crawl status')
|
||||||
|
|
||||||
|
def _prepare_headers(self):
|
||||||
|
return {
|
||||||
|
'Content-Type': 'application/json',
|
||||||
|
'Authorization': f'Bearer {self.api_key}'
|
||||||
|
}
|
||||||
|
|
||||||
|
def _post_request(self, url, data, headers):
|
||||||
|
return requests.post(url, headers=headers, json=data)
|
||||||
|
|
||||||
|
def _get_request(self, url, headers):
|
||||||
|
return requests.get(url, headers=headers)
|
||||||
|
|
||||||
|
def _monitor_job_status(self, job_id, headers, timeout):
|
||||||
|
import time
|
||||||
|
while True:
|
||||||
|
status_response = self._get_request(f'https://api.firecrawl.dev/v0/crawl/status/{job_id}', headers)
|
||||||
|
if status_response.status_code == 200:
|
||||||
|
status_data = status_response.json()
|
||||||
|
if status_data['status'] == 'completed':
|
||||||
|
if 'data' in status_data:
|
||||||
|
return status_data['data']
|
||||||
|
else:
|
||||||
|
raise Exception('Crawl job completed but no data was returned')
|
||||||
|
elif status_data['status'] in ['active', 'paused', 'pending', 'queued']:
|
||||||
|
if timeout < 2:
|
||||||
|
timeout = 2
|
||||||
|
time.sleep(timeout) # Wait for the specified timeout before checking again
|
||||||
|
else:
|
||||||
|
raise Exception(f'Crawl job failed or was stopped. Status: {status_data["status"]}')
|
||||||
|
else:
|
||||||
|
self._handle_error(status_response, 'check crawl status')
|
||||||
|
|
||||||
|
def _handle_error(self, response, action):
|
||||||
|
if response.status_code in [402, 409, 500]:
|
||||||
|
error_message = response.json().get('error', 'Unknown error occurred')
|
||||||
|
raise Exception(f'Failed to {action}. Status code: {response.status_code}. Error: {error_message}')
|
||||||
|
else:
|
||||||
|
raise Exception(f'Unexpected error occurred while trying to {action}. Status code: {response.status_code}')
|
7
apps/python-sdk/firecrawl_py.egg-info/PKG-INFO
Normal file
7
apps/python-sdk/firecrawl_py.egg-info/PKG-INFO
Normal file
|
@ -0,0 +1,7 @@
|
||||||
|
Metadata-Version: 2.1
|
||||||
|
Name: firecrawl-py
|
||||||
|
Version: 0.0.5
|
||||||
|
Summary: Python SDK for Firecrawl API
|
||||||
|
Home-page: https://github.com/mendableai/firecrawl-py
|
||||||
|
Author: Mendable.ai
|
||||||
|
Author-email: nick@mendable.ai
|
9
apps/python-sdk/firecrawl_py.egg-info/SOURCES.txt
Normal file
9
apps/python-sdk/firecrawl_py.egg-info/SOURCES.txt
Normal file
|
@ -0,0 +1,9 @@
|
||||||
|
README.md
|
||||||
|
setup.py
|
||||||
|
firecrawl/__init__.py
|
||||||
|
firecrawl/firecrawl.py
|
||||||
|
firecrawl_py.egg-info/PKG-INFO
|
||||||
|
firecrawl_py.egg-info/SOURCES.txt
|
||||||
|
firecrawl_py.egg-info/dependency_links.txt
|
||||||
|
firecrawl_py.egg-info/requires.txt
|
||||||
|
firecrawl_py.egg-info/top_level.txt
|
|
@ -0,0 +1 @@
|
||||||
|
|
1
apps/python-sdk/firecrawl_py.egg-info/requires.txt
Normal file
1
apps/python-sdk/firecrawl_py.egg-info/requires.txt
Normal file
|
@ -0,0 +1 @@
|
||||||
|
requests
|
1
apps/python-sdk/firecrawl_py.egg-info/top_level.txt
Normal file
1
apps/python-sdk/firecrawl_py.egg-info/top_level.txt
Normal file
|
@ -0,0 +1 @@
|
||||||
|
firecrawl
|
14
apps/python-sdk/setup.py
Normal file
14
apps/python-sdk/setup.py
Normal file
|
@ -0,0 +1,14 @@
|
||||||
|
from setuptools import setup, find_packages
|
||||||
|
|
||||||
|
setup(
|
||||||
|
name='firecrawl-py',
|
||||||
|
version='0.0.5',
|
||||||
|
url='https://github.com/mendableai/firecrawl-py',
|
||||||
|
author='Mendable.ai',
|
||||||
|
author_email='nick@mendable.ai',
|
||||||
|
description='Python SDK for Firecrawl API',
|
||||||
|
packages=find_packages(),
|
||||||
|
install_requires=[
|
||||||
|
'requests',
|
||||||
|
],
|
||||||
|
)
|
1
apps/www/README.md
Normal file
1
apps/www/README.md
Normal file
|
@ -0,0 +1 @@
|
||||||
|
Coming soon!
|
Loading…
Reference in New Issue
Block a user