Initial commit

2024-11-16 03:32:22 +08:00 · 2024-04-15 17:01:47 -04:00 · 2024-04-15 17:01:47 -04:00 · a6c2a87811
commit a6c2a87811
74 changed files with 10873 additions and 0 deletions
--- a/.DS_Store
+++ b/.DS_Store
--- a/.gitattributes
+++ b/.gitattributes
@ -0,0 +1,2 @@
 # Auto detect text files and perform LF normalization
 * text=auto
--- a/.github/workflows/fly.yml
+++ b/.github/workflows/fly.yml
@ -0,0 +1,20 @@
 name: Fly Deploy
 on:
  push:
    branches:
      - main
  # schedule:
  #   - cron: '0 */4 * * *'
 jobs:
  deploy:
    name: Deploy app
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: superfly/flyctl-actions/setup-flyctl@master
      - name: Change directory
        run: cd apps/api
      - run: flyctl deploy --remote-only
        env:
          FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN }}
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,6 @@
 /node_modules/
 /dist/
 .env
 *.csv
 dump.rdb
 /mongo-data
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -0,0 +1,4 @@
 # Contributing
 We love contributions! Please read our [contributing guide](CONTRIBUTING.md) before submitting a pull request.
--- a/201
+++ b/201
@ -0,0 +1,201 @@
                                 Apache License
                           Version 2.0, January 2004
                        http://www.apache.org/licenses/
   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
   1. Definitions.
      "License" shall mean the terms and conditions for use, reproduction,
      and distribution as defined by Sections 1 through 9 of this document.
      "Licensor" shall mean the copyright owner or entity authorized by
      the copyright owner that is granting the License.
      "Legal Entity" shall mean the union of the acting entity and all
      other entities that control, are controlled by, or are under common
      control with that entity. For the purposes of this definition,
      "control" means (i) the power, direct or indirect, to cause the
      direction or management of such entity, whether by contract or
      otherwise, or (ii) ownership of fifty percent (50%) or more of the
      outstanding shares, or (iii) beneficial ownership of such entity.
      "You" (or "Your") shall mean an individual or Legal Entity
      exercising permissions granted by this License.
      "Source" form shall mean the preferred form for making modifications,
      including but not limited to software source code, documentation
      source, and configuration files.
      "Object" form shall mean any form resulting from mechanical
      transformation or translation of a Source form, including but
      not limited to compiled object code, generated documentation,
      and conversions to other media types.
      "Work" shall mean the work of authorship, whether in Source or
      Object form, made available under the License, as indicated by a
      copyright notice that is included in or attached to the work
      (an example is provided in the Appendix below).
      "Derivative Works" shall mean any work, whether in Source or Object
      form, that is based on (or derived from) the Work and for which the
      editorial revisions, annotations, elaborations, or other modifications
      represent, as a whole, an original work of authorship. For the purposes
      of this License, Derivative Works shall not include works that remain
      separable from, or merely link (or bind by name) to the interfaces of,
      the Work and Derivative Works thereof.
      "Contribution" shall mean any work of authorship, including
      the original version of the Work and any modifications or additions
      to that Work or Derivative Works thereof, that is intentionally
      submitted to Licensor for inclusion in the Work by the copyright owner
      or by an individual or Legal Entity authorized to submit on behalf of
      the copyright owner. For the purposes of this definition, "submitted"
      means any form of electronic, verbal, or written communication sent
      to the Licensor or its representatives, including but not limited to
      communication on electronic mailing lists, source code control systems,
      and issue tracking systems that are managed by, or on behalf of, the
      Licensor for the purpose of discussing and improving the Work, but
      excluding communication that is conspicuously marked or otherwise
      designated in writing by the copyright owner as "Not a Contribution."
      "Contributor" shall mean Licensor and any individual or Legal Entity
      on behalf of whom a Contribution has been received by Licensor and
      subsequently incorporated within the Work.
   2. Grant of Copyright License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      copyright license to reproduce, prepare Derivative Works of,
      publicly display, publicly perform, sublicense, and distribute the
      Work and such Derivative Works in Source or Object form.
   3. Grant of Patent License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      (except as stated in this section) patent license to make, have made,
      use, offer to sell, sell, import, and otherwise transfer the Work,
      where such license applies only to those patent claims licensable
      by such Contributor that are necessarily infringed by their
      Contribution(s) alone or by combination of their Contribution(s)
      with the Work to which such Contribution(s) was submitted. If You
      institute patent litigation against any entity (including a
      cross-claim or counterclaim in a lawsuit) alleging that the Work
      or a Contribution incorporated within the Work constitutes direct
      or contributory patent infringement, then any patent licenses
      granted to You under this License for that Work shall terminate
      as of the date such litigation is filed.
   4. Redistribution. You may reproduce and distribute copies of the
      Work or Derivative Works thereof in any medium, with or without
      modifications, and in Source or Object form, provided that You
      meet the following conditions:
      (a) You must give any other recipients of the Work or
          Derivative Works a copy of this License; and
      (b) You must cause any modified files to carry prominent notices
          stating that You changed the files; and
      (c) You must retain, in the Source form of any Derivative Works
          that You distribute, all copyright, patent, trademark, and
          attribution notices from the Source form of the Work,
          excluding those notices that do not pertain to any part of
          the Derivative Works; and
      (d) If the Work includes a "NOTICE" text file as part of its
          distribution, then any Derivative Works that You distribute must
          include a readable copy of the attribution notices contained
          within such NOTICE file, excluding those notices that do not
          pertain to any part of the Derivative Works, in at least one
          of the following places: within a NOTICE text file distributed
          as part of the Derivative Works; within the Source form or
          documentation, if provided along with the Derivative Works; or,
          within a display generated by the Derivative Works, if and
          wherever such third-party notices normally appear. The contents
          of the NOTICE file are for informational purposes only and
          do not modify the License. You may add Your own attribution
          notices within Derivative Works that You distribute, alongside
          or as an addendum to the NOTICE text from the Work, provided
          that such additional attribution notices cannot be construed
          as modifying the License.
      You may add Your own copyright statement to Your modifications and
      may provide additional or different license terms and conditions
      for use, reproduction, or distribution of Your modifications, or
      for any such Derivative Works as a whole, provided Your use,
      reproduction, and distribution of the Work otherwise complies with
      the conditions stated in this License.
   5. Submission of Contributions. Unless You explicitly state otherwise,
      any Contribution intentionally submitted for inclusion in the Work
      by You to the Licensor shall be under the terms and conditions of
      this License, without any additional terms or conditions.
      Notwithstanding the above, nothing herein shall supersede or modify
      the terms of any separate license agreement you may have executed
      with Licensor regarding such Contributions.
   6. Trademarks. This License does not grant permission to use the trade
      names, trademarks, service marks, or product names of the Licensor,
      except as required for reasonable and customary use in describing the
      origin of the Work and reproducing the content of the NOTICE file.
   7. Disclaimer of Warranty. Unless required by applicable law or
      agreed to in writing, Licensor provides the Work (and each
      Contributor provides its Contributions) on an "AS IS" BASIS,
      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
      implied, including, without limitation, any warranties or conditions
      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
      PARTICULAR PURPOSE. You are solely responsible for determining the
      appropriateness of using or redistributing the Work and assume any
      risks associated with Your exercise of permissions under this License.
   8. Limitation of Liability. In no event and under no legal theory,
      whether in tort (including negligence), contract, or otherwise,
      unless required by applicable law (such as deliberate and grossly
      negligent acts) or agreed to in writing, shall any Contributor be
      liable to You for damages, including any direct, indirect, special,
      incidental, or consequential damages of any character arising as a
      result of this License or out of the use or inability to use the
      Work (including but not limited to damages for loss of goodwill,
      work stoppage, computer failure or malfunction, or any and all
      other commercial damages or losses), even if such Contributor
      has been advised of the possibility of such damages.
   9. Accepting Warranty or Additional Liability. While redistributing
      the Work or Derivative Works thereof, You may choose to offer,
      and charge a fee for, acceptance of support, warranty, indemnity,
      or other liability obligations and/or rights consistent with this
      License. However, in accepting such obligations, You may act only
      on Your own behalf and on Your sole responsibility, not on behalf
      of any other Contributor, and only if You agree to indemnify,
      defend, and hold each Contributor harmless for any liability
      incurred by, or claims asserted against, such Contributor by reason
      of your accepting any such warranty or additional liability.
   END OF TERMS AND CONDITIONS
   APPENDIX: How to apply the Apache License to your work.
      To apply the Apache License to your work, attach the following
      boilerplate notice, with the fields enclosed by brackets "[]"
      replaced with your own identifying information. (Don't include
      the brackets!)  The text should be enclosed in the appropriate
      comment syntax for the file format. We also recommend that a
      file or class name and description of purpose be included on the
      same "printed page" as the copyright notice for easier
      identification within third-party archives.
   Copyright 2024 Firecrawl | Mendable.ai
   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at
       http://www.apache.org/licenses/LICENSE-2.0
   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
--- a/README.md
+++ b/README.md
@ -0,0 +1,108 @@
 # 🔥 Firecrawl
 Crawl and convert any website into clean markdown
 *This repo is still in early development.*
 ## What is Firecrawl?
 [Firecrawl](https://firecrawl.dev?ref=github) is an API service that takes a URL, crawls it, and converts it into clean markdown. We crawl all accessible subpages and give you clean markdown for each. No sitemap required.
 ## How to use it?
 We provide an easy to use API with our hosted version. You can find the playground and documentation [here](https://firecrawl.com/playground). You can also self host the backend if you'd like. 
 - [x] API
 - [x] Python SDK
 - [x] JS SDK - Coming Soon
 Self-host. To self-host refer to guide [here](https://github.com/mendableai/firecrawl/blob/main/SELF_HOST.md).
 ### API Key
 To use the API, you need to sign up on [Firecrawl](https://firecrawl.com) and get an API key.
 ### Crawling
 Used to crawl a URL and all accessible subpages. This submits a crawl job and returns a job ID to check the status of the crawl.
 ```bash
 curl -X POST https://api.firecrawl.dev/v0/crawl \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer YOUR_API_KEY' \
    -d '{
      "url": "https://mendable.ai"
    }'
 ```
 Returns a jobId
 ```json
 { "jobId": "1234-5678-9101" }
 ```
 ### Check Crawl Job
 Used to check the status of a crawl job and get its result.
 ```bash
 curl -X GET https://api.firecrawl.dev/v0/crawl/status/1234-5678-9101 \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer YOUR_API_KEY'
 ```
 ```json
 {
    "status": "completed",
    "current": 22,
    "total": 22,
    "data": [
        {
        "content": "Raw Content ",
        "markdown": "# Markdown Content",
        "provider": "web-scraper",
        "metadata": {
            "title": "Mendable | AI for CX and Sales",
            "description": "AI for CX and Sales",
            "language": null,
            "sourceURL": "https://www.mendable.ai/",
        }
    ]
 }
 ```
 ## Using Python SDK
 ### Installing Python SDK
 ```bash
 pip install firecrawl-py
 ```
 ### Crawl a website
 ```python
 from firecrawl import FirecrawlApp
 app = FirecrawlApp(api_key="YOUR_API_KEY")
 crawl_result = app.crawl_url('mendable.ai', {'crawlerOptions': {'excludes': ['blog/*']}})
 # Get the markdown
 for result in crawl_result:
    print(result['markdown'])
 ```
 ### Scraping a URL
 To scrape a single URL, use the `scrape_url` method. It takes the URL as a parameter and returns the scraped data as a dictionary.
 ```python
 url = 'https://example.com'
 scraped_data = app.scrape_url(url)
 ```
 ## Contributing
 We love contributions! Please read our [contributing guide](CONTRIBUTING.md) before submitting a pull request.
--- a/SELF_HOST.md
+++ b/SELF_HOST.md
@ -0,0 +1,6 @@
 # Self-hosting Firecrawl
 Guide coming soon.
--- a/apps/.DS_Store
+++ b/apps/.DS_Store
--- a/apps/api/.dockerignore
+++ b/apps/api/.dockerignore
@ -0,0 +1,4 @@
 /node_modules/
 /dist/
 .env
 *.csv
--- a/apps/api/.env.local
+++ b/apps/api/.env.local
@ -0,0 +1,8 @@
 PORT=
 HOST=
 SUPABASE_ANON_TOKEN=
 SUPABASE_URL=
 SUPABASE_SERVICE_TOKEN=
 REDIS_URL=
 OPENAI_API_KEY=
 SCRAPING_BEE_API_KEY=
--- a/apps/api/.gitattributes
+++ b/apps/api/.gitattributes
@ -0,0 +1,2 @@
 # Auto detect text files and perform LF normalization
 * text=auto
--- a/apps/api/.gitignore
+++ b/apps/api/.gitignore
@ -0,0 +1,6 @@
 /node_modules/
 /dist/
 .env
 *.csv
 dump.rdb
 /mongo-data
--- a/apps/api/Dockerfile
+++ b/apps/api/Dockerfile
@ -0,0 +1,36 @@
 FROM node:20-slim AS base
 ENV PNPM_HOME="/pnpm"
 ENV PATH="$PNPM_HOME:$PATH"
 LABEL fly_launch_runtime="Node.js"
 RUN corepack enable
 COPY . /app
 WORKDIR /app
 FROM base AS prod-deps
 RUN --mount=type=cache,id=pnpm,target=/pnpm/store pnpm install --prod --frozen-lockfile
 FROM base AS build
 RUN --mount=type=cache,id=pnpm,target=/pnpm/store pnpm install --frozen-lockfile
 RUN pnpm install
 RUN pnpm run build
 # Install packages needed for deployment
 FROM base
 RUN apt-get update -qq && \
    apt-get install --no-install-recommends -y chromium chromium-sandbox && \
    rm -rf /var/lib/apt/lists /var/cache/apt/archives
 COPY --from=prod-deps /app/node_modules /app/node_modules
 COPY --from=build /app /app
 # Start the server by default, this can be overwritten at runtime
 EXPOSE 8080
 ENV PUPPETEER_EXECUTABLE_PATH="/usr/bin/chromium"
 CMD [ "pnpm", "run", "start:production" ]
 CMD [ "pnpm", "run", "worker:production" ]
--- a/apps/api/fly.toml
+++ b/apps/api/fly.toml
@ -0,0 +1,47 @@
 # fly.toml app configuration file generated for firecrawl-scraper-js on 2024-04-07T21:09:59-03:00
 #
 # See https://fly.io/docs/reference/configuration/ for information about how to use this file.
 #
 app = 'firecrawl-scraper-js'
 primary_region = 'mia'
 kill_signal = 'SIGINT'
 kill_timeout = '5s'
 [build]
 [processes]
  app = 'npm run start:production'
  worker = 'npm run worker:production'
 [http_service]
  internal_port = 8080
  force_https = true
  auto_stop_machines = true
  auto_start_machines = true
  min_machines_running = 0
  processes = ['app']
 [[services]]
  protocol = 'tcp'
  internal_port = 8080
  processes = ['app']
 [[services.ports]]
    port = 80
    handlers = ['http']
    force_https = true
 [[services.ports]]
    port = 443
    handlers = ['tls', 'http']
  [services.concurrency]
    type = 'connections'
    hard_limit = 45
    soft_limit = 20
 [[vm]]
  size = 'performance-1x'
--- a/apps/api/jest.config.js
+++ b/apps/api/jest.config.js
@ -0,0 +1,5 @@
 module.exports = {
  preset: "ts-jest",
  testEnvironment: "node",
  setupFiles: ["./jest.setup.js"],
 };
--- a/apps/api/jest.setup.js
+++ b/apps/api/jest.setup.js
@ -0,0 +1 @@
 global.fetch = require('jest-fetch-mock');
--- a/apps/api/package.json
+++ b/apps/api/package.json
@ -0,0 +1,98 @@
 {
  "name": "firecrawl-scraper-js",
  "version": "1.0.0",
  "description": "",
  "main": "index.js",
  "scripts": {
    "start": "nodemon --exec ts-node src/index.ts",
    "start:production": "tsc && node dist/src/index.js",
    "format": "prettier --write \"src/**/*.(js|ts)\"",
    "flyio": "node dist/src/index.js",
    "start:dev": "nodemon --exec ts-node src/index.ts",
    "build": "tsc",
    "test": "jest --verbose",
    "workers": "nodemon --exec ts-node src/services/queue-worker.ts",
    "worker:production": "node dist/src/services/queue-worker.js",
    "mongo-docker": "docker run -d -p 2717:27017 -v ./mongo-data:/data/db --name mongodb mongo:latest",
    "mongo-docker-console": "docker exec -it mongodb mongosh",
    "run-example": "npx ts-node src/example.ts"
  },
  "author": "",
  "license": "ISC",
  "devDependencies": {
    "@flydotio/dockerfile": "^0.4.10",
    "@tsconfig/recommended": "^1.0.3",
    "@types/body-parser": "^1.19.2",
    "@types/bull": "^4.10.0",
    "@types/cors": "^2.8.13",
    "@types/express": "^4.17.17",
    "@types/jest": "^29.5.6",
    "body-parser": "^1.20.1",
    "express": "^4.18.2",
    "jest": "^29.6.3",
    "jest-fetch-mock": "^3.0.3",
    "nodemon": "^2.0.20",
    "supabase": "^1.77.9",
    "supertest": "^6.3.3",
    "ts-jest": "^29.1.1",
    "ts-node": "^10.9.1",
    "typescript": "^5.4.2"
  },
  "dependencies": {
    "@brillout/import": "^0.2.2",
    "@bull-board/api": "^5.14.2",
    "@bull-board/express": "^5.8.0",
    "@devil7softwares/pos": "^1.0.2",
    "@dqbd/tiktoken": "^1.0.7",
    "@logtail/node": "^0.4.12",
    "@nangohq/node": "^0.36.33",
    "@sentry/node": "^7.48.0",
    "@supabase/supabase-js": "^2.7.1",
    "async": "^3.2.5",
    "async-mutex": "^0.4.0",
    "axios": "^1.3.4",
    "bottleneck": "^2.19.5",
    "bull": "^4.11.4",
    "cheerio": "^1.0.0-rc.12",
    "cohere": "^1.1.1",
    "cors": "^2.8.5",
    "cron-parser": "^4.9.0",
    "date-fns": "^2.29.3",
    "dotenv": "^16.3.1",
    "express-rate-limit": "^6.7.0",
    "glob": "^10.3.12",
    "gpt3-tokenizer": "^1.1.5",
    "ioredis": "^5.3.2",
    "keyword-extractor": "^0.0.25",
    "langchain": "^0.1.25",
    "languagedetect": "^2.0.0",
    "logsnag": "^0.1.6",
    "luxon": "^3.4.3",
    "md5": "^2.3.0",
    "moment": "^2.29.4",
    "mongoose": "^8.0.3",
    "natural": "^6.3.0",
    "openai": "^4.28.4",
    "pos": "^0.4.2",
    "promptable": "^0.0.9",
    "puppeteer": "^22.6.3",
    "rate-limiter-flexible": "^2.4.2",
    "redis": "^4.6.7",
    "robots-parser": "^3.0.1",
    "scrapingbee": "^1.7.4",
    "stripe": "^12.2.0",
    "turndown": "^7.1.3",
    "typesense": "^1.5.4",
    "unstructured-client": "^0.9.4",
    "uuid": "^9.0.1",
    "wordpos": "^2.1.0",
    "xml2js": "^0.6.2"
  },
  "nodemonConfig": {
    "ignore": [
      "*.docx",
      "*.json",
      "temp"
    ]
  }
 }
--- a/apps/api/pnpm-lock.yaml
+++ b/apps/api/pnpm-lock.yaml
--- a/apps/api/requests.http
+++ b/apps/api/requests.http
@ -0,0 +1,53 @@
 ### Crawl Website
 POST http://localhost:3002/v0/crawl HTTP/1.1
 Authorization: Bearer 1b5b20e4-3417-4298-b1b2-983920e1b74d
 {
    "url":"https://docs.mendable.ai"
 }
 ### Check Job Status
 GET http://localhost:3002/v0/jobs/active HTTP/1.1
 ### Scrape Website
 POST https://api.firecrawl.dev/v0/scrape HTTP/1.1
 Authorization: Bearer 1b5b20e4-3417-4298-b1b2-983920e1b74d
 content-type: application/json
 {
    "url":"https://www.agentops.ai"
 }
 ### Scrape Website
 POST http://localhost:3002/v0/scrape HTTP/1.1
 Authorization: Bearer 1b5b20e4-3417-4298-b1b2-983920e1b74d
 content-type: application/json
 {
    "url":"https://www.agentops.ai"
 }
 ### Check Job Status
 GET http://localhost:3002/v0/crawl/status/333ab225-dc3e-418b-9d4b-8fb833cbaf89 HTTP/1.1
 Authorization: Bearer 1b5b20e4-3417-4298-b1b2-983920e1b74d
 ### Get Job Result
 POST https://api.firecrawl.dev/v0/crawl HTTP/1.1
 Authorization: Bearer 30c90634-8377-4446-9ef9-a280b9be1702
 content-type: application/json
 {
    "url":"https://markprompt.com"
 }
 ### Check Job Status
 GET https://api.firecrawl.dev/v0/crawl/status/cfcb71ac-23a3-4da5-bd85-d4e58b871d66
 Authorization: Bearer 30c90634-8377-4446-9ef9-a280b9be1702
--- a/apps/api/src/.DS_Store
+++ b/apps/api/src/.DS_Store
--- a/apps/api/src/control.ts
+++ b/apps/api/src/control.ts
@ -0,0 +1,2 @@
 // ! IN CASE OPENAI goes down, then activate the fallback -> true
 export const is_fallback = false;
--- a/apps/api/src/example.ts
+++ b/apps/api/src/example.ts
@ -0,0 +1,18 @@
 import { WebScraperDataProvider } from "./scraper/WebScraper";
 async function example() {
  const example = new WebScraperDataProvider();
  await example.setOptions({
    mode: "crawl",
    urls: ["https://mendable.ai"],
    crawlerOptions: {},
  });
  const docs = await example.getDocuments(false);
  docs.map((doc) => {
    console.log(doc.metadata.sourceURL);
  });
  console.log(docs.length);
 }
 // example();
--- a/apps/api/src/index.ts
+++ b/apps/api/src/index.ts
@ -0,0 +1,352 @@
 import express from "express";
 import bodyParser from "body-parser";
 import cors from "cors";
 import "dotenv/config";
 import { getWebScraperQueue } from "./services/queue-service";
 import { addWebScraperJob } from "./services/queue-jobs";
 import { supabase_service } from "./services/supabase";
 import { WebScraperDataProvider } from "./scraper/WebScraper";
 import { billTeam, checkTeamCredits } from "./services/billing/credit_billing";
 import { getRateLimiter, redisClient } from "./services/rate-limiter";
 const { createBullBoard } = require("@bull-board/api");
 const { BullAdapter } = require("@bull-board/api/bullAdapter");
 const { ExpressAdapter } = require("@bull-board/express");
 export const app = express();
 global.isProduction = process.env.IS_PRODUCTION === "true";
 app.use(bodyParser.urlencoded({ extended: true }));
 app.use(bodyParser.json({ limit: "10mb" }));
 app.use(cors()); // Add this line to enable CORS
 const serverAdapter = new ExpressAdapter();
 serverAdapter.setBasePath(`/admin/${process.env.BULL_AUTH_KEY}/queues`);
 const { addQueue, removeQueue, setQueues, replaceQueues } = createBullBoard({
  queues: [new BullAdapter(getWebScraperQueue())],
  serverAdapter: serverAdapter,
 });
 app.use(
  `/admin/${process.env.BULL_AUTH_KEY}/queues`,
  serverAdapter.getRouter()
 );
 app.get("/", (req, res) => {
  res.send("SCRAPERS-JS: Hello, world! Fly.io");
 });
 //write a simple test function
 app.get("/test", async (req, res) => {
  res.send("Hello, world!");
 });
 async function authenticateUser(req, res, mode?: string): Promise<string> {
  const authHeader = req.headers.authorization;
  if (!authHeader) {
    return res.status(401).json({ error: "Unauthorized" });
  }
  const token = authHeader.split(" ")[1]; // Extract the token from "Bearer <token>"
  if (!token) {
    return res.status(401).json({ error: "Unauthorized: Token missing" });
  }
  try {
    const incomingIP = (req.headers["x-forwarded-for"] ||
      req.socket.remoteAddress) as string;
    const iptoken = incomingIP + token;
    await getRateLimiter(
      token === "this_is_just_a_preview_token" ? true : false
    ).consume(iptoken);
  } catch (rateLimiterRes) {
    console.error(rateLimiterRes);
    return res.status(429).json({
      error: "Rate limit exceeded. Too many requests, try again in 1 minute.",
    });
  }
  if (token === "this_is_just_a_preview_token" && mode === "scrape") {
    return "preview";
  }
  // make sure api key is valid, based on the api_keys table in supabase
  const { data, error } = await supabase_service
    .from("api_keys")
    .select("*")
    .eq("key", token);
  if (error || !data || data.length === 0) {
    return res.status(401).json({ error: "Unauthorized: Invalid token" });
  }
  return data[0].team_id;
 }
 app.post("/v0/scrape", async (req, res) => {
  try {
    // make sure to authenticate user first, Bearer <token>
    const team_id = await authenticateUser(req, res, "scrape");
    try {
      const { success: creditsCheckSuccess, message: creditsCheckMessage } =
        await checkTeamCredits(team_id, 1);
      if (!creditsCheckSuccess) {
        return res.status(402).json({ error: "Insufficient credits" });
      }
    } catch (error) {
      console.error(error);
      return res.status(500).json({ error: "Internal server error" });
    }
    // authenticate on supabase
    const url = req.body.url;
    if (!url) {
      return res.status(400).json({ error: "Url is required" });
    }
    try {
      const a = new WebScraperDataProvider();
      await a.setOptions({
        mode: "single_urls",
        urls: [url],
      });
      const docs = await a.getDocuments(false);
      // make sure doc.content is not empty
      const filteredDocs = docs.filter(
        (doc: { content?: string }) =>
          doc.content && doc.content.trim().length > 0
      );
      if (filteredDocs.length === 0) {
        return res.status(200).json({ success: true, data: [] });
      }
      const { success, credit_usage } = await billTeam(
        team_id,
        filteredDocs.length
      );
      if (!success) {
        // throw new Error("Failed to bill team, no subscribtion was found");
        // return {
        //   success: false,
        //   message: "Failed to bill team, no subscribtion was found",
        //   docs: [],
        // };
        return res
          .status(402)
          .json({ error: "Failed to bill, no subscribtion was found" });
      }
      return res.json({
        success: true,
        data: filteredDocs[0],
      });
    } catch (error) {
      console.error(error);
      return res.status(500).json({ error: error.message });
    }
  } catch (error) {
    console.error(error);
    return res.status(500).json({ error: error.message });
  }
 });
 app.post("/v0/crawl", async (req, res) => {
  try {
    const team_id = await authenticateUser(req, res);
    const { success: creditsCheckSuccess, message: creditsCheckMessage } =
      await checkTeamCredits(team_id, 1);
    if (!creditsCheckSuccess) {
      return res.status(402).json({ error: "Insufficient credits" });
    }
    // authenticate on supabase
    const url = req.body.url;
    if (!url) {
      return res.status(400).json({ error: "Url is required" });
    }
    const mode = req.body.mode ?? "crawl";
    const crawlerOptions = req.body.crawlerOptions ?? {};
    if (mode === "single_urls" && !url.includes(",")) {
      try {
        const a = new WebScraperDataProvider();
        await a.setOptions({
          mode: "single_urls",
          urls: [url],
          crawlerOptions: {
            returnOnlyUrls: true,
          },
        });
        const docs = await a.getDocuments(false, (progress) => {
          job.progress({
            current: progress.current,
            total: progress.total,
            current_step: "SCRAPING",
            current_url: progress.currentDocumentUrl,
          });
        });
        return res.json({
          success: true,
          documents: docs,
        });
      } catch (error) {
        console.error(error);
        return res.status(500).json({ error: error.message });
      }
    }
    const job = await addWebScraperJob({
      url: url,
      mode: mode ?? "crawl", // fix for single urls not working
      crawlerOptions: { ...crawlerOptions },
      team_id: team_id,
    });
    res.json({ jobId: job.id });
  } catch (error) {
    console.error(error);
    return res.status(500).json({ error: error.message });
  }
 });
 app.post("/v0/crawlWebsitePreview", async (req, res) => {
  try {
    // make sure to authenticate user first, Bearer <token>
    const authHeader = req.headers.authorization;
    if (!authHeader) {
      return res.status(401).json({ error: "Unauthorized" });
    }
    const token = authHeader.split(" ")[1]; // Extract the token from "Bearer <token>"
    if (!token) {
      return res.status(401).json({ error: "Unauthorized: Token missing" });
    }
    // authenticate on supabase
    const url = req.body.url;
    if (!url) {
      return res.status(400).json({ error: "Url is required" });
    }
    const mode = req.body.mode ?? "crawl";
    const crawlerOptions = req.body.crawlerOptions ?? {};
    const job = await addWebScraperJob({
      url: url,
      mode: mode ?? "crawl", // fix for single urls not working
      crawlerOptions: { ...crawlerOptions, limit: 5, maxCrawledLinks: 5 },
      team_id: "preview",
    });
    res.json({ jobId: job.id });
  } catch (error) {
    console.error(error);
    return res.status(500).json({ error: error.message });
  }
 });
 app.get("/v0/crawl/status/:jobId", async (req, res) => {
  try {
    const authHeader = req.headers.authorization;
    if (!authHeader) {
      return res.status(401).json({ error: "Unauthorized" });
    }
    const token = authHeader.split(" ")[1]; // Extract the token from "Bearer <token>"
    if (!token) {
      return res.status(401).json({ error: "Unauthorized: Token missing" });
    }
    // make sure api key is valid, based on the api_keys table in supabase
    const { data, error } = await supabase_service
      .from("api_keys")
      .select("*")
      .eq("key", token);
    if (error || !data || data.length === 0) {
      return res.status(401).json({ error: "Unauthorized: Invalid token" });
    }
    const job = await getWebScraperQueue().getJob(req.params.jobId);
    if (!job) {
      return res.status(404).json({ error: "Job not found" });
    }
    const { current, current_url, total, current_step } = await job.progress();
    res.json({
      status: await job.getState(),
      // progress: job.progress(),
      current: current,
      current_url: current_url,
      current_step: current_step,
      total: total,
      data: job.returnvalue,
    });
  } catch (error) {
    console.error(error);
    return res.status(500).json({ error: error.message });
  }
 });
 app.get("/v0/checkJobStatus/:jobId", async (req, res) => {
  try {
    const job = await getWebScraperQueue().getJob(req.params.jobId);
    if (!job) {
      return res.status(404).json({ error: "Job not found" });
    }
    const { current, current_url, total, current_step } = await job.progress();
    res.json({
      status: await job.getState(),
      // progress: job.progress(),
      current: current,
      current_url: current_url,
      current_step: current_step,
      total: total,
      data: job.returnvalue,
    });
  } catch (error) {
    console.error(error);
    return res.status(500).json({ error: error.message });
  }
 });
 const DEFAULT_PORT = process.env.PORT ?? 3002;
 const HOST = process.env.HOST ?? "localhost";
 redisClient.connect();
 export function startServer(port = DEFAULT_PORT) {
  const server = app.listen(Number(port), HOST, () => {
    console.log(`Server listening on port ${port}`);
    console.log(`For the UI, open http://${HOST}:${port}/admin/queues`);
    console.log("");
    console.log("1. Make sure Redis is running on port 6379 by default");
    console.log(
      "2. If you want to run nango, make sure you do port forwarding in 3002 using ngrok http 3002 "
    );
  });
  return server;
 }
 if (require.main === module) {
  startServer();
 }
 // Use this as a health check that way we dont destroy the server
 app.get(`/admin/${process.env.BULL_AUTH_KEY}/queues`, async (req, res) => {
  try {
    const webScraperQueue = getWebScraperQueue();
    const [webScraperActive] = await Promise.all([
      webScraperQueue.getActiveCount(),
    ]);
    const noActiveJobs = webScraperActive === 0;
    // 200 if no active jobs, 503 if there are active jobs
    return res.status(noActiveJobs ? 200 : 500).json({
      webScraperActive,
      noActiveJobs,
    });
  } catch (error) {
    console.error(error);
    return res.status(500).json({ error: error.message });
  }
 });
 app.get("/is-production", (req, res) => {
  res.send({ isProduction: global.isProduction });
 });
--- a/apps/api/src/lib/batch-process.ts
+++ b/apps/api/src/lib/batch-process.ts
@ -0,0 +1,16 @@
 export async function batchProcess<T>(
    array: T[],
    batchSize: number,
    asyncFunction: (item: T, index: number) => Promise<void>
  ): Promise<void> {
    const batches = [];
    for (let i = 0; i < array.length; i += batchSize) {
      const batch = array.slice(i, i + batchSize);
      batches.push(batch);
    }
    for (const batch of batches) {
      await Promise.all(batch.map((item, i) => asyncFunction(item, i)));
    }
  }
--- a/apps/api/src/lib/custom-error.ts
+++ b/apps/api/src/lib/custom-error.ts
@ -0,0 +1,21 @@
 export class CustomError extends Error {
  statusCode: number;
  status: string;
  message: string;
  dataIngestionJob: any;
  constructor(
    statusCode: number,
    status: string,
    message: string = "",
    dataIngestionJob?: any,
  ) {
    super(message);
    this.statusCode = statusCode;
    this.status = status;
    this.message = message;
    this.dataIngestionJob = dataIngestionJob;
    Object.setPrototypeOf(this, CustomError.prototype);
  }
 }
--- a/apps/api/src/lib/entities.ts
+++ b/apps/api/src/lib/entities.ts
@ -0,0 +1,37 @@
 export interface Progress {
  current: number;
  total: number;
  status: string;
  metadata?: {
    sourceURL?: string;
    [key: string]: any;
  };
  currentDocumentUrl?: string;
 }
 export class Document {
  id?: string;
  content: string;
  markdown?: string;
  createdAt?: Date;
  updatedAt?: Date;
  type?: string;
  metadata: {
    sourceURL?: string;
    [key: string]: any;
  };
  childrenLinks?: string[];
  constructor(data: Partial<Document>) {
    if (!data.content) {
      throw new Error("Missing required fields");
    }
    this.content = data.content;
    this.createdAt = data.createdAt || new Date();
    this.updatedAt = data.updatedAt || new Date();
    this.type = data.type || "unknown";
    this.metadata = data.metadata || { sourceURL: "" };
    this.markdown = data.markdown || "";
    this.childrenLinks = data.childrenLinks || undefined;
  }
 }
--- a/apps/api/src/lib/html-to-markdown.ts
+++ b/apps/api/src/lib/html-to-markdown.ts
@ -0,0 +1,51 @@
 export function parseMarkdown(html: string) {
  var TurndownService = require("turndown");
  const turndownService = new TurndownService();
  turndownService.addRule("inlineLink", {
    filter: function (node, options) {
      return (
        options.linkStyle === "inlined" &&
        node.nodeName === "A" &&
        node.getAttribute("href")
      );
    },
    replacement: function (content, node) {
      var href = node.getAttribute("href").trim();
      var title = node.title ? ' "' + node.title + '"' : "";
      return "[" + content.trim() + "](" + href + title + ")\n";
    },
  });
  let markdownContent = turndownService.turndown(html);
  // multiple line links
  let insideLinkContent = false;
  let newMarkdownContent = "";
  let linkOpenCount = 0;
  for (let i = 0; i < markdownContent.length; i++) {
    const char = markdownContent[i];
    if (char == "[") {
      linkOpenCount++;
    } else if (char == "]") {
      linkOpenCount = Math.max(0, linkOpenCount - 1);
    }
    insideLinkContent = linkOpenCount > 0;
    if (insideLinkContent && char == "\n") {
      newMarkdownContent += "\\" + "\n";
    } else {
      newMarkdownContent += char;
    }
  }
  markdownContent = newMarkdownContent;
  // Remove [Skip to Content](#page) and [Skip to content](#skip)
  markdownContent = markdownContent.replace(
    /\[Skip to Content\]\(#[^\)]*\)/gi,
    ""
  );
  return markdownContent;
 }
--- a/apps/api/src/lib/parse-mode.ts
+++ b/apps/api/src/lib/parse-mode.ts
@ -0,0 +1,12 @@
 export function parseMode(mode: string) {
  switch (mode) {
    case "single_urls":
      return "single_urls";
    case "sitemap":
      return "sitemap";
    case "crawl":
      return "crawl";
    default:
      return "single_urls";
  }
 }
--- a/apps/api/src/main/runWebScraper.ts
+++ b/apps/api/src/main/runWebScraper.ts
@ -0,0 +1,96 @@
 import { Job } from "bull";
 import { CrawlResult, WebScraperOptions } from "../types";
 import { WebScraperDataProvider } from "../scraper/WebScraper";
 import { Progress } from "../lib/entities";
 import { billTeam } from "../services/billing/credit_billing";
 export async function startWebScraperPipeline({
  job,
 }: {
  job: Job<WebScraperOptions>;
 }) {
  return (await runWebScraper({
    url: job.data.url,
    mode: job.data.mode,
    crawlerOptions: job.data.crawlerOptions,
    inProgress: (progress) => {
      job.progress(progress);
    },
    onSuccess: (result) => {
      job.moveToCompleted(result);
    },
    onError: (error) => {
      job.moveToFailed(error);
    },
    team_id: job.data.team_id,
  })) as { success: boolean; message: string; docs: CrawlResult[] };
 }
 export async function runWebScraper({
  url,
  mode,
  crawlerOptions,
  inProgress,
  onSuccess,
  onError,
  team_id,
 }: {
  url: string;
  mode: "crawl" | "single_urls" | "sitemap";
  crawlerOptions: any;
  inProgress: (progress: any) => void;
  onSuccess: (result: any) => void;
  onError: (error: any) => void;
  team_id: string;
 }): Promise<{ success: boolean; message: string; docs: CrawlResult[] }> {
  try {
    const provider = new WebScraperDataProvider();
    if (mode === "crawl") {
      await provider.setOptions({
        mode: mode,
        urls: [url],
        crawlerOptions: crawlerOptions,
      });
    } else {
      await provider.setOptions({
        mode: mode,
        urls: url.split(","),
        crawlerOptions: crawlerOptions,
      });
    }
    const docs = (await provider.getDocuments(false, (progress: Progress) => {
      inProgress(progress);
    })) as CrawlResult[];
    if (docs.length === 0) {
      return {
        success: true,
        message: "No pages found",
        docs: [],
      };
    }
    // remove docs with empty content
    const filteredDocs = docs.filter((doc) => doc.content.trim().length > 0);
    onSuccess(filteredDocs);
    const { success, credit_usage } = await billTeam(
      team_id,
      filteredDocs.length
    );
    if (!success) {
      // throw new Error("Failed to bill team, no subscribtion was found");
      return {
        success: false,
        message: "Failed to bill team, no subscribtion was found",
        docs: [],
      };
    }
    return { success: true, message: "", docs: filteredDocs as CrawlResult[] };
  } catch (error) {
    console.error("Error running web scraper", error);
    onError(error);
    return { success: false, message: error.message, docs: [] };
  }
 }
--- a/apps/api/src/scraper/WebScraper/crawler.ts
+++ b/apps/api/src/scraper/WebScraper/crawler.ts
@ -0,0 +1,295 @@
 import axios from "axios";
 import cheerio, { load } from "cheerio";
 import { URL } from "url";
 import { getLinksFromSitemap } from "./sitemap";
 import async from "async";
 import { Progress } from "../../lib/entities";
 import { scrapWithScrapingBee } from "./single_url";
 import robotsParser from "robots-parser";
 export class WebCrawler {
  private initialUrl: string;
  private baseUrl: string;
  private includes: string[];
  private excludes: string[];
  private maxCrawledLinks: number;
  private visited: Set<string> = new Set();
  private crawledUrls: Set<string> = new Set();
  private limit: number;
  private robotsTxtUrl: string;
  private robots: any;
  constructor({
    initialUrl,
    includes,
    excludes,
    maxCrawledLinks,
    limit = 10000,
  }: {
    initialUrl: string;
    includes?: string[];
    excludes?: string[];
    maxCrawledLinks?: number;
    limit?: number;
  }) {
    this.initialUrl = initialUrl;
    this.baseUrl = new URL(initialUrl).origin;
    this.includes = includes ?? [];
    this.excludes = excludes ?? [];
    this.limit = limit;
    this.robotsTxtUrl = `${this.baseUrl}/robots.txt`;
    this.robots = robotsParser(this.robotsTxtUrl, "");
    // Deprecated, use limit instead
    this.maxCrawledLinks = maxCrawledLinks ?? limit;
  }
  private filterLinks(sitemapLinks: string[], limit: number): string[] {
    return sitemapLinks
      .filter((link) => {
        const url = new URL(link);
        const path = url.pathname;
        // Check if the link should be excluded
        if (this.excludes.length > 0 && this.excludes[0] !== "") {
          if (
            this.excludes.some((excludePattern) =>
              new RegExp(excludePattern).test(path)
            )
          ) {
            return false;
          }
        }
        // Check if the link matches the include patterns, if any are specified
        if (this.includes.length > 0 && this.includes[0] !== "") {
          return this.includes.some((includePattern) =>
            new RegExp(includePattern).test(path)
          );
        }
        const isAllowed = this.robots.isAllowed(link, "FireCrawlAgent") ?? true;
        // Check if the link is disallowed by robots.txt
        if (!isAllowed) {
          console.log(`Link disallowed by robots.txt: ${link}`);
          return false;
        }
        return true;
      })
      .slice(0, limit);
  }
  public async start(
    inProgress?: (progress: Progress) => void,
    concurrencyLimit: number = 5,
    limit: number = 10000
  ): Promise<string[]> {
    // Fetch and parse robots.txt
    try {
      const response = await axios.get(this.robotsTxtUrl);
      this.robots = robotsParser(this.robotsTxtUrl, response.data);
    } catch (error) {
      console.error(`Failed to fetch robots.txt from ${this.robotsTxtUrl}`);
    }
    const sitemapLinks = await this.tryFetchSitemapLinks(this.initialUrl);
    if (sitemapLinks.length > 0) {
      const filteredLinks = this.filterLinks(sitemapLinks, limit);
      return filteredLinks;
    }
    const urls = await this.crawlUrls(
      [this.initialUrl],
      concurrencyLimit,
      inProgress
    );
    if (
      urls.length === 0 &&
      this.filterLinks([this.initialUrl], limit).length > 0
    ) {
      return [this.initialUrl];
    }
    // make sure to run include exclude here again
    return this.filterLinks(urls, limit);
  }
  private async crawlUrls(
    urls: string[],
    concurrencyLimit: number,
    inProgress?: (progress: Progress) => void
  ): Promise<string[]> {
    const queue = async.queue(async (task: string, callback) => {
      if (this.crawledUrls.size >= this.maxCrawledLinks) {
        if (callback && typeof callback === "function") {
          callback();
        }
        return;
      }
      const newUrls = await this.crawl(task);
      newUrls.forEach((url) => this.crawledUrls.add(url));
      if (inProgress && newUrls.length > 0) {
        inProgress({
          current: this.crawledUrls.size,
          total: this.maxCrawledLinks,
          status: "SCRAPING",
          currentDocumentUrl: newUrls[newUrls.length - 1],
        });
      } else if (inProgress) {
        inProgress({
          current: this.crawledUrls.size,
          total: this.maxCrawledLinks,
          status: "SCRAPING",
          currentDocumentUrl: task,
        });
      }
      await this.crawlUrls(newUrls, concurrencyLimit, inProgress);
      if (callback && typeof callback === "function") {
        callback();
      }
    }, concurrencyLimit);
    queue.push(
      urls.filter(
        (url) =>
          !this.visited.has(url) && this.robots.isAllowed(url, "FireCrawlAgent")
      ),
      (err) => {
        if (err) console.error(err);
      }
    );
    await queue.drain();
    return Array.from(this.crawledUrls);
  }
  async crawl(url: string): Promise<string[]> {
    if (this.visited.has(url) || !this.robots.isAllowed(url, "FireCrawlAgent"))
      return [];
    this.visited.add(url);
    if (!url.startsWith("http")) {
      url = "https://" + url;
    }
    if (url.endsWith("/")) {
      url = url.slice(0, -1);
    }
    if (this.isFile(url) || this.isSocialMediaOrEmail(url)) {
      return [];
    }
    try {
      let content;
      // If it is the first link, fetch with scrapingbee
      if (this.visited.size === 1) {
        content = await scrapWithScrapingBee(url, "load");
      } else {
        const response = await axios.get(url);
        content = response.data;
      }
      const $ = load(content);
      let links: string[] = [];
      $("a").each((_, element) => {
        const href = $(element).attr("href");
        if (href) {
          let fullUrl = href;
          if (!href.startsWith("http")) {
            fullUrl = new URL(href, this.baseUrl).toString();
          }
          const url = new URL(fullUrl);
          const path = url.pathname;
          if (
            // fullUrl.startsWith(this.initialUrl) && // this condition makes it stop crawling back the url
            this.isInternalLink(fullUrl) &&
            this.matchesPattern(fullUrl) &&
            this.noSections(fullUrl) &&
            this.matchesIncludes(path) &&
            !this.matchesExcludes(path) &&
            this.robots.isAllowed(fullUrl, "FireCrawlAgent")
          ) {
            links.push(fullUrl);
          }
        }
      });
      return links.filter((link) => !this.visited.has(link));
    } catch (error) {
      return [];
    }
  }
  private matchesIncludes(url: string): boolean {
    if (this.includes.length === 0 || this.includes[0] == "") return true;
    return this.includes.some((pattern) => new RegExp(pattern).test(url));
  }
  private matchesExcludes(url: string): boolean {
    if (this.excludes.length === 0 || this.excludes[0] == "") return false;
    return this.excludes.some((pattern) => new RegExp(pattern).test(url));
  }
  private noSections(link: string): boolean {
    return !link.includes("#");
  }
  private isInternalLink(link: string): boolean {
    const urlObj = new URL(link, this.baseUrl);
    const domainWithoutProtocol = this.baseUrl.replace(/^https?:\/\//, "");
    return urlObj.hostname === domainWithoutProtocol;
  }
  private matchesPattern(link: string): boolean {
    return true; // Placeholder for future pattern matching implementation
  }
  private isFile(url: string): boolean {
    const fileExtensions = [
      ".png",
      ".jpg",
      ".jpeg",
      ".gif",
      ".css",
      ".js",
      ".ico",
      ".svg",
      ".pdf",
      ".zip",
      ".exe",
      ".dmg",
      ".mp4",
      ".mp3",
      ".pptx",
      ".docx",
      ".xlsx",
      ".xml",
    ];
    return fileExtensions.some((ext) => url.endsWith(ext));
  }
  private isSocialMediaOrEmail(url: string): boolean {
    const socialMediaOrEmail = [
      "facebook.com",
      "twitter.com",
      "linkedin.com",
      "instagram.com",
      "pinterest.com",
      "mailto:",
    ];
    return socialMediaOrEmail.some((ext) => url.includes(ext));
  }
  private async tryFetchSitemapLinks(url: string): Promise<string[]> {
    const sitemapUrl = url.endsWith("/sitemap.xml")
      ? url
      : `${url}/sitemap.xml`;
    try {
      const response = await axios.get(sitemapUrl);
      if (response.status === 200) {
        return await getLinksFromSitemap(sitemapUrl);
      }
    } catch (error) {
      // Error handling for failed sitemap fetch
    }
    return [];
  }
 }
--- a/apps/api/src/scraper/WebScraper/index.ts
+++ b/apps/api/src/scraper/WebScraper/index.ts
@ -0,0 +1,287 @@
 import { Document } from "../../lib/entities";
 import { Progress } from "../../lib/entities";
 import { scrapSingleUrl } from "./single_url";
 import { SitemapEntry, fetchSitemapData, getLinksFromSitemap } from "./sitemap";
 import { WebCrawler } from "./crawler";
 import { getValue, setValue } from "../../services/redis";
 export type WebScraperOptions = {
  urls: string[];
  mode: "single_urls" | "sitemap" | "crawl";
  crawlerOptions?: {
    returnOnlyUrls?: boolean;
    includes?: string[];
    excludes?: string[];
    maxCrawledLinks?: number;
    limit?: number;
  };
  concurrentRequests?: number;
 };
 export class WebScraperDataProvider {
  private urls: string[] = [""];
  private mode: "single_urls" | "sitemap" | "crawl" = "single_urls";
  private includes: string[];
  private excludes: string[];
  private maxCrawledLinks: number;
  private returnOnlyUrls: boolean;
  private limit: number = 10000;
  private concurrentRequests: number = 20;
  authorize(): void {
    throw new Error("Method not implemented.");
  }
  authorizeNango(): Promise<void> {
    throw new Error("Method not implemented.");
  }
  private async convertUrlsToDocuments(
    urls: string[],
    inProgress?: (progress: Progress) => void
  ): Promise<Document[]> {
    const totalUrls = urls.length;
    let processedUrls = 0;
    console.log("Converting urls to documents");
    console.log("Total urls", urls);
    const results: (Document | null)[] = new Array(urls.length).fill(null);
    for (let i = 0; i < urls.length; i += this.concurrentRequests) {
      const batchUrls = urls.slice(i, i + this.concurrentRequests);
      await Promise.all(batchUrls.map(async (url, index) => {
        const result = await scrapSingleUrl(url, true);
        processedUrls++;
        if (inProgress) {
          inProgress({
            current: processedUrls,
            total: totalUrls,
            status: "SCRAPING",
            currentDocumentUrl: url,
          });
        }
        results[i + index] = result;
      }));
    }
    return results.filter((result) => result !== null) as Document[];
  }
  async getDocuments(
    useCaching: boolean = false,
    inProgress?: (progress: Progress) => void
  ): Promise<Document[]> {
    if (this.urls[0].trim() === "") {
      throw new Error("Url is required");
    }
    if (!useCaching) {
      if (this.mode === "crawl") {
        const crawler = new WebCrawler({
          initialUrl: this.urls[0],
          includes: this.includes,
          excludes: this.excludes,
          maxCrawledLinks: this.maxCrawledLinks,
          limit: this.limit,
        });
        const links = await crawler.start(inProgress, 5, this.limit);
        if (this.returnOnlyUrls) {
          return links.map((url) => ({
            content: "",
            metadata: { sourceURL: url },
            provider: "web",
            type: "text",
          }));
        }
        let documents = await this.convertUrlsToDocuments(links, inProgress);
        documents = await this.getSitemapData(this.urls[0], documents);
        console.log("documents", documents)
        // CACHING DOCUMENTS
        // - parent document
        const cachedParentDocumentString = await getValue('web-scraper-cache:' + this.normalizeUrl(this.urls[0]));
        if (cachedParentDocumentString != null) {
          let cachedParentDocument = JSON.parse(cachedParentDocumentString);
          if (!cachedParentDocument.childrenLinks || cachedParentDocument.childrenLinks.length < links.length - 1) {
            cachedParentDocument.childrenLinks = links.filter((link) => link !== this.urls[0]);
            await setValue('web-scraper-cache:' + this.normalizeUrl(this.urls[0]), JSON.stringify(cachedParentDocument), 60 * 60 * 24 * 10); // 10 days
          }
        } else {
          let parentDocument = documents.filter((document) => this.normalizeUrl(document.metadata.sourceURL) === this.normalizeUrl(this.urls[0]))
          await this.setCachedDocuments(parentDocument, links);
        }
        await this.setCachedDocuments(documents.filter((document) => this.normalizeUrl(document.metadata.sourceURL) !== this.normalizeUrl(this.urls[0])), []);
        documents = this.removeChildLinks(documents);
        documents = documents.splice(0, this.limit);
        return documents;
      }
      if (this.mode === "single_urls") {
        let documents = await this.convertUrlsToDocuments(this.urls, inProgress);
        const baseUrl = new URL(this.urls[0]).origin;
        documents = await this.getSitemapData(baseUrl, documents);
        await this.setCachedDocuments(documents);
        documents = this.removeChildLinks(documents);
        documents = documents.splice(0, this.limit);
        return documents;
      }
      if (this.mode === "sitemap") {
        const links = await getLinksFromSitemap(this.urls[0]);
        let documents = await this.convertUrlsToDocuments(links.slice(0, this.limit), inProgress);
        documents = await this.getSitemapData(this.urls[0], documents);
        await this.setCachedDocuments(documents);
        documents = this.removeChildLinks(documents);
        documents = documents.splice(0, this.limit);
        return documents;
      }
      return [];
    }
    let documents = await this.getCachedDocuments(this.urls.slice(0, this.limit));
    if (documents.length < this.limit) {
       const newDocuments: Document[] = await this.getDocuments(false, inProgress);
      newDocuments.forEach(doc => {
        if (!documents.some(d => this.normalizeUrl(d.metadata.sourceURL) === this.normalizeUrl(doc.metadata?.sourceURL))) {
          documents.push(doc);
        }
      });
    }
    documents = this.filterDocsExcludeInclude(documents);
    documents = this.removeChildLinks(documents);
    documents = documents.splice(0, this.limit);
    return documents;
  }
  private filterDocsExcludeInclude(documents: Document[]): Document[] {
    return documents.filter((document) => {
      const url = new URL(document.metadata.sourceURL);
      const path = url.pathname;
      if (this.excludes.length > 0 && this.excludes[0] !== '') {
        // Check if the link should be excluded
        if (this.excludes.some(excludePattern => new RegExp(excludePattern).test(path))) {
          return false;
        }
      }
      if (this.includes.length > 0 && this.includes[0] !== '') {
        // Check if the link matches the include patterns, if any are specified
        if (this.includes.length > 0) {
          return this.includes.some(includePattern => new RegExp(includePattern).test(path));
        }
      }
      return true;
    });
  }
  private normalizeUrl(url: string): string {
    if (url.includes("//www.")) {
      return url.replace("//www.", "//");
    }
    return url;
  }
  private removeChildLinks(documents: Document[]): Document[] {
    for (let document of documents) {
      if (document?.childrenLinks) delete document.childrenLinks;
    };
    return documents;
  }
  async setCachedDocuments(documents: Document[], childrenLinks?: string[]) {
    for (const document of documents) {
      if (document.content.trim().length === 0) {
        continue;
      }
      const normalizedUrl = this.normalizeUrl(document.metadata.sourceURL);
      await setValue('web-scraper-cache:' + normalizedUrl, JSON.stringify({
        ...document,
        childrenLinks: childrenLinks || []
      }), 60 * 60 * 24 * 10); // 10 days
    }
  }
  async getCachedDocuments(urls: string[]): Promise<Document[]> {
    let documents: Document[] = [];
    for (const url of urls) {
      const normalizedUrl = this.normalizeUrl(url);
      console.log("Getting cached document for web-scraper-cache:" + normalizedUrl)
      const cachedDocumentString = await getValue('web-scraper-cache:' + normalizedUrl);
      if (cachedDocumentString) {
        const cachedDocument = JSON.parse(cachedDocumentString);
        documents.push(cachedDocument);
        // get children documents
        for (const childUrl of cachedDocument.childrenLinks) {
          const normalizedChildUrl = this.normalizeUrl(childUrl);
          const childCachedDocumentString = await getValue('web-scraper-cache:' + normalizedChildUrl);
          if (childCachedDocumentString) {
            const childCachedDocument = JSON.parse(childCachedDocumentString);
            if (!documents.find((doc) => doc.metadata.sourceURL === childCachedDocument.metadata.sourceURL)) {
              documents.push(childCachedDocument);
            }
          }
        }
      }
    }
    return documents;
  }
  setOptions(options: WebScraperOptions): void {
    if (!options.urls) {
      throw new Error("Urls are required");
    }
    console.log("options", options.crawlerOptions?.excludes)
    this.urls = options.urls;
    this.mode = options.mode;
    this.concurrentRequests = options.concurrentRequests ?? 20;
    this.includes = options.crawlerOptions?.includes ?? [];
    this.excludes = options.crawlerOptions?.excludes ?? [];
    this.maxCrawledLinks = options.crawlerOptions?.maxCrawledLinks ?? 1000;
    this.returnOnlyUrls = options.crawlerOptions?.returnOnlyUrls ?? false;
    this.limit = options.crawlerOptions?.limit ?? 10000;
    //! @nicolas, for some reason this was being injected and breakign everything. Don't have time to find source of the issue so adding this check
    this.excludes = this.excludes.filter(item => item !== '');
    // make sure all urls start with https://
    this.urls = this.urls.map((url) => {
      if (!url.trim().startsWith("http")) {
        return `https://${url}`;
      }
      return url;
    });
  }
  private async getSitemapData(baseUrl: string, documents: Document[]) {
    const sitemapData = await fetchSitemapData(baseUrl)
    if (sitemapData) {
      for (let i = 0; i < documents.length; i++) {
        const docInSitemapData = sitemapData.find((data) => this.normalizeUrl(data.loc) === this.normalizeUrl(documents[i].metadata.sourceURL))
        if (docInSitemapData) {
          let sitemapDocData: Partial<SitemapEntry> = {};
          if (docInSitemapData.changefreq) {
            sitemapDocData.changefreq = docInSitemapData.changefreq;
          }
          if (docInSitemapData.priority) {
            sitemapDocData.priority = Number(docInSitemapData.priority);
          }
          if (docInSitemapData.lastmod) {
            sitemapDocData.lastmod = docInSitemapData.lastmod;
          }
          if (Object.keys(sitemapDocData).length !== 0) {
            documents[i].metadata.sitemap = sitemapDocData;
          }
        }
      }
    }
    return documents;
  }
 }
--- a/apps/api/src/scraper/WebScraper/single_url.ts
+++ b/apps/api/src/scraper/WebScraper/single_url.ts
@ -0,0 +1,145 @@
 import * as cheerio from "cheerio";
 import { ScrapingBeeClient } from "scrapingbee";
 import { attemptScrapWithRequests, sanitizeText } from "./utils/utils";
 import { extractMetadata } from "./utils/metadata";
 import dotenv from "dotenv";
 import { Document } from "../../lib/entities";
 import { parseMarkdown } from "../../lib/html-to-markdown";
 // import puppeteer from "puppeteer";
 dotenv.config();
 export async function scrapWithScrapingBee(url: string, wait_browser:string = "domcontentloaded"): Promise<string> {
  try {
    const client = new ScrapingBeeClient(process.env.SCRAPING_BEE_API_KEY);
    const response = await client.get({
      url: url,
      params: { timeout: 15000, wait_browser: wait_browser },
      headers: { "ScrapingService-Request": "TRUE" },
    });
    if (response.status !== 200 && response.status !== 404) {
      console.error(
        `Scraping bee error in ${url} with status code ${response.status}`
      );
      return "";
    }
    const decoder = new TextDecoder();
    const text = decoder.decode(response.data);
    return text;
  } catch (error) {
    console.error(`Error scraping with Scraping Bee: ${error}`);
    return "";
  }
 }
 export async function scrapWithPlaywright(url: string): Promise<string> {
  try {
    const response = await fetch(process.env.PLAYWRIGHT_MICROSERVICE_URL, {
      method: 'POST',
      headers: {
        "Content-Type": "application/json",
      },
      body: JSON.stringify({ url: url }),
    });
    if (!response.ok) {
      console.error(`Error fetching w/ playwright server -> URL: ${url} with status: ${response.status}`);
      return "";
    }
    const data = await response.json();
    const html = data.content;
    return html ?? "";
  } catch (error) {
    console.error(`Error scraping with Puppeteer: ${error}`);
    return "";
  }
 }
 export async function scrapSingleUrl(
  urlToScrap: string,
  toMarkdown: boolean = true
 ): Promise<Document> {
  console.log(`Scraping URL: ${urlToScrap}`);
  urlToScrap = urlToScrap.trim();
  const removeUnwantedElements = (html: string) => {
    const soup = cheerio.load(html);
    soup("script, style, iframe, noscript, meta, head").remove();
    return soup.html();
  };
  const attemptScraping = async (url: string, method: 'scrapingBee' | 'playwright' | 'scrapingBeeLoad' | 'fetch') => {
    let text = "";
    switch (method) {
      case 'scrapingBee':
        if (process.env.SCRAPING_BEE_API_KEY) {
          text = await scrapWithScrapingBee(url);
        }
        break;
      case 'playwright':
        if (process.env.PLAYWRIGHT_MICROSERVICE_URL) {
          text = await scrapWithPlaywright(url);
        }
        break;
      case 'scrapingBeeLoad':
        if (process.env.SCRAPING_BEE_API_KEY) {
          text = await scrapWithScrapingBee(url, "networkidle2");
        }
        break;
      case 'fetch':
        try {
          const response = await fetch(url);
          if (!response.ok) {
            console.error(`Error fetching URL: ${url} with status: ${response.status}`);
            return "";
          }
          text = await response.text();
        } catch (error) {
          console.error(`Error scraping URL: ${error}`);
          return "";
        }
        break;
    }
    const cleanedHtml = removeUnwantedElements(text);
    return [await parseMarkdown(cleanedHtml), text];
  };
  try {
    let [text, html ] = await attemptScraping(urlToScrap, 'scrapingBee');
    if (!text || text.length < 100) {
      console.log("Falling back to playwright");
      [text, html] = await attemptScraping(urlToScrap, 'playwright');
    }
    if (!text || text.length < 100) {
      console.log("Falling back to scraping bee load");
      [text, html] = await attemptScraping(urlToScrap, 'scrapingBeeLoad');
    }
    if (!text || text.length < 100) {
      console.log("Falling back to fetch");
      [text, html] = await attemptScraping(urlToScrap, 'fetch');
    }
    const soup = cheerio.load(html);
    const metadata = extractMetadata(soup, urlToScrap);
    return {
      content: text,
      markdown: text,
      metadata: { ...metadata, sourceURL: urlToScrap },
    } as Document;
  } catch (error) {
    console.error(`Error: ${error} - Failed to fetch URL: ${urlToScrap}`);
    return {
      content: "",
      markdown: "",
      metadata: { sourceURL: urlToScrap },
    } as Document;
  }
 }
--- a/apps/api/src/scraper/WebScraper/sitemap.ts
+++ b/apps/api/src/scraper/WebScraper/sitemap.ts
@ -0,0 +1,74 @@
 import axios from "axios";
 import { parseStringPromise } from "xml2js";
 export async function getLinksFromSitemap(
  sitemapUrl: string,
  allUrls: string[] = []
 ): Promise<string[]> {
  try {
    let content: string;
    try {
      const response = await axios.get(sitemapUrl);
      content = response.data;
    } catch (error) {
      console.error(`Request failed for ${sitemapUrl}: ${error}`);
      return allUrls;
    }
    const parsed = await parseStringPromise(content);
    const root = parsed.urlset || parsed.sitemapindex;
    if (root && root.sitemap) {
      for (const sitemap of root.sitemap) {
        if (sitemap.loc && sitemap.loc.length > 0) {
          await getLinksFromSitemap(sitemap.loc[0], allUrls);
        }
      }
    } else if (root && root.url) {
      for (const url of root.url) {
        if (url.loc && url.loc.length > 0) {
          allUrls.push(url.loc[0]);
        }
      }
    }
  } catch (error) {
    console.error(`Error processing ${sitemapUrl}: ${error}`);
  }
  return allUrls;
 }
 export const fetchSitemapData = async (url: string): Promise<SitemapEntry[] | null> => {
  const sitemapUrl = url.endsWith("/sitemap.xml") ? url : `${url}/sitemap.xml`;
  try {
    const response = await axios.get(sitemapUrl);
    if (response.status === 200) {
      const xml = response.data;
      const parsedXml = await parseStringPromise(xml);
      const sitemapData: SitemapEntry[] = [];
      if (parsedXml.urlset && parsedXml.urlset.url) {
        for (const urlElement of parsedXml.urlset.url) {
          const sitemapEntry: SitemapEntry = { loc: urlElement.loc[0] };
          if (urlElement.lastmod) sitemapEntry.lastmod = urlElement.lastmod[0];
          if (urlElement.changefreq) sitemapEntry.changefreq = urlElement.changefreq[0];
          if (urlElement.priority) sitemapEntry.priority = Number(urlElement.priority[0]);
          sitemapData.push(sitemapEntry);
        }
      }
      return sitemapData;
    }
    return null;
  } catch (error) {
    // Error handling for failed sitemap fetch
  }
  return [];
 }
 export interface SitemapEntry {
  loc: string;
  lastmod?: string;
  changefreq?: string;
  priority?: number;
 }
--- a/apps/api/src/scraper/WebScraper/utils/metadata.ts
+++ b/apps/api/src/scraper/WebScraper/utils/metadata.ts
@ -0,0 +1,109 @@
 // import * as cheerio from 'cheerio';
 import { CheerioAPI } from "cheerio";
 interface Metadata {
  title?: string;
  description?: string;
  language?: string;
  keywords?: string;
  robots?: string;
  ogTitle?: string;
  ogDescription?: string;
  dctermsCreated?: string;
  dcDateCreated?: string;
  dcDate?: string;
  dctermsType?: string;
  dcType?: string;
  dctermsAudience?: string;
  dctermsSubject?: string;
  dcSubject?: string;
  dcDescription?: string;
  ogImage?: string;
  dctermsKeywords?: string;
  modifiedTime?: string;
  publishedTime?: string;
  articleTag?: string;
  articleSection?: string;
 }
 export function extractMetadata(soup: CheerioAPI, url: string): Metadata {
  let title: string | null = null;
  let description: string | null = null;
  let language: string | null = null;
  let keywords: string | null = null;
  let robots: string | null = null;
  let ogTitle: string | null = null;
  let ogDescription: string | null = null;
  let dctermsCreated: string | null = null;
  let dcDateCreated: string | null = null;
  let dcDate: string | null = null;
  let dctermsType: string | null = null;
  let dcType: string | null = null;
  let dctermsAudience: string | null = null;
  let dctermsSubject: string | null = null;
  let dcSubject: string | null = null;
  let dcDescription: string | null = null;
  let ogImage: string | null = null;
  let dctermsKeywords: string | null = null;
  let modifiedTime: string | null = null;
  let publishedTime: string | null = null;
  let articleTag: string | null = null;
  let articleSection: string | null = null;
  try {
    title = soup("title").text() || null;
    description = soup('meta[name="description"]').attr("content") || null;
    // Assuming the language is part of the URL as per the regex pattern
    const pattern = /([a-zA-Z]+-[A-Z]{2})/;
    const match = pattern.exec(url);
    language = match ? match[1] : null;
    keywords = soup('meta[name="keywords"]').attr("content") || null;
    robots = soup('meta[name="robots"]').attr("content") || null;
    ogTitle = soup('meta[property="og:title"]').attr("content") || null;
    ogDescription = soup('meta[property="og:description"]').attr("content") || null;
    articleSection = soup('meta[name="article:section"]').attr("content") || null;
    articleTag = soup('meta[name="article:tag"]').attr("content") || null;
    publishedTime = soup('meta[property="article:published_time"]').attr("content") || null;
    modifiedTime = soup('meta[property="article:modified_time"]').attr("content") || null;
    ogImage = soup('meta[property="og:image"]').attr("content") || null;
    dctermsKeywords = soup('meta[name="dcterms.keywords"]').attr("content") || null;
    dcDescription = soup('meta[name="dc.description"]').attr("content") || null;
    dcSubject = soup('meta[name="dc.subject"]').attr("content") || null;
    dctermsSubject = soup('meta[name="dcterms.subject"]').attr("content") || null;
    dctermsAudience = soup('meta[name="dcterms.audience"]').attr("content") || null;
    dcType = soup('meta[name="dc.type"]').attr("content") || null;
    dctermsType = soup('meta[name="dcterms.type"]').attr("content") || null;
    dcDate = soup('meta[name="dc.date"]').attr("content") || null;
    dcDateCreated = soup('meta[name="dc.date.created"]').attr("content") || null;
    dctermsCreated = soup('meta[name="dcterms.created"]').attr("content") || null;
  } catch (error) {
    console.error("Error extracting metadata:", error);
  }
  return {
    ...(title ? { title } : {}),
    ...(description ? { description } : {}),
    ...(language ? { language } : {}),
    ...(keywords ? { keywords } : {}),
    ...(robots ? { robots } : {}),
    ...(ogTitle ? { ogTitle } : {}),
    ...(ogDescription ? { ogDescription } : {}),
    ...(dctermsCreated ? { dctermsCreated } : {}),
    ...(dcDateCreated ? { dcDateCreated } : {}),
    ...(dcDate ? { dcDate } : {}),
    ...(dctermsType ? { dctermsType } : {}),
    ...(dcType ? { dcType } : {}),
    ...(dctermsAudience ? { dctermsAudience } : {}),
    ...(dctermsSubject ? { dctermsSubject } : {}),
    ...(dcSubject ? { dcSubject } : {}),
    ...(dcDescription ? { dcDescription } : {}),
    ...(ogImage ? { ogImage } : {}),
    ...(dctermsKeywords ? { dctermsKeywords } : {}),
    ...(modifiedTime ? { modifiedTime } : {}),
    ...(publishedTime ? { publishedTime } : {}),
    ...(articleTag ? { articleTag } : {}),
    ...(articleSection ? { articleSection } : {}),
  };
 }
--- a/apps/api/src/scraper/WebScraper/utils/utils.ts
+++ b/apps/api/src/scraper/WebScraper/utils/utils.ts
@ -0,0 +1,23 @@
 import axios from "axios";
 export async function attemptScrapWithRequests(
  urlToScrap: string
 ): Promise<string | null> {
  try {
    const response = await axios.get(urlToScrap);
    if (!response.data) {
      console.log("Failed normal requests as well");
      return null;
    }
    return response.data;
  } catch (error) {
    console.error(`Error in attemptScrapWithRequests: ${error}`);
    return null;
  }
 }
 export function sanitizeText(text: string): string {
  return text.replace("\u0000", "");
 }
--- a/apps/api/src/services/billing/credit_billing.ts
+++ b/apps/api/src/services/billing/credit_billing.ts
@ -0,0 +1,219 @@
 import { supabase_service } from "../supabase";
 const FREE_CREDITS = 100;
 export async function billTeam(team_id: string, credits: number) {
  if (team_id === "preview") {
    return { success: true, message: "Preview team, no credits used" };
  }
  console.log(`Billing team ${team_id} for ${credits} credits`);
  //   When the API is used, you can log the credit usage in the credit_usage table:
  // team_id: The ID of the team using the API.
  // subscription_id: The ID of the team's active subscription.
  // credits_used: The number of credits consumed by the API call.
  // created_at: The timestamp of the API usage.
  // 1. get the subscription
  const { data: subscription } = await supabase_service
    .from("subscriptions")
    .select("*")
    .eq("team_id", team_id)
    .eq("status", "active")
    .single();
  if (!subscription) {
    const { data: credit_usage } = await supabase_service
      .from("credit_usage")
      .insert([
        {
          team_id,
          credits_used: credits,
          created_at: new Date(),
        },
      ])
      .select();
    return { success: true, credit_usage };
  }
  // 2. add the credits to the credits_usage
  const { data: credit_usage } = await supabase_service
    .from("credit_usage")
    .insert([
      {
        team_id,
        subscription_id: subscription.id,
        credits_used: credits,
        created_at: new Date(),
      },
    ])
    .select();
  return { success: true, credit_usage };
 }
 // if team has enough credits for the operation, return true, else return false
 export async function checkTeamCredits(team_id: string, credits: number) {
  if (team_id === "preview") {
    return { success: true, message: "Preview team, no credits used" };
  }
  // 1. Retrieve the team's active subscription based on the team_id.
  const { data: subscription, error: subscriptionError } =
    await supabase_service
      .from("subscriptions")
      .select("id, price_id, current_period_start, current_period_end")
      .eq("team_id", team_id)
      .eq("status", "active")
      .single();
  if (subscriptionError || !subscription) {
    const { data: creditUsages, error: creditUsageError } =
      await supabase_service
        .from("credit_usage")
        .select("credits_used")
        .is("subscription_id", null)
        .eq("team_id", team_id);
    // .gte("created_at", subscription.current_period_start)
    // .lte("created_at", subscription.current_period_end);
    if (creditUsageError) {
      throw new Error(
        `Failed to retrieve credit usage for subscription_id: ${subscription.id}`
      );
    }
    const totalCreditsUsed = creditUsages.reduce(
      (acc, usage) => acc + usage.credits_used,
      0
    );
    console.log("totalCreditsUsed", totalCreditsUsed);
    // 5. Compare the total credits used with the credits allowed by the plan.
    if (totalCreditsUsed + credits > FREE_CREDITS) {
      return {
        success: false,
        message: "Insufficient credits, please upgrade!",
      };
    }
    return { success: true, message: "Sufficient credits available" };
  }
  // 2. Get the price_id from the subscription.
  const { data: price, error: priceError } = await supabase_service
    .from("prices")
    .select("credits")
    .eq("id", subscription.price_id)
    .single();
  if (priceError) {
    throw new Error(
      `Failed to retrieve price for price_id: ${subscription.price_id}`
    );
  }
  // 4. Calculate the total credits used by the team within the current billing period.
  const { data: creditUsages, error: creditUsageError } = await supabase_service
    .from("credit_usage")
    .select("credits_used")
    .eq("subscription_id", subscription.id)
    .gte("created_at", subscription.current_period_start)
    .lte("created_at", subscription.current_period_end);
  if (creditUsageError) {
    throw new Error(
      `Failed to retrieve credit usage for subscription_id: ${subscription.id}`
    );
  }
  const totalCreditsUsed = creditUsages.reduce(
    (acc, usage) => acc + usage.credits_used,
    0
  );
  // 5. Compare the total credits used with the credits allowed by the plan.
  if (totalCreditsUsed + credits > price.credits) {
    return { success: false, message: "Insufficient credits, please upgrade!" };
  }
  return { success: true, message: "Sufficient credits available" };
 }
 // Count the total credits used by a team within the current billing period and return the remaining credits.
 export async function countCreditsAndRemainingForCurrentBillingPeriod(
  team_id: string
 ) {
  // 1. Retrieve the team's active subscription based on the team_id.
  const { data: subscription, error: subscriptionError } =
    await supabase_service
      .from("subscriptions")
      .select("id, price_id, current_period_start, current_period_end")
      .eq("team_id", team_id)
      .single();
  if (subscriptionError || !subscription) {
    // throw new Error(`Failed to retrieve subscription for team_id: ${team_id}`);
    // Free
    const { data: creditUsages, error: creditUsageError } =
      await supabase_service
        .from("credit_usage")
        .select("credits_used")
        .is("subscription_id", null)
        .eq("team_id", team_id);
    // .gte("created_at", subscription.current_period_start)
    // .lte("created_at", subscription.current_period_end);
    if (creditUsageError || !creditUsages) {
      throw new Error(
        `Failed to retrieve credit usage for subscription_id: ${subscription.id}`
      );
    }
    const totalCreditsUsed = creditUsages.reduce(
      (acc, usage) => acc + usage.credits_used,
      0
    );
    // 4. Calculate remaining credits.
    const remainingCredits = FREE_CREDITS - totalCreditsUsed;
    return { totalCreditsUsed, remainingCredits, totalCredits: FREE_CREDITS };
  }
  // 2. Get the price_id from the subscription to retrieve the total credits available.
  const { data: price, error: priceError } = await supabase_service
    .from("prices")
    .select("credits")
    .eq("id", subscription.price_id)
    .single();
  if (priceError || !price) {
    throw new Error(
      `Failed to retrieve price for price_id: ${subscription.price_id}`
    );
  }
  // 3. Calculate the total credits used by the team within the current billing period.
  const { data: creditUsages, error: creditUsageError } = await supabase_service
    .from("credit_usage")
    .select("credits_used")
    .eq("subscription_id", subscription.id)
    .gte("created_at", subscription.current_period_start)
    .lte("created_at", subscription.current_period_end);
  if (creditUsageError || !creditUsages) {
    throw new Error(
      `Failed to retrieve credit usage for subscription_id: ${subscription.id}`
    );
  }
  const totalCreditsUsed = creditUsages.reduce(
    (acc, usage) => acc + usage.credits_used,
    0
  );
  // 4. Calculate remaining credits.
  const remainingCredits = price.credits - totalCreditsUsed;
  return { totalCreditsUsed, remainingCredits, totalCredits: price.credits };
 }
--- a/apps/api/src/services/logtail.ts
+++ b/apps/api/src/services/logtail.ts
@ -0,0 +1,4 @@
 const { Logtail } = require("@logtail/node");
 //dot env
 require("dotenv").config();
 export const logtail = new Logtail(process.env.LOGTAIL_KEY);
--- a/apps/api/src/services/queue-jobs.ts
+++ b/apps/api/src/services/queue-jobs.ts
@ -0,0 +1,17 @@
 import { Job, Queue } from "bull";
 import {
  getWebScraperQueue,
 } from "./queue-service";
 import { v4 as uuidv4 } from "uuid";
 import { WebScraperOptions } from "../types";
 export async function addWebScraperJob(
  webScraperOptions: WebScraperOptions,
  options: any = {}
 ): Promise<Job> {
  return await getWebScraperQueue().add(webScraperOptions, {
    ...options,
    jobId: uuidv4(),
  });
 }
--- a/apps/api/src/services/queue-service.ts
+++ b/apps/api/src/services/queue-service.ts
@ -0,0 +1,16 @@
 import Queue from "bull";
 let webScraperQueue;
 export function getWebScraperQueue() {
  if (!webScraperQueue) {
    webScraperQueue = new Queue("web-scraper", process.env.REDIS_URL, {
      settings: {
        lockDuration: 4 * 60 * 60 * 1000, // 4 hours in milliseconds,
        lockRenewTime: 30 * 60 * 1000, // 30 minutes in milliseconds
      },
    });
    console.log("Web scraper queue created");
  }
  return webScraperQueue;
 }
--- a/apps/api/src/services/queue-worker.ts
+++ b/apps/api/src/services/queue-worker.ts
@ -0,0 +1,62 @@
 import { CustomError } from "../lib/custom-error";
 import { getWebScraperQueue } from "./queue-service";
 import "dotenv/config";
 import { logtail } from "./logtail";
 import { startWebScraperPipeline } from "../main/runWebScraper";
 import { WebScraperDataProvider } from "../scraper/WebScraper";
 import { callWebhook } from "./webhook";
 getWebScraperQueue().process(
  Math.floor(Number(process.env.NUM_WORKERS_PER_QUEUE ?? 8)),
  async function (job, done) {
    try {
      job.progress({
        current: 1,
        total: 100,
        current_step: "SCRAPING",
        current_url: "",
      });
      const { success, message, docs } = await startWebScraperPipeline({ job });
      const data = {
        success: success,
        result: {
          links: docs.map((doc) => {
            return { content: doc, source: doc.metadata.sourceURL };
          }),
        },
        project_id: job.data.project_id,
        error: message /* etc... */,
      };
      await callWebhook(job.data.team_id, data);
      done(null, data);
    } catch (error) {
      if (error instanceof CustomError) {
        // Here we handle the error, then save the failed job
        console.error(error.message); // or any other error handling
        logtail.error("Custom error while ingesting", {
          job_id: job.id,
          error: error.message,
          dataIngestionJob: error.dataIngestionJob,
        });
      }
      console.log(error);
      logtail.error("Overall error ingesting", {
        job_id: job.id,
        error: error.message,
      });
      const data = {
        success: false,
        project_id: job.data.project_id,
        error:
          "Something went wrong... Contact help@mendable.ai or try again." /* etc... */,
      };
      await callWebhook(job.data.team_id, data);
      done(null, data);
    }
  }
 );
--- a/apps/api/src/services/rate-limiter.ts
+++ b/apps/api/src/services/rate-limiter.ts
@ -0,0 +1,65 @@
 import { RateLimiterRedis } from "rate-limiter-flexible";
 import * as redis from "redis";
 const MAX_REQUESTS_PER_MINUTE_PREVIEW = 5;
 const MAX_CRAWLS_PER_MINUTE_STARTER = 2;
 const MAX_CRAWLS_PER_MINUTE_STANDAR = 4;
 const MAX_CRAWLS_PER_MINUTE_SCALE = 20;
 const MAX_REQUESTS_PER_MINUTE_ACCOUNT = 40;
 export const redisClient = redis.createClient({
  url: process.env.REDIS_URL,
  legacyMode: true,
 });
 export const previewRateLimiter = new RateLimiterRedis({
  storeClient: redisClient,
  keyPrefix: "middleware",
  points: MAX_REQUESTS_PER_MINUTE_PREVIEW,
  duration: 60, // Duration in seconds
 });
 export const serverRateLimiter = new RateLimiterRedis({
  storeClient: redisClient,
  keyPrefix: "middleware",
  points: MAX_REQUESTS_PER_MINUTE_ACCOUNT,
  duration: 60, // Duration in seconds
 });
 export function crawlRateLimit(plan: string){
  if(plan === "standard"){
    return new RateLimiterRedis({
      storeClient: redisClient,
      keyPrefix: "middleware",
      points: MAX_CRAWLS_PER_MINUTE_STANDAR,
      duration: 60, // Duration in seconds
    });
  }else if(plan === "scale"){
    return new RateLimiterRedis({
      storeClient: redisClient,
      keyPrefix: "middleware",
      points: MAX_CRAWLS_PER_MINUTE_SCALE,
      duration: 60, // Duration in seconds
    });
  }
  return new RateLimiterRedis({
    storeClient: redisClient,
    keyPrefix: "middleware",
    points: MAX_CRAWLS_PER_MINUTE_STARTER,
    duration: 60, // Duration in seconds
  });
 }
 export function getRateLimiter(preview: boolean){
  if(preview){
    return previewRateLimiter;
  }else{
    return serverRateLimiter;
  }
 }
--- a/apps/api/src/services/redis.ts
+++ b/apps/api/src/services/redis.ts
@ -0,0 +1,38 @@
 import Redis from 'ioredis';
 // Initialize Redis client
 const redis = new Redis(process.env.REDIS_URL);
 /**
 * Set a value in Redis with an optional expiration time.
 * @param {string} key The key under which to store the value.
 * @param {string} value The value to store.
 * @param {number} [expire] Optional expiration time in seconds.
 */
 const setValue = async (key: string, value: string, expire?: number) => {
  if (expire) {
    await redis.set(key, value, 'EX', expire);
  } else {
    await redis.set(key, value);
  }
 };
 /**
 * Get a value from Redis.
 * @param {string} key The key of the value to retrieve.
 * @returns {Promise<string|null>} The value, if found, otherwise null.
 */
 const getValue = async (key: string): Promise<string | null> => {
  const value = await redis.get(key);
  return value;
 };
 /**
 * Delete a key from Redis.
 * @param {string} key The key to delete.
 */
 const deleteKey = async (key: string) => {
  await redis.del(key);
 };
 export { setValue, getValue, deleteKey };
--- a/apps/api/src/services/supabase.ts
+++ b/apps/api/src/services/supabase.ts
@ -0,0 +1,6 @@
 import { createClient } from "@supabase/supabase-js";
 export const supabase_service = createClient<any>(
  process.env.SUPABASE_URL,
  process.env.SUPABASE_SERVICE_TOKEN,
 );
--- a/apps/api/src/services/webhook.ts
+++ b/apps/api/src/services/webhook.ts
@ -0,0 +1,41 @@
 import { supabase_service } from "./supabase";
 export const callWebhook = async (teamId: string, data: any) => {
  const { data: webhooksData, error } = await supabase_service
    .from('webhooks')
    .select('url')
    .eq('team_id', teamId)
    .limit(1);
  if (error) {
    console.error(`Error fetching webhook URL for team ID: ${teamId}`, error.message);
    return null;
  }
  if (!webhooksData || webhooksData.length === 0) {
    return null;
  }
  let dataToSend = [];
  if (data.result.links && data.result.links.length !== 0) {
    for (let i = 0; i < data.result.links.length; i++) {
      dataToSend.push({
        content: data.result.links[i].content.content,
        markdown: data.result.links[i].content.markdown,
        metadata: data.result.links[i].content.metadata,
      });
    }
  }
  await fetch(webhooksData[0].url, {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      success: data.success,
      data: dataToSend,
      error: data.error || undefined,
    }),
  });
 }
--- a/apps/api/src/strings.ts
+++ b/apps/api/src/strings.ts
@ -0,0 +1,2 @@
 export const errorNoResults =
  "No results found, please check the URL or contact us at help@mendable.ai to file a ticket.";
--- a/apps/api/src/supabase_types.ts
+++ b/apps/api/src/supabase_types.ts
--- a/apps/api/src/types.ts
+++ b/apps/api/src/types.ts
@ -0,0 +1,26 @@
 export interface CrawlResult {
  source: string;
  content: string;
  options?: {
    summarize?: boolean;
    summarize_max_chars?: number;
  };
  metadata?: any;
  raw_context_id?: number | string;
  permissions?: any[];
 }
 export interface IngestResult {
  success: boolean;
  error: string;
  data: CrawlResult[];
 }
 export interface WebScraperOptions {
  url: string;
  mode: "crawl" | "single_urls" | "sitemap";
  crawlerOptions: any;
  team_id: string;
 }
--- a/apps/api/tsconfig.json
+++ b/apps/api/tsconfig.json
@ -0,0 +1,17 @@
 {
  "compilerOptions": {
    "rootDir": "./src",
    "lib": ["es6","DOM"],
    "target": "ES2020", // or higher
    "module": "commonjs",
    "esModuleInterop": true,
    "sourceMap": true,
    "outDir": "./dist/src",
    "moduleResolution": "node",
    "baseUrl": ".",
    "paths": {
      "*": ["node_modules/*", "src/types/*"],
    }
  },
  "include": ["src/","src/**/*", "services/db/supabase.ts", "utils/utils.ts", "services/db/supabaseEmbeddings.ts", "utils/EventEmmitter.ts", "src/services/queue-service.ts"]
 }
--- a/apps/playwright-service/.DS_Store
+++ b/apps/playwright-service/.DS_Store
--- a/apps/playwright-service/.gitignore
+++ b/apps/playwright-service/.gitignore
@ -0,0 +1,152 @@
 # Byte-compiled / optimized / DLL files
 __pycache__/
 *.py[cod]
 *$py.class
 # C extensions
 *.so
 # Distribution / packaging
 .Python
 build/
 develop-eggs/
 dist/
 downloads/
 eggs/
 .eggs/
 lib/
 lib64/
 parts/
 sdist/
 var/
 wheels/
 share/python-wheels/
 *.egg-info/
 .installed.cfg
 *.egg
 MANIFEST
 # PyInstaller
 #  Usually these files are written by a python script from a template
 #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 *.manifest
 *.spec
 # Installer logs
 pip-log.txt
 pip-delete-this-directory.txt
 # Unit test / coverage reports
 htmlcov/
 .tox/
 .nox/
 .coverage
 .coverage.*
 .cache
 nosetests.xml
 coverage.xml
 *.cover
 *.py,cover
 .hypothesis/
 .pytest_cache/
 cover/
 # Translations
 *.mo
 *.pot
 # Django stuff:
 *.log
 local_settings.py
 db.sqlite3
 db.sqlite3-journal
 # Flask stuff:
 instance/
 .webassets-cache
 # Scrapy stuff:
 .scrapy
 # Sphinx documentation
 docs/_build/
 # PyBuilder
 .pybuilder/
 target/
 # Jupyter Notebook
 .ipynb_checkpoints
 # IPython
 profile_default/
 ipython_config.py
 # pyenv
 #   For a library or package, you might want to ignore these files since the code is
 #   intended to run in multiple environments; otherwise, check them in:
 # .python-version
 # pipenv
 #   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
 #   However, in case of collaboration, if having platform-specific dependencies or dependencies
 #   having no cross-platform support, pipenv may install dependencies that don't work, or not
 #   install all needed dependencies.
 #Pipfile.lock
 # poetry
 #   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
 #   This is especially recommended for binary packages to ensure reproducibility, and is more
 #   commonly ignored for libraries.
 #   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
 #poetry.lock
 # PEP 582; used by e.g. github.com/David-OConnor/pyflow
 __pypackages__/
 # Celery stuff
 celerybeat-schedule
 celerybeat.pid
 # SageMath parsed files
 *.sage.py
 # Environments
 .env
 .venv
 env/
 venv/
 ENV/
 env.bak/
 venv.bak/
 # Spyder project settings
 .spyderproject
 .spyproject
 # Rope project settings
 .ropeproject
 # mkdocs documentation
 /site
 # mypy
 .mypy_cache/
 .dmypy.json
 dmypy.json
 # Pyre type checker
 .pyre/
 # pytype static type analyzer
 .pytype/
 # Cython debug symbols
 cython_debug/
 # PyCharm
 #  JetBrains specific template is maintainted in a separate JetBrains.gitignore that can
 #  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
 #  and can be added to the global gitignore or merged into this file.  For a more nuclear
 #  option (not recommended) you can uncomment the following to ignore the entire idea folder.
 #.idea/
--- a/apps/playwright-service/Dockerfile
+++ b/apps/playwright-service/Dockerfile
@ -0,0 +1,38 @@
 FROM python:3.11-slim
 ENV PYTHONUNBUFFERED=1
 ENV PYTHONDONTWRITEBYTECODE=1
 ENV PIP_DISABLE_PIP_VERSION_CHECK=1
 RUN apt-get update && apt-get install -y --no-install-recommends \
    gcc \
    libstdc++6
 WORKDIR /app
 # Install Python dependencies
 COPY requirements.txt ./
 # Remove py which is pulled in by retry, py is not needed and is a CVE
 RUN pip install --no-cache-dir --upgrade -r requirements.txt && \
    pip uninstall -y py && \
    playwright install chromium && playwright install-deps chromium && \
    ln -s /usr/local/bin/supervisord /usr/bin/supervisord
 # Cleanup for CVEs and size reduction
 # https://github.com/tornadoweb/tornado/issues/3107
 # xserver-common and xvfb included by playwright installation but not needed after
 # perl-base is part of the base Python Debian image but not needed for Danswer functionality
 # perl-base could only be removed with --allow-remove-essential
 COPY . ./
 EXPOSE $PORT
 # run fast api hypercorn
 CMD hypercorn main:app --bind [::]:$PORT
 # CMD ["hypercorn", "main:app", "--bind", "[::]:$PORT"]
 # CMD ["sh", "-c", "uvicorn main:app --host 0.0.0.0 --port $PORT"]
--- a/apps/playwright-service/README.md
+++ b/apps/playwright-service/README.md
--- a/apps/playwright-service/main.py
+++ b/apps/playwright-service/main.py
@ -0,0 +1,28 @@
 from fastapi import FastAPI, Response
 from playwright.async_api import async_playwright
 import os
 from fastapi.responses import JSONResponse
 from pydantic import BaseModel
 app = FastAPI()
 from pydantic import BaseModel
 class UrlModel(BaseModel):
    url: str
@app.post("/html")  # Kept as POST to accept body parameters
 async def root(body: UrlModel):  # Using Pydantic model for request body
    async with async_playwright() as p:
        browser = await p.chromium.launch()
        context = await browser.new_context()
        page = await context.new_page()
        await page.goto(body.url)  # Adjusted to use the url from the request body model
        page_content = await page.content()  # Get the HTML content of the page
        await browser.close()
        json_compatible_item_data = {"content": page_content}
        return JSONResponse(content=json_compatible_item_data)
--- a/apps/playwright-service/requests.http
+++ b/apps/playwright-service/requests.http
--- a/apps/playwright-service/requirements.txt
+++ b/apps/playwright-service/requirements.txt
@ -0,0 +1,4 @@
 hypercorn==0.16.0
 fastapi==0.110.0
 playwright==1.42.0
 uvicorn
--- a/apps/playwright-service/runtime.txt
+++ b/apps/playwright-service/runtime.txt
@ -0,0 +1 @@
 3.11
--- a/apps/python-sdk/README.md
+++ b/apps/python-sdk/README.md
@ -0,0 +1,91 @@
 # Firecrawl Python SDK
 The Firecrawl Python SDK is a library that allows you to easily scrape and crawl websites, and output the data in a format ready for use with language models (LLMs). It provides a simple and intuitive interface for interacting with the Firecrawl API.
 ## Installation
 To install the Firecrawl Python SDK, you can use pip:
 ```bash
 pip install firecrawl-py
 ```
 ## Usage
 1. Get an API key from [firecrawl.dev](https://firecrawl.dev)
 2. Set the API key as an environment variable named `FIRECRAWL_API_KEY` or pass it as a parameter to the `FirecrawlApp` class.
 Here's an example of how to use the SDK:
 ```python
 from firecrawl import FirecrawlApp
 # Initialize the FirecrawlApp with your API key
 app = FirecrawlApp(api_key='your_api_key')
 # Scrape a single URL
 url = 'https://mendable.ai'
 scraped_data = app.scrape_url(url)
 # Crawl a website
 crawl_url = 'https://mendable.ai'
 crawl_params = {
    'crawlerOptions': {
        'excludes': ['blog/*'],
        'includes': [], # leave empty for all pages
        'limit': 1000,
    }
 }
 crawl_result = app.crawl_url(crawl_url, params=crawl_params)
 ```
 ### Scraping a URL
 To scrape a single URL, use the `scrape_url` method. It takes the URL as a parameter and returns the scraped data as a dictionary.
 ```python
 url = 'https://example.com'
 scraped_data = app.scrape_url(url)
 ```
 ### Crawling a Website
 To crawl a website, use the `crawl_url` method. It takes the starting URL and optional parameters as arguments. The `params` argument allows you to specify additional options for the crawl job, such as the maximum number of pages to crawl, allowed domains, and the output format.
 The `wait_until_done` parameter determines whether the method should wait for the crawl job to complete before returning the result. If set to `True`, the method will periodically check the status of the crawl job until it is completed or the specified `timeout` (in seconds) is reached. If set to `False`, the method will return immediately with the job ID, and you can manually check the status of the crawl job using the `check_crawl_status` method.
 ```python
 crawl_url = 'https://example.com'
 crawl_params = {
    'crawlerOptions': {
        'excludes': ['blog/*'],
        'includes': [], # leave empty for all pages
        'limit': 1000,
    }
 }
 crawl_result = app.crawl_url(crawl_url, params=crawl_params, wait_until_done=True, timeout=5)
 ```
 If `wait_until_done` is set to `True`, the `crawl_url` method will return the crawl result once the job is completed. If the job fails or is stopped, an exception will be raised.
 ### Checking Crawl Status
 To check the status of a crawl job, use the `check_crawl_status` method. It takes the job ID as a parameter and returns the current status of the crawl job.
 ```python
 job_id = crawl_result['jobId']
 status = app.check_crawl_status(job_id)
 ```
 ## Error Handling
 The SDK handles errors returned by the Firecrawl API and raises appropriate exceptions. If an error occurs during a request, an exception will be raised with a descriptive error message.
 ## Contributing
 Contributions to the Firecrawl Python SDK are welcome! If you find any issues or have suggestions for improvements, please open an issue or submit a pull request on the GitHub repository.
 ## License
 The Firecrawl Python SDK is open-source and released under the [MIT License](https://opensource.org/licenses/MIT).
--- a/apps/python-sdk/build/lib/firecrawl/init.py
+++ b/apps/python-sdk/build/lib/firecrawl/init.py
@ -0,0 +1 @@
 from .firecrawl import FirecrawlApp
--- a/apps/python-sdk/build/lib/firecrawl/firecrawl.py
+++ b/apps/python-sdk/build/lib/firecrawl/firecrawl.py
@ -0,0 +1,96 @@
 import os
 import requests
 class FirecrawlApp:
    def __init__(self, api_key=None):
        self.api_key = api_key or os.getenv('FIRECRAWL_API_KEY')
        if self.api_key is None:
            raise ValueError('No API key provided')
    def scrape_url(self, url, params=None):
        headers = {
            'Content-Type': 'application/json',
            'Authorization': f'Bearer {self.api_key}'
        }
        json_data = {'url': url}
        if params:
            json_data.update(params)
        response = requests.post(
            'https://api.firecrawl.dev/v0/scrape',
            headers=headers,
            json=json_data
        )
        if response.status_code == 200:
            response = response.json()
            if response['success'] == True:
                return response['data']
            else:
                raise Exception(f'Failed to scrape URL. Error: {response["error"]}')
        elif response.status_code in [402, 409, 500]:
            error_message = response.json().get('error', 'Unknown error occurred')
            raise Exception(f'Failed to scrape URL. Status code: {response.status_code}. Error: {error_message}')
        else:
            raise Exception(f'Failed to scrape URL. Status code: {response.status_code}')
    def crawl_url(self, url, params=None, wait_until_done=True, timeout=2):
        headers = self._prepare_headers()
        json_data = {'url': url}
        if params:
            json_data.update(params)
        response = self._post_request('https://api.firecrawl.dev/v0/crawl', json_data, headers)
        if response.status_code == 200:
            job_id = response.json().get('jobId')
            if wait_until_done:
                return self._monitor_job_status(job_id, headers, timeout)
            else:
                return {'jobId': job_id}
        else:
            self._handle_error(response, 'start crawl job')
    def check_crawl_status(self, job_id):
        headers = self._prepare_headers()
        response = self._get_request(f'https://api.firecrawl.dev/v0/crawl/status/{job_id}', headers)
        if response.status_code == 200:
            return response.json()
        else:
            self._handle_error(response, 'check crawl status')
    def _prepare_headers(self):
        return {
            'Content-Type': 'application/json',
            'Authorization': f'Bearer {self.api_key}'
        }
    def _post_request(self, url, data, headers):
        return requests.post(url, headers=headers, json=data)
    def _get_request(self, url, headers):
        return requests.get(url, headers=headers)
    def _monitor_job_status(self, job_id, headers, timeout):
        import time
        while True:
            status_response = self._get_request(f'https://api.firecrawl.dev/v0/crawl/status/{job_id}', headers)
            if status_response.status_code == 200:
                status_data = status_response.json()
                if status_data['status'] == 'completed':
                    if 'data' in status_data:
                        return status_data['data']
                    else:
                        raise Exception('Crawl job completed but no data was returned')
                elif status_data['status'] in ['active', 'paused', 'pending', 'queued']:
                    if timeout < 2:
                        timeout = 2
                    time.sleep(timeout)  # Wait for the specified timeout before checking again
                else:
                    raise Exception(f'Crawl job failed or was stopped. Status: {status_data["status"]}')
            else:
                self._handle_error(status_response, 'check crawl status')
    def _handle_error(self, response, action):
        if response.status_code in [402, 409, 500]:
            error_message = response.json().get('error', 'Unknown error occurred')
            raise Exception(f'Failed to {action}. Status code: {response.status_code}. Error: {error_message}')
        else:
            raise Exception(f'Unexpected error occurred while trying to {action}. Status code: {response.status_code}')
--- a/apps/python-sdk/dist/firecrawl-py-0.0.5.tar.gz
+++ b/apps/python-sdk/dist/firecrawl-py-0.0.5.tar.gz
--- a/apps/python-sdk/dist/firecrawl_py-0.0.5-py3-none-any.whl
+++ b/apps/python-sdk/dist/firecrawl_py-0.0.5-py3-none-any.whl
--- a/apps/python-sdk/example.py
+++ b/apps/python-sdk/example.py
@ -0,0 +1,13 @@
 from firecrawl import FirecrawlApp
 app = FirecrawlApp(api_key="a6a2d63a-ed2b-46a9-946d-2a7207efed4d")
 crawl_result = app.crawl_url('mendable.ai', {'crawlerOptions': {'excludes': ['blog/*']}})
 print(crawl_result[0]['markdown'])
 job_id = crawl_result['jobId']
 print(job_id)
 status = app.check_crawl_status(job_id)
 print(status)
--- a/apps/python-sdk/firecrawl/init.py
+++ b/apps/python-sdk/firecrawl/init.py
@ -0,0 +1 @@
 from .firecrawl import FirecrawlApp
--- a/apps/python-sdk/firecrawl/pycache/init.cpython-311.pyc
+++ b/apps/python-sdk/firecrawl/pycache/init.cpython-311.pyc
--- a/apps/python-sdk/firecrawl/pycache/firecrawl.cpython-311.pyc
+++ b/apps/python-sdk/firecrawl/pycache/firecrawl.cpython-311.pyc
--- a/apps/python-sdk/firecrawl/firecrawl.py
+++ b/apps/python-sdk/firecrawl/firecrawl.py
@ -0,0 +1,96 @@
 import os
 import requests
 class FirecrawlApp:
    def __init__(self, api_key=None):
        self.api_key = api_key or os.getenv('FIRECRAWL_API_KEY')
        if self.api_key is None:
            raise ValueError('No API key provided')
    def scrape_url(self, url, params=None):
        headers = {
            'Content-Type': 'application/json',
            'Authorization': f'Bearer {self.api_key}'
        }
        json_data = {'url': url}
        if params:
            json_data.update(params)
        response = requests.post(
            'https://api.firecrawl.dev/v0/scrape',
            headers=headers,
            json=json_data
        )
        if response.status_code == 200:
            response = response.json()
            if response['success'] == True:
                return response['data']
            else:
                raise Exception(f'Failed to scrape URL. Error: {response["error"]}')
        elif response.status_code in [402, 409, 500]:
            error_message = response.json().get('error', 'Unknown error occurred')
            raise Exception(f'Failed to scrape URL. Status code: {response.status_code}. Error: {error_message}')
        else:
            raise Exception(f'Failed to scrape URL. Status code: {response.status_code}')
    def crawl_url(self, url, params=None, wait_until_done=True, timeout=2):
        headers = self._prepare_headers()
        json_data = {'url': url}
        if params:
            json_data.update(params)
        response = self._post_request('https://api.firecrawl.dev/v0/crawl', json_data, headers)
        if response.status_code == 200:
            job_id = response.json().get('jobId')
            if wait_until_done:
                return self._monitor_job_status(job_id, headers, timeout)
            else:
                return {'jobId': job_id}
        else:
            self._handle_error(response, 'start crawl job')
    def check_crawl_status(self, job_id):
        headers = self._prepare_headers()
        response = self._get_request(f'https://api.firecrawl.dev/v0/crawl/status/{job_id}', headers)
        if response.status_code == 200:
            return response.json()
        else:
            self._handle_error(response, 'check crawl status')
    def _prepare_headers(self):
        return {
            'Content-Type': 'application/json',
            'Authorization': f'Bearer {self.api_key}'
        }
    def _post_request(self, url, data, headers):
        return requests.post(url, headers=headers, json=data)
    def _get_request(self, url, headers):
        return requests.get(url, headers=headers)
    def _monitor_job_status(self, job_id, headers, timeout):
        import time
        while True:
            status_response = self._get_request(f'https://api.firecrawl.dev/v0/crawl/status/{job_id}', headers)
            if status_response.status_code == 200:
                status_data = status_response.json()
                if status_data['status'] == 'completed':
                    if 'data' in status_data:
                        return status_data['data']
                    else:
                        raise Exception('Crawl job completed but no data was returned')
                elif status_data['status'] in ['active', 'paused', 'pending', 'queued']:
                    if timeout < 2:
                        timeout = 2
                    time.sleep(timeout)  # Wait for the specified timeout before checking again
                else:
                    raise Exception(f'Crawl job failed or was stopped. Status: {status_data["status"]}')
            else:
                self._handle_error(status_response, 'check crawl status')
    def _handle_error(self, response, action):
        if response.status_code in [402, 409, 500]:
            error_message = response.json().get('error', 'Unknown error occurred')
            raise Exception(f'Failed to {action}. Status code: {response.status_code}. Error: {error_message}')
        else:
            raise Exception(f'Unexpected error occurred while trying to {action}. Status code: {response.status_code}')
--- a/apps/python-sdk/firecrawl_py.egg-info/PKG-INFO
+++ b/apps/python-sdk/firecrawl_py.egg-info/PKG-INFO
@ -0,0 +1,7 @@
 Metadata-Version: 2.1
 Name: firecrawl-py
 Version: 0.0.5
 Summary: Python SDK for Firecrawl API
 Home-page: https://github.com/mendableai/firecrawl-py
 Author: Mendable.ai
 Author-email: nick@mendable.ai
--- a/apps/python-sdk/firecrawl_py.egg-info/SOURCES.txt
+++ b/apps/python-sdk/firecrawl_py.egg-info/SOURCES.txt
@ -0,0 +1,9 @@
 README.md
 setup.py
 firecrawl/__init__.py
 firecrawl/firecrawl.py
 firecrawl_py.egg-info/PKG-INFO
 firecrawl_py.egg-info/SOURCES.txt
 firecrawl_py.egg-info/dependency_links.txt
 firecrawl_py.egg-info/requires.txt
 firecrawl_py.egg-info/top_level.txt
--- a/apps/python-sdk/firecrawl_py.egg-info/dependency_links.txt
+++ b/apps/python-sdk/firecrawl_py.egg-info/dependency_links.txt
@ -0,0 +1 @@
--- a/apps/python-sdk/firecrawl_py.egg-info/requires.txt
+++ b/apps/python-sdk/firecrawl_py.egg-info/requires.txt
@ -0,0 +1 @@
 requests
--- a/apps/python-sdk/firecrawl_py.egg-info/top_level.txt
+++ b/apps/python-sdk/firecrawl_py.egg-info/top_level.txt
@ -0,0 +1 @@
 firecrawl
--- a/apps/python-sdk/setup.py
+++ b/apps/python-sdk/setup.py
@ -0,0 +1,14 @@
 from setuptools import setup, find_packages
 setup(
    name='firecrawl-py',
    version='0.0.5',
    url='https://github.com/mendableai/firecrawl-py',
    author='Mendable.ai',
    author_email='nick@mendable.ai',
    description='Python SDK for Firecrawl API',
    packages=find_packages(),    
    install_requires=[
        'requests',
    ],
 )
--- a/apps/www/README.md
+++ b/apps/www/README.md
@ -0,0 +1 @@
 Coming soon!
		`@ -0,0 +1,2 @@`
							`# Auto detect text files and perform LF normalization`
							`* text=auto`
		`@ -0,0 +1,4 @@`
							`# Contributing`

							`We love contributions! Please read our [contributing guide](CONTRIBUTING.md) before submitting a pull request.`
		`@ -0,0 +1,6 @@`
							`# Self-hosting Firecrawl`

							`Guide coming soon.`
		`@ -0,0 +1,2 @@`
							`// ! IN CASE OPENAI goes down, then activate the fallback -> true`
							`export const is_fallback = false;`
		`@ -0,0 +1,2 @@`
							`export const errorNoResults =`
							`"No results found, please check the URL or contact us at help@mendable.ai to file a ticket.";`