mirror of
https://github.com/mendableai/firecrawl.git
synced 2024-11-16 19:58:08 +08:00
8d467c8ca7
* feat: use strictNullChecking * feat: switch logger to Winston * feat(scrapeURL): first batch * fix(scrapeURL): error swallow * fix(scrapeURL): add timeout to EngineResultsTracker * fix(scrapeURL): report unexpected error to sentry * chore: remove unused modules * feat(transfomers/coerce): warn when a format's response is missing * feat(scrapeURL): feature flag priorities, engine quality sorting, PDF and DOCX support * (add note) * feat(scrapeURL): wip readme * feat(scrapeURL): LLM extract * feat(scrapeURL): better warnings * fix(scrapeURL/engines/fire-engine;playwright): fix screenshot * feat(scrapeURL): add forceEngine internal option * feat(scrapeURL/engines): scrapingbee * feat(scrapeURL/transformars): uploadScreenshot * feat(scrapeURL): more intense tests * bunch of stuff * get rid of WebScraper (mostly) * adapt batch scrape * add staging deploy workflow * fix yaml * fix logger issues * fix v1 test schema * feat(scrapeURL/fire-engine/chrome-cdp): remove wait inserts on actions * scrapeURL: v0 backwards compat * logger fixes * feat(scrapeurl): v0 returnOnlyUrls support * fix(scrapeURL/v0): URL leniency * fix(batch-scrape): ts non-nullable * fix(scrapeURL/fire-engine/chromecdp): fix wait action * fix(logger): remove error debug key * feat(requests.http): use dotenv expression * fix(scrapeURL/extractMetadata): extract custom metadata * fix crawl option conversion * feat(scrapeURL): Add retry logic to robustFetch * fix(scrapeURL): crawl stuff * fix(scrapeURL): LLM extract * fix(scrapeURL/v0): search fix * fix(tests/v0): grant larger response size to v0 crawl status * feat(scrapeURL): basic fetch engine * feat(scrapeURL): playwright engine * feat(scrapeURL): add url-specific parameters * Update readme and examples * added e2e tests for most parameters. Still a few actions, location and iframes to be done. * fixed type * Nick: * Update scrape.ts * Update index.ts * added actions and base64 check * Nick: skipTls feature flag? * 403 * todo * todo * fixes * yeet headers from url specific params * add warning when final engine has feature deficit * expose engine results tracker for ScrapeEvents implementation * ingest scrape events * fixed some tests * comment * Update index.test.ts * fixed rawHtml * Update index.test.ts * update comments * move geolocation to global f-e option, fix removeBase64Images * Nick: * trim url-specific params * Update index.ts --------- Co-authored-by: Eric Ciarla <ericciarla@yahoo.com> Co-authored-by: rafaelmmiller <8574157+rafaelmmiller@users.noreply.github.com> Co-authored-by: Nicolas <nicolascamara29@gmail.com>
119 lines
3.7 KiB
Markdown
119 lines
3.7 KiB
Markdown
# Contributors guide:
|
||
|
||
Welcome to [Firecrawl](https://firecrawl.dev) 🔥! Here are some instructions on how to get the project locally, so you can run it on your own (and contribute)
|
||
|
||
If you're contributing, note that the process is similar to other open source repos i.e. (fork firecrawl, make changes, run tests, PR). If you have any questions, and would like help gettin on board, reach out to hello@mendable.ai for more or submit an issue!
|
||
|
||
## Running the project locally
|
||
|
||
First, start by installing dependencies:
|
||
|
||
1. node.js [instructions](https://nodejs.org/en/learn/getting-started/how-to-install-nodejs)
|
||
2. pnpm [instructions](https://pnpm.io/installation)
|
||
3. redis [instructions](https://redis.io/docs/latest/operate/oss_and_stack/install/install-redis/)
|
||
|
||
Set environment variables in a .env in the /apps/api/ directory you can copy over the template in .env.example.
|
||
|
||
To start, we wont set up authentication, or any optional sub services (pdf parsing, JS blocking support, AI features )
|
||
|
||
.env:
|
||
|
||
```
|
||
# ===== Required ENVS ======
|
||
NUM_WORKERS_PER_QUEUE=8
|
||
PORT=3002
|
||
HOST=0.0.0.0
|
||
REDIS_URL=redis://localhost:6379
|
||
REDIS_RATE_LIMIT_URL=redis://localhost:6379
|
||
|
||
## To turn on DB authentication, you need to set up supabase.
|
||
USE_DB_AUTHENTICATION=false
|
||
|
||
# ===== Optional ENVS ======
|
||
|
||
# Supabase Setup (used to support DB authentication, advanced logging, etc.)
|
||
SUPABASE_ANON_TOKEN=
|
||
SUPABASE_URL=
|
||
SUPABASE_SERVICE_TOKEN=
|
||
|
||
# Other Optionals
|
||
TEST_API_KEY= # use if you've set up authentication and want to test with a real API key
|
||
SCRAPING_BEE_API_KEY= #Set if you'd like to use scraping Be to handle JS blocking
|
||
OPENAI_API_KEY= # add for LLM dependednt features (image alt generation, etc.)
|
||
BULL_AUTH_KEY= @
|
||
PLAYWRIGHT_MICROSERVICE_URL= # set if you'd like to run a playwright fallback
|
||
LLAMAPARSE_API_KEY= #Set if you have a llamaparse key you'd like to use to parse pdfs
|
||
SLACK_WEBHOOK_URL= # set if you'd like to send slack server health status messages
|
||
POSTHOG_API_KEY= # set if you'd like to send posthog events like job logs
|
||
POSTHOG_HOST= # set if you'd like to send posthog events like job logs
|
||
|
||
|
||
```
|
||
|
||
### Installing dependencies
|
||
|
||
First, install the dependencies using pnpm.
|
||
|
||
```bash
|
||
# cd apps/api # to make sure you're in the right folder
|
||
pnpm install # make sure you have pnpm version 9+!
|
||
```
|
||
|
||
### Running the project
|
||
|
||
You're going to need to open 3 terminals. Here is [a video guide accurate as of Oct 2024](https://youtu.be/LHqg5QNI4UY).
|
||
|
||
### Terminal 1 - setting up redis
|
||
|
||
Run the command anywhere within your project
|
||
|
||
```bash
|
||
redis-server
|
||
```
|
||
|
||
### Terminal 2 - setting up workers
|
||
|
||
Now, navigate to the apps/api/ directory and run:
|
||
|
||
```bash
|
||
pnpm run workers
|
||
# if you are going to use the [llm-extract feature](https://github.com/mendableai/firecrawl/pull/586/), you should also export OPENAI_API_KEY=sk-______
|
||
```
|
||
|
||
This will start the workers who are responsible for processing crawl jobs.
|
||
|
||
### Terminal 3 - setting up the main server
|
||
|
||
To do this, navigate to the apps/api/ directory and run if you don’t have this already, install pnpm here: https://pnpm.io/installation
|
||
Next, run your server with:
|
||
|
||
```bash
|
||
pnpm run start
|
||
```
|
||
|
||
### Terminal 3 - sending our first request.
|
||
|
||
Alright: now let’s send our first request.
|
||
|
||
```curl
|
||
curl -X GET http://localhost:3002/test
|
||
```
|
||
|
||
This should return the response Hello, world!
|
||
|
||
If you’d like to test the crawl endpoint, you can run this
|
||
|
||
```curl
|
||
curl -X POST http://localhost:3002/v1/crawl \
|
||
-H 'Content-Type: application/json' \
|
||
-d '{
|
||
"url": "https://mendable.ai"
|
||
}'
|
||
```
|
||
|
||
## Tests:
|
||
|
||
The best way to do this is run the test with `npm run test:local-no-auth` if you'd like to run the tests without authentication.
|
||
|
||
If you'd like to run the tests with authentication, run `npm run test:prod`
|