Learn how to use Firecrawl and Groq to extract structured data from a web page in a few lines of code. With Groq fast inference speeds and firecrawl parellization, you can extract data from web pages *super* fast.
Install our python dependencies, including groq and firecrawl-py.
```bash
pip install groq firecrawl-py
```
## Getting your Groq and Firecrawl API Keys
To use Groq and Firecrawl, you will need to get your API keys. You can get your Groq API key from [here](https://groq.com) and your Firecrawl API key from [here](https://firecrawl.dev).
## Load website with Firecrawl
To be able to get all the data from a website page and make sure it is in the cleanest format, we will use [FireCrawl](https://firecrawl.dev). It handles by-passing JS-blocked websites, extracting the main content, and outputting in a LLM-readable format for increased accuracy.
Here is how we will scrape a website url using Firecrawl. We will also set a `pageOptions` for only extracting the main content (`onlyMainContent: True`) of the website page - excluding the navs, footers, etc.
```python
from firecrawl import FirecrawlApp # Importing the FireCrawlLoader
page_content = firecrawl.scrape_url(url=url, # Target URL to crawl
params={
"pageOptions":{
"onlyMainContent": True # Ignore navs, footers, etc.
}
})
print(page_content)
```
Perfect, now we have clean data from the website - ready to be fed to the LLM for data extraction.
## Extraction and Generation
Now that we have the website data, let's use Groq to pull out the information we need. We'll use Groq Llama 3 model in JSON mode and pick out certain fields from the page content.
We are using LLama 3 8b model for this example. Feel free to use bigger models for improved results.
```python
import json
from groq import Groq
client = Groq(
api_key="gsk_YOUR_GROQ_API_KEY", # Note: Replace 'API_KEY' with your actual Groq API key
)
# Here we define the fields we want to extract from the page content