Scrape Web Content with LLM
Use Llama 3.1 to get structured data from web pages
Getting structured data from web pages can be a tedious task. You often need to :
- Understand the structure of the Page
- write code to extract the data (usually selectors)
- Redo pretty much everything once the page inevitably changes
I'm not even talking about defeating anti-scraping techniques like captchas, rate limiting, etc. We won't cover these topics in this guide.
Luckily, you can get pretty far using libraries like Playwright1 and an open AI models like Llama 3.1. Our plan of attack will involve two steps:
- Get the content of a web page as markdown
- Use Llama 3.1 to extract structured data from the markdown content
Tutorial Goals
In this tutorial, you will:
- Setup Playwright to get web page content (including JavaScript)
- Use html2text to convert HTML to Markdown
- Use Llama 3.1 with Pydantic models to get structured (custom) output
- Save extracted content as CSV files for later use