Scrape Web Content with LLM

Use Llama 3.1 to get structured data from web pages

Getting structured data from web pages can be a tedious task. You often need to :

Understand the structure of the Page
write code to extract the data (usually selectors)
Redo pretty much everything once the page inevitably changes

I’m not even talking about defeating anti-scraping techniques like captchas, rate limiting, etc. We won’t cover these topics in this guide.

Luckily, you can get pretty far using libraries like Playwright¹ and an open AI models like Llama 3.1. Our plan of attack will involve two steps:

Get the content of a web page as markdown
Use Llama 3.1 to extract structured data from the markdown content

Tutorial Goals

In this tutorial, you will:

Setup Playwright to get web page content (including JavaScript)
Use html2text to convert HTML to Markdown
Use Llama 3.1 with Pydantic models to get structured (custom) output
Save extracted content as CSV files for later use

Setup

MLExpert is loading...

References

Playwright ↩

Write Social Media Content with Agents