Bootcamp
Scrape Web Content with LLM

Scrape Web Content with LLM

Use Llama 3.1 to get structured data from web pages

Getting structured data from web pages can be a tedious task. You often need to :

  • Understand the structure of the Page
  • write code to extract the data (usually selectors)
  • Redo pretty much everything once the page inevitably changes

I'm not even talking about defeating anti-scraping techniques like captchas, rate limiting, etc. We won't cover these topics in this guide.

Luckily, you can get pretty far using libraries like Playwright1 and an open AI models like Llama 3.1. Our plan of attack will involve two steps:

  • Get the content of a web page as markdown
  • Use Llama 3.1 to extract structured data from the markdown content

Tutorial Goals

In this tutorial, you will:

  • Setup Playwright to get web page content (including JavaScript)
  • Use html2text to convert HTML to Markdown
  • Use Llama 3.1 with Pydantic models to get structured (custom) output
  • Save extracted content as CSV files for later use

Setup

MLExpert is loading...

References

Footnotes

  1. Playwright (opens in a new tab)