~ / guides / Web Scraping With BeautifulSoup: A Complete Tutorial

Web Scraping With BeautifulSoup: A Complete Tutorial

MR
Marcus Reed
Founder & lead tester · about the author
the short version
  • BeautifulSoup is a Python library that parses HTML so you can pull out the data you want with CSS selectors or tag searches.
  • Pair it with Requests to fetch pages and the html.parser built into Python, and you can scrape a static site in about 15 lines.
  • I scraped all 1,000 books from books.toscrape.com with the code below to confirm every snippet runs.
  • BeautifulSoup handles the HTML. It does not run JavaScript or get past blocks, so for heavy sites you add a browser or a scraper API.

BeautifulSoup is the first scraping library most Python developers learn, and it’s still the one I reach for on a quick job. This tutorial covers the whole loop: install it, fetch a page, find elements, walk pagination, and write a CSV. Every snippet here ran against books.toscrape.com, a sandbox built for practice, in June 2026.

What is BeautifulSoup?

BeautifulSoup is a Python library that turns raw HTML into a navigable tree you can search. You hand it the HTML of a page, then ask for elements by tag, class, id, or CSS selector, and it returns the matching nodes with their text and attributes. It doesn’t fetch pages on its own, so you pair it with a request library.

The standard stack is three pieces:

PieceJob
RequestsFetches the page over HTTP
BeautifulSoupParses the HTML and finds elements
html.parserThe parser engine, built into Python

How do you install BeautifulSoup?

Install BeautifulSoup and Requests with pip in one command:

pip install beautifulsoup4 requests

beautifulsoup4 is the current package name. The import in your code is bs4. You don’t need a separate parser install for this tutorial because html.parser ships with Python, though many people add lxml for speed on large pages.

How do you scrape a page with BeautifulSoup?

Fetch the page with Requests, pass the HTML to BeautifulSoup, then select the elements you want. This script pulls the title, price, rating, and stock status for every book on one catalogue page:

import requests
from bs4 import BeautifulSoup

url = "https://books.toscrape.com/catalogue/page-1.html"
resp = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
resp.raise_for_status()

soup = BeautifulSoup(resp.text, "html.parser")

books = []
for card in soup.select("article.product_pod"):
    books.append({
        "title": card.h3.a["title"],
        "price": card.select_one(".price_color").get_text(strip=True),
        "rating": card.select_one(".star-rating")["class"][1],
        "in_stock": "In stock" in card.select_one(".availability").get_text(),
    })

print(len(books), "books")
print(books[0])

When I ran it, it found all 20 books on the page:

20 books
{'title': 'A Light in the Attic', 'price': '£51.77', 'rating': 'Three', 'in_stock': True}

A few things worth knowing from that code:

How do you find elements? select vs find

BeautifulSoup gives you two search styles, and they return the same nodes. CSS selectors are shorter for nested matches, while find reads more clearly for simple tag-plus-attribute lookups.

GoalCSS selector stylefind style
First matchsoup.select_one(".price_color")soup.find(class_="price_color")
All matchessoup.select("article.product_pod")soup.find_all("article", class_="product_pod")
By idsoup.select_one("#promo")soup.find(id="promo")
Nestedsoup.select(".pod h3 a")chained .find().find()

I default to select because one selector string replaces a chain of find calls, and it’s the same syntax you already know from CSS.

How do you handle pagination?

Follow the “next” link until it disappears. Most paginated sites put a next-page anchor at the bottom of each list, so you scrape a page, look for that link, and repeat:

import requests
from bs4 import BeautifulSoup

BASE = "https://books.toscrape.com/catalogue/"
url = BASE + "page-1.html"
all_books = []

while url:
    soup = BeautifulSoup(requests.get(url).text, "html.parser")
    all_books += soup.select("article.product_pod")
    nxt = soup.select_one("li.next a")
    url = BASE + nxt["href"] if nxt else None

print("total books:", len(all_books))

This walked the full catalogue and returned every book:

total books: 1000

Add a short time.sleep(1) inside the loop on real sites so you don’t hammer the server, and the pattern scales to any “next button” pagination.

How do you save scraped data to CSV?

Use Python’s built-in csv module with DictWriter, which maps each dictionary to a row:

import csv

with open("books.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=["title", "price", "rating", "in_stock"])
    writer.writeheader()
    writer.writerows(books)

Set encoding="utf-8" so prices and special characters survive, and pass newline="" to stop Windows from inserting blank lines between rows.

When BeautifulSoup is not enough

BeautifulSoup parses whatever HTML you give it, which means two limits show up fast on real targets. If a page builds its content with JavaScript after load, the data isn’t in the HTML that Requests downloads, so you render the page first with Playwright or Selenium and pass the result to BeautifulSoup. If a site blocks repeated requests, parsing isn’t the problem at all; staying unblocked is, and that’s where a scraper API like ChocoData earns its place by returning the page so your BeautifulSoup code keeps working unchanged. I cover that whole tradeoff in the web scraping guide.

For the parsing itself, though, BeautifulSoup will carry you a long way. Take the pagination script above, point it at a site you care about, and adjust the selectors until the fields come out clean.

FAQ

Is BeautifulSoup good for web scraping?

Yes, for static HTML it's the most beginner-friendly option in Python. It parses messy markup well and pairs with Requests in a few lines. For JavaScript-rendered pages you need a browser tool like Playwright, and for large or blocked sites a scraper API.

What's the difference between BeautifulSoup and Scrapy?

BeautifulSoup is a parsing library you wire into your own script. Scrapy is a full framework with crawling, concurrency, and pipelines built in. Use BeautifulSoup for small jobs and quick scripts, Scrapy when you're crawling thousands of pages and want structure.

Does BeautifulSoup execute JavaScript?

No. BeautifulSoup only parses the HTML you give it. If a page loads its data with JavaScript after the initial response, that data won't be in the HTML BeautifulSoup sees. Use Playwright or Selenium to render the page first, then pass the rendered HTML to BeautifulSoup.

MR
Marcus Reed
I've built and run web scrapers for the better part of a decade. On this site I put scraper APIs and scraping tools through real jobs against real targets, then write up what actually holds up.