Web Scraping With BeautifulSoup: A Complete Tutorial
- BeautifulSoup is a Python library that parses HTML so you can pull out the data you want with CSS selectors or tag searches.
- Pair it with Requests to fetch pages and the
html.parserbuilt into Python, and you can scrape a static site in about 15 lines. - I scraped all 1,000 books from books.toscrape.com with the code below to confirm every snippet runs.
- BeautifulSoup handles the HTML. It does not run JavaScript or get past blocks, so for heavy sites you add a browser or a scraper API.
BeautifulSoup is the first scraping library most Python developers learn, and it’s still the one I reach for on a quick job. This tutorial covers the whole loop: install it, fetch a page, find elements, walk pagination, and write a CSV. Every snippet here ran against books.toscrape.com, a sandbox built for practice, in June 2026.
What is BeautifulSoup?
BeautifulSoup is a Python library that turns raw HTML into a navigable tree you can search. You hand it the HTML of a page, then ask for elements by tag, class, id, or CSS selector, and it returns the matching nodes with their text and attributes. It doesn’t fetch pages on its own, so you pair it with a request library.
The standard stack is three pieces:
| Piece | Job |
|---|---|
| Requests | Fetches the page over HTTP |
| BeautifulSoup | Parses the HTML and finds elements |
| html.parser | The parser engine, built into Python |
How do you install BeautifulSoup?
Install BeautifulSoup and Requests with pip in one command:
pip install beautifulsoup4 requests
beautifulsoup4 is the current package name. The import in your code is bs4. You don’t need a separate parser install for this tutorial because html.parser ships with Python, though many people add lxml for speed on large pages.
How do you scrape a page with BeautifulSoup?
Fetch the page with Requests, pass the HTML to BeautifulSoup, then select the elements you want. This script pulls the title, price, rating, and stock status for every book on one catalogue page:
import requests
from bs4 import BeautifulSoup
url = "https://books.toscrape.com/catalogue/page-1.html"
resp = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "html.parser")
books = []
for card in soup.select("article.product_pod"):
books.append({
"title": card.h3.a["title"],
"price": card.select_one(".price_color").get_text(strip=True),
"rating": card.select_one(".star-rating")["class"][1],
"in_stock": "In stock" in card.select_one(".availability").get_text(),
})
print(len(books), "books")
print(books[0])
When I ran it, it found all 20 books on the page:
20 books
{'title': 'A Light in the Attic', 'price': '£51.77', 'rating': 'Three', 'in_stock': True}
A few things worth knowing from that code:
soup.select(...)takes a CSS selector and returns every match.select_one(...)returns the first.card.h3.a["title"]reaches into a tag and reads an attribute. The book title lives in the link’stitleattribute, not its text.- The star rating is stored as a class name (
star-rating Three), so I read the second class to get the word.
How do you find elements? select vs find
BeautifulSoup gives you two search styles, and they return the same nodes. CSS selectors are shorter for nested matches, while find reads more clearly for simple tag-plus-attribute lookups.
| Goal | CSS selector style | find style |
|---|---|---|
| First match | soup.select_one(".price_color") | soup.find(class_="price_color") |
| All matches | soup.select("article.product_pod") | soup.find_all("article", class_="product_pod") |
| By id | soup.select_one("#promo") | soup.find(id="promo") |
| Nested | soup.select(".pod h3 a") | chained .find().find() |
I default to select because one selector string replaces a chain of find calls, and it’s the same syntax you already know from CSS.
How do you handle pagination?
Follow the “next” link until it disappears. Most paginated sites put a next-page anchor at the bottom of each list, so you scrape a page, look for that link, and repeat:
import requests
from bs4 import BeautifulSoup
BASE = "https://books.toscrape.com/catalogue/"
url = BASE + "page-1.html"
all_books = []
while url:
soup = BeautifulSoup(requests.get(url).text, "html.parser")
all_books += soup.select("article.product_pod")
nxt = soup.select_one("li.next a")
url = BASE + nxt["href"] if nxt else None
print("total books:", len(all_books))
This walked the full catalogue and returned every book:
total books: 1000
Add a short time.sleep(1) inside the loop on real sites so you don’t hammer the server, and the pattern scales to any “next button” pagination.
How do you save scraped data to CSV?
Use Python’s built-in csv module with DictWriter, which maps each dictionary to a row:
import csv
with open("books.csv", "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=["title", "price", "rating", "in_stock"])
writer.writeheader()
writer.writerows(books)
Set encoding="utf-8" so prices and special characters survive, and pass newline="" to stop Windows from inserting blank lines between rows.
When BeautifulSoup is not enough
BeautifulSoup parses whatever HTML you give it, which means two limits show up fast on real targets. If a page builds its content with JavaScript after load, the data isn’t in the HTML that Requests downloads, so you render the page first with Playwright or Selenium and pass the result to BeautifulSoup. If a site blocks repeated requests, parsing isn’t the problem at all; staying unblocked is, and that’s where a scraper API like ChocoData earns its place by returning the page so your BeautifulSoup code keeps working unchanged. I cover that whole tradeoff in the web scraping guide.
For the parsing itself, though, BeautifulSoup will carry you a long way. Take the pagination script above, point it at a site you care about, and adjust the selectors until the fields come out clean.
FAQ
Yes, for static HTML it's the most beginner-friendly option in Python. It parses messy markup well and pairs with Requests in a few lines. For JavaScript-rendered pages you need a browser tool like Playwright, and for large or blocked sites a scraper API.
BeautifulSoup is a parsing library you wire into your own script. Scrapy is a full framework with crawling, concurrency, and pipelines built in. Use BeautifulSoup for small jobs and quick scripts, Scrapy when you're crawling thousands of pages and want structure.
No. BeautifulSoup only parses the HTML you give it. If a page loads its data with JavaScript after the initial response, that data won't be in the HTML BeautifulSoup sees. Use Playwright or Selenium to render the page first, then pass the rendered HTML to BeautifulSoup.