XPath vs CSS Selectors for Web Scraping: A Practical Guide
- CSS selectors are shorter and cover most scraping jobs. XPath is more powerful when you need to match on text, walk to a parent, or use complex conditions.
- In Python, lxml handles both:
.cssselect()for CSS and.xpath()for XPath, on the same parsed document. - I ran every selector below against quotes.toscrape.com so the matches and counts are real.
- Learn CSS first. Reach for XPath the moment a job needs
text(),contains(), or a step back up the tree.
Every scraper eventually argues with itself about XPath versus CSS selectors. I use both, and the choice is usually obvious once you know what each one does well. This guide shows the same extractions written both ways, run against quotes.toscrape.com with Python’s lxml in June 2026, so you can see exactly where each wins.
XPath vs CSS selectors: the short answer
CSS selectors handle the common cases in fewer characters, and XPath handles the cases CSS can’t reach. Here’s the split I use day to day:
| Job | Best tool | Why |
|---|---|---|
| Match by class or id | CSS | Shortest, most readable |
| Match nested elements | CSS | .quote .author reads clearly |
| Match by text content | XPath | CSS can’t match on text |
| Walk to a parent or ancestor | XPath | CSS only goes down the tree |
| Match Nth element with a condition | XPath | Richer predicates |
| Grab an attribute value | Either | Both do it cleanly |
Reach for CSS by default. Switch to XPath when you need text matching or you need to move back up the document.
How do you use selectors in Python?
Parse the page once with lxml, then query the same document with either selector style. Install the two packages first:
pip install lxml cssselect requests
cssselect is what lets lxml accept CSS selectors. With both installed, one parsed document answers both kinds of query:
import requests
from lxml import html
resp = requests.get("https://quotes.toscrape.com", headers={"User-Agent": "Mozilla/5.0"})
doc = html.fromstring(resp.content)
# XPath
authors_xpath = doc.xpath('//div[@class="quote"]/span/small[@class="author"]/text()')
# CSS selector (compiled to XPath under the hood)
quotes_css = doc.cssselect("div.quote span.text")
print(len(authors_xpath), "authors via XPath")
print(len(quotes_css), "quotes via CSS")
That returned ten of each, which is the count on page one:
10 authors via XPath
10 quotes via CSS
Note html.fromstring(resp.content) uses .content, the raw bytes, so lxml reads the page’s encoding correctly instead of guessing.
Selecting by class and id
For class and id matches, CSS is shorter and reads the way you think. XPath needs the longer attribute-predicate form:
| Target | CSS | XPath |
|---|---|---|
| Class | div.quote | //div[@class="quote"] |
| Id | #promo | //*[@id="promo"] |
| Descendant | div.quote span.text | //div[@class="quote"]//span[@class="text"] |
| Direct child | ul > li | //ul/li |
| Attribute | a[href] | //a[@href] |
One catch with XPath: @class="quote" matches only when the class attribute is exactly quote. If an element has class="quote featured", that test fails. CSS .quote matches either way, which is another reason I default to CSS.
Selecting by text: where XPath wins
XPath can match an element by what it says, and CSS cannot. This is the capability that earns XPath a place in every scraper. To find the “Next” pagination link by its text:
# XPath: match the link by its visible text
next_link = doc.xpath('//a[contains(text(), "Next")]/@href')
print(next_link)
['/page/2/']
contains(text(), "Next") finds the anchor whose text includes “Next” and /@href reads its link. There is no CSS selector that does this. When a page gives you nothing stable to grab except the words on a button, XPath is the way through.
Walking up the tree
XPath can step from a matched element back to its parent or an ancestor, which CSS can’t do. If you find a price and need the product container around it:
# from a price node, climb to the enclosing quote box
container = doc.xpath('//span[@class="text"]/parent::div[@class="quote"]')
print(len(container), "containers reached via parent axis")
10 containers reached via parent axis
The parent:: and ancestor:: axes let you anchor on the one reliable element on a page, then navigate to the messy parts around it. CSS only ever moves downward, so this pattern is XPath-only.
Which should you learn first?
Learn CSS selectors first, because they cover most of what you’ll scrape and you may already know them from styling pages. Add XPath the first time a job needs text matching, a parent hop, or a condition CSS can’t express. Most of my scrapers use CSS for the easy 90% and a few XPath expressions for the parts that fight back.
Once your selectors are solid, the next wall is usually a site that blocks you rather than one with tricky markup. That’s a different problem, covered in the main web scraping guide, and it’s where a scraper API like ChocoData takes over the fetching while your selectors keep doing the parsing.
FAQ
In practice the speed difference is negligible for scraping. lxml compiles both to fast C-level lookups. Choose on readability and capability, not speed: CSS for simple class and id matches, XPath when you need text matching or axis navigation.
Standard CSS selectors cannot match an element by its text content. XPath can, with contains(text(), '...') or matching the normalized string. This is the single most common reason scrapers reach for XPath.
For XPath, yes, lxml is the standard choice and supports the full XPath 1.0 syntax. BeautifulSoup on its own does not support XPath; it uses its own search methods and CSS selectors. Many scrapers parse with lxml when they want XPath.