~ / guides / XPath vs CSS Selectors for Web Scraping: A Practical Guide

XPath vs CSS Selectors for Web Scraping: A Practical Guide

MR
Marcus Reed
Founder & lead tester · about the author
the short version
  • CSS selectors are shorter and cover most scraping jobs. XPath is more powerful when you need to match on text, walk to a parent, or use complex conditions.
  • In Python, lxml handles both: .cssselect() for CSS and .xpath() for XPath, on the same parsed document.
  • I ran every selector below against quotes.toscrape.com so the matches and counts are real.
  • Learn CSS first. Reach for XPath the moment a job needs text(), contains(), or a step back up the tree.

Every scraper eventually argues with itself about XPath versus CSS selectors. I use both, and the choice is usually obvious once you know what each one does well. This guide shows the same extractions written both ways, run against quotes.toscrape.com with Python’s lxml in June 2026, so you can see exactly where each wins.

XPath vs CSS selectors: the short answer

CSS selectors handle the common cases in fewer characters, and XPath handles the cases CSS can’t reach. Here’s the split I use day to day:

JobBest toolWhy
Match by class or idCSSShortest, most readable
Match nested elementsCSS.quote .author reads clearly
Match by text contentXPathCSS can’t match on text
Walk to a parent or ancestorXPathCSS only goes down the tree
Match Nth element with a conditionXPathRicher predicates
Grab an attribute valueEitherBoth do it cleanly

Reach for CSS by default. Switch to XPath when you need text matching or you need to move back up the document.

How do you use selectors in Python?

Parse the page once with lxml, then query the same document with either selector style. Install the two packages first:

pip install lxml cssselect requests

cssselect is what lets lxml accept CSS selectors. With both installed, one parsed document answers both kinds of query:

import requests
from lxml import html

resp = requests.get("https://quotes.toscrape.com", headers={"User-Agent": "Mozilla/5.0"})
doc = html.fromstring(resp.content)

# XPath
authors_xpath = doc.xpath('//div[@class="quote"]/span/small[@class="author"]/text()')

# CSS selector (compiled to XPath under the hood)
quotes_css = doc.cssselect("div.quote span.text")

print(len(authors_xpath), "authors via XPath")
print(len(quotes_css), "quotes via CSS")

That returned ten of each, which is the count on page one:

10 authors via XPath
10 quotes via CSS

Note html.fromstring(resp.content) uses .content, the raw bytes, so lxml reads the page’s encoding correctly instead of guessing.

Selecting by class and id

For class and id matches, CSS is shorter and reads the way you think. XPath needs the longer attribute-predicate form:

TargetCSSXPath
Classdiv.quote//div[@class="quote"]
Id#promo//*[@id="promo"]
Descendantdiv.quote span.text//div[@class="quote"]//span[@class="text"]
Direct childul > li//ul/li
Attributea[href]//a[@href]

One catch with XPath: @class="quote" matches only when the class attribute is exactly quote. If an element has class="quote featured", that test fails. CSS .quote matches either way, which is another reason I default to CSS.

Selecting by text: where XPath wins

XPath can match an element by what it says, and CSS cannot. This is the capability that earns XPath a place in every scraper. To find the “Next” pagination link by its text:

# XPath: match the link by its visible text
next_link = doc.xpath('//a[contains(text(), "Next")]/@href')
print(next_link)
['/page/2/']

contains(text(), "Next") finds the anchor whose text includes “Next” and /@href reads its link. There is no CSS selector that does this. When a page gives you nothing stable to grab except the words on a button, XPath is the way through.

Walking up the tree

XPath can step from a matched element back to its parent or an ancestor, which CSS can’t do. If you find a price and need the product container around it:

# from a price node, climb to the enclosing quote box
container = doc.xpath('//span[@class="text"]/parent::div[@class="quote"]')
print(len(container), "containers reached via parent axis")
10 containers reached via parent axis

The parent:: and ancestor:: axes let you anchor on the one reliable element on a page, then navigate to the messy parts around it. CSS only ever moves downward, so this pattern is XPath-only.

Which should you learn first?

Learn CSS selectors first, because they cover most of what you’ll scrape and you may already know them from styling pages. Add XPath the first time a job needs text matching, a parent hop, or a condition CSS can’t express. Most of my scrapers use CSS for the easy 90% and a few XPath expressions for the parts that fight back.

Once your selectors are solid, the next wall is usually a site that blocks you rather than one with tricky markup. That’s a different problem, covered in the main web scraping guide, and it’s where a scraper API like ChocoData takes over the fetching while your selectors keep doing the parsing.

FAQ

Is XPath or CSS faster for web scraping?

In practice the speed difference is negligible for scraping. lxml compiles both to fast C-level lookups. Choose on readability and capability, not speed: CSS for simple class and id matches, XPath when you need text matching or axis navigation.

Can CSS selectors select by text?

Standard CSS selectors cannot match an element by its text content. XPath can, with contains(text(), '...') or matching the normalized string. This is the single most common reason scrapers reach for XPath.

Do I need lxml to use XPath in Python?

For XPath, yes, lxml is the standard choice and supports the full XPath 1.0 syntax. BeautifulSoup on its own does not support XPath; it uses its own search methods and CSS selectors. Many scrapers parse with lxml when they want XPath.

MR
Marcus Reed
I've built and run web scrapers for the better part of a decade. On this site I put scraper APIs and scraping tools through real jobs against real targets, then write up what actually holds up.