Web Scraping With Scrapy: A Complete Beginner Tutorial
- Scrapy is a Python framework for scraping at scale. It handles requests, concurrency, retries, and data export so you write only the extraction logic.
- A spider defines where to start, how to parse a page, and how to follow links. The whole thing fits in about 15 lines.
- I ran the spider below against quotes.toscrape.com and it scraped all 100 quotes across 10 pages into JSON.
- Use Scrapy when you're crawling many pages. For a quick one-page pull, BeautifulSoup is less setup.
Scrapy is what you graduate to when a one-file script stops being enough. It runs requests concurrently, retries failures, respects delays, and exports your data, all from a small spider class you write. This tutorial builds a working spider from scratch and runs it against quotes.toscrape.com, confirmed in June 2026.
What is Scrapy?
Scrapy is a Python framework for crawling and scraping websites at scale. Where BeautifulSoup parses one page you fetched, Scrapy manages the whole operation: it queues requests, runs many in parallel, follows links, retries on errors, and writes the results to a file. You supply the extraction logic and the crawl rules; the framework runs the machine around them.
That structure is worth setup cost when you’re crawling many pages and overkill when you’re grabbing one. Here’s how the two compare:
| BeautifulSoup | Scrapy | |
|---|---|---|
| Type | Parsing library | Full framework |
| Best for | One page, quick scripts | Crawling many pages |
| Concurrency | You add it | Built in |
| Retries and delays | You add them | Built in |
| Data export | You write it | Built in (-o) |
| Setup | Minimal | A spider class |
How do you install Scrapy?
Install Scrapy with pip, ideally inside a virtual environment because it pulls in several dependencies:
pip install scrapy
Confirm it installed by checking the version:
scrapy version
Scrapy works on Windows, macOS, and Linux. On Windows it pulls in a Twisted networking dependency, which pip handles automatically on current versions.
How do you write a Scrapy spider?
A spider is a Python class that defines a start URL, a parse method to extract data, and rules for following links. Save this as quotes_spider.py:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = ["https://quotes.toscrape.com/"]
custom_settings = {"USER_AGENT": "my-project/1.0", "DOWNLOAD_DELAY": 1}
def parse(self, response):
for quote in response.css("div.quote"):
yield {
"text": quote.css("span.text::text").get(),
"author": quote.css("small.author::text").get(),
"tags": quote.css("div.tags a.tag::text").getall(),
}
next_page = response.css("li.next a::attr(href)").get()
if next_page:
yield response.follow(next_page, callback=self.parse)
What each piece does:
nameidentifies the spider when you run it.start_urlsis where the crawl begins.parseruns on every downloaded page.yield-ing a dict emits a scraped item.response.css("span.text::text").get()reads text;::attr(href)reads an attribute;.getall()returns a list (used for the multiple tags).response.follow(next_page, ...)queues the next page through the sameparsemethod, which is how pagination works in Scrapy.DOWNLOAD_DELAY: 1waits a second between requests so you’re polite by default.
How do you run the spider and export data?
Run a single-file spider with scrapy runspider and use -o to export the scraped items. The file format follows the extension:
scrapy runspider quotes_spider.py -o quotes.json
When I ran it, Scrapy crawled all ten pages and wrote 100 items. Reading the output back confirmed it:
scraped items: 100
first author: Albert Einstein
first tags: ['change', 'deep-thoughts', 'thinking', 'world']
Swap the extension to export differently: -o quotes.csv for CSV, or -o quotes.jsonl for line-delimited JSON that’s better for large crawls. Scrapy appends by default, so delete the file or use -O (capital) to overwrite between runs.
Scrapy selectors: css and xpath
Scrapy’s response object supports both CSS and XPath, the same split covered in the selectors guide. The only Scrapy-specific part is the ::text and ::attr() pseudo-elements:
| Goal | Scrapy CSS | Scrapy XPath |
|---|---|---|
| Element text | .css("h1::text").get() | .xpath("//h1/text()").get() |
| Attribute | .css("a::attr(href)").get() | .xpath("//a/@href").get() |
| All matches | .css("a.tag::text").getall() | .xpath("//a/text()").getall() |
.get() returns the first match or None, and .getall() returns a list, which keeps your parse code from crashing on missing fields.
When to use Scrapy
Use Scrapy when the job is a crawl: many pages, many links to follow, and a need for speed, retries, and clean exports. For a single page or a quick experiment, BeautifulSoup is less ceremony. And for either tool, a site that blocks you is a separate problem from parsing, one I cover in the web scraping guide, where a scraper API like ChocoData handles the fetching so your spider keeps running.
FAQ
They solve different problems. BeautifulSoup is a parsing library for small scripts. Scrapy is a full framework with crawling, concurrency, retries, and export pipelines built in. For a single page, BeautifulSoup is faster to write. For crawling thousands of pages, Scrapy's structure and speed win.
For a single-file spider, use scrapy runspider spider.py -o output.json. Inside a full Scrapy project, use scrapy crawl spidername. The -o flag exports scraped items to a file, with the format inferred from the extension (.json, .csv, or .jsonl).
Not by itself. Scrapy fetches HTML over HTTP and won't run page JavaScript. For JS-rendered sites, add scrapy-playwright to render pages with a real browser, or use a scraper API that returns rendered HTML to your spider.