~ / guides / Web Scraping With Scrapy: A Complete Beginner Tutorial

Web Scraping With Scrapy: A Complete Beginner Tutorial

MR
Marcus Reed
Founder & lead tester · about the author
the short version
  • Scrapy is a Python framework for scraping at scale. It handles requests, concurrency, retries, and data export so you write only the extraction logic.
  • A spider defines where to start, how to parse a page, and how to follow links. The whole thing fits in about 15 lines.
  • I ran the spider below against quotes.toscrape.com and it scraped all 100 quotes across 10 pages into JSON.
  • Use Scrapy when you're crawling many pages. For a quick one-page pull, BeautifulSoup is less setup.

Scrapy is what you graduate to when a one-file script stops being enough. It runs requests concurrently, retries failures, respects delays, and exports your data, all from a small spider class you write. This tutorial builds a working spider from scratch and runs it against quotes.toscrape.com, confirmed in June 2026.

What is Scrapy?

Scrapy is a Python framework for crawling and scraping websites at scale. Where BeautifulSoup parses one page you fetched, Scrapy manages the whole operation: it queues requests, runs many in parallel, follows links, retries on errors, and writes the results to a file. You supply the extraction logic and the crawl rules; the framework runs the machine around them.

That structure is worth setup cost when you’re crawling many pages and overkill when you’re grabbing one. Here’s how the two compare:

BeautifulSoupScrapy
TypeParsing libraryFull framework
Best forOne page, quick scriptsCrawling many pages
ConcurrencyYou add itBuilt in
Retries and delaysYou add themBuilt in
Data exportYou write itBuilt in (-o)
SetupMinimalA spider class

How do you install Scrapy?

Install Scrapy with pip, ideally inside a virtual environment because it pulls in several dependencies:

pip install scrapy

Confirm it installed by checking the version:

scrapy version

Scrapy works on Windows, macOS, and Linux. On Windows it pulls in a Twisted networking dependency, which pip handles automatically on current versions.

How do you write a Scrapy spider?

A spider is a Python class that defines a start URL, a parse method to extract data, and rules for following links. Save this as quotes_spider.py:

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ["https://quotes.toscrape.com/"]
    custom_settings = {"USER_AGENT": "my-project/1.0", "DOWNLOAD_DELAY": 1}

    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                "text": quote.css("span.text::text").get(),
                "author": quote.css("small.author::text").get(),
                "tags": quote.css("div.tags a.tag::text").getall(),
            }
        next_page = response.css("li.next a::attr(href)").get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

What each piece does:

How do you run the spider and export data?

Run a single-file spider with scrapy runspider and use -o to export the scraped items. The file format follows the extension:

scrapy runspider quotes_spider.py -o quotes.json

When I ran it, Scrapy crawled all ten pages and wrote 100 items. Reading the output back confirmed it:

scraped items: 100
first author: Albert Einstein
first tags: ['change', 'deep-thoughts', 'thinking', 'world']

Swap the extension to export differently: -o quotes.csv for CSV, or -o quotes.jsonl for line-delimited JSON that’s better for large crawls. Scrapy appends by default, so delete the file or use -O (capital) to overwrite between runs.

Scrapy selectors: css and xpath

Scrapy’s response object supports both CSS and XPath, the same split covered in the selectors guide. The only Scrapy-specific part is the ::text and ::attr() pseudo-elements:

GoalScrapy CSSScrapy XPath
Element text.css("h1::text").get().xpath("//h1/text()").get()
Attribute.css("a::attr(href)").get().xpath("//a/@href").get()
All matches.css("a.tag::text").getall().xpath("//a/text()").getall()

.get() returns the first match or None, and .getall() returns a list, which keeps your parse code from crashing on missing fields.

When to use Scrapy

Use Scrapy when the job is a crawl: many pages, many links to follow, and a need for speed, retries, and clean exports. For a single page or a quick experiment, BeautifulSoup is less ceremony. And for either tool, a site that blocks you is a separate problem from parsing, one I cover in the web scraping guide, where a scraper API like ChocoData handles the fetching so your spider keeps running.

FAQ

Is Scrapy better than BeautifulSoup?

They solve different problems. BeautifulSoup is a parsing library for small scripts. Scrapy is a full framework with crawling, concurrency, retries, and export pipelines built in. For a single page, BeautifulSoup is faster to write. For crawling thousands of pages, Scrapy's structure and speed win.

How do I run a Scrapy spider?

For a single-file spider, use scrapy runspider spider.py -o output.json. Inside a full Scrapy project, use scrapy crawl spidername. The -o flag exports scraped items to a file, with the format inferred from the extension (.json, .csv, or .jsonl).

Does Scrapy handle JavaScript?

Not by itself. Scrapy fetches HTML over HTTP and won't run page JavaScript. For JS-rendered sites, add scrapy-playwright to render pages with a real browser, or use a scraper API that returns rendered HTML to your spider.

MR
Marcus Reed
I've built and run web scrapers for the better part of a decade. On this site I put scraper APIs and scraping tools through real jobs against real targets, then write up what actually holds up.