What Is Web Scraping? A Practical Guide From Someone Who Does It Daily
- Web scraping is automated extraction of public data from websites: a script fetches a page and pulls the fields you want into a structured file.
- There are three ways to do it: write code (most control), use a scraper API (skips the blocking problem), or a no-code tool (no programming).
- It's legal in the US and EU when you scrape public data and respect personal-data and copyright law. Trouble starts once you log in and bypass that login.
- The real difficulty is staying unblocked at scale, which is exactly what every paid scraper tool sells a solution to.
I’ve been scraping the web for years, mostly to feed price-tracking and market-research tools. The concept takes about a minute to understand and a lot longer to do well once a serious target starts pushing back. This guide is the explanation I wish I’d had on day one: what web scraping actually is, the three ways to do it, what each one costs you, and where the legal line sits.
What is web scraping?
Web scraping is the automated extraction of data from websites. Instead of opening a page and copying values by hand, you run a program that fetches the page’s HTML and pulls out the specific fields you want, then saves them in a structured format like CSV or JSON.
A scraper does three things in a loop:
- Fetch a page over HTTP, the same request your browser makes.
- Parse the returned HTML to locate the data inside it.
- Store the result as a row of clean, structured data.
That’s the whole idea. A price tracker scrapes product pages for prices. A lead tool scrapes directories for company names. A researcher scrapes listings to study a market. The data is already public and visible in your browser, and scraping collects it at a speed and scale no human can match.
How does web scraping work?
Web scraping works by imitating what your browser does, then reading the response with code. When you visit a page, your browser sends an HTTP request and gets back HTML. A scraper sends the same request, receives the same HTML, then uses a parser to walk the document and pull out the parts you care about.
Here’s a real scraper. I ran this against quotes.toscrape.com, a sandbox the Scrapy team publishes specifically for practice, in June 2026. It collects every quote across all ten pages, following the “next” link until there isn’t one.
import csv
import requests
from bs4 import BeautifulSoup
BASE = "https://quotes.toscrape.com"
def scrape_quotes():
rows, path = [], "/page/1/"
while path:
resp = requests.get(BASE + path, headers={"User-Agent": "Mozilla/5.0"})
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "html.parser")
for card in soup.select(".quote"):
rows.append({
"text": card.select_one(".text").get_text(strip=True),
"author": card.select_one(".author").get_text(strip=True),
"tags": ", ".join(t.get_text() for t in card.select(".tag")),
})
nxt = soup.select_one("li.next a") # find the next-page link
path = nxt["href"] if nxt else None # stop when it's gone
return rows
quotes = scrape_quotes()
print(f"scraped {len(quotes)} quotes")
with open("quotes.csv", "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=["text", "author", "tags"])
writer.writeheader()
writer.writerows(quotes)
Install the two dependencies first with pip install requests beautifulsoup4. When I ran it, it returned exactly 100 quotes and wrote a clean quotes.csv:
scraped 100 quotes
That’s the entire pattern: fetch, parse with a CSS selector, follow pagination, write rows. Every scraper you build is a more complicated version of those twenty lines, and the complications come from the target fighting back, which I’ll get to.
What is web scraping used for?
Web scraping is used anywhere a decision depends on data that lives on other people’s websites. These are the use cases I run into most:
| Use case | What gets scraped | Who needs it |
|---|---|---|
| Price monitoring | Competitor and retailer product prices | Ecommerce, retail, resellers |
| Lead generation | Company and contact details from directories | Sales, agencies, B2B SaaS |
| Market research | Listings, reviews, catalog data | Founders, analysts, investors |
| SEO and SERP tracking | Search rankings and result pages | Marketers, SEO teams |
| Training data | Text and images for models | ML and AI teams |
| Travel and real estate | Fares, rates, property listings | Aggregators, comparison sites |
The common thread is volume. The data is public, but it’s spread across thousands of pages and changes constantly, so collecting it by hand is impossible and you automate it.
How do you scrape a website? The three methods
There are three ways to scrape a website: write your own code, call a scraper API, or use a no-code tool. They trade control for convenience. Here’s how I choose between them.
| Method | Control | Setup effort | Handles blocking? | Best for |
|---|---|---|---|---|
| Your own code | Highest | High | You build it | Custom jobs, full control, learning |
| Scraper API | High | Low | Yes, built in | Scale, hard targets, production |
| No-code tool | Low | Lowest | Partly | Small jobs, non-developers |
Method 1: Write your own code
Writing your own scraper gives you total control and costs nothing but your time and infrastructure. The code sample above is method one. You pick the language, the parser, and the logic. For most people that means Python with Requests to fetch and BeautifulSoup to parse, or Scrapy when you need a full framework. For sites that build their content with JavaScript, you reach for a real browser through Playwright or Puppeteer.
This is the right call when the job is small, when you’re learning, or when you need behavior no tool offers. Once a target starts blocking you, the work changes shape: now you’re maintaining proxies, browser fingerprints, and retry logic, which is a much bigger job than the scraping that started it.
Method 2: Use a scraper API
A scraper API takes a URL and hands back the page’s data, dealing with proxies, browsers, and blocking on its end. You send one HTTP request to the API, it talks to the target for you, and the hard infrastructure becomes someone else’s problem. This is what I reach for at scale, and it’s the category this whole site exists to test.
ChocoData is a good example of the category. It exposes a universal endpoint that turns any URL into JSON, HTML, or text, plus 453 dedicated endpoints for specific sites so you skip writing parsers for common targets. A request looks like this:
import requests
resp = requests.get("https://api.chocodata.com/api/v1/universal/get", params={
"api_key": "YOUR_KEY",
"url": "https://example.com/product/123",
})
data = resp.json()
Every scraper API in this category follows the same model: one request, structured data back, no proxy rotation to maintain. I test the major ones against the same targets and publish the numbers, so you can choose on measured results rather than marketing copy.
Method 3: Use a no-code tool
No-code tools let you scrape by clicking the data you want instead of writing code. Browser-extension scrapers and desktop apps let you select fields visually and export to a spreadsheet. They’re the fastest way to pull a few hundred rows off a cooperative site, and they’re genuinely useful for non-developers.
Their ceiling is low. Once a site needs logins, handles pagination through background requests, or starts blocking, most no-code tools stall. Use them to get started or to handle one-off jobs, then move to code or an API when you need something that runs in production.
Is web scraping legal?
Scraping publicly available data is generally legal in the United States and the European Union, but how you do it and what you collect can cross legal lines. This is not legal advice, and the details depend on your jurisdiction and use case. The broad strokes are well established.
In the US, the most cited case is hiQ Labs v. LinkedIn. The Ninth Circuit ruled that scraping data which is publicly accessible, with no login required, does not violate the Computer Fraud and Abuse Act. The key phrase is “publicly accessible.” Once you log in, accept terms of service, or bypass an access control, you move into different territory.
What actually creates legal risk:
- Personal data. Scraping names, emails, or other personal data pulls you under privacy law. In the EU that’s the GDPR, and it applies regardless of where you operate if the people are in the EU.
- Copyrighted content. A page being public still leaves its content under copyright. Copying creative work wholesale can trigger a copyright claim, whatever method you used to collect it.
- Bypassing access controls. Logins, paywalls, and CAPTCHAs are access controls, and circumventing them is the line courts care about most.
- Terms of service. Breaking a site’s terms can get you sued for breach of contract or permanently banned, even when no law is broken.
My rule of thumb: scrape public data, respect robots.txt and rate limits, leave personal data alone unless you have a lawful basis, and don’t republish content you don’t own. That keeps the large majority of projects on safe ground.
What’s actually hard about web scraping?
Fetching and parsing are the easy part. The work that consumes your time is staying unblocked once you scale up. My twenty-line example runs cleanly because quotes.toscrape.com is built to be scraped, while a real commercial target spends engineering effort trying to stop you.
A popular site sees thousands of requests from one IP address making identical patterns and concludes, correctly, that you’re a bot. Then it serves a CAPTCHA, a block page, or quietly feeds you fake data. Getting past that means rotating IP addresses, sending realistic browser headers, running a real browser for JavaScript-heavy pages, solving or avoiding CAPTCHAs, and pacing requests so you blend in. Each of those is a system to build and maintain.
That maintenance burden is the reason scraper APIs exist and people pay for them. The price buys you out of building and babysitting the unblocking layer. Whether it’s worth paying depends on your target and your scale, and that tradeoff is what I test on the rest of this site.
Where to go next
If you’re starting out, take the code sample above, point it at quotes.toscrape.com, and change the selectors until you can pull the data into a CSV. That single exercise teaches you more than any amount of reading. When you hit your first block page, and you will, that’s your cue to reach for a scraper API rather than fight the infrastructure alone.
FAQ
No. Crawling means discovering and following links to map what pages exist, which is what search engines do. Scraping means extracting specific data from those pages. Most real projects crawl to find URLs, then scrape each one.
Not for small jobs. No-code tools like browser-extension scrapers let you click the fields you want. For anything large or for sites that fight back, code or a scraper API gives you far more control, and sites start fighting back fast once you scale up.
You got blocked or the page layout changed. Sites detect repeated automated requests by IP, headers, and behavior, then serve a CAPTCHA or a block page. Rotating IPs, realistic headers, and slower request rates help. At scale, a scraper API handles this for you.
Python, for most people. The ecosystem (Requests, BeautifulSoup, Scrapy, Playwright) is the deepest and the examples are everywhere. JavaScript with Node is a strong second, especially for sites that need a real browser.