Web Scraping With cURL: A Practical Guide
- cURL is a command-line tool for sending HTTP requests. It's the fastest way to inspect a page, test headers, and confirm what a server returns before you write a scraper.
- The core flags:
-Asets a user agent,-Hadds headers,-Lfollows redirects,-osaves output,-dposts data. - Every command below was run with curl 8.19; the response codes and sizes are the real output.
- cURL fetches raw HTML. It doesn't parse it or run JavaScript, so it pairs with a parser like BeautifulSoup for the extraction step.
cURL is the tool I open before writing a single line of scraper code. It tells me what a server actually returns: the status, the headers, whether my user agent gets blocked, and what the HTML looks like. This guide covers the commands that matter for scraping, each one run with curl 8.19 in June 2026 so the output is real.
What is cURL and why use it for scraping?
cURL is a command-line tool that sends HTTP requests and prints the response. For scraping it does two jobs: it fetches raw HTML you can pipe into a parser, and it’s the fastest way to debug why a scraper is failing. When a Python request returns a block page, I reproduce it in one cURL line and change headers until it works, then port that back to code.
cURL handles the transport only. It downloads bytes; it doesn’t parse HTML or run JavaScript. So in a real workflow it fetches and a library like BeautifulSoup parses.
How do you fetch a page with cURL?
Run curl followed by the URL to print the page. Add flags to control the request and capture useful information:
curl -s -A "Mozilla/5.0" "https://quotes.toscrape.com/page/1/" \
-o page1.html -w "code=%{http_code} size=%{size_download} time=%{time_total}s\n"
That fetched the page and reported the result without dumping the HTML to the terminal:
code=200 size=11064 time=0.623561s
The flags doing the work:
-ssilences the progress meter so the output stays clean.-A "Mozilla/5.0"sets the user agent, which often decides whether you get the page or a block.-o page1.htmlsaves the body to a file.-w "..."prints chosen values after the request.%{http_code},%{size_download}, and%{time_total}are the three I check first.
How do you inspect response headers?
Use -I to fetch only the headers, which is the quickest way to read status, content type, and caching without downloading the body:
curl -s -I "https://quotes.toscrape.com/"
HTTP/1.1 200 OK
Date: Fri, 12 Jun 2026 11:48:09 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 11064
Connection: keep-alive
This tells me the page is HTML, returns 200, and is about 11 KB before I commit to parsing it. Use -i instead of -I to get headers and body together.
The cURL flags that matter for scraping
These are the flags I use constantly, with the request behavior each one controls:
| Flag | Does | Scraping use |
|---|---|---|
-A | Sets user agent | Avoid default-cURL blocks |
-H | Adds a header | Set Accept, Referer, cookies |
-L | Follows redirects | Land on the final page |
-o / -O | Saves to file | Keep the HTML for parsing |
-s | Silent mode | Clean, scriptable output |
-I | Headers only | Quick status and type check |
-d | Sends POST body | Hit search and form endpoints |
--compressed | Requests gzip | Match what browsers send |
Stacking a few of these reproduces a realistic browser request: curl -sL --compressed -A "Mozilla/5.0" -H "Accept-Language: en-US" URL.
How do you POST data with cURL?
Use -d to send a request body, which is how you hit search endpoints and APIs that expect form or JSON input. Posting JSON to a test endpoint:
curl -s -X POST "https://httpbin.org/post" \
-H "Content-Type: application/json" \
-d '{"q":"scraping"}'
The endpoint echoed the payload back, confirming the POST went through:
"data": "{\"q\":\"scraping\"}",
"q": "scraping"
-d implies a POST, so -X POST is optional here, but I keep it for clarity. For form-encoded data, drop the JSON header and pass -d "q=scraping&page=1".
From cURL to a real scraper
cURL gets you a verified request. The next step is moving it into code so you can parse and loop. Each flag maps cleanly to Python Requests:
import requests
# the equivalent of: curl -A "Mozilla/5.0" -H "Accept-Language: en-US" URL
resp = requests.get(
"https://quotes.toscrape.com/page/1/",
headers={"User-Agent": "Mozilla/5.0", "Accept-Language": "en-US"},
)
print(resp.status_code, len(resp.text))
Once that works, hand resp.text to a parser and you have a scraper. When a request that works in cURL starts failing at scale, the cause is usually blocking rather than your command, and that’s the point where a scraper API like ChocoData takes over the fetch. The full tradeoff is in the web scraping guide.
FAQ
Yes, cURL fetches the raw HTML of a page, which is the first half of scraping. You then pass that HTML to a parser to extract data. cURL is also the best tool for debugging a scraper: it shows you exactly what the server returns for a given set of headers.
Use the -A flag followed by the user-agent string, for example curl -A "Mozilla/5.0" https://example.com. Many sites return different responses or block requests that use cURL's default user agent, so setting a realistic one is often the first fix when a request fails.
Map each flag to a Requests argument: -A and -H become the headers dict, -d becomes data or json, -L maps to allow_redirects (on by default). A curl -A "UA" -H "Accept: application/json" URL becomes requests.get(URL, headers={"User-Agent": "UA", "Accept": "application/json"}).