~ / guides / How to Scrape Wikipedia: API and HTML Methods (Tested)

How to Scrape Wikipedia: API and HTML Methods (Tested)

MR
Marcus Reed
Founder & lead tester · about the author
the short version
  • Wikipedia is one of the friendliest sites to scrape: stable HTML, a free official API, and a permissive content license.
  • For text and summaries, use the REST API. For tables, parse the HTML with pandas.read_html, which turns a wikitable into a DataFrame in one line.
  • I pulled the full 503-row S&P 500 table and the Web scraping article summary with the code below.
  • Set a descriptive User-Agent. Wikipedia asks for it and can block generic bot agents.

Wikipedia is the best place to learn scraping, and a genuinely useful data source in its own right. The HTML is stable, the content license is permissive, and there’s an official API when you want clean data. This guide shows both routes, the API and HTML parsing, each tested against live Wikipedia pages in June 2026.

Can you scrape Wikipedia?

Yes, and Wikipedia makes it easy on purpose. The content is licensed under Creative Commons BY-SA, which means you can reuse it with attribution. Wikipedia also publishes a full REST API, a MediaWiki API, and downloadable database dumps, so automated access is a supported path rather than a fight. The one rule that matters: set a descriptive User-Agent header. Wikipedia’s guidelines ask for it, and generic bot agents can get blocked.

That makes Wikipedia the rare large site where the legal and technical friction is low, which is why I use it as the teaching example for parsing tables.

Method 1: the Wikipedia REST API

For article text and summaries, the REST API returns clean JSON and saves you from parsing HTML at all. One request gets a page summary:

import requests

UA = {"User-Agent": "my-project/1.0 (you@example.com)"}
url = "https://en.wikipedia.org/api/rest_v1/page/summary/Web_scraping"

data = requests.get(url, headers=UA, timeout=30).json()
print(data["title"])
print(data["extract"][:90])

That returned the title and the opening extract:

Web scraping
Web scraping, web harvesting, or web data extraction is data scraping used for extracting

The summary endpoint gives you the title, a plain-text extract, a thumbnail, and the canonical URL. Swap Web_scraping for any article title (underscores for spaces) to get the same shape back. Because this is the official API, it won’t break when Wikipedia changes its page layout.

Method 2: parse a table with pandas

When the data you want is a table, pandas.read_html is the fastest tool there is. It scans a page for HTML tables and hands each one back as a DataFrame. The S&P 500 list is a classic target because its main table has a stable id:

import io
import requests
import pandas as pd

UA = {"User-Agent": "my-project/1.0 (you@example.com)"}
url = "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies"

html = requests.get(url, headers=UA, timeout=30).text
tables = pd.read_html(io.StringIO(html), attrs={"id": "constituents"})
df = tables[0]

print("shape:", df.shape)
print(df.iloc[0]["Symbol"], "|", df.iloc[0]["Security"])

That pulled the full constituents table:

shape: (503, 8)
MMM | 3M

Two details make this reliable. Passing attrs={"id": "constituents"} targets the one table I want instead of every table on the page. Wrapping the HTML in io.StringIO(...) matches how current pandas expects the input. From a DataFrame, df.to_csv("sp500.csv", index=False) writes a clean file.

Method 3: parse the HTML yourself

When you need something read_html won’t give you, like links inside a cell or a specific infobox value, parse with BeautifulSoup. The pattern is the same as any other site:

import requests
from bs4 import BeautifulSoup

UA = {"User-Agent": "my-project/1.0 (you@example.com)"}
url = "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies"

soup = BeautifulSoup(requests.get(url, headers=UA, timeout=30).text, "html.parser")
print(soup.select_one("#firstHeading").get_text(strip=True))

# symbol + the company's Wikipedia link, pulled from each row's cells
for row in soup.select("#constituents tbody tr")[1:4]:
    cells = row.find_all("td")
    symbol = cells[0].get_text(strip=True)
    link = cells[1].find("a")          # the Security column links to the company article
    print(symbol, "->", link["href"])
List of S&P 500 companies
MMM -> /wiki/3M
AOS -> /wiki/A._O._Smith
ABT -> /wiki/Abbott_Laboratories

#firstHeading is the page title element Wikipedia uses on every article. Reading each row’s cells by hand lets you keep the links that read_html flattens away, like the /wiki/ path to each company’s own article.

Which method should you use?

Pick the API for text, the table parser for tables, and hand-parsing for the awkward rest:

You wantUseWhy
Article summary or textREST APIClean JSON, layout-proof
A data tablepandas.read_htmlOne line to a DataFrame
Links, infobox, odd fieldsBeautifulSoupFull control over the HTML
Bulk, whole-encyclopedia dataDatabase dumpsNo scraping needed at all

For a one-off table, read_html wins every time. For an application that reads many articles, the API is steadier. And if you ever need Wikipedia at real bulk, download a database dump instead of crawling, which is kinder to their servers and faster for you.

Wikipedia rarely blocks well-behaved scrapers, so you won’t need anti-blocking tricks here. On commercial targets that do fight back, the calculus changes, which is the subject of the main web scraping guide.

FAQ

Is it legal to scrape Wikipedia?

Yes. Wikipedia's text is licensed under Creative Commons (CC BY-SA), so you can reuse it with attribution and share-alike. Wikipedia also publishes an official API and even full database dumps, so scraping is explicitly supported. Follow their API etiquette: set a real User-Agent and don't hammer the servers.

Should I use the Wikipedia API or scrape the HTML?

Use the API for article text, summaries, and structured metadata; it's cleaner, faster, and stable across layout changes. Parse the HTML when you need something the API doesn't expose conveniently, like a specific wikitable, where pandas.read_html is the fastest route.

How do I extract a table from a Wikipedia page in Python?

Use pandas.read_html, which finds HTML tables on a page and returns each as a DataFrame. Target a specific table with the attrs argument, for example pd.read_html(html, attrs={'id': 'constituents'}), then export the DataFrame to CSV.

MR
Marcus Reed
I've built and run web scrapers for the better part of a decade. On this site I put scraper APIs and scraping tools through real jobs against real targets, then write up what actually holds up.