Introduction to Web Scraping with Python

Learn web scraping with Python using BeautifulSoup and Requests. Master HTML parsing, CSS selectors, handling pagination, dynamic content, ethical scraping, and building data pipelines.

Introduction to Web Scraping with Python

Web scraping is the automated extraction of data from websites by sending HTTP requests, parsing the returned HTML, and extracting the specific content you need. In Python, the standard approach combines the requests library (to fetch web pages) with BeautifulSoup (to parse and navigate the HTML structure), enabling you to systematically collect data from any website into structured DataFrames. Web scraping is appropriate when no API is available and is always done responsibly by respecting robots.txt files, rate-limiting requests, and complying with a site’s terms of service.

Introduction

There are two ways to get data from a website programmatically. The first is through an official API — a structured, documented interface designed for machine access. The second, when no API exists, is web scraping: treating the website’s HTML as the data source and extracting information directly from its structure.

Web scraping opens up an enormous universe of data. Public records and government databases, product pricing and availability, news articles and press releases, academic publications, real estate listings, job postings, sports statistics, restaurant reviews — all of this exists as HTML on web pages, accessible to anyone, and extractable with the right tools.

The technique has a long history in data science and journalism. Investigative reporters scrape public court records to identify patterns. Economists scrape online job postings to track labor market trends in real time. Data scientists scrape e-commerce sites to build price comparison models. Researchers scrape academic databases to analyze citation networks. The list is as long as the web itself.

This article teaches you web scraping from the ground up: understanding HTML structure, using requests and BeautifulSoup to extract data, handling common complications like pagination and dynamic content, building responsible and robust scrapers, and turning raw scraped data into clean DataFrames. By the end, you’ll be able to extract structured data from most websites you encounter.

The Ethics and Legality of Web Scraping

Before writing a single line of scraping code, every data scientist needs to understand the ethical and legal landscape.

Always Check robots.txt

The robots.txt file (located at https://www.example.com/robots.txt) is a standard convention where websites declare which parts of their site automated bots are allowed to access. Ethical scrapers always check and respect it:

Plaintext
# Example robots.txt
User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /user-data/
Allow: /public/
Crawl-delay: 10

User-agent: Googlebot
Allow: /
  • Disallow: Do not scrape these paths
  • Allow: These paths are explicitly permitted
  • Crawl-delay: Wait at least this many seconds between requests
Python
import urllib.robotparser

def is_scraping_allowed(base_url: str, path: str, user_agent: str = "*") -> bool:
    """Check whether scraping a given path is permitted by robots.txt."""
    rp = urllib.robotparser.RobotFileParser()
    rp.set_url(f"{base_url}/robots.txt")
    rp.read()
    return rp.can_fetch(user_agent, f"{base_url}{path}")

# Example usage
allowed = is_scraping_allowed("https://example.com", "/products/", user_agent="MyBot")
print(f"Scraping allowed: {allowed}")

Read the Terms of Service

Many websites explicitly prohibit automated scraping in their Terms of Service (ToS). Even where scraping is technically possible, violating ToS can result in IP bans, legal cease-and-desist letters, or — in extreme cases involving circumventing access controls — legal action under computer fraud statutes. Always read the ToS before scraping at scale.

Be a Respectful Scraper

Even where scraping is legal and permitted, aggressive scraping harms the target website:

  • Add delays between requests — don’t hammer a server with hundreds of requests per second
  • Use a descriptive User-Agent — identify yourself and provide contact information
  • Scrape during off-peak hours when server load is lower
  • Cache results — don’t re-fetch pages you’ve already scraped
  • Prefer APIs when they exist — APIs are designed for machine access; scraping puts additional load on servers built for humans
Python
# Identify yourself responsibly
HEADERS = {
    "User-Agent": "DataResearchBot/1.0 (academic research; contact: researcher@university.edu)",
    "Accept": "text/html,application/xhtml+xml",
    "Accept-Language": "en-US,en;q=0.9"
}

Understanding HTML: The Structure of Web Pages

To scrape effectively, you need to understand HTML — the markup language that defines the structure and content of web pages.

HTML Basics

HTML (HyperText Markup Language) consists of elements defined by tags:

HTML
<!DOCTYPE html>
<html>
<head>
    <title>Product Catalog</title>
</head>
<body>
    <h1 class="page-title">Our Products</h1>

    <div id="product-list">

        <div class="product-card" data-product-id="PROD_001">
            <h2 class="product-name">Wireless Headphones</h2>
            <span class="price">$149.99</span>
            <span class="category">Electronics</span>
            <p class="description">Premium noise-cancelling headphones.</p>
            <a href="/products/wireless-headphones" class="details-link">View Details</a>
            <div class="rating">
                <span class="stars">★★★★☆</span>
                <span class="review-count">(247 reviews)</span>
            </div>
        </div>

        <div class="product-card" data-product-id="PROD_002">
            <h2 class="product-name">USB-C Hub</h2>
            <span class="price">$49.99</span>
            <span class="category">Electronics</span>
            <p class="description">7-in-1 USB-C hub with HDMI output.</p>
            <a href="/products/usb-c-hub" class="details-link">View Details</a>
            <div class="rating">
                <span class="stars">★★★★★</span>
                <span class="review-count">(89 reviews)</span>
            </div>
        </div>

    </div>

    <div class="pagination">
        <a href="/products?page=2" class="next-page">Next →</a>
    </div>
</body>
</html>

Key concepts:

  • Tags: <h1>, <div>, <span>, <a>, <p> — define element types
  • Attributes: class, id, href, data-product-id — provide metadata
  • id: Should be unique on the page — targets one specific element
  • class: Can be shared by many elements — identifies groups of similar elements
  • Nesting: Elements contain other elements, forming a tree structure called the DOM (Document Object Model)
  • Text content: The human-readable text between opening and closing tags

Using Browser Developer Tools

Before writing any scraper, open the target page in your browser and use Developer Tools to inspect the HTML structure:

  1. Right-click on the element you want to extract
  2. Select “Inspect” (or “Inspect Element”)
  3. The DevTools panel opens, highlighting the HTML for that element
  4. Examine parent elements to understand the containing structure
  5. Look for consistent patterns: class names, IDs, or structural positions that reliably identify the data you want

This inspection step — understanding the HTML structure before writing code — is the most important part of building any scraper.

Installing the Tools

Python
pip install requests beautifulsoup4 lxml
  • requests: Fetches web pages (covered in Article 75)
  • beautifulsoup4: Parses HTML and provides a navigation/search API
  • lxml: A fast HTML/XML parser that BeautifulSoup uses as its backend

BeautifulSoup: Parsing and Navigating HTML

BeautifulSoup turns raw HTML text into a navigable tree of Python objects. You can search it by tag name, CSS class, ID, attributes, or text content — and extract exactly the data you need.

Creating a Soup Object

Python
from bs4 import BeautifulSoup

# Parse an HTML string
html = """
<div id="product-list">
    <div class="product-card">
        <h2 class="product-name">Wireless Headphones</h2>
        <span class="price">$149.99</span>
    </div>
    <div class="product-card">
        <h2 class="product-name">USB-C Hub</h2>
        <span class="price">$49.99</span>
    </div>
</div>
"""

soup = BeautifulSoup(html, "lxml")  # "lxml" is the parser to use

# In practice, parse the response from a requests call:
# response = requests.get(url, headers=HEADERS)
# soup = BeautifulSoup(response.text, "lxml")

Finding Elements: find() and find_all()

The two most important BeautifulSoup methods:

Python
# find(): Returns the FIRST matching element (or None if not found)
# find_all(): Returns a LIST of ALL matching elements

# Find by tag name
first_h2 = soup.find("h2")                   # First <h2> element
all_h2s = soup.find_all("h2")                # All <h2> elements

# Find by CSS class
first_card = soup.find("div", class_="product-card")
all_cards = soup.find_all("div", class_="product-card")

# Find by ID
product_list = soup.find("div", id="product-list")

# Find by attribute
link = soup.find("a", href=True)             # Any <a> with an href
specific_link = soup.find("a", href="/about")

# Find by multiple attributes
elem = soup.find("div", {"class": "product-card", "data-featured": "true"})

# Get text content of an element
first_name = soup.find("h2", class_="product-name")
print(first_name.text)                       # "Wireless Headphones"
print(first_name.get_text(strip=True))       # Same but strips whitespace

Extracting All Products: Iterating find_all()

Python
products = []

for card in soup.find_all("div", class_="product-card"):
    # Extract fields from each card
    name_tag  = card.find("h2", class_="product-name")
    price_tag = card.find("span", class_="price")

    products.append({
        "name":  name_tag.get_text(strip=True) if name_tag else None,
        "price": price_tag.get_text(strip=True) if price_tag else None,
    })

print(products)
# [{'name': 'Wireless Headphones', 'price': '$149.99'},
#  {'name': 'USB-C Hub', 'price': '$49.99'}]

Navigating the Tree

BeautifulSoup lets you traverse parent–child–sibling relationships:

Python
card = soup.find("div", class_="product-card")

# Children (direct)
for child in card.children:
    if child.name:  # Skip NavigableString whitespace nodes
        print(child.name, child.get_text(strip=True)[:30])

# Parent
parent = card.parent
print(parent.get("id"))  # "product-list"

# Siblings
next_card = card.find_next_sibling("div", class_="product-card")
prev_card = card.find_previous_sibling("div", class_="product-card")

Extracting Attributes

Python
full_html = """
<div class="product-card" data-product-id="PROD_001">
    <h2 class="product-name">Wireless Headphones</h2>
    <a href="/products/wireless-headphones">View Details</a>
    <img src="/images/headphones.jpg" alt="Wireless Headphones">
</div>
"""
soup2 = BeautifulSoup(full_html, "lxml")

card = soup2.find("div", class_="product-card")

# Get a specific attribute
product_id = card.get("data-product-id")    # "PROD_001"
link_url   = card.find("a").get("href")     # "/products/wireless-headphones"
img_src    = card.find("img").get("src")    # "/images/headphones.jpg"
img_alt    = card.find("img").get("alt")    # "Wireless Headphones"

# Get all attributes as a dict
print(card.attrs)
# {'class': ['product-card'], 'data-product-id': 'PROD_001'}

CSS Selectors: A Powerful Alternative to find()

BeautifulSoup also supports CSS selector syntax via .select() and .select_one(). CSS selectors are often more concise than find() for complex targeting:

Python
# CSS selector reference:
# "tag"              → Select by tag name
# ".classname"       → Select by class
# "#id"              → Select by ID
# "tag.class"        → Tag with specific class
# "parent child"     → Child anywhere inside parent
# "parent > child"   → Direct child only
# "[attr]"           → Has attribute
# "[attr=value]"     → Attribute equals value
# "tag:nth-child(n)" → nth child of parent

# select_one(): equivalent to find() — returns first match or None
product_name = soup.select_one(".product-card .product-name")
print(product_name.get_text(strip=True))  # "Wireless Headphones"

# select(): equivalent to find_all() — returns list of all matches
all_prices = soup.select(".product-card .price")
prices = [p.get_text(strip=True) for p in all_prices]
print(prices)  # ['$149.99', '$49.99']

# More complex selectors
# Get the href from all "details" links inside product cards
links = soup.select(".product-card a.details-link")
hrefs = [link.get("href") for link in links]

# Get all product IDs from data attributes
cards = soup.select("div[data-product-id]")
ids = [card.get("data-product-id") for card in cards]

CSS selectors are especially useful when you’re working from the patterns you see in browser DevTools, since DevTools often lets you copy an element’s CSS selector directly.

A Complete Scraping Workflow

Let’s walk through a complete, realistic scraping project against a public website. We’ll use books.toscrape.com — a site specifically designed for practicing web scraping.

Step 1: Inspect the Target Page

Before writing code, use browser DevTools to understand the HTML structure of http://books.toscrape.com/. Inspection reveals:

  • Each book is a <li> with class="col-xs-6 col-sm-4 col-md-3 col-lg-3"
  • Inside each <li>, the book info is in an <article class="product_pod">
  • Title is in <h3><a title="Full Title">Short Title</a></h3> — the full title is in the title attribute
  • Price is in <p class="price_color">£12.99</p>
  • Rating is in <p class="star-rating Three"> — the class encodes the rating word
  • Stock is in <p class="instock availability">
  • Pagination: next page link is <li class="next"><a href="catalogue/page-2.html">

Step 2: Scrape a Single Page

Python
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import re

BASE_URL = "http://books.toscrape.com"

HEADERS = {
    "User-Agent": "DataScienceStudy/1.0 (learning web scraping)",
    "Accept": "text/html"
}

RATING_MAP = {
    "One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5
}

def scrape_books_page(url: str) -> list[dict]:
    """
    Scrape all books from a single page of books.toscrape.com.
    
    Parameters
    ----------
    url : str
        Full URL of the page to scrape.
    
    Returns
    -------
    list of dict
        One dict per book with keys: title, price, rating, in_stock, url.
    """
    response = requests.get(url, headers=HEADERS, timeout=10)
    response.raise_for_status()

    soup = BeautifulSoup(response.text, "lxml")
    books = []

    for article in soup.select("article.product_pod"):
        # Title: in the 'title' attribute of the <a> inside <h3>
        title_tag = article.select_one("h3 a")
        title = title_tag.get("title", "").strip() if title_tag else None

        # Price: strip the £ symbol and convert to float
        price_tag = article.select_one("p.price_color")
        price_text = price_tag.get_text(strip=True) if price_tag else ""
        price = float(re.sub(r"[^\d.]", "", price_text)) if price_text else None

        # Rating: encoded in the class name of the <p> element
        # e.g. class="star-rating Three" → 3
        rating_tag = article.select_one("p.star-rating")
        rating_class = rating_tag.get("class", []) if rating_tag else []
        rating_word = next(
            (cls for cls in rating_class if cls != "star-rating"), None
        )
        rating = RATING_MAP.get(rating_word)

        # Availability
        avail_tag = article.select_one("p.availability")
        in_stock = "In stock" in avail_tag.get_text() if avail_tag else None

        # Detail page URL
        detail_href = title_tag.get("href", "") if title_tag else ""
        # href is relative like "catalogue/a-light-in-the-attic_1000/index.html"
        detail_url = BASE_URL + "/catalogue/" + detail_href.replace("../", "")

        books.append({
            "title":    title,
            "price_gbp": price,
            "rating":   rating,
            "in_stock": in_stock,
            "url":      detail_url
        })

    return books

Step 3: Handle Pagination

Python
def get_next_page_url(soup: BeautifulSoup, current_url: str) -> str | None:
    """
    Extract the URL of the next page from pagination, or None if last page.
    """
    next_li = soup.select_one("li.next a")
    if not next_li:
        return None

    next_href = next_li.get("href", "")

    # Build absolute URL from relative href
    if current_url.endswith("/"):
        return current_url + next_href
    else:
        # Replace last path segment with next page href
        base = current_url.rsplit("/", 1)[0]
        return base + "/" + next_href


def scrape_all_books(start_url: str, max_pages: int = None, delay: float = 1.0) -> pd.DataFrame:
    """
    Scrape all books from books.toscrape.com across all pages.

    Parameters
    ----------
    start_url : str
        URL of the first page to scrape.
    max_pages : int, optional
        Maximum pages to scrape. If None, scrapes all pages.
    delay : float, optional
        Seconds to wait between page requests. By default 1.0.

    Returns
    -------
    pd.DataFrame
        All books across all pages.
    """
    all_books = []
    current_url = start_url
    page_num = 0

    while current_url:
        page_num += 1
        print(f"Scraping page {page_num}: {current_url}")

        response = requests.get(current_url, headers=HEADERS, timeout=10)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, "lxml")

        page_books = scrape_books_page(current_url)
        all_books.extend(page_books)
        print(f"  Found {len(page_books)} books (total: {len(all_books)})")

        # Check for next page
        next_url = get_next_page_url(soup, current_url)

        if max_pages and page_num >= max_pages:
            print(f"Reached max_pages limit ({max_pages})")
            break

        current_url = next_url

        if current_url:
            time.sleep(delay)  # Be respectful — pause between requests

    df = pd.DataFrame(all_books)
    print(f"\nScraping complete. Total books: {len(df)}")
    return df


# Scrape the first 3 pages as a demonstration
df = scrape_all_books(
    start_url="http://books.toscrape.com/",
    max_pages=3,
    delay=1.0
)

print(df.head(10))
print(f"\nShape: {df.shape}")
print(f"Average price: £{df['price_gbp'].mean():.2f}")
print(f"Rating distribution:\n{df['rating'].value_counts().sort_index()}")

Step 4: Clean the Scraped Data

Raw scraped data always needs cleaning:

Python
def clean_books_df(df: pd.DataFrame) -> pd.DataFrame:
    """Clean and type-cast the scraped books DataFrame."""
    df = df.copy()

    # Drop rows where title is None (scraping errors)
    df = df.dropna(subset=["title"])

    # Rating as integer (already numeric, but ensure type)
    df["rating"] = pd.to_numeric(df["rating"], errors="coerce").astype("Int64")

    # Price as float (already done during scraping, but ensure)
    df["price_gbp"] = pd.to_numeric(df["price_gbp"], errors="coerce")

    # in_stock as boolean
    df["in_stock"] = df["in_stock"].astype(bool)

    # Add price tiers
    df["price_tier"] = pd.cut(
        df["price_gbp"],
        bins=[0, 10, 25, 50, float("inf")],
        labels=["Budget", "Mid", "Premium", "Luxury"]
    )

    # Remove duplicates
    df = df.drop_duplicates(subset=["title"])

    return df.reset_index(drop=True)

df_clean = clean_books_df(df)
print(df_clean.dtypes)
print(df_clean.describe())

Scraping Detail Pages: Going Deeper

Often the list page gives you summary data but you need detail pages for the full information. This is the “spider” pattern — collecting URLs from list pages, then visiting each one:

Python
def scrape_book_detail(url: str) -> dict:
    """
    Scrape detailed information from a single book's detail page.
    Returns additional fields: description, genre, UPC.
    """
    response = requests.get(url, headers=HEADERS, timeout=10)
    response.raise_for_status()
    soup = BeautifulSoup(response.text, "lxml")

    # Description
    description_tag = soup.select_one("#product_description ~ p")
    description = description_tag.get_text(strip=True) if description_tag else None

    # Product info table (UPC, price, availability, etc.)
    product_info = {}
    for row in soup.select("table.table-striped tr"):
        cells = row.find_all("td")
        if len(cells) == 2:
            key = row.find("th").get_text(strip=True)
            value = cells[1].get_text(strip=True)
            product_info[key] = value

    # Breadcrumb: genre (second-to-last crumb)
    crumbs = soup.select("ul.breadcrumb li")
    genre = crumbs[-2].get_text(strip=True) if len(crumbs) >= 2 else None

    return {
        "description": description,
        "upc":         product_info.get("UPC"),
        "genre":       genre,
        "num_reviews": int(product_info.get("Number of reviews", 0) or 0),
    }


def enrich_with_details(df: pd.DataFrame, delay: float = 0.5,
                        max_books: int = None) -> pd.DataFrame:
    """
    Add detail page fields to the books DataFrame.
    Visits each book's detail URL to fetch additional data.
    """
    df = df.copy()
    details_list = []
    books_to_process = df.head(max_books) if max_books else df

    for i, (_, row) in enumerate(books_to_process.iterrows()):
        if i % 10 == 0:
            print(f"  Fetching details: {i}/{len(books_to_process)}")

        try:
            details = scrape_book_detail(row["url"])
        except Exception as e:
            print(f"  Error fetching {row['url']}: {e}")
            details = {"description": None, "upc": None,
                       "genre": None, "num_reviews": None}

        details_list.append(details)
        time.sleep(delay)

    details_df = pd.DataFrame(details_list, index=books_to_process.index)
    return df.join(details_df)


# Enrich first 20 books with detail page data
df_enriched = enrich_with_details(df_clean, delay=0.5, max_books=20)
print(df_enriched[["title", "genre", "num_reviews", "description"]].head())

Handling Common Scraping Challenges

Challenge 1: Inconsistent HTML Structure

Real websites are inconsistent — some pages have fields others don’t. Always use .find() with a fallback:

Python
# FRAGILE: KeyError if tag is missing
price = soup.find("span", class_="price").text  # Crashes if None

# ROBUST: Returns None if any step fails
price_tag = soup.find("span", class_="price")
price = price_tag.get_text(strip=True) if price_tag else None

# EVEN SAFER for deeply nested access:
def safe_text(soup_obj, selector: str, attr: str = None) -> str | None:
    """Safely extract text or attribute from a CSS selector."""
    elem = soup_obj.select_one(selector)
    if elem is None:
        return None
    if attr:
        return elem.get(attr)
    return elem.get_text(strip=True)

name  = safe_text(card, "h2.product-name")
price = safe_text(card, "span.price")
link  = safe_text(card, "a.details-link", attr="href")

Challenge 2: Anti-Scraping Measures

Some sites detect and block automated scrapers. Common countermeasures and how to handle them:

User-Agent detection: Sites reject requests from default requests User-Agent. Set a realistic browser User-Agent:

Python
HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/119.0.0.0 Safari/537.36"
    ),
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.5",
    "Accept-Encoding": "gzip, deflate",
    "Connection": "keep-alive",
}

Rate-based blocking: Too many requests too fast triggers blocking. Add jitter to delays:

Python
import random
import time

def respectful_sleep(min_s: float = 1.0, max_s: float = 3.0):
    """Sleep for a random duration to appear more human-like."""
    time.sleep(random.uniform(min_s, max_s))

# Between requests
respectful_sleep(1.0, 2.5)

Session-based cookies: Some sites require a valid session. Use requests.Session() to persist cookies across requests:

Python
session = requests.Session()
session.headers.update(HEADERS)

# Visit the homepage first to establish a session
session.get("https://example.com/", timeout=10)

# Now scrape — session cookies are sent automatically
response = session.get("https://example.com/products/", timeout=10)

Challenge 3: Relative vs. Absolute URLs

Links on web pages are often relative. Convert them before storing or following:

Python
from urllib.parse import urljoin

base_url = "http://books.toscrape.com/catalogue/"
relative_href = "../a-light-in-the-attic_1000/index.html"

# urljoin handles relative paths correctly
absolute_url = urljoin(base_url, relative_href)
print(absolute_url)
# http://books.toscrape.com/a-light-in-the-attic_1000/index.html

Challenge 4: Encoding Issues

Web pages sometimes have encoding problems. Explicitly set the encoding:

Python
response = requests.get(url, headers=HEADERS)

# Check detected encoding
print(response.encoding)     # e.g., 'utf-8', 'ISO-8859-1'

# If wrong, set it explicitly
response.encoding = "utf-8"
soup = BeautifulSoup(response.text, "lxml")

# Or use response.content (bytes) and let BeautifulSoup detect encoding
soup = BeautifulSoup(response.content, "lxml")

Dynamic Content: When Requests + BeautifulSoup Isn’t Enough

A growing fraction of modern websites render their content with JavaScript — the HTML returned by the server is essentially empty, and the actual data is loaded dynamically by JavaScript running in the browser. BeautifulSoup can only parse what the server sends; it can’t execute JavaScript.

Signs that a page uses dynamic content:

  • requests.get() returns a page with no data but you can see data in your browser
  • The HTML contains <div id="app"></div> or similar empty container divs
  • Browser DevTools Network tab shows XHR/Fetch requests to separate API endpoints after page load

Solution 1: Find the Underlying API (Preferred)

Often, the JavaScript making the page dynamic is fetching data from a JSON API. Use browser DevTools Network tab to find it:

  1. Open DevTools → Network tab → filter to “XHR” or “Fetch”
  2. Reload the page
  3. Look for requests returning JSON data
  4. Copy the request URL and headers — call it directly with requests

This approach is faster, more reliable, and more respectful than browser automation.

Solution 2: Playwright (Browser Automation)

When the underlying API can’t be found or is too complex to reproduce, Playwright automates a real browser to load the page, execute JavaScript, and expose the fully rendered HTML:

Python
pip install playwright
playwright install chromium
Python
from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup
import pandas as pd

def scrape_dynamic_page(url: str) -> pd.DataFrame:
    """
    Scrape a JavaScript-rendered page using Playwright.
    Use only when requests + BeautifulSoup doesn't work.
    Much slower than direct HTTP requests.
    """
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)  # headless=False to see the browser
        page = browser.new_page()

        # Navigate to the page
        page.goto(url, wait_until="networkidle")

        # Wait for a specific element to confirm content is loaded
        page.wait_for_selector(".product-card", timeout=10000)

        # Get the fully rendered HTML
        html = page.content()
        browser.close()

    # Now parse with BeautifulSoup as usual
    soup = BeautifulSoup(html, "lxml")
    products = []
    for card in soup.select(".product-card"):
        products.append({
            "name":  safe_text(card, ".product-name"),
            "price": safe_text(card, ".price")
        })

    return pd.DataFrame(products)

Playwright can also interact with pages — clicking buttons, filling forms, scrolling — enabling scraping of content that requires user interaction to reveal.

Saving and Resuming: Building a Robust Scraper

For large scraping jobs that could fail partway through, save progress as you go:

Python
import json
import os
from pathlib import Path

CHECKPOINT_FILE = ".scrape_checkpoint.json"
OUTPUT_FILE = "data/raw/books_scraped.csv"

def load_checkpoint() -> dict:
    """Load saved scraping progress."""
    if os.path.exists(CHECKPOINT_FILE):
        with open(CHECKPOINT_FILE) as f:
            return json.load(f)
    return {"last_url": None, "pages_scraped": 0, "total_books": 0}

def save_checkpoint(next_url: str, pages: int, books: int):
    """Save current scraping progress."""
    with open(CHECKPOINT_FILE, "w") as f:
        json.dump({
            "last_url": next_url,
            "pages_scraped": pages,
            "total_books": books
        }, f)

def scrape_with_checkpointing(start_url: str, delay: float = 1.0) -> pd.DataFrame:
    """
    Scrape all pages with checkpoint-based resume capability.
    If interrupted, restart from where it left off.
    """
    checkpoint = load_checkpoint()
    current_url = checkpoint["last_url"] or start_url
    pages_done = checkpoint["pages_scraped"]

    # Load previously scraped data if any
    all_books = []
    if os.path.exists(OUTPUT_FILE) and pages_done > 0:
        all_books = pd.read_csv(OUTPUT_FILE).to_dict("records")
        print(f"Resuming from page {pages_done + 1} ({len(all_books)} books already scraped)")

    while current_url:
        response = requests.get(current_url, headers=HEADERS, timeout=10)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, "lxml")

        page_books = scrape_books_page(current_url)
        all_books.extend(page_books)
        pages_done += 1

        next_url = get_next_page_url(soup, current_url)
        save_checkpoint(next_url, pages_done, len(all_books))

        # Save incrementally every 5 pages
        if pages_done % 5 == 0:
            pd.DataFrame(all_books).to_csv(OUTPUT_FILE, index=False)
            print(f"Page {pages_done}: {len(all_books)} books scraped and saved")

        current_url = next_url
        if current_url:
            time.sleep(delay)

    # Final save
    df = pd.DataFrame(all_books)
    df.to_csv(OUTPUT_FILE, index=False)

    # Clean up checkpoint on successful completion
    if os.path.exists(CHECKPOINT_FILE):
        os.remove(CHECKPOINT_FILE)

    print(f"Complete: {len(df)} books saved to {OUTPUT_FILE}")
    return df

Complete Example: Scraping a Data Table

Many government and academic websites present data in HTML tables — straightforward to scrape:

Python
import requests
from bs4 import BeautifulSoup
import pandas as pd

def scrape_html_table(url: str, table_index: int = 0,
                      headers: dict = None) -> pd.DataFrame:
    """
    Extract an HTML table from a webpage into a DataFrame.

    Parameters
    ----------
    url : str
        URL of the page containing the table.
    table_index : int
        Index of the table to extract (0 = first table). By default 0.
    headers : dict, optional
        Request headers.

    Returns
    -------
    pd.DataFrame
        The table contents as a DataFrame.
    """
    response = requests.get(url, headers=headers or HEADERS, timeout=15)
    response.raise_for_status()
    soup = BeautifulSoup(response.text, "lxml")

    tables = soup.find_all("table")
    if not tables:
        raise ValueError("No tables found on the page")
    if table_index >= len(tables):
        raise ValueError(f"Table index {table_index} out of range (found {len(tables)} tables)")

    table = tables[table_index]

    # Extract headers
    headers_row = table.find("thead")
    if headers_row:
        col_names = [th.get_text(strip=True) for th in headers_row.find_all("th")]
    else:
        # Infer from first row
        first_row = table.find("tr")
        col_names = [td.get_text(strip=True) for td in first_row.find_all(["th", "td"])]

    # Extract data rows
    rows = []
    tbody = table.find("tbody") or table
    for tr in tbody.find_all("tr"):
        cells = tr.find_all("td")
        if cells:
            rows.append([cell.get_text(strip=True) for cell in cells])

    df = pd.DataFrame(rows, columns=col_names[:len(rows[0])] if rows else col_names)
    return df

# pandas can often do this in one line for simple tables:
# dfs = pd.read_html(url)  # Returns list of all tables on page as DataFrames
# df = dfs[0]              # First table

Note: pd.read_html() is a powerful shortcut for HTML tables — it uses lxml under the hood and handles most cases automatically. Use BeautifulSoup when you need more control or when the table has complex structure.

Structuring a Scraping Project

For serious scraping projects, organize your code properly:

Plaintext
scraping-project/
├── scrapers/
│   ├── __init__.py
│   ├── books_scraper.py      ← Main scraping logic
│   ├── detail_scraper.py     ← Detail page scraping
│   └── utils.py              ← Shared helpers (safe_text, urljoin, etc.)
├── data/
│   ├── raw/                  ← Raw scraped data (gitignored)
│   └── processed/            ← Cleaned data
├── notebooks/
│   └── 01_exploration.ipynb  ← Analysis of scraped data
├── tests/
│   └── test_scrapers.py      ← Tests against local HTML fixtures
├── .scrape_checkpoint.json   ← Auto-generated, gitignored
├── requirements.txt
└── README.md                 ← Documents what is scraped and why

Testing scrapers against local HTML files (saved during development) ensures your code is resilient:

Python
# tests/test_scrapers.py
from bs4 import BeautifulSoup
from scrapers.books_scraper import scrape_books_page

def test_scrape_books_page():
    """Test book scraping against a saved local HTML fixture."""
    with open("tests/fixtures/books_page_1.html") as f:
        html = f.read()

    # Monkey-patch requests to return local HTML
    # (or use requests-mock library)
    books = scrape_books_from_html(html)  # Modified version taking html directly

    assert len(books) == 20             # One page has 20 books
    assert all("title" in b for b in books)
    assert all("price_gbp" in b for b in books)
    assert all(isinstance(b["price_gbp"], float) for b in books)
    assert books[0]["rating"] in range(1, 6)

Summary

Web scraping is an indispensable data collection technique for data scientists, providing access to a vast world of publicly available information that has no official API. The core workflow is consistent: inspect the HTML structure with browser DevTools, fetch the page with requests, parse it with BeautifulSoup, extract data by targeting elements with CSS selectors or tag/attribute searches, handle pagination to collect all records, and clean the results into analysis-ready DataFrames.

The distinguishing marks of a professional scraper versus a fragile one are robustness and responsibility. Robust code uses .get() with fallbacks at every extraction step, validates and cleans data as it’s collected, handles pagination and errors gracefully, and supports checkpointing for long runs. Responsible scrapers check robots.txt, read terms of service, add delays between requests, use descriptive User-Agents, and prefer APIs when they exist.

The scraping techniques covered here — static HTML parsing with BeautifulSoup and browser automation with Playwright for JavaScript-rendered pages — cover the vast majority of real-world scraping needs. When combined with the API skills from the previous article, they give you the complete toolkit for collecting data from essentially any publicly accessible web source.

Key Takeaways

  • Always check robots.txt and terms of service before scraping — urllib.robotparser can automate the robots.txt check; ToS must be read manually
  • requests.get() + BeautifulSoup(response.text, "lxml") is the core scraping pattern — fetch the HTML, parse it, then navigate with .find(), .find_all(), or .select() using CSS selectors
  • find() returns the first matching element or None; find_all() returns a list of all matches; always use .get("attr") instead of ["attr"] and check for None before calling .text to avoid crashes on missing elements
  • get_text(strip=True) extracts clean text from an element; .get("href") extracts an attribute value — these two methods cover most extraction needs
  • Pagination requires explicitly following “next page” links in a loop — detect the next URL from the HTML, request it, repeat until no next URL is found
  • Add time.sleep() with random jitter between requests — random.uniform(1.0, 3.0) is a reasonable default for polite scraping
  • When requests + BeautifulSoup returns empty content (JavaScript-rendered pages), first look for the underlying JSON API in browser DevTools Network tab; use Playwright only as a last resort
  • pd.read_html(url) is a one-line shortcut for extracting HTML tables — use BeautifulSoup when you need finer control over extraction, table selection, or data cleaning
Share:
Subscribe
Notify of
0 Comments
Inline Feedbacks
View all comments

Discover More

Introduction to Model Evaluation Metrics

Master machine learning evaluation metrics including accuracy, precision, recall, F1-score, ROC-AUC, RMSE, and more with…

Probability Theory Fundamentals for Machine Learning

Master probability theory fundamentals essential for machine learning. Learn probability distributions, conditional probability, Bayes’ theorem,…

Parallel Circuits Demystified: Creating Multiple Paths for Current

Master parallel circuits with this comprehensive guide. Learn how components connect in parallel, calculate current…

Samsung Announces Massive AI Expansion Targeting 800 Million Mobile Devices in 2026

Samsung announces aggressive AI strategy to double Galaxy AI-enabled devices to 800 million by 2026.…

JavaScript Functions: Declaration, Invocation and Parameters

Learn about JavaScript functions, including declarations, invocations, parameters, and handling asynchronous tasks with callbacks, promises…

What is Continuity Testing and Why is it Your Best Debugging Friend?

Learn what continuity testing is, how to use it with a multimeter, and why it’s…

Click For More
0
Would love your thoughts, please comment.x
()
x