Building a crawler to fetch ten static pages takes five minutes. Engineering a distributed system that maps dynamic e-commerce sites without triggering anti-bot bans takes months. Most tutorials ignore the latter, leaving you with brittle scripts that crash on infinite pagination loops.
This guide provides the complete upgrade path. You will learn how to make a web crawler in Python, starting with a solid baseline script. We then cover mandatory production upgrades: URL normalization, robots.txt caching, JavaScript rendering, rate limiting, and horizontal scaling. We also establish the exact inflection point where maintaining infrastructure becomes an economic burden, dictating a switch to a structured web data API.
What is a Python web crawler?
A Python web crawler is an automated script that discovers and maps websites. It starts from a seed URL, fetches HTML content, parses it to extract internal links, filters those links, and maintains a deduplicated queue to systematically navigate the entire site structure.
What a Python Web Crawler Does, and How It Differs From Scraping
Developers often conflate these terms, but they serve different system architectures. A crawler discovers URLs and navigates domains to map site structure. A scraper extracts specific structured data (like a price or author name) from a single page template.
You typically deploy a Python web crawler to handle:
- Documentation indexing for search engines.
- Competitive pricing and catalog monitoring.
- SEO site audits and dead-link detection.
- Product and inventory discovery.
- Lead enrichment pipelines.
- AI and RAG (Retrieval-Augmented Generation) document ingestion.
Build a Simple Python Web Crawler With Requests and BeautifulSoup
A basic crawler requires an HTTP client for fetching, an HTML parser for extraction, a queue for traversal, and a tracking set to prevent infinite loops.
Start with a synchronous, single-machine script to master the fundamental mechanics of discovery and retrieval. Five core components power a same-domain tutorial crawler:
requests: Handles synchronous HTTP GET requests.BeautifulSoup: Parses HTML to extract anchor tags.collections.deque: Acts as an efficient, double-ended queue for the crawl frontier.urllib.parse: Converts relative links into absolute URLs.csvorjson: Exports the gathered data.
Python web crawler code
This script starts from one seed URL, respects a depth limit, stays on the original domain, and uses a visited set to avoid fetching the same page twice. It also wraps calls in a requests.Session() to reuse TCP connections, significantly lowering latency.
import csv
import urllib.robotparser
from collections import deque
from urllib.parse import urljoin, urlparse
import requests
from bs4 import BeautifulSoup
def crawl_website(seed_url, max_depth=2):
# Initialize queues and tracking sets
queue = deque([(seed_url, 0)])
visited = {seed_url}
results = []
# Initialize robots.txt parser
rp = urllib.robotparser.RobotFileParser()
parsed_seed = urlparse(seed_url)
rp.set_url(f"{parsed_seed.scheme}://{parsed_seed.netloc}/robots.txt")
try:
rp.read()
except Exception:
pass # Failsafe if robots.txt is missing
# Use a session to reuse TCP connections
with requests.Session() as session:
session.headers.update({"User-Agent": "MyPythonCrawler/1.0"})
while queue:
current_url, depth = queue.popleft()
# Enforce depth limit
if depth > max_depth:
continue
# Check robots.txt permission
if not rp.can_fetch("*", current_url):
print(f"Blocked by robots.txt: {current_url}")
continue
try:
response = session.get(current_url, timeout=5)
if response.status_code != 200:
continue
soup = BeautifulSoup(response.text, 'html.parser')
title = soup.title.string.strip() if soup.title else "No Title"
results.append({
"url": current_url,
"title": title,
"depth": depth,
"status": response.status_code
})
print(f"Crawled [{depth}]: {current_url}")
# Stop link extraction if we hit the max depth
if depth == max_depth:
continue
# Extract and filter links
for link in soup.find_all('a', href=True):
absolute_url = urljoin(current_url, link['href'])
parsed_url = urlparse(absolute_url)
# Same-domain filtering and duplicate prevention
if parsed_url.netloc == parsed_seed.netloc and absolute_url not in visited:
visited.add(absolute_url)
queue.append((absolute_url, depth + 1))
except requests.RequestException as e:
print(f"Failed to fetch {current_url}: {e}")
return results
# Execute and export
data = crawl_website("https://example.com", max_depth=2)
with open("crawl_results.csv", "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=["url", "title", "depth", "status"])
writer.writeheader()
writer.writerows(data)Export crawler results to CSV or JSON
Structure your output immediately. Raw HTML quickly consumes storage and remains difficult to query.
JSON Example Object:
{
"url": "https://example.com/about",
"title": "About Us - Example",
"depth": 1,
"status": 200
}Shortcut Strategy: If a site provides a valid sitemap.xml, parse it directly. A Python sitemap crawler bypasses HTML parsing entirely, saving compute cycles and revealing unlinked orphan pages.
How the Crawler Works Internally
Understanding the underlying mechanics transforms a script into a reliable tool.
The crawl queue, frontier, and depth limit
The crawl frontier holds URLs waiting to be fetched. Without strict constraints, a crawler runs indefinitely. Using a deque, the script performs a Breadth-First Search (BFS), processing all links at Depth 1 before moving to Depth 2. The depth limit prevents the crawler from wandering into thousands of dynamically generated archive pages.
Link extraction and same-domain filtering
Raw href attributes rarely contain clean absolute URLs. They often look like /category/shoes. The urllib.parse.urljoin function converts relative paths into absolute links. Same-domain filtering ensures the crawler does not bleed onto external social media profiles or partner sites.
URL normalization and duplicate prevention
Duplicate URLs waste queue space and skew datasets. You must normalize them before checking the visited set.
Before Normalization:
https://example.com/Page?sort=asc#reviewshttps://example.com/page/?sort=asc
After Normalization:
https://example.com/page?sort=asc
Strip URL fragments (#), enforce consistent trailing slashes, lowercase hostnames, and sort query parameters to identify mathematically identical pages.
Why a Tutorial Crawler Breaks in the Real World
asic scripts assume the internet is stable. Production crawling requires defensive engineering to handle infinite loops, unrendered JavaScript, and aggressive server throttling.
The gap between a tutorial and a production system exposes itself rapidly at scale.
Duplicate URL variants and crawl traps
Certain site architectures mathematically guarantee infinite crawling if unchecked. A blog with a "Next Month" link generates an endless loop of empty calendar URLs. E-commerce faceted navigation (sorting by size, color, brand, and price) creates millions of unique parameter combinations for the same fifty products.
Incomplete or inconsistent data
Data output degrades quickly due to server friction and client-side rendering.
| Symptom | Likely Cause | Fix |
|---|---|---|
| Empty HTML / Missing text | JavaScript-rendered content | Inspect XHR/Fetch network calls to find the JSON API |
| 403 / 429 Status Codes | Anti-bot blocking or rate limits | Implement adaptive throttling and residential proxies |
NoneType errors |
HTML selector drift | Add schema validation and fallback selectors |
Make the Crawler Polite and Reliable
Scaling up requires shifting from brute-force requests to defensive engineering. Networks fail. Servers drop connections. Your crawler must handle this autonomously.
Handle robots.txt in Python
robots.txt provides policy input, not complete permission. Check it before fetching. The standard RobotFileParser caches host rules via read() and evaluates URL eligibility using can_fetch(). Cache these rules at the host level rather than re-fetching the file for every new URL.
Rate limiting without relying on static sleep
A fixed time.sleep(2) loop slows extraction needlessly on fast servers while still triggering bans on sensitive ones. Production crawlers use per-host concurrency limits, request timeouts, and adaptive throttling. Libraries like Scrapy automatically calculate latency-based delays to maintain politeness without sacrificing total throughput.
Retries, headers, and error handling
| Status Code | Action | Logic |
|---|---|---|
| 500 / 502 / 503 | Retry with backoff | Server error. Wait and retry up to 3 times. |
| 429 | Pause and backoff | Rate limit hit. Respect Retry-After headers if present. |
| 403 / 404 | Drop and log | Access denied or missing. Do not retry. |
Handle JavaScript-Heavy Websites the Cheap Way First
Do not blindly deploy a browser just because a page uses JavaScript. Headless browsers consume massive CPU and memory.
- Inspect Network Calls: Open Chrome DevTools, filter the Network tab by Fetch/XHR, and reload the page. If an underlying JSON API populates the product grid, call that endpoint directly in Python. This bypasses the DOM entirely and accelerates extraction.
- Limit Browser Automation: Playwright and Selenium Python crawlers are expensive to run. Use them exclusively for authenticated workflows, complex user interactions, or heavily obfuscated single-page applications (SPAs).
- Avoid LLM Agents for Static HTML: A January 2026 arXiv benchmark (Beyond BeautifulSoup) confirms that end-to-end web agents incur a 15–20x runtime overhead on static HTML compared to traditional parsing tools. Reserve AI agents for bypassing CAPTCHAs or repairing selectors, not for the primary extraction loop.
Scale Python Web Crawling Beyond One Machine
Single-threaded local scripts hit a ceiling fast. Scaling requires asynchronous networking, external shared queues, and memory-efficient deduplication.
When your seed list hits hundreds of thousands of domains, local memory limits will break your script.
Move to Async I/O: Synchronous requests block the thread while waiting for network responses. Swapping to an async architecture with aiohttp or httpx allows a single machine to multiplex hundreds of open connections concurrently.
Externalize the Frontier: A local Python deque evaporates if the process crashes, destroying your crawl state. Move the frontier to an external data store. A Redis list acts as a durable shared queue, allowing multiple Python worker scripts to pull URLs and push discovered links simultaneously.
Deploy Bloom Filters: Checking a local Python set() for duplicate URLs exhausts RAM at high volume. Implement a Bloom filter—a probabilistic data structure that trades a microscopic, tunable chance of false positives for massive memory reduction. This avoids revisiting identical pages without crashing your server.
Choose the Right Stack: BeautifulSoup, Scrapy, or an API
Match the tool to the operational reality of your project.
| Tool | Setup Time | JS Support | Crawl Scale | Maintenance Burden | Best Use Case |
|---|---|---|---|---|---|
| BeautifulSoup | Minutes | None | Low | High (brittle) | Single scripts |
| Scrapy | Hours | Requires plugins | High | Medium | Distributed static crawling |
| Playwright | Hours | Native | Medium (expensive) | High | Complex UI interaction |
| Web Data API | Minutes | Native | Very High | Zero | Production pipelines |
- Use requests + BeautifulSoup for low-volume tasks, predictable HTML, and maximum simplicity.
- Move to a Scrapy web crawler when you need built-in event-driven routing, resilient proxy middleware, and automated politeness controls.
- Transition to a structured web data API when maintaining crawler infrastructure and anti-bot proxies costs more engineering time than the data itself is worth.
Build vs Buy: When a Crawler Becomes an Infrastructure Project
Engineering time is not free. A script costs nothing to write, but a reliable pipeline costs thousands of dollars a month to maintain. Calculate your Total Cost of Ownership (TCO) across target page volume, proxy bandwidth, anti-bot bypass costs, and engineering hours spent repairing broken selectors.
Stop building in-house when you process tens of thousands of URLs per batch run, downstream systems require strictly structured JSON, or your pipeline demands scheduled daily execution.
Mapping the workflow to an API endpoint simplifies the architecture.
- Single-page extraction: Use Olostep Scrape for real-time HTML or structured JSON, natively bypassing anti-bot measures and handling JS-rendering.
- Website discovery: Use Olostep Maps to retrieve a site’s underlying URLs before scraping, and Olostep Crawls for recursive subpage retrieval with strict routing rules.
- Batch processing: Push lists directly to Olostep Batches to process up to 10k URLs in parallel within minutes.
- Structured data at scale: Apply Olostep Parsers to enforce deterministic JSON schemas, making extraction faster and cheaper than generic LLM prompts.
Is Web Crawling Legal, and Do You Need Permission?
This is operational guidance, not legal advice.
Public access does not remove all risk. Legality depends on the data collected, whether technical barriers were circumvented, and whether you aggregate personally identifiable information (PII).
robots.txtacts purely as a polite crawl policy signal, not a binding legal contract.- Website Terms of Service (ToS) can introduce contract liability if you explicitly agree to them (e.g., via a login gate).
- Privacy regulations (GDPR, CCPA) dictate strict compliance when collecting PII.
Avoid scraping PII without legal approval, never explicitly bypass login gates to scrape commercial data, and apply responsible rate limits to avoid disrupting the target server.
Frequently Asked Questions
Can Python be used for a web crawler?
Yes. Python is the industry standard for web crawling due to its large ecosystem of extraction libraries (BeautifulSoup, Scrapy) and asynchronous networking tools (aiohttp).
Should I use BeautifulSoup or Scrapy for web crawling?
Use BeautifulSoup for small, single-domain scripts where simplicity matters most. Switch to Scrapy when you need distributed pipelines, built-in retry middleware, and automated politeness controls.
How do I stop a crawler from visiting the same URL twice?
Maintain a visited set to filter out URLs before appending them to your queue. For large-scale crawling, implement URL normalization and replace the standard set with a memory-efficient Bloom filter.
How do I handle JavaScript-heavy websites in a Python crawler?
Inspect the browser's Network tab first to locate and call the hidden backend JSON API directly. If no API exists, use browser automation tools like Playwright or a managed web data API.
How do I make a Python crawler faster?
Upgrade from synchronous requests to an asynchronous architecture using aiohttp. Optimize your parser, reuse TCP connections via sessions, and scale workers horizontally using a Redis-backed queue.
Decide Your Operating Model
The hardest part of web data extraction is not writing the initial script to fetch fifty URLs. Real engineering begins when you hit client-side rendering, infinite pagination traps, and IP bans.
Now that you understand how to make a web crawler in Python, evaluate your operational reality. Keep it strictly DIY for small, static sites. Move to the Scrapy framework when your pipeline requires heavy concurrency and structured routing logic. Transition to a structured web data API when maintaining proxy fleets and headless browsers eclipses the value of the data itself.
