Aadithyan
AadithyanApr 19, 2026

Learn web crawling with Python. Master async scaling, JavaScript handling, URL frontiers, and API architectures for AI pipelines.

Web Crawling With Python: From Simple Scripts to Production

You want to scale web crawling with Python, but your local script keeps blocking, looping, or crashing. Building a basic python crawler is trivial. Managing state, navigating JavaScript-heavy DOMs, and structuring outputs for AI pipelines requires a fundamentally different mental model.

What is web crawling in Python?

Web crawling with Python is the automated process of programmatically discovering, fetching, and prioritizing URLs across a website. While web scraping extracts specific data from a known page, a web crawler in Python builds a site map by managing a URL queue, following internal links, and enforcing traversal rules to systematically uncover new content.

Whether you need to crawl a website with Python to feed a Retrieval-Augmented Generation (RAG) pipeline or monitor competitor catalogs, you will leave this guide with a working baseline script, a framework for tool selection, and a clear path from a raw python web spider to a scalable production architecture.

Crawling vs. Scraping: Understanding the Difference

TL;DR Summary Box:

  • Crawling maps site structures and discovers URLs.
  • Scraping extracts data from pages you already know.
  • Discovery crawling and ingestion crawling require distinct architectures.

What is the difference between web crawling and web scraping?

A python crawler tutorial often conflates web scraping vs web crawling concepts. They are related but distinct jobs. Crawling focuses on breadth and discovery. It maps domain structures, uncovers sitemaps, and builds URL inventories. Scraping focuses on depth and extraction. It parses the DOM of a targeted URL to pull specific fields, like pricing or documentation text.

In production, you split these workflows. Discovery crawling audits site structures. Ingestion crawling turns deep documentation trees into clean data for internal knowledge bases.

Choosing the Best Python Library for Web Crawling

  • Select your output format (HTML, Markdown, JSON) before picking a python crawler library.
  • requests is for learning; HTTPX is for async speed.
  • Scrapy handles complex scheduling.
  • Crawl4AI and Crawlee manage AI-native workflows and hybrid browser needs.

Do not start by deciding between BeautifulSoup and Scrapy. Start by asking what your crawler in python must produce.

HTML, Markdown, and JSON

Your downstream consumer dictates the output format.

  • HTML: Use for archival snapshots and classic CSS-selector workflows. It captures everything but introduces massive DOM noise for AI applications.
  • Markdown: Use for AI engineering and RAG pipelines. Markdown provides low-noise text, preserves heading hierarchies, and keeps links intact.
  • Structured JSON: Use for analytical dashboards and stable database schemas. The NEXT-EVAL benchmark paper (2025) demonstrated that feeding LLMs Flat JSON yields superior extraction accuracy (F1 score of 0.9567) with minimal hallucination compared to Slimmed HTML or Hierarchical JSON.

Which Python library is best for web crawling?

Evaluate your specific I/O constraints to pick the right stack for web crawling in python.

  • requests and BeautifulSoup: The learning baseline. Perfect for low-volume, public HTML pages. They are not a production default.
  • HTTPX: The modern async HTTP standard. Upgrade to HTTPX when your workload becomes I/O-bound. It provides native async APIs and HTTP/2 support.
  • Scrapy: The durable framework. Use Scrapy when architecture, request scheduling, and pipeline management matter.
  • Crawl4AI: The AI-native ingestion tool. It focuses heavily on Markdown generation and BM25 filtering for LLM pipelines.
  • Crawlee for Python: The hybrid orchestrator. It switches seamlessly between fast HTTP parsing and headless browser rendering.
  • Playwright: The rendering engine. Use Playwright when DOM elements only exist after client-side scripts execute. Treat the browser as a computational cost center.
  • curl_cffi: The middle ground. It impersonates legitimate browser TLS and HTTP2 fingerprints without the overhead of rendering a full page.

How to Build a Simple Web Crawler in Python

  • A reliable website crawler python script requires a fetcher, a parser, and a URL queue.
  • Link normalization prevents infinite duplicate loops.
  • Scope restriction enforces depth and domain boundaries.

Baseline Crawler Architecture

To create web crawler python logic, you need a fetch-and-parse loop. The fetcher retrieves the raw response. The parser extracts links and normalizes them into absolute URLs. The crawler checks these against a "seen" set, adding new internal links to a queue. Finally, scope rules restrict the crawl by domain and depth.

Web Crawler Python Code

This baseline script establishes the required frontier logic for a theoretical documentation site.

import requestsfrom bs4 import BeautifulSoupfrom urllib.parse import urljoin, urlparsefrom collections import dequedef simple_docs_crawler(seed_url, max_pages=50):    queue = deque([seed_url])    seen = {seed_url}    base_domain = urlparse(seed_url).netloc    crawled_data = []    while queue and len(crawled_data) < max_pages:        current_url = queue.popleft()                try:            # 1. Fetch the page            response = requests.get(current_url, timeout=5)            if response.status_code != 200:                continue                            # 2. Parse the content            soup = BeautifulSoup(response.text, 'html.parser')            page_title = soup.title.string if soup.title else "No Title"            crawled_data.append({"url": current_url, "title": page_title})                        # 3. Find and normalize internal links            for link in soup.find_all('a', href=True):                next_url = urljoin(current_url, link['href']).split('#')[0]                                # 4. Restrict scope to the same domain and exact path                if urlparse(next_url).netloc == base_domain and '/docs/' in next_url:                    if next_url not in seen:                        seen.add(next_url)                        queue.append(next_url)                                except requests.RequestException:            continue    return crawled_data# Example usage:# data = simple_docs_crawler("https://example.com/docs/")

This web crawler using python works perfectly for 50 pages but fails at 50,000. It lacks async concurrency, meaning it wastes time waiting for network responses. It has no retry logic, persistence, or observability. It uses a basic set for deduplication, which consumes too much memory at scale.

Scaling a Python Crawler Without Getting Blocked

  • Crawling is bottlenecked by network waiting time, not CPU compute.
  • HTTPX with asyncio handles thousands of concurrent requests efficiently.
  • Throttle request rates dynamically based on server response times.

Upgrading from Sync to Async

Synchronous crawlers stall because they spend 95% of execution time waiting for target servers to respond. Transitioning to HTTPX and asyncio turns waiting time into productive time.

Use asyncio.Semaphore to cap concurrency strictly. Uncapped concurrency will crash your machine or trigger IP bans. Pair your async client with distinct timeout thresholds and exponential backoff for 429 (Too Many Requests) or 500-level errors.

Recent Quansight benchmarking showed an asyncio web-scraping workload scaling from 12 stories/sec on a single worker up to 80 stories/sec using the new free-threaded CPython build.

How do you scale a Python crawler without getting blocked?

Start with politeness, not stealth. Tie your crawl rate to the target server's response time. If response latency increases from 200ms to 800ms, your crawler must automatically throttle back concurrency. Implement circuit breakers. If a domain returns five consecutive 503 errors, pause the crawl for that specific domain.

When standard requests fail, escalate logically. Use curl_cffi to mimic real browser TLS handshakes before you jump to full headless browser rendering.

How to Crawl JavaScript-Heavy Websites with Python

  • Verify rendering is actually required before launching a headless browser.
  • Use Playwright for full session control.
  • Use lightweight action flows for discrete interactions.

Do not guess if a site requires rendering. Check the raw HTML response. Rendering is necessary if the content is completely missing from the raw DOM, appears only after XHR network requests resolve, sits behind an interactive gate, or relies on client-side infinite scroll.

To crawl data from website python environments that rely heavily on JavaScript, use Playwright. It handles complex DOMs through auto-waiting mechanisms and resilient locators, ensuring the elements actually exist before the script attempts extraction. Alternatively, use Crawlee's PlaywrightCrawler for a seamless HTTP-to-browser escalation path.

Reduce wasted browser compute. Often, a page only requires a single simulated interaction, like clicking a "Load More" button. Execute these discrete action layers instead of maintaining long, expensive browser sessions.

Managing the URL Frontier: How to Avoid Duplicate URLs

  • Normalize URLs immediately to prevent infinite loops.
  • Match your traversal strategy (BFS, DFS) to the site topology.
  • Upgrade to Bloom filters for massive URL sets.

The URL frontier acts as your crawler’s decision engine.

How do you avoid duplicate URLs in a crawler?

You avoid duplicate URLs through aggressive normalization. A crawler treats page.html#top and page.html#bottom as two separate URLs unless you strip the fragment identifiers. Normalize query parameters by ordering them alphabetically to ensure identical pages resolve to identical URLs. Always convert relative links into absolute links and respect canonical tags.

Traversal Strategy

  • BFS (Breadth-First Search): Uncovers category maps quickly. Ideal for ecommerce catalogs.
  • DFS (Depth-First Search): Follows a single technical manual down to its deepest sub-page. Ideal for documentation.
  • Best-first: Prioritizes the queue based on relevance scores, ensuring search-query-guided AI ingestion crawls the highest-value URLs first.

Enforce strict depth caps and maximum page budgets per run. Crawlers easily fall into infinite calendar generators or faceted navigation loops. At scale, a standard Python set() will exhaust your server memory. Replace sets with Bloom filters, a probabilistic data structure that drastically reduces deduplication memory usage.

  • Parse robots.txt for User-Agent rules and crawl delays.
  • Public data carries different risk profiles than authenticated data.
  • Compliance with standard protocols does not negate Terms of Service.

How do you respect robots.txt when crawling a website?

Parse the robots.txt file for specific User-Agent directives, Disallow paths, and Crawl-delay rules, codified under RFC 9309. Cache the file locally to avoid requesting it before every URL. If a site returns a 403 for robots.txt, treat the site as disallowed.

Disclaimer: This is operational guidance, not legal advice.

Crawling public, unauthenticated data generally carries lower legal risk. However, scraping behind a login wall, ignoring clear cease-and-desist terms, or aggregating personally identifiable information fundamentally shifts your legal exposure. Recent reporting in AI Bots Are Now a Significant Source of Web Traffic indicated a 336% spike in sites blocking AI crawlers, while The 2026 State of AI Traffic & Cyberthreat Benchmark Report found AI scraper traffic grew 597% in 2025.

Provide a clear User-Agent string containing a contact email or project URL. Drop crawl rates dynamically if server response times spike.

When to Use Scrapy Instead of Requests and BeautifulSoup

  • Adopt Scrapy when request scheduling and middleware dictate the architecture.
  • Skip Scrapy for simple one-off AI ingestion tasks.
  • Map your custom script loops to Scrapy’s built-in engines.

When should you use Scrapy instead of Requests and BeautifulSoup?

Switch to Scrapy when crawler architecture dominates the workload. You should use Scrapy when your project requires managing dozens of concurrent spiders, complex request scheduling, custom downloader middleware, automated retries, and dedicated item exporters.

Skip Scrapy for narrow domain audits or Markdown-first AI ingestion workflows where async HTTPX or modern tools like Crawl4AI fit better.

If you are migrating your hand-rolled script:

  • Your custom URL queue becomes the Scrapy Scheduler.
  • Your requests fetcher becomes the Scrapy Downloader.
  • Your BeautifulSoup parsing logic moves inside the Scrapy Spider.
  • Your final data formatting shifts to a Scrapy Item Pipeline.

Bridging to Production: API Layers and Managed Parsers

  • Keep building in Python for narrow, stable jobs.
  • Outsource infrastructure when maintenance consumes your engineering sprints.
  • Use managed APIs to process massive URL lists quickly.

Signs you should keep building custom python crawler logic: narrow scope, low volume, stable HTML layouts, and low stakes for downtime.

Signs infrastructure is cheaper than code: massive URL volume, recurring daily monitoring, highly dynamic content, and failing schemas. PromptCloud notes that engineering time spent on scraper maintenance often exceeds the time spent actually using the data. A 403 Forbidden error is loud and easy to fix. Quietly collecting empty data due to schema drift is an operational disaster.

When hand-rolling infrastructure becomes a liability, managed API layers provide a direct bridge to production. Platforms like Olostep separate recursive crawling, URL discovery, batching, and parsing into dedicated endpoints designed for AI workflows.

  • Recursive Mapping: Replaces complex frontier logic. Use for docs ingestion and subdomain-limited site discovery.
  • Batches: Processes massive URL lists concurrently. Use for price monitoring sweeps and content refresh checks.
  • Parsers: Solves the schema drift problem. Returns deterministic structured JSON for AI. For recurring extraction jobs, parsers prove faster and more cost-efficient than running broad LLM extractions on raw text.

Conclusion

You now have the framework to move beyond the toy script. Web crawling with python scales predictably when you respect the fundamental rules of network I/O, queue management, and DOM structure.

If your job is narrow, do not over-engineer the solution. Copy the python crawler tutorial script provided above, define your frontier rules, and run the job. If your target list grows into the thousands, rewrite your fetcher using HTTPX and asyncio, or map your requirements to Scrapy's architecture. When your engineering team starts spending days adjusting selectors and fighting proxy bans, shift to API infrastructure.

Own your extraction layer first. Outsource network and rendering only when the maintenance curve turns against you.

About the Author

Aadithyan Nair

Founding Engineer, Olostep · Dubai, AE

Aadithyan is a Founding Engineer at Olostep, focusing on infrastructure and GTM. He's been hacking on computers since he was 10 and loves building things from scratch (including custom programming languages and servers for fun). Before Olostep, he co-founded an ed-tech startup, did some first-author ML research at NYU Abu Dhabi, and shipped AI tools at Zecento, RAEN AI.

On this page

Read more