You want high-resolution product photos. You write soup.find_all("img"). You execute the script. You receive a folder full of 1x1 tracking pixels, Base64 placeholders, and missing mobile thumbnails. Welcome to modern image scraping.
Most developers build Python image scrapers that fail immediately because the web no longer delivers static images. From 2022 to 2024, JPEG fell from 40% to 32% of web images and AVIF adoption grew 386%, according to Media | 2024 | The Web Almanac by HTTP Archive, and roughly 16% of pages still lazy-load their largest contentful paint (LCP) image, according to Performance | 2024 | The Web Almanac by HTTP Archive. The real visual asset usually hides inside JSON payloads, responsive srcset attributes, or CSS background styles.
What is image scraping?
Image scraping is the automated extraction of image URLs, metadata, and physical image files from webpages. The process involves parsing HTML, interpreting JavaScript payloads, resolving CDN parameters, and exporting structured data (like alt text and dimensions) into a dataset or data pipeline.
How to scrape images from website python?
- Extract structured data (JSON-LD, Open Graph, Sitemaps) first.
- Parse the
srcattribute for static standard images. - Target
data-srcordata-srcsetattributes to bypass JavaScript lazy loading. - Intercept network XHR/JSON payloads for dynamic galleries.
- Use Playwright or a managed API if the DOM requires full rendering.
To extract the right data, you need a strict extraction hierarchy. This guide explains how to scrape images from a website using Python, why common image scrapers fail, when to use BeautifulSoup versus Playwright, and how to scale reliable extraction using Olostep.
Choose the Output Before You Write Code
I recommend storing URLs plus metadata as your default output for SEO auditing, asset monitoring, competitor tracking, and catalog enrichment. Many developers assume web scraping images always means writing binary files to disk. You must distinguish between image discovery and image downloading.
This distinction impacts your storage costs, dataset freshness, legal copyright exposure, and data pipeline complexity.
Do I need to download images or just store image URLs?
You should store URLs plus metadata unless your specific use case requires analyzing the actual pixels. You only need to extract full image files for computer vision pipelines and perceptual hashing. For all other use cases, storing the URL captures the asset reference without incurring high storage or bandwidth costs.
Define a minimum reusable record shape before writing code. Your schema should look like this: page_url, image_url, source_type, alt_text, width, height, mime_type, checksum, and captured_at.
Build a Python Image Scraper for Static HTML
For purely static web pages, the cleanest approach pairs httpx with BeautifulSoup. HTTPX provides a modern client with HTTP/2 support, connection pooling, and streaming capabilities.
Note: Install the required libraries via pip install httpx beautifulsoup4.
import httpxfrom bs4 import BeautifulSoupfrom urllib.parse import urljoindef scrape_static_images(page_url): headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"} response = httpx.get(page_url, headers=headers) response.raise_for_status() soup = BeautifulSoup(response.text, "html.parser") image_records = [] for img in soup.find_all("img"): raw_src = img.get("src") if not raw_src: continue absolute_url = urljoin(page_url, raw_src) image_records.append({ "page_url": page_url, "image_url": absolute_url, "alt_text": img.get("alt", "").strip(), "source": "img_src" }) return image_recordsThis method is fast but brittle. If you rely solely on img.get("src"), you will miss data-src (JS lazy loading), <picture> variants (Responsive images), CSS background-image declarations, and JavaScript-injected image URLs.
The Discover, Resolve, Retrieve Framework
To build a Python image scraper that survives the modern web, abandon the idea of just parsing HTML. Use a three-phase model.
1. Discover
Image references exist in structured data, CSS stylesheets, rendered DOM nodes, or raw API responses. Your scraper must identify the correct delivery surface.
2. Resolve
Once you find an image reference, you must select the correct asset. This requires normalizing absolute URLs, choosing the highest resolution from a CDN parameter, and stripping useless tracking tags.
3. Retrieve
Downloading the file is strictly optional. You trigger a file download only when you need to inspect the binary data.
Treat structured data extraction as Level 0 image scraping. Before you write a single CSS selector, inspect the page source for JSON-LD, Open Graph tags, and sitemaps.
Start with Structured Data
Extracting metadata directly from semantic layers skips layout changes and A/B testing entirely. These sources are cleaner and significantly less brittle than DOM parsing.
JSON-LD and Schema.org ImageObject
Modern e-commerce and media sites embed Schema.org data. Look for ImageObject, contentUrl, image, or primaryImageOfPage. Schema implementations also bundle vital context like description, author, datePublished, and contentLocation.
Use the extruct library to pull JSON-LD, Open Graph, microdata, and RDFa simultaneously:
import extructimport httpxresponse = httpx.get("https://example.com/product")data = extruct.extract(response.text, base_url="https://example.com")json_ld = data.get("json-ld", [])for item in json_ld: if item.get("@type") == "Product": print("Product Image:", item.get("image"))Open Graph image tags
Social sharing tags guarantee at least one high-quality representative image per page. You can easily extract og:image, og:image:width, og:image:height, and og:image:alt.
og_image = soup.find("meta", property="og:image")if og_image: print("OG Image URL:", og_image.get("content"))Image sitemaps
Google Search Central explicitly documents Image sitemaps as a way to expose images crawlers might otherwise miss. Each <url> block can contain up to 1,000 <image:image> entries holding the <image:loc>. Map the site via sitemaps first rather than scraping blind.
Diagnose How a Site Delivers Images
Can you scrape all images from a website? You can only capture everything if you account for every delivery surface. When your Python image scraper fails, diagnose the frontend pattern before changing your library.
Native lazy loading vs JavaScript lazy loading
What is the best way to scrape images from lazy-loading pages?
Native lazy loading (loading="lazy") is static-scraper friendly. The browser defers the fetch, but the real URL remains directly inside the src attribute. JavaScript lazy-loading libraries hide the real URL to prevent automatic requests. They place a 1x1 pixel SVG or Base64 string in the src, and put the real image in a data-* attribute.
Implement a fallback chain in your code to detect the real URL:
def extract_real_image_url(img_tag): fallbacks = ["data-src", "data-original", "data-lazy", "srcset", "data-srcset", "src"] for attr in fallbacks: val = img_tag.get(attr) if val and not val.startswith("data:image"): return val return NoneResponsive images with srcset and <picture>
The srcset attribute provides the browser with multiple image candidates. Naive scrapers grab the first URL they see, which is often a low-res mobile thumbnail. You must parse the string, evaluate the width descriptors (e.g., 800w), and select the highest resolution target variant.
API payloads and JS-injected images
If the gallery is empty in the page source but visible in the browser, the images are injected via JS. Open the DevTools Network tab, filter by Fetch/XHR, and reload. You will often find a clean JSON payload containing the entire high-resolution image list.
Always inspect the Network tab before launching a headless browser. If a product gallery requests a JSON file when you scroll down, you can configure httpx to call that background JSON endpoint directly.
Resolve Image URLs, Alt Text, and Metadata
Knowing how to extract image URLs, alt text and metadata efficiently is the difference between a useless dataset and a production-ready one.
How do I extract image URLs, alt text and metadata?
Parse the alt text attribute from the DOM, but immediately resolve the URL using urllib.parse.urljoin() to handle cross-domain CDN paths. Capture nearby caption text, explicit width/height values, the MIME type, and the page title to ensure the final output is contextually useful.
Avoid downloading broken, low-quality or placeholder images
Base64 placeholders create massive false positives. Your scraper will extract 5,000-character strings instead of URLs. Always run a check to discard values starting with data:image/svg+xml;base64, or data:image/gif;base64,.
When dealing with CDNs like Cloudinary or Imgix, evaluate the URL parameters. CDNs use width params (w_800), quality params (q_auto), and format params (f_auto). To grab the highest resolution asset, run a simple string replacement on the extracted URL, changing w_400 to w_1200 before storing it.
Build a Safe Download Pipeline
If you must retrieve the physical files, do not write a script that blindly force-saves everything as .jpg.
Build a pipeline that uses HTTP HEAD requests or streamed GET requests. Validate the Content-Type headers to ensure the file is actually an image. Map the content type to the correct file extension, enforce minimum dimension thresholds, and deduplicate your downloads immediately using a SHA-256 checksum.
How to Scrape Images From Dynamic Websites
When basic HTTP clients fail, developers assume they need to emulate a human. Browser automation is not the first fix, but it is necessary for heavily obfuscated Single Page Applications (SPAs).
Should I use BeautifulSoup, Selenium, Playwright or an API?
Use BeautifulSoup for static HTML. Use Playwright for rendered DOMs, post-click galleries, and dynamic interaction. Use Selenium only if you are bound to legacy infrastructure. Use a managed API when anti-bot handling and proxy rotation become full-time infrastructure work.
Use Playwright for rendered DOMs
Playwright handles dynamic content better than older tools. Its auto-waiting mechanisms ensure the DOM mounts before extraction occurs.
from playwright.sync_api import sync_playwrightimport timedef scrape_dynamic_images(url): with sync_playwright() as p: browser = p.chromium.launch() page = browser.new_page() page.goto(url) # Incremental scroll to trigger IntersectionObservers for i in range(5): page.mouse.wheel(0, 1000) time.sleep(1) images = page.eval_on_selector_all("img", "elements => elements.map(e => e.src)") browser.close() return imagesIf you only capture the first 10 images of a 50-image gallery, scrolling to the bottom once is not enough. IntersectionObserver-triggered images require incremental scrolling with wait conditions to allow the frontend framework to fire the network requests.
Legal Boundaries and Anti-Bot Systems
Is it legal to scrape images from a website?
Collecting public image URLs and metadata for internal SEO or monitoring generally carries a low risk profile. Downloading the physical files increases risk, as you are creating a copy on your servers. Republishing those downloaded images almost certainly violates copyright unless you have a license or a strong fair use defense.
If you are scraping images to build visual AI datasets, advise your legal team. The U.S. Copyright Office’s May 2025 Copyright and Artificial Intelligence, Part 3: Generative AI Training Pre-Publication Version explicitly stated that dataset preparation and training implicate the reproduction right, as developers make multiple copies by downloading, transferring, and converting works.
Why more sites block scrapers
If your scraper receives continuous 403 Forbidden errors, you are likely collateral damage in the war against AI crawlers. Cloudflare reported in Regain control of AI crawlers that from July 2024 to July 2025, raw requests from GPTBot rose 147% and Meta-ExternalAgent requests rose 843%. In response, more sites enabled aggressive AI crawler blocking.
If your download fails with a 403 Forbidden, you are likely hitting hotlink protection or advanced anti-bot systems. The CDN expects a valid Referer header matching the origin site. Pass the correct standard headers, or migrate to a managed scraping API.
Operationalize Image Scraping at Scale with Olostep
Maintaining proxy pools, handling anti-bot blocks, and updating broken CSS selectors eventually costs more than paying for infrastructure. When your manual Python scraper becomes unreliable, you should move the workload to Olostep.
Olostep acts as a managed web data layer designed for teams that need reliable, structured web data at scale.
Match the endpoint to the job
- Use the Scrape Endpoint for rendered pages: The
/v1/scrapesendpoint returns rendered HTML, text, screenshots, or structured JSON from any URL in real time. It handles JS-rendered sites and login flows automatically. - Use Parsers for recurring schema extraction: If you repeatedly extract gallery arrays or product media fields from the same domains, use a Parser. Parsers turn unstructured pages into backend-compatible structured JSON.
- Use the Map Endpoint for discovery: If you need to discover product URLs before extracting their images, the
/v1/mapsendpoint returns URLs from a site and supports include/exclude patterns. - Use Batch Endpoint for scale: The
/v1/batchesendpoint supports up to 10,000 URLs per batch and runs multiple batches in parallel, generally processing in 5 to 8 minutes. - Use Schedules for monitoring: If your workflow requires daily dataset refreshes, Schedules execute API calls at specific times. Webhooks automatically notify your system when async jobs finish.
Olostep scales predictably. Start on the free tier (500 successful JS-rendered requests using residential IPs, no credit card required) to validate one representative domain before committing engineering time to a custom build.
End with the Right Default
To successfully scrape images from a website with Python, always choose the lightest reliable path. Do not start with a headless browser.
Follow this definitive order:
- Parse structured data (JSON-LD, OG tags, sitemaps).
- Read the DOM and custom
data-*lazy-load attributes. - Inspect
srcset, CSS backgrounds, and underlying network JSON payloads. - Escalate to Playwright only for obfuscated SPAs.
- Migrate to a managed API like Olostep when maintenance outpaces value.
Test one representative page manually using the Discover, Resolve, Retrieve framework. Once you map the delivery pattern, you can write the code.
