Data Extraction
Aadithyan
AadithyanApr 30, 2026

Learn how to scrape and download images from a website using Python. Use BeautifulSoup for static pages, handle lazy loading, Base64, srcset, and dynamic sites.

How to Scrape Images From a Website Using Python

You write a quick script to scrape images from a website using Python. The code finishes in seconds. But when you open your folder, you don't find high-resolution photos. Instead, you stare at 500 identical 1x1 transparent GIFs, broken relative URLs, and 403 Forbidden errors.

Your Python syntax isn't broken. Your mental model of modern web delivery is.

Image scraping requires diagnosing delivery patterns before writing code. This guide shows you how to extract real URLs hidden in lazy-load attributes or Base64, download actual binary files safely, and know exactly when to escalate from an HTML parser to Playwright or a managed API.

How do I scrape images from a website using Python?

To scrape images from a website using Python, first inspect the webpage to locate the image URLs. For static pages, use the requests and BeautifulSoup libraries to parse the HTML and extract attributes like src or data-src. Finally, download the binary image files politely using requests.get() with stream=True and write them to your local disk in chunks.

Step 1: Diagnose how the website delivers images

Always check the network tab and source code before choosing a scraping library. Identify if images use static HTML, dynamic injection, or CSS.

Most scripts fail because developers choose a Python library before checking how the website actually serves media. Diagnose the delivery pattern first.

Static HTML, lazy attributes, or a dynamic shell?

Open your target page in Chrome. Right-click and select View Page Source. Search for your target image URL.

If the URL is there, the page uses static HTML. You can use standard requests and BeautifulSoup.

If the URL is missing from the source but appears when you use Inspect Element, the image is injected dynamically. It might be hidden inside a lazy-load attribute, fetched via an API after you scroll, or generated by a JavaScript framework. An empty or extremely thin HTML response points directly to a client-side rendered application.

Attribute priority checklist before you write code

Do not blindly target the <img> tag's src attribute. Modern performance optimization means src often holds a low-resolution fallback or a placeholder. Check these attributes in order:

  • data-src / data-original: The real URL on lazy-loaded pages.
  • srcset: Multiple resolutions for responsive layouts.
  • src: The default fallback, often a thumbnail.
  • style: CSS background-image URLs for cards and banners.
  • src="data:image...": Inline Base64 image data that requires decoding, not downloading.

Step 2: Extract image URLs with BeautifulSoup

Can BeautifulSoup scrape images from a website? Yes, if the URLs exist in the static HTML. Use it to target custom data attributes for lazy-loaded images, apply urljoin for relative paths, and parse srcset for high-resolution assets.

If your diagnosis confirms the URLs exist in the static HTML, use requests and BeautifulSoup.

How to extract image URLs with BeautifulSoup

This is the minimal reliable pattern for extracting image nodes from a static page. Use a Session to pool connections and apply a timeout to prevent hanging requests.

import requests
from bs4 import BeautifulSoup

session = requests.Session()
session.headers.update({"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"})

response = session.get("https://example.com/gallery", timeout=10)
response.raise_for_status()

soup = BeautifulSoup(response.text, "html.parser")
images = soup.find_all("img")

for img in images:
    print(img.attrs)

How to handle lazy-loaded images when scraping

Many lazy-loaded pages do not require a headless browser. The real image URL simply sits in a custom attribute waiting for JavaScript to swap it into the src tag upon scrolling.

You can bypass the browser entirely by targeting the first valid attribute.

def get_best_image_url(img_tag):
    candidates = ['data-src', 'data-original', 'src']
    for attr in candidates:
        url = img_tag.get(attr)
        if url and "data:image" not in url and "1x1" not in url:
            return url
    return None

Resolve relative URLs safely

Websites frequently use relative paths (/images/pic.jpg) or protocol-relative URLs (//example.com/pic.jpg). Passing these directly to a downloader causes a fatal error. Use urllib.parse.urljoin to build absolute URLs safely.

from urllib.parse import urljoin

base_url = "https://example.com/gallery"
extracted_path = "/assets/photo.jpg"
absolute_url = urljoin(base_url, extracted_path)

Extract the highest-resolution image from srcset

Responsive websites use srcset to serve different image sizes based on device width. If you only scrape src, you will likely download thumbnails. Parse the srcset string to find the candidate with the highest width descriptor (w).

def get_highest_res_from_srcset(srcset_string):
    if not srcset_string: return None
    
    candidates = []
    for item in srcset_string.split(','):
        parts = item.strip().split(' ')
        url = parts[0]
        width = int(parts[1].replace('w', '')) if len(parts) > 1 and 'w' in parts[1] else 0
        candidates.append((url, width))
        
    candidates.sort(key=lambda x: x[1], reverse=True)
    return candidates[0][0]

Step 3: Save and download images from a URL in Python

Never load large image files directly into memory. Use chunked streaming (stream=True) to download files politely and validate Content-Type headers to prevent saving HTML error pages as images.

Extracting a string is only the first half of the problem. You now need to download the actual binary files.

How do I save an image from a URL in Python?

Never use response.content to download large files directly into memory. It causes memory spikes and crashes during bulk operations. Use stream=True and write the file in chunks. Validate the Content-Type header to ensure you are saving an image, not an error page.

def download_image(url, filepath, session):
    with session.get(url, stream=True, timeout=15) as r:
        r.raise_for_status()
        
        content_type = r.headers.get('Content-Type', '')
        if not content_type.startswith('image/'):
            raise ValueError(f"Invalid content type: {content_type}")
            
        with open(filepath, 'wb') as f:
            for chunk in r.iter_content(chunk_size=8192):
                f.write(chunk)

How do I download all images from a webpage in Python?

Chain your extraction and download logic together into a modular pipeline. Ensure the target folder exists and gracefully skip duplicates to make your script rerun-safe.

import os

def bulk_download(urls, download_dir, session):
    os.makedirs(download_dir, exist_ok=True)
    downloaded = set()
    
    for idx, url in enumerate(urls):
        if url in downloaded: continue
        
        filepath = os.path.join(download_dir, f"image_{idx}.jpg")
        if os.path.exists(filepath): continue
            
        try:
            download_image(url, filepath, session)
            downloaded.add(url)
            print(f"Saved: {filepath}")
        except Exception as e:
            print(f"Failed {url}: {e}")

How do I avoid broken image downloads in Python?

Scrapers easily ingest junk data. An HTTP 200 OK response does not guarantee a valid image file. Validate the file constraints immediately to prevent folder pollution.

  • Zero-byte files: Check Content-Length before writing.
  • Corrupt files: Enforce Content-Type: image/* validation to stop HTML error pages from saving as .jpg.
  • Duplicate 1x1 files: Filter out URLs matching generic placeholder patterns.

Wrap your requests in a retry loop with exponential backoff to reduce failure rates without triggering rate limiting.

Step 4: Handle common Python image scraper traps

Beware of 1x1 transparent placeholders and Base64 encoded strings. Check custom attributes for real URLs and decode Base64 data locally to avoid HTTP errors.

Real-world web scraping images is messy. Learn to recognize these common failure states.

The 1x1 pixel placeholder trap

If your scraper successfully downloads 500 files, but they are all identical 1-kilobyte transparent squares, you hit the placeholder trap. Websites inject these tiny images into the src attribute to maintain layout structure until JavaScript loads the real asset.

Inspect the element. Find the data-src or data-original attribute holding the actual URL. Adjust your parser to target that attribute first.

The Base64 trap and the padding fix

Images formatted as data:image/jpeg;base64,/9j/4AAQ... are Base64-encoded directly into the HTML. You cannot download them via HTTP requests. You must decode the text locally.

Standard decoders fail here if the string loses its URL-safe padding or if spaces replace + signs during parsing.

import base64

def save_base64_image(data_uri, filepath):
    header, encoded = data_uri.split(",", 1)
    encoded = encoded.replace(" ", "+")
    
    padding_needed = len(encoded) % 4
    if padding_needed:
        encoded += "=" * (4 - padding_needed)
        
    image_data = base64.b64decode(encoded)
    with open(filepath, "wb") as f:
        f.write(image_data)

Why you only downloaded thumbnails

E-commerce and news sites rarely place the highest-quality image in the basic src attribute. They reserve it for a lightbox overlay or a srcset array.

If the downloaded image is blurry, look at the DOM tree surrounding the <img> tag. Search for a parent <picture> element, a <source> tag, or an adjacent hidden link containing the high-resolution asset.

Step 5: Scrape images from dynamic websites using Playwright

When images require complex JavaScript rendering or scroll events to load, escalate to Playwright to simulate browser interactions or intercept XHR network traffic directly.

When the real image URL is buried behind complex JavaScript rendering, scroll events, or user interactions, escalate to browser automation. Playwright interfaces directly with modern browsers via WebSockets, executing scripts natively and asynchronously, making it a powerful choice for new Python scraping pipelines.

Render, scroll, and wait for real image URLs

Dynamic galleries load images continuously as you scroll. This pattern opens the page, scrolls incrementally to trigger the lazy-load scripts, waits for the placeholders to vanish, and collects the finalized URLs.

from playwright.sync_api import sync_playwright
import time

def scrape_dynamic_images(url):
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto(url)
        
        for _ in range(5):
            page.evaluate("window.scrollBy(0, document.body.scrollHeight)")
            time.sleep(1) 
            
        images = page.eval_on_selector_all(
            "img", "elements => elements.map(el => el.src)"
        )
        browser.close()
        return [img for img in images if "http" in img]

Intercept XHR or Fetch responses

For API-driven React or Vue applications, reading the DOM is inefficient. Use Playwright to intercept the underlying network traffic and pull URLs directly from the JSON responses.

def intercept_api_images(url):
    urls = []
    
    def handle_response(response):
        if "api/v1/gallery" in response.url and response.status == 200:
            data = response.json()
            for item in data.get('items', []):
                urls.append(item['high_res_image_url'])

    with sync_playwright() as p:
        browser = p.chromium.launch()
        page = browser.new_page()
        page.on("response", handle_response)
        page.goto(url)
        page.wait_for_load_state("networkidle")
        browser.close()
    return urls

How do I scrape images from a dynamic website with Selenium?

If you are locked into Selenium, you can execute the scroll-and-wait pattern using driver.execute_script("window.scrollTo(0, document.body.scrollHeight);"). However, migrating to Playwright eliminates the need for manual webdriver binaries and brittle explicit waits.

Step 6: What is the best Python library for scraping images?

Match your tool to the image delivery architecture. Use BeautifulSoup for static DOMs, Playwright for dynamic rendering, and a managed API for Web Application Firewall (WAF) blocks.

There is no single best Python library for scraping images. The optimal choice depends entirely on where the site exposes the data.

What you see Where the image lives Best Tool Common Failure
Standard src HTML DOM BeautifulSoup Hotlinking blocks
Incomplete path HTML DOM urljoin Forgetting protocol prefixes
Lazy attribute data-src / data-original BeautifulSoup Scraping the 1x1 placeholder
Responsive srcset String parsing Regex complexity
CSS background style attribute Regex Incorrect regex syntax
Base64 string data:image/... base64.b64decode Incorrect padding error
Empty HTML shell API/JSON/GraphQL Network tab / XHR Dynamic tokens required
WAF block (403) Protected server Managed API Basic header spoofing fails

Step 7: When your Python image scraping script stops scaling

Manual scripts break as volume grows. Shift from local browser automation to an infrastructure API like Olostep to bypass bot protection and handle mass extraction seamlessly.

Local scripts break as project requirements grow. Maintainability plummets when you transition from scraping one tutorial page to thousands of production URLs.

Signs your script is becoming brittle

You have outgrown manual browser choreography if you experience:

  • Constant adjustments to CSS selectors due to A/B testing.
  • Memory leaks from long-running browser instances.
  • IP bans across thousands of URLs.
  • Difficulty handling complex pagination loops.
  • Excessive time spent maintaining infrastructure instead of analyzing data.

Use Olostep for reliable dynamic extraction

When manual scripting turns into an arms race with anti-bot systems or complex JavaScript rendering, shift the execution burden to an infrastructure layer.

Olostep provides a Scrape Endpoint at /v1/scrapes that processes dynamic sites, executes specific actions (like wait, click, fill_input, and scroll), and returns clean outputs in real time. Rather than maintaining local Playwright instances, you can request the URL and receive structured Markdown, HTML, text, screenshots, or JSON. This handles JS-rendered content and PDF parsing natively.

Olostep Batch and Parsers for large data jobs

High-volume workflows require asynchronous processing. The Olostep /v1/batches endpoint scales to process up to 10,000 URLs per batch, making it designed for massive data ingestion tasks.

When recurring workflows demand precise structured data—like mapping product names to specific image URLs across an entire catalog—the Parser framework applies predefined or custom schemas to extract standardized JSON.

Always check robots.txt and copyright terms before scraping. Add structured logging, idempotent checkpointing, and WebP compression to keep your scraper robust and storage costs low.

Protect your target site's operations and your local storage limits.

Just because you can fetch an image does not mean you can lawfully reuse it. Check the site's robots.txt file and operational terms. Verify the licensing rights of the imagery before using scraped assets in commercial projects or public datasets. Implement polite request pacing to avoid overwhelming the host server.

Logging, retries, and rerun safety

A scraper will fail eventually. Make sure it can resume safely. Use structured logs to track which URLs succeed or fail. Implement checkpointing and idempotent naming conventions (using hashes or unique IDs) so you can restart a crashed script without downloading the entire dataset twice.

Compress scraped images to WebP

Uncompressed JPEGs and PNGs bloat storage costs rapidly for machine learning or analytics datasets. Convert scraped images to WebP immediately using the Pillow library to reduce file size by roughly 25% to 34% without noticeable quality loss, according to the WebP Compression Study.

from PIL import Image

def convert_to_webp(source_path, target_path):
    with Image.open(source_path) as img:
        img.save(target_path, "webp", quality=80)

Build your pipeline in this order

To build a reliable pipeline and successfully scrape images from a website using Python, stop writing code blindly and follow this structured sequence:

  1. Inspect the delivery pattern: Check the HTML source, Elements tab, and Network traffic.
  2. Extract the right source: Target data-src, srcset, or XHR payloads over the basic src.
  3. Normalize or decode: Resolve relative paths and apply padding fixes to Base64 data.
  4. Download politely: Use chunked streaming, backoff, and strict validation.
  5. Escalate only when needed: Add Playwright solely for JS-heavy flows.
  6. Use infrastructure at scale: If you need repeated extraction across many dynamic URLs, automate the heavy lifting with a managed API.

About the Author

Aadithyan Nair

Founding Engineer, Olostep · Dubai, AE

Aadithyan is a Founding Engineer at Olostep, focusing on infrastructure and GTM. He's been hacking on computers since he was 10 and loves building things from scratch (including custom programming languages and servers for fun). Before Olostep, he co-founded an ed-tech startup, did some first-author ML research at NYU Abu Dhabi, and shipped AI tools at Zecento, RAEN AI.

On this page

Read more