Web scraping in R no longer requires juggling brittle Selenium scripts. You just need to match the right extraction tool to the target site's architecture. Use rvest for static HTML, read_html_live() for JavaScript-rendered pages, and httr2 for hidden APIs.
How do you scrape data from a website using R?
To scrape data from a website using R, inspect the page architecture first. If the data is in the source code, use rvest and read_html() to parse the static HTML. If the page relies on JavaScript, use read_html_live() to load the rendered DOM. For hidden APIs, extract the JSON payload directly using the httr2 package.
This modern approach forms a decision system. You analyze the page structure, then pick the exact tool built for that layer. The modern stack is broader than a static-only mental model: Live web scraping (with chromote) documents the live-browser interface for pages that generate HTML with JavaScript.
Without that middle layer, the workflow collapses into a false choice between simple HTML parsing and heavy browser automation. That is why a script can work on one site and return empty nodes on the next when the architecture shifts from static delivery to a rendered DOM. Moving between different websites often causes scripts to break because the underlying architecture shifts from static to dynamic delivery.
The modern decision matrix
- Static HTML: Use
rvest(read_html). - Dynamic JS Pages: Use
rvest(read_html_live). - Direct APIs: Use
httr2. - Massive/Blocked Scrapes: Handoff to production infrastructure like Olostep.
R web scraping tools: the modern stack
Modern data extraction relies on a stack of four packages rather than a single solution. Installing rvest alone does not provide a complete data pipeline.
What package is used for web scraping in R?
The primary package used for web scraping in R is rvest, which handles HTML parsing and live browser simulation. However, a production stack also requires httr2 for direct API requests and network control, alongside polite to manage session etiquette, rate limits, and robots.txt compliance.
- rvest (
read_html): Best for static HTML. Fails on JavaScript-rendered data. - rvest (
read_html_live): Best for JavaScript-rendered DOMs. Requires a local browser footprint viachromote. - httr2: Best for APIs and request control. Use for JSON endpoints.
- polite: Best for session etiquette. Obeys
robots.txtstrictly.
What is rvest in R?
The rvest package is the core HTML parsing layer in R. It allows you to read an HTML document, select specific CSS or XPath elements, and extract attributes, text, and tables directly into structured data frames.
When target data exists in the raw HTML payload sent by the server, the read_html() function remains the fastest extraction method available in R.
What are alternatives to rvest in R?
When rvest cannot fulfill your data extraction requirements, the best alternatives in R include httr2 for scraping direct API JSON payloads, chromote or selenider for heavy browser automation, and external infrastructure like Olostep to bypass strict anti-bot systems at scale.
Older workflows often suggested Rcrawler for scaling, but that package was removed from CRAN on April 4, 2025, as recorded by CRANberries. Additionally, many legacy scripts rely on the original httr package in R, but httr2 is now the modern standard for HTTP requests.
How to scrape data from a website using R: an rvest tutorial
This rvest tutorial uses a stable target—the standard books.toscrape.com sandbox—to walk you from a raw URL to a clean tibble.
- Always inspect the page architecture in your browser before coding.
- Prefer IDs and semantic tags over brittle, auto-generated CSS classes.
- Transform raw extracted strings into typed, validated data frames immediately.
Inspect the page before you code
Inspect the page architecture before writing any code to extract data from a website in R. The Chrome DevTools Network panel records requests and reveals exactly how the page loads.
Right-click the target page and select "View Page Source". If the data exists in the source code, use static HTML scraping. If the data is missing from the source but visible on the screen, inspect the live DOM using DevTools. Filter the Network panel by "Fetch/XHR". If a JSON payload contains your data, scrape the API.
Pick stable selectors
Target stable structural markers. Prefer IDs and semantic HTML tags over auto-generated CSS classes. Use XPath when you need to traverse the document tree upwards or select elements based on exact text content. You can use SelectorGadget to find a CSS selector visually, but always verify its stability in the inspector.
Read and scrape HTML in R
When data sits in static HTML, read_html() acts as the entry point to scrape HTML in R. Extract individual matches with html_element() and collect all matches with html_elements().
The minimal working pattern involves reading the document, targeting a repeating container, and extracting fields from each container.
library(rvest)
url <- "http://books.toscrape.com/catalogue/category/books/science_22/index.html"
page <- read_html(url)
# Select the container holding each book
books <- page |> html_elements(".product_pod")
# Extract the specific title element within each container
titles <- books |>
html_element("h3 a") |>
html_attr("title")I strongly advise using html_element() inside purrr::map or iterative loops. Functions renamed in rvest 1.0.0 documents the shift from html_node() to html_element(), and using the modern syntax prevents future deprecation warnings.
Extract text, attributes, links, and tables
Different data types require distinct extraction functions. Use html_text2() to extract text exactly as it appears in a web browser, preserving line breaks. Use html_attr() to extract metadata, such as href values for links or src for images. Use html_table() to parse perfectly structured <table> elements directly into data frames.
prices <- books |>
html_element(".price_color") |>
html_text2()
links <- books |>
html_element("h3 a") |>
html_attr("href")Turn raw extraction into tidy data
Raw vectors are not analysis-ready. Immediately transform scraped strings into a validated tibble.
Convert currency strings to numeric types. Standardize column names. Drop duplicate rows to prevent downstream pipeline errors. Count your rows and check for missing values to validate that your selectors accurately captured the intended targets.
library(tibble)
library(dplyr)
library(stringr)
clean_books <- tibble(
title = titles,
price = prices,
url = links
) |>
mutate(
price = str_remove(price, "£") |> as.numeric(),
url = paste0("http://books.toscrape.com/catalogue/", url)
) |>
filter(!is.na(title))How to scrape dynamic websites in R
Dynamic websites generate their content via JavaScript after the initial page load. The standard read_html() function parses only the initial HTML shell, returning empty nodes.
How to scrape dynamic websites in R?
To scrape dynamic websites in R, use the read_html_live() function from the rvest package. It runs a background headless browser via the chromote package, allowing R to execute JavaScript, wait for the DOM to populate, and simulate user interactions like scrolling and clicking.
- Switch to
read_html_live()if static selectors return empty sets. - Live HTML handles infinite scrolls and JS-rendered tables natively.
- Avoid escalating to Selenium unless forced by complex login workflows.
Spot JavaScript-rendered pages fast
You face a JavaScript-rendered page if target elements return empty node sets in R, despite being visible in the browser. Secondary signals include missing tables in the raw page source, content that appears only after scrolling, or form submissions that update the UI without changing the URL.
Use read_html_live() with a live browser
The read_html_live() function returns a LiveHTML object. This architecture permits R to execute JavaScript and trigger methods like $click(), $type(), $scroll_to(), and $view().
library(rvest)
# Starts a live headless browser session
live_page <- read_html_live("http://quotes.toscrape.com/scroll")
# Scroll to trigger dynamic loading
live_page$scroll_to(top = 1000)
Sys.sleep(2) # Give the JS time to fetch and render
# Extract from the now-populated DOM
quotes <- live_page |>
html_elements(".quote .text") |>
html_text2()Know when browser automation is the wrong step
Do not jump blindly to Selenium-class tools when you encounter JavaScript. Live-browser scraping is computationally expensive. Always check if the JavaScript simply fetches data from a hidden API. Escalating to heavy browser automation is an architectural error if a clean JSON endpoint is available.
Can R scrape data from APIs?
Yes, R can scrape data from APIs natively using the httr2 package. Scraping hidden backend APIs is significantly faster and more reliable than parsing HTML. Find the API endpoint in your browser's Network tab, rebuild the HTTP request in R, and parse the JSON payload into a data frame using jsonlite.
This R scraping API approach avoids brittle CSS selectors completely.
- APIs provide the cleanest, most resilient data extraction path.
- Use the DevTools
Fetch/XHRfilter to locate backend JSON payloads. - Leverage
httr2to rebuild headers, query parameters, and handle retries.
Find hidden Fetch/XHR requests in DevTools
Navigate to the Chrome Network panel and click the "Fetch/XHR" filter. Watch for requests returning JSON payloads when you interact with dynamic page elements. Inspect the request headers, pagination parameters, and required cookies. You can right-click the specific request and select "Copy as cURL" to inspect exactly how the browser asked the server for the data.
Rebuild the request with httr2
The httr2 package provides the exact control layer required to rebuild that browser request natively. Define the URL, append headers, add query parameters, and handle authentication.
library(httr2)
req <- request("https://api.example.com/data") |>
req_url_query(page = 1, limit = 50) |>
req_headers(Accept = "application/json") |>
req_retry(max_tries = 3) |>
req_throttle(rate = 1)
resp <- req_perform(req)Use req_retry() to handle transient failures, req_throttle() to respect server rate limits, and req_cache() to avoid duplicate fetches during development.
Parse and validate JSON in R
Parse the JSON payload to flatten nested fields into a rectangular format. Validate the schema immediately to ensure required columns exist before passing data to your analysis pipeline.
library(jsonlite)
data <- resp |>
resp_body_json(simplifyVector = TRUE) |>
as_tibble()Automate web scraping R workflows
Moving from a single script to an automated workflow requires pacing, failure handling, and concurrency management.
- Replace manual
Sys.sleep()commands withhttr2throttling. - Bind automated retries to 429 and 503 HTTP status codes.
- Restrict parallel requests strictly to avoid triggering IP bans.
Add throttling, retries, and caching
Ad hoc Sys.sleep() commands fail unpredictably. Bind req_retry() to 429 and 503 HTTP status codes to apply exponential backoff automatically when the server is overwhelmed. Add req_throttle() to enforce a hard request rate ceiling, preventing IP bans.
Run safe parallel jobs
When processing a URL list, req_perform_parallel() sends concurrent requests. Restrict this to moderate-scale tasks. Blasting hundreds of concurrent requests blindly triggers server defenses. Parallel performance requires strict host-level etiquette.
urls <- c("site.com/page1", "site.com/page2", "site.com/page3")
reqs <- lapply(urls, request)
# Perform safely with a pool limit
resps <- req_perform_parallel(reqs, pool = 2)Schedule recurring scrapes and store outputs
Decouple the extraction script from your local machine. Use the cronR package locally, or configure GitHub Actions to run your R script on a timer. Write to Parquet files when feeding downstream analytics pipelines, or push data directly into a database via DBI for recurring ingestion.
When data extraction in R breaks
Pipelines fail. Fixing them requires mapping the symptom to the correct failure mode.
Fix 403 errors and blocked requests
A 403 Forbidden error often means the server detected a bot. Update your User-Agent string; Best User Agent for Scraping explains why default rvest headers get blocked frequently. Add natural browser headers and manage session cookies accurately. If modifying headers fails, the target has likely deployed anti-bot protections like Cloudflare or Datadome.
Fix empty nodes and missing tables
If selectors return empty sets, JavaScript rendering or delayed content loads are the likely culprits. Switch extraction from read_html() to read_html_live(). If a form submission fails to return results, use a live browser session to trigger the underlying XHR request.
Handle selector drift
Websites redesign frequently. When class names change, html_element() fails silently. Combat selector drift by writing structural XPath queries instead of targeting fragile CSS classes. Implement automated alerting that stops the pipeline if a required field suddenly returns NA across a batch.
Use LLM extraction only as a fallback
Passing raw HTML or screenshots to a Large Language Model (LLM) for extraction is expensive and slow. Deterministic parsers using selectors and APIs are drastically cheaper and more consistent. Reserve LLM extraction solely for layouts that change too frequently for CSS selectors to remain economically viable.
When local R scripts stop scaling
R excels at analysis, but local scripts face physical limitations when extraction requirements grow. The goal is not to abandon R, but to recognize when infrastructure maintenance becomes the primary bottleneck.
- Local
purrrloops break when hitting strict anti-bot protections at scale. - Handling headless browsers across thousands of URLs requires dedicated infrastructure.
- Olostep bypasses CAPTCHAs natively and returns structured JSON directly to R via API.
The scale gap
A local loop breaks when encountering JavaScript-heavy pages at scale, complex login flows, or distributed IP requirements. Managing headless browser instances and fighting CAPTCHAs rapidly consumes more engineering time than analyzing the actual data.
Introduce Olostep as the next step
When you outgrow manual scripts, you need production infrastructure. Olostep's Scrapes endpoint bypasses anti-bot systems natively. Olostep Parsers convert repeated target pages into highly structured JSON. You can use Olostep Batches to process vast URL lists asynchronously in one job, returning clean Markdown, HTML, or JSON.
Call Olostep from R with httr2
You do not need to leave R. Call the Olostep API via httr2 and keep your downstream analysis native.
library(httr2)
olo_req <- request("https://api.olostep.com/v1/scrapes") |>
req_headers(Authorization = "Bearer YOUR_API_KEY") |>
req_body_json(list(
url = "https://complex-dynamic-site.com",
format = "markdown"
))
olo_resp <- req_perform(olo_req) |> resp_body_json()Legal and ethical web scraping
Capability does not equal authorization. The polite package operationalizes permission, pacing, and caching to keep scripts compliant.
Make polite your default wrapper
Make polite your default session wrapper. The bow() function introduces your script to the host, parses the robots.txt file, establishes a respectful crawl delay, and caches results so you never burden the server with duplicate requests.
library(polite)
library(rvest)
session <- bow(
url = "http://books.toscrape.com",
user_agent = "DataAnalysisBot - contact@example.com",
delay = 2
)
page_data <- scrape(session) |>
html_elements(".product_pod h3 a") |>
html_text2()Understand robots.txt and jurisdiction
The robots.txt protocol outlines which paths automated bots are permitted to visit. It is an operational request, not a technical access barrier, but you must respect it strictly to avoid IP bans. Jurisdiction matters: copyright, data privacy laws, and Terms of Service vary drastically by region. Bypassing authentications carries severe legal risk.
Is R good for web scraping?
Is R good for web scraping?
R is exceptionally good for web scraping when the data extraction process feeds directly into statistical analysis, data cleaning, or visualization pipelines. The seamless integration between rvest, httr2, and the tidyverse allows data scientists to scrape, clean, and plot data in a single, reproducible script.
Where R wins vs. Python
R dominates the analysis-first workflow. Python maintains a broader crawling ecosystem (like Scrapy) and handles asynchronous, highly concurrent browser loops natively better than R. Use R for analysis-first workflows and moderate-scale automation. Move to dedicated infrastructure when proxy management and raw concurrency dominate your requirements.
Conclusion: Choose the right next step
The modern web scraping in R workflow relies on identifying your target's architecture:
- Static HTML: Use
rvestandread_html(). - Rendered DOM: Switch to
read_html_live(). - Hidden API: Rebuild the request with
httr2. - Recurring, large-scale extraction: Handoff to production infrastructure like Olostep.
Start with a single static page. Test the same logic on a JavaScript-heavy page, then locate an API endpoint. If requirements escalate to massive batch jobs, leverage the Batch Endpoint to handle the extraction layer while keeping your analysis firmly inside R.
