Olostep Crawl endpoint

Crawl many pages. Retrieve content on demand.

The Crawl endpoint discovers and processes subpages under your rules — then you pull HTML or markdown per page via retrieve.

How it works

POST a crawl with start_url and limits.

Track status until completed.

List pages and call /v1/retrieve per page.

Trusted by teams worldwide

Merchkit
Podqi
Khoj
Finny AI
Contents
Athena HQ
CivilGrid
GumLoop
Plots
Uman
Verisave
Relay
OpenMart
Profound
Centralize
Use Bear
Merchkit
Podqi
Khoj
Finny AI
Contents
Athena HQ
CivilGrid
GumLoop
Plots
Uman
Verisave
Relay
OpenMart
Profound
Centralize
Use Bear
Merchkit
Podqi
Khoj
Finny AI
Contents
Athena HQ
CivilGrid
GumLoop
Plots
Uman
Verisave
Relay
OpenMart
Profound
Centralize
Use Bear
Merchkit
Podqi
Khoj
Finny AI
Contents
Athena HQ
CivilGrid
GumLoop
Plots
Uman
Verisave
Relay
OpenMart
Profound
Centralize
Use Bear

From start URL to many pages

Orchestrate multi-page extraction without building your own crawler.

Step 1

Start a crawl

POST start_url with include_urls, exclude_urls, max_pages, and optional max_depth.

Step 2

Poll or webhook

Poll GET /v1/crawls/{id} until status is completed, or pass webhook_url for a push notification.

Step 3

List pages & retrieve

Paginate GET /v1/crawls/{id}/pages, then fetch each retrieve_id via /v1/retrieve for markdown or HTML.

POST /v1/crawls

{
  "start_url": "https://example.com",
  "max_pages": 100,
  "include_urls": ["/**"],
  "exclude_urls": ["/admin/**"]
}

Built for multi-page jobs

The /v1/crawls API fits between one-off scrapes and huge batch URL lists.

Bounded exploration

Control depth and max_pages so crawls stay predictable.

Webhooks

Optional webhook_url when the crawl completes.

Cursor pagination

Stream pages while the crawl is in progress or after completion.

Retrieve pipeline

Pair crawls with /v1/retrieve for markdown, html, or json formats.

What you can build

Mirror sites, build datasets, and power research agents.

Documentation mirrors

Pull every docs page for offline search or RAG.

E-commerce catalogs

Crawl product sections with include patterns.

Compliance archives

Snapshot many pages under a domain with audit metadata.

Multi-page extraction

Start a crawl. Poll. List pages.

Use webhooks for completion signals; paginate pages with cursor and limit.

RequestPOST /v1/crawls
{
  "start_url": "https://sugarbooandco.com",
  "max_pages": 100,
  "include_urls": ["/**"],
  "exclude_urls": ["/collections/**"],
  "include_external": false
}
Response200 OK
{
  "id": "crawl_…",
  "object": "crawl",
  "status": "in_progress",
  "start_url": "https://sugarbooandco.com",
  "pages_count": 0
}

Frequently asked questions

Everything you need to know about the Crawl endpoint.

What is the Crawl endpoint for?

It walks a site (within your limits) and returns page records you can later retrieve with /v1/retrieve — ideal for multi-page extraction without orchestrating every URL yourself.

How do I get page content?

List pages from /v1/crawls/{id}/pages, then call /v1/retrieve with each page's retrieve_id. Prefer retrieve over deprecated inline content fields on pages.

Can I filter which URLs are crawled?

Yes. Use include_urls and exclude_urls globs, search_query, top_n, and related parameters as documented.

How much does crawling cost?

Crawl is billed per page crawled (see current pricing in the docs).

Start crawling with Olostep

500 free credits to try — no credit card required.