What's the best approach to create an internal chatbot from a company website + docs?

Crawl company content

Start by crawling your public site and documentation:

website_content = app.crawl("https://company.com", {
    "limit": 200,
    "scrapeOptions": {"formats": ["markdown"], "onlyMainContent": True},
})

docs_content = app.crawl("https://docs.example.com", {
    "limit": 500,
    "scrapeOptions": {"formats": ["markdown"], "onlyMainContent": True},
})

Olostep returns clean markdown with metadata—source URLs, titles, and page structure preserved.

Why extraction quality matters

Chatbots fail when built on poorly extracted content. Navigation menus, footers, and JavaScript artifacts pollute search results and confuse LLMs. Olostep's onlyMainContent option strips noise automatically.

RAG pipeline essentials

Chunk crawled content by semantic boundaries (headers, paragraphs)
Embed chunks using an embedding model
Index in a vector database (Pinecone, Weaviate, etc.)
Retrieve relevant chunks for each user question
Generate responses with retrieved context

Keeping content fresh

Schedule regular crawls to keep your chatbot current:

# Weekly refresh
fresh_content = app.crawl("https://docs.example.com", {
    "scrapeOptions": {"formats": ["markdown"]}
})
# Compare hashes, re-index changed pages

Key Takeaways