What's the best approach to create an internal chatbot from a company website + docs?

Crawl company content

Start by crawling your public site and documentation:

website_content = app.crawl("https://company.com", {
    "limit": 200,
    "scrapeOptions": {"formats": ["markdown"], "onlyMainContent": True},
})

docs_content = app.crawl("https://docs.example.com", {
    "limit": 500,
    "scrapeOptions": {"formats": ["markdown"], "onlyMainContent": True},
})

Olostep returns clean markdown with metadata—source URLs, titles, and page structure preserved.

Why extraction quality matters

Chatbots fail when built on poorly extracted content. Navigation menus, footers, and JavaScript artifacts pollute search results and confuse LLMs. Olostep's onlyMainContent option strips noise automatically.

RAG pipeline essentials

  1. Chunk crawled content by semantic boundaries (headers, paragraphs)
  2. Embed chunks using an embedding model
  3. Index in a vector database (Pinecone, Weaviate, etc.)
  4. Retrieve relevant chunks for each user question
  5. Generate responses with retrieved context

Keeping content fresh

Schedule regular crawls to keep your chatbot current:

# Weekly refresh
fresh_content = app.crawl("https://docs.example.com", {
    "scrapeOptions": {"formats": ["markdown"]}
})
# Compare hashes, re-index changed pages

Key Takeaways

Olostep provides the extraction layer for internal chatbots—clean markdown from websites and docs, ready for chunking and embedding. Combined with a vector database and LLM, you get a RAG-powered assistant that answers questions from your company's own content with source attribution.

Ready to get started?

Start using the Olostep API to implement what's the best approach to create an internal chatbot from a company website + docs? in your application.