Build a Website Knowledge Chatbot Using Streamlit, ChromaDB, Olostep, and OpenAI

Have you ever wanted a smart AI assistant that understands your entire website and can answer questions like ChatGPT? In this tutorial, we’ll show you how to build it — without training your own LLM or managing any backend.

We'll use:

Olostep to crawl and extract website content
ChromaDB to store and search content embeddings with metadata
OpenAI GPT-4 and Embeddings API for context-aware Q&A
Streamlit to build a live chatbot interface

Perfect for product sites, documentation portals, landing pages, and more.

Here's the gist with the full code if you want to run it on your own: https://gist.github.com/olostep/40b5380e8f5b144d370656c3f5fdbd41

Requirements

Before we dive in, install the required Python packages:

pip install streamlit openai==1.7.6 chromadb requests

You'll also need:

How It Works

Crawl your website using Olostep (no scraping or parsing needed)
Extract clean markdown from each page
Use OpenAI to embed each page’s content into vectors
Store vectors + content + metadata in ChromaDB
Ask a question in Streamlit
Query top-matching documents
Use GPT-4 to summarize an answer based on retrieved content

Step-by-Step Implementation

Step 1: Crawl Website with Olostep

We use Olostep's API to start crawling from the homepage:

def start_crawl(url):
    payload = {
        "start_url": url,
        "include_urls": ["/**"],
        "max_pages": 10,
        "max_depth": 3
    }
    headers = {"Authorization": f"Bearer YOUR_OLOSTEP_API_KEY"}
    res = requests.post("https://api.olostep.com/v1/crawls", headers=headers, json=payload)
    return res.json()["id"]

To wait for completion:

def wait_for_crawl(crawl_id):
    while True:
        res = requests.get(f"https://api.olostep.com/v1/crawls/{crawl_id}",
                           headers={"Authorization": f"Bearer YOUR_OLOSTEP_API_KEY"})
        if res.json()["status"] == "completed":
            break
        time.sleep(30)

Get all crawled page IDs:

def get_pages(crawl_id):
    res = requests.get(f"https://api.olostep.com/v1/crawls/{crawl_id}/pages",
                       headers={"Authorization": f"Bearer YOUR_OLOSTEP_API_KEY"})
    return res.json()["pages"]

Step 2: Clean and Prepare Markdown

Once we retrieve the markdown, we clean it:

import re

def clean_markdown(markdown):
    markdown = re.sub(r'#+ ', '', markdown)
    markdown = re.sub(r'\* ', '', markdown)
    markdown = re.sub(r'> ', '', markdown)
    markdown = re.sub(r'\[(.*?)\]\(.*?\)', r'\1', markdown)
    markdown = re.sub(r'`|\*\*|_', '', markdown)
    markdown = re.sub(r'\n{2,}', '\n', markdown)
    return markdown.strip()

Step 3: Initialize ChromaDB

import chromadb
from chromadb.utils import embedding_functions

chroma_client = chromadb.Client()
openai_embed_fn = embedding_functions.OpenAIEmbeddingFunction(
    api_key="YOUR_OPENAI_API_KEY",
    model_name="text-embedding-ada-002"
)

collection = chroma_client.get_or_create_collection(
    name="website-content",
    embedding_function=openai_embed_fn
)

Step 4: Index Content

def retrieve_markdown(retrieve_id):
    res = requests.get("https://api.olostep.com/v1/retrieve",
                       headers={"Authorization": f"Bearer YOUR_OLOSTEP_API_KEY"},
                       params={"retrieve_id": retrieve_id, "formats": ["markdown"]})
    return res.json().get("markdown_content", "")

def index_content(pages):
    for page in pages:
        try:
            markdown = retrieve_markdown(page["retrieve_id"])
            text = clean_markdown(markdown)
            if len(text) > 20:
                collection.add(
                    documents=[text],
                    metadatas=[{"url": page["url"]}],
                    ids=[page["retrieve_id"]]
                )
                print(f"✅ Indexed: {page['url']}")
        except Exception as e:
            print(f"⚠️ Error on {page['url']}: {e}")

Step 5: Ask Questions & Get Summarized Answers

Query the Chroma collection:

def query_website(question, top_k=3):
    results = collection.query(query_texts=[question], n_results=top_k)
    return results["documents"][0]

Summarize with GPT-4:

import openai
client = openai.OpenAI(api_key="YOUR_OPENAI_API_KEY")

def summarize_with_gpt(question, chunks):
    if not chunks:
        return "Sorry, I couldn't find enough information."

    prompt = f'''
Use the following website content to answer this question:

{''.join(chunks)}

Q: {question}
A:
'''

    res = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.4,
    )
    return res.choices[0].message.content

Step 6: Build the Streamlit UI

import streamlit as st

st.set_page_config(page_title="Website Chatbot", page_icon="💬")
st.title("💬 Website Chatbot")
st.caption("Ask anything based on your website's content.")

question = st.chat_input("Type your question here...")

if question:
    with st.chat_message("user"):
        st.markdown(question)

    with st.spinner("Thinking..."):
        chunks = query_website(question)
        final_answer = summarize_with_gpt(question, chunks)

    with st.chat_message("assistant"):
        st.markdown(final_answer)

Result

Now you can ask questions like:

What services does this company provide?
What are your support hours?
Where can I find your pricing?

And your chatbot will instantly generate intelligent, context-aware answers.

Conclusion

Congratulations! You've now built a fully functional AI-powered chatbot that can answer questions about your website with remarkable accuracy — all without needing to train a model or host complex infrastructure.

This solution combines:

⚙️ Olostep for fast and reliable website crawling
📚 ChromaDB for powerful local document + vector storage
🤖 OpenAI for embeddings and natural language summarization
💬 Streamlit for a lightweight, real-time chatbot interface

Whether you're building an internal documentation assistant, a smart helpdesk bot, or simply want to turn your static website into an interactive experience — this stack gets you there quickly and efficiently.

What's Next?

To take your chatbot to the next level, consider:

Saving ChromaDB collections to skip re-indexing
Chunking long documents for higher recall accuracy
Adding memory to support multi-turn conversations
Showing source URLs in the GPT answers
Hosting your Streamlit app with password-protection or domain integration

This is just the beginning. By combining structured retrieval with modern LLMs, you can build incredibly powerful tools for any knowledge-centric product.

Thank You

Feel free to share this tutorial or contribute to future improvements!
Got stuck or want help deploying this to production? Reach out — we're happy to help.

Happy building! 💡

Build a Website Chatbot Using Streamlit, ChromaDB, Olostep, and OpenAI (v1.7.6)