Build a Website Chatbot Using Streamlit, ChromaDB, Olostep, and OpenAI (v1.7.6)

Build a Website Chatbot Using Streamlit, ChromaDB, Olostep, and OpenAI (v1.7.6)

Ehsan

Ehsan

Build a Website Knowledge Chatbot Using Streamlit, ChromaDB, Olostep, and OpenAI

Have you ever wanted a smart AI assistant that understands your entire website and can answer questions like ChatGPT? In this tutorial, we’ll show you how to build it — without training your own LLM or managing any backend.

We'll use:

  • Olostep to crawl and extract website content
  • ChromaDB to store and search content embeddings with metadata
  • OpenAI GPT-4 and Embeddings API for context-aware Q&A
  • Streamlit to build a live chatbot interface

Perfect for product sites, documentation portals, landing pages, and more.

Here's the gist with the full code if you want to run it on your own: https://gist.github.com/olostep/40b5380e8f5b144d370656c3f5fdbd41


Requirements

Before we dive in, install the required Python packages:

pip install streamlit openai==1.7.6 chromadb requests

You'll also need:


How It Works

  1. Crawl your website using Olostep (no scraping or parsing needed)
  2. Extract clean markdown from each page
  3. Use OpenAI to embed each page’s content into vectors
  4. Store vectors + content + metadata in ChromaDB
  5. Ask a question in Streamlit
  6. Query top-matching documents
  7. Use GPT-4 to summarize an answer based on retrieved content

Step-by-Step Implementation

Step 1: Crawl Website with Olostep

We use Olostep's API to start crawling from the homepage:

def start_crawl(url):
    payload = {
        "start_url": url,
        "include_urls": ["/**"],
        "max_pages": 10,
        "max_depth": 3
    }
    headers = {"Authorization": f"Bearer YOUR_OLOSTEP_API_KEY"}
    res = requests.post("https://api.olostep.com/v1/crawls", headers=headers, json=payload)
    return res.json()["id"]

To wait for completion:

def wait_for_crawl(crawl_id):
    while True:
        res = requests.get(f"https://api.olostep.com/v1/crawls/{crawl_id}",
                           headers={"Authorization": f"Bearer YOUR_OLOSTEP_API_KEY"})
        if res.json()["status"] == "completed":
            break
        time.sleep(30)

Get all crawled page IDs:

def get_pages(crawl_id):
    res = requests.get(f"https://api.olostep.com/v1/crawls/{crawl_id}/pages",
                       headers={"Authorization": f"Bearer YOUR_OLOSTEP_API_KEY"})
    return res.json()["pages"]

Step 2: Clean and Prepare Markdown

Once we retrieve the markdown, we clean it:

import re

def clean_markdown(markdown):
    markdown = re.sub(r'#+ ', '', markdown)
    markdown = re.sub(r'\* ', '', markdown)
    markdown = re.sub(r'> ', '', markdown)
    markdown = re.sub(r'\[(.*?)\]\(.*?\)', r'\1', markdown)
    markdown = re.sub(r'`|\*\*|_', '', markdown)
    markdown = re.sub(r'\n{2,}', '\n', markdown)
    return markdown.strip()

Step 3: Initialize ChromaDB

import chromadb
from chromadb.utils import embedding_functions

chroma_client = chromadb.Client()
openai_embed_fn = embedding_functions.OpenAIEmbeddingFunction(
    api_key="YOUR_OPENAI_API_KEY",
    model_name="text-embedding-ada-002"
)

collection = chroma_client.get_or_create_collection(
    name="website-content",
    embedding_function=openai_embed_fn
)

Step 4: Index Content

def retrieve_markdown(retrieve_id):
    res = requests.get("https://api.olostep.com/v1/retrieve",
                       headers={"Authorization": f"Bearer YOUR_OLOSTEP_API_KEY"},
                       params={"retrieve_id": retrieve_id, "formats": ["markdown"]})
    return res.json().get("markdown_content", "")

def index_content(pages):
    for page in pages:
        try:
            markdown = retrieve_markdown(page["retrieve_id"])
            text = clean_markdown(markdown)
            if len(text) > 20:
                collection.add(
                    documents=[text],
                    metadatas=[{"url": page["url"]}],
                    ids=[page["retrieve_id"]]
                )
                print(f"✅ Indexed: {page['url']}")
        except Exception as e:
            print(f"⚠️ Error on {page['url']}: {e}")

Step 5: Ask Questions & Get Summarized Answers

Query the Chroma collection:

def query_website(question, top_k=3):
    results = collection.query(query_texts=[question], n_results=top_k)
    return results["documents"][0]

Summarize with GPT-4:

import openai
client = openai.OpenAI(api_key="YOUR_OPENAI_API_KEY")

def summarize_with_gpt(question, chunks):
    if not chunks:
        return "Sorry, I couldn't find enough information."

    prompt = f'''
Use the following website content to answer this question:

{''.join(chunks)}

Q: {question}
A:
'''

    res = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.4,
    )
    return res.choices[0].message.content

Step 6: Build the Streamlit UI

import streamlit as st

st.set_page_config(page_title="Website Chatbot", page_icon="💬")
st.title("💬 Website Chatbot")
st.caption("Ask anything based on your website's content.")

question = st.chat_input("Type your question here...")

if question:
    with st.chat_message("user"):
        st.markdown(question)

    with st.spinner("Thinking..."):
        chunks = query_website(question)
        final_answer = summarize_with_gpt(question, chunks)

    with st.chat_message("assistant"):
        st.markdown(final_answer)

Result

Now you can ask questions like:

  • What services does this company provide?
  • What are your support hours?
  • Where can I find your pricing?

And your chatbot will instantly generate intelligent, context-aware answers.


Conclusion

Congratulations! You've now built a fully functional AI-powered chatbot that can answer questions about your website with remarkable accuracy — all without needing to train a model or host complex infrastructure.

This solution combines:

  • ⚙️ Olostep for fast and reliable website crawling
  • 📚 ChromaDB for powerful local document + vector storage
  • 🤖 OpenAI for embeddings and natural language summarization
  • 💬 Streamlit for a lightweight, real-time chatbot interface

Whether you're building an internal documentation assistant, a smart helpdesk bot, or simply want to turn your static website into an interactive experience — this stack gets you there quickly and efficiently.


What's Next?

To take your chatbot to the next level, consider:

  • Saving ChromaDB collections to skip re-indexing
  • Chunking long documents for higher recall accuracy
  • Adding memory to support multi-turn conversations
  • Showing source URLs in the GPT answers
  • Hosting your Streamlit app with password-protection or domain integration

This is just the beginning. By combining structured retrieval with modern LLMs, you can build incredibly powerful tools for any knowledge-centric product.


Thank You

Feel free to share this tutorial or contribute to future improvements!
Got stuck or want help deploying this to production? Reach out — we're happy to help.

Happy building! 💡