
Build a Website Chatbot Using Streamlit, ChromaDB, Olostep, and OpenAI (v1.7.6)

Ehsan
Build a Website Knowledge Chatbot Using Streamlit, ChromaDB, Olostep, and OpenAI
Have you ever wanted a smart AI assistant that understands your entire website and can answer questions like ChatGPT? In this tutorial, we’ll show you how to build it — without training your own LLM or managing any backend.
We'll use:
- Olostep to crawl and extract website content
- ChromaDB to store and search content embeddings with metadata
- OpenAI GPT-4 and Embeddings API for context-aware Q&A
- Streamlit to build a live chatbot interface
Perfect for product sites, documentation portals, landing pages, and more.
Here's the gist with the full code if you want to run it on your own: https://gist.github.com/olostep/40b5380e8f5b144d370656c3f5fdbd41
Requirements
Before we dive in, install the required Python packages:
pip install streamlit openai==1.7.6 chromadb requests
You'll also need:
How It Works
- Crawl your website using Olostep (no scraping or parsing needed)
- Extract clean markdown from each page
- Use OpenAI to embed each page’s content into vectors
- Store vectors + content + metadata in ChromaDB
- Ask a question in Streamlit
- Query top-matching documents
- Use GPT-4 to summarize an answer based on retrieved content
Step-by-Step Implementation
Step 1: Crawl Website with Olostep
We use Olostep's API to start crawling from the homepage:
def start_crawl(url):
payload = {
"start_url": url,
"include_urls": ["/**"],
"max_pages": 10,
"max_depth": 3
}
headers = {"Authorization": f"Bearer YOUR_OLOSTEP_API_KEY"}
res = requests.post("https://api.olostep.com/v1/crawls", headers=headers, json=payload)
return res.json()["id"]
To wait for completion:
def wait_for_crawl(crawl_id):
while True:
res = requests.get(f"https://api.olostep.com/v1/crawls/{crawl_id}",
headers={"Authorization": f"Bearer YOUR_OLOSTEP_API_KEY"})
if res.json()["status"] == "completed":
break
time.sleep(30)
Get all crawled page IDs:
def get_pages(crawl_id):
res = requests.get(f"https://api.olostep.com/v1/crawls/{crawl_id}/pages",
headers={"Authorization": f"Bearer YOUR_OLOSTEP_API_KEY"})
return res.json()["pages"]
Step 2: Clean and Prepare Markdown
Once we retrieve the markdown, we clean it:
import re
def clean_markdown(markdown):
markdown = re.sub(r'#+ ', '', markdown)
markdown = re.sub(r'\* ', '', markdown)
markdown = re.sub(r'> ', '', markdown)
markdown = re.sub(r'\[(.*?)\]\(.*?\)', r'\1', markdown)
markdown = re.sub(r'`|\*\*|_', '', markdown)
markdown = re.sub(r'\n{2,}', '\n', markdown)
return markdown.strip()
Step 3: Initialize ChromaDB
import chromadb
from chromadb.utils import embedding_functions
chroma_client = chromadb.Client()
openai_embed_fn = embedding_functions.OpenAIEmbeddingFunction(
api_key="YOUR_OPENAI_API_KEY",
model_name="text-embedding-ada-002"
)
collection = chroma_client.get_or_create_collection(
name="website-content",
embedding_function=openai_embed_fn
)
Step 4: Index Content
def retrieve_markdown(retrieve_id):
res = requests.get("https://api.olostep.com/v1/retrieve",
headers={"Authorization": f"Bearer YOUR_OLOSTEP_API_KEY"},
params={"retrieve_id": retrieve_id, "formats": ["markdown"]})
return res.json().get("markdown_content", "")
def index_content(pages):
for page in pages:
try:
markdown = retrieve_markdown(page["retrieve_id"])
text = clean_markdown(markdown)
if len(text) > 20:
collection.add(
documents=[text],
metadatas=[{"url": page["url"]}],
ids=[page["retrieve_id"]]
)
print(f"✅ Indexed: {page['url']}")
except Exception as e:
print(f"⚠️ Error on {page['url']}: {e}")
Step 5: Ask Questions & Get Summarized Answers
Query the Chroma collection:
def query_website(question, top_k=3):
results = collection.query(query_texts=[question], n_results=top_k)
return results["documents"][0]
Summarize with GPT-4:
import openai
client = openai.OpenAI(api_key="YOUR_OPENAI_API_KEY")
def summarize_with_gpt(question, chunks):
if not chunks:
return "Sorry, I couldn't find enough information."
prompt = f'''
Use the following website content to answer this question:
{''.join(chunks)}
Q: {question}
A:
'''
res = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=0.4,
)
return res.choices[0].message.content
Step 6: Build the Streamlit UI
import streamlit as st
st.set_page_config(page_title="Website Chatbot", page_icon="💬")
st.title("💬 Website Chatbot")
st.caption("Ask anything based on your website's content.")
question = st.chat_input("Type your question here...")
if question:
with st.chat_message("user"):
st.markdown(question)
with st.spinner("Thinking..."):
chunks = query_website(question)
final_answer = summarize_with_gpt(question, chunks)
with st.chat_message("assistant"):
st.markdown(final_answer)
Result
Now you can ask questions like:
- What services does this company provide?
- What are your support hours?
- Where can I find your pricing?
And your chatbot will instantly generate intelligent, context-aware answers.
Conclusion
Congratulations! You've now built a fully functional AI-powered chatbot that can answer questions about your website with remarkable accuracy — all without needing to train a model or host complex infrastructure.
This solution combines:
- ⚙️ Olostep for fast and reliable website crawling
- 📚 ChromaDB for powerful local document + vector storage
- 🤖 OpenAI for embeddings and natural language summarization
- 💬 Streamlit for a lightweight, real-time chatbot interface
Whether you're building an internal documentation assistant, a smart helpdesk bot, or simply want to turn your static website into an interactive experience — this stack gets you there quickly and efficiently.
What's Next?
To take your chatbot to the next level, consider:
- Saving ChromaDB collections to skip re-indexing
- Chunking long documents for higher recall accuracy
- Adding memory to support multi-turn conversations
- Showing source URLs in the GPT answers
- Hosting your Streamlit app with password-protection or domain integration
This is just the beginning. By combining structured retrieval with modern LLMs, you can build incredibly powerful tools for any knowledge-centric product.
Thank You
Feel free to share this tutorial or contribute to future improvements!
Got stuck or want help deploying this to production? Reach out — we're happy to help.
Happy building! 💡