
How to Discover Trending Blog Ideas Using Web Crawling and AI

Ehsan
Discover Trending Blog Ideas with Olostep and OpenAI
Keeping up with trending topics across top tech websites can be time-consuming. In this guide, you’ll learn how to automate the discovery of blog ideas from multiple sources like TechCrunch and The Verge using the Olostep web scraping API, enrich that content using OpenAI for summarization, and export everything to a structured CSV.
Objectives
- Crawl blogs from multiple tech websites.
- Retrieve the titles and content.
- Use OpenAI GPT to generate intelligent summaries.
- Export everything to a CSV file for further use in planning, research, or publishing.
Requirements
Before we begin, ensure the following are installed in your Python environment:
pip install requests beautifulsoup4 openai
You also need:
- A valid Olostep API Key.
- An OpenAI API Key for GPT-based summarization.
Step 1: Define Sources and API Keys
We begin by specifying the blog sources we want to monitor and configuring our API keys.
API_KEY = "YOUR_OLOSTEP_API_KEY"
OPENAI_API_KEY = "YOUR_OPENAI_API_KEY"
blog_sources = [
"https://techcrunch.com",
"https://www.theverge.com",
"https://www.wired.com"
]
Step 2: Start a Crawl for Each Domain
This function initiates a crawl from the specified blog homepage, targeting articles from the past few years.
def start_crawl(start_url):
payload = {
"start_url": start_url,
"include_urls": ["/2025/**", "/2024/**", "/2023/**"],
"max_pages": 20,
"max_depth": 1
}
headers = {
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
}
response = requests.post("https://api.olostep.com/v1/crawls", headers=headers, json=payload)
return response.json()["id"]
Step 3: Monitor the Crawl Until Completion
Once started, we need to periodically check the crawl’s status. Crawls may take a few minutes depending on depth and site load.
def check_status(crawl_id):
headers = {"Authorization": f"Bearer {API_KEY}"}
response = requests.get(f"https://api.olostep.com/v1/crawls/{crawl_id}", headers=headers)
return response.json()["status"]
Step 4: Retrieve Crawled Pages
Once the crawl completes, we retrieve the individual pages that were crawled.
def get_pages(crawl_id):
headers = {"Authorization": f"Bearer {API_KEY}"}
response = requests.get(f"https://api.olostep.com/v1/crawls/{crawl_id}/pages", headers=headers)
return response.json()["pages"]
Step 5: Extract Raw HTML from the Page
We now fetch the HTML content for each page using its retrieve_id
.
def retrieve_html(retrieve_id):
headers = {"Authorization": f"Bearer {API_KEY}"}
params = {"retrieve_id": retrieve_id, "formats": ["html"]}
response = requests.get("https://api.olostep.com/v1/retrieve", headers=headers, params=params)
return response.json().get("html_content", "")
Step 6: Summarize the Page Content with OpenAI
Using OpenAI’s GPT model, we convert raw HTML text into a digestible summary. This is useful for trend analysis and content briefs.
def summarize_text(text):
from openai import OpenAI
client = OpenAI(api_key=OPENAI_API_KEY)
import requests
response = client.chat.completions.create(model="gpt-4o-mini",
messages=[{
"role": "user",
"content": f"Summarize this blog content in 1-2 sentences:{text}"
}])
return response.choices[0].message.content.strip()
Step 7: Extract Titles and Save Results to CSV
We extract each page’s title using BeautifulSoup, summarize its content, and write the result to a CSV file.
import csv
from bs4 import BeautifulSoup
def extract_and_save(pages, output_writer):
for page in pages:
try:
html = retrieve_html(page["retrieve_id"])
soup = BeautifulSoup(html, "html.parser")
title = soup.title.string.strip() if soup.title else "No Title"
text = soup.get_text()[:3000]
summary = summarize_text(text)
output_writer.writerow([title, page["url"], summary])
except Exception as e:
print(f"Error with {page['url']}: {e}")
Step 8: Full Pipeline to Run the System
Putting everything together in one run loop to automate the entire process.
def run():
with open("blog_trends.csv", "w", newline='', encoding="utf-8") as file:
writer = csv.writer(file)
writer.writerow(["Title", "URL", "AI Summary"])
for source in blog_sources:
print(f"🚀 Crawling: {source}")
crawl_id = start_crawl(source)
while True:
if check_status(crawl_id) == "completed":
break
time.sleep(30)
pages = get_pages(crawl_id)
extract_and_save(pages, writer)
if __name__ == "__main__":
run()
Output Example
The script will produce a blog_trends.csv
with the following structure:
Title | URL | AI Summary |
---|---|---|
Apple's Vision Pro Event | https://techcrunch.com/... | Apple reveals its Vision Pro headset... |
Meta’s Latest AI Chip Launch | https://www.theverge.com/... | Meta introduces new AI architecture... |
🧠 Final Thoughts
This workflow:
- Saves hours of manual blog browsing
- Gives AI-assisted summaries
- Makes CSV export ready for SEO, newsletters, or product content teams