How to Discover Trending Blog Ideas Using Web Crawling and AI

How to Discover Trending Blog Ideas Using Web Crawling and AI

Ehsan

Ehsan

Discover Trending Blog Ideas with Olostep and OpenAI

Keeping up with trending topics across top tech websites can be time-consuming. In this guide, you’ll learn how to automate the discovery of blog ideas from multiple sources like TechCrunch and The Verge using the Olostep web scraping API, enrich that content using OpenAI for summarization, and export everything to a structured CSV.


Objectives

  • Crawl blogs from multiple tech websites.
  • Retrieve the titles and content.
  • Use OpenAI GPT to generate intelligent summaries.
  • Export everything to a CSV file for further use in planning, research, or publishing.

Requirements

Before we begin, ensure the following are installed in your Python environment:

pip install requests beautifulsoup4 openai

You also need:

  • A valid Olostep API Key.
  • An OpenAI API Key for GPT-based summarization.

Step 1: Define Sources and API Keys

We begin by specifying the blog sources we want to monitor and configuring our API keys.

API_KEY = "YOUR_OLOSTEP_API_KEY"
OPENAI_API_KEY = "YOUR_OPENAI_API_KEY"


blog_sources = [
    "https://techcrunch.com",
    "https://www.theverge.com",
    "https://www.wired.com"
]

Step 2: Start a Crawl for Each Domain

This function initiates a crawl from the specified blog homepage, targeting articles from the past few years.

def start_crawl(start_url):
    payload = {
        "start_url": start_url,
        "include_urls": ["/2025/**", "/2024/**", "/2023/**"],
        "max_pages": 20,
        "max_depth": 1
    }
    headers = {
        "Authorization": f"Bearer {API_KEY}",
        "Content-Type": "application/json"
    }
    response = requests.post("https://api.olostep.com/v1/crawls", headers=headers, json=payload)
    return response.json()["id"]

Step 3: Monitor the Crawl Until Completion

Once started, we need to periodically check the crawl’s status. Crawls may take a few minutes depending on depth and site load.

def check_status(crawl_id):
    headers = {"Authorization": f"Bearer {API_KEY}"}
    response = requests.get(f"https://api.olostep.com/v1/crawls/{crawl_id}", headers=headers)
    return response.json()["status"]

Step 4: Retrieve Crawled Pages

Once the crawl completes, we retrieve the individual pages that were crawled.

def get_pages(crawl_id):
    headers = {"Authorization": f"Bearer {API_KEY}"}
    response = requests.get(f"https://api.olostep.com/v1/crawls/{crawl_id}/pages", headers=headers)
    return response.json()["pages"]

Step 5: Extract Raw HTML from the Page

We now fetch the HTML content for each page using its retrieve_id.

def retrieve_html(retrieve_id):
    headers = {"Authorization": f"Bearer {API_KEY}"}
    params = {"retrieve_id": retrieve_id, "formats": ["html"]}
    response = requests.get("https://api.olostep.com/v1/retrieve", headers=headers, params=params)
    return response.json().get("html_content", "")

Step 6: Summarize the Page Content with OpenAI

Using OpenAI’s GPT model, we convert raw HTML text into a digestible summary. This is useful for trend analysis and content briefs.

def summarize_text(text):
    from openai import OpenAI
    client = OpenAI(api_key=OPENAI_API_KEY)
    import requests
    response = client.chat.completions.create(model="gpt-4o-mini",
    messages=[{
        "role": "user",
        "content": f"Summarize this blog content in 1-2 sentences:{text}"
    }])
    return response.choices[0].message.content.strip()

Step 7: Extract Titles and Save Results to CSV

We extract each page’s title using BeautifulSoup, summarize its content, and write the result to a CSV file.

import csv
from bs4 import BeautifulSoup

def extract_and_save(pages, output_writer):
    for page in pages:
        try:
            html = retrieve_html(page["retrieve_id"])
            soup = BeautifulSoup(html, "html.parser")
            title = soup.title.string.strip() if soup.title else "No Title"
            text = soup.get_text()[:3000]
            summary = summarize_text(text)
            output_writer.writerow([title, page["url"], summary])
        except Exception as e:
            print(f"Error with {page['url']}: {e}")

Step 8: Full Pipeline to Run the System

Putting everything together in one run loop to automate the entire process.

def run():
    with open("blog_trends.csv", "w", newline='', encoding="utf-8") as file:
        writer = csv.writer(file)
        writer.writerow(["Title", "URL", "AI Summary"])

        for source in blog_sources:
            print(f"🚀 Crawling: {source}")
            crawl_id = start_crawl(source)

            while True:
                if check_status(crawl_id) == "completed":
                    break
                time.sleep(30)

            pages = get_pages(crawl_id)
            extract_and_save(pages, writer)

if __name__ == "__main__":
    run()

Output Example

The script will produce a blog_trends.csv with the following structure:

Title URL AI Summary
Apple's Vision Pro Event https://techcrunch.com/... Apple reveals its Vision Pro headset...
Meta’s Latest AI Chip Launch https://www.theverge.com/... Meta introduces new AI architecture...

🧠 Final Thoughts

This workflow:

  • Saves hours of manual blog browsing
  • Gives AI-assisted summaries
  • Makes CSV export ready for SEO, newsletters, or product content teams