SEO Pipeline Overview: 11-Step Query-to-Page Generation

This article provides a comprehensive overview of our SEO pipeline that transforms a large set of search queries into optimized query pages, related searches, and semantic search capabilities.

The Problem: Scaling SEO Content

Traditional SEO means creating and tuning a page for each search query by hand:

Many queries × non-trivial time per page → manual work does not scale
Each page needs title, meta description, content, and internal links
Pages must be updated when products or catalog change
Consistency and best practices must be kept across all pages

Automation is required to handle this volume.

The Solution: Automated SEO Pipeline

The pipeline has 11 main steps. It:

Fetches queries from several sources (Google Search Console, Google Ads, live traffic, and Algolia analytics)
Clusters similar queries into semantic groups
Matches queries to products using embeddings
Generates query pages with optimized content
Builds related searches for navigation
Loads data into Valkey for the search service

End-to-end execution takes on the order of hours; it produces thousands of query pages.

Pipeline Architecture

graph TD
    S0[Step 0: Embed Source Data
products, parts, articles]
    S1a[Step 1a: Fetch GSC Queries]
    S1b[Step 1b: Fetch Google Ads]
    S1c[Step 1c: Fetch Keyword Ideas]
    S1d[Step 1d: Fetch Live Queries]
    S1e[Step 1e: Fetch Algolia Queries]
    S2[Step 2: Combine Queries
merge all sources]
    S3a[Step 3a: Generate Base
Phrase Mappings]
    S3b[Step 3b: Embed Queries
convert to vectors]
    S4[Step 4: Expand Phrase Mappings
semantic similarity]
    S5[Step 5: Cluster Queries
all-to-all similarity]
    S6[Step 6: Match Source Data
query-product matching]
    S7[Step 7: Build Query Pages
generate HTML]
    S8[Step 8: Generate Related Searches
3-tier strategy]
    S9[Step 9: Migrate Descriptions
migration + enrichment]
    S11[Step 11: Migrate to Valkey
load search service]

    S0 --> S6
    S1a --> S2
    S1b --> S2
    S1c --> S2
    S1d --> S2
    S1e --> S2
    S2 --> S3b
    S3a --> S4
    S3b --> S4
    S3b --> S5
    S4 --> S6
    S5 --> S6
    S6 --> S7
    S7 --> S8
    S0 --> S9
    S7 --> S11
    S8 --> S11

Step 0: Embed Source Data

Purpose: Produce embeddings for all products, parts, and articles.
Input: Product catalog, parts data, and article content.
Output: Source metadata (keys, descriptions, types) and a matrix of embedding vectors (fixed dimension per item).
Process:
1. Extract products, parts, and articles from their sources.
2. Build a description per item.
3. Embed descriptions with a sentence model (see Embedding Strategy).
4. Persist embeddings and metadata.
See: Embed Source Data

Step 1: Fetch Queries

Purpose: Collect search queries from multiple sources.
Substeps:
- 1a: Fetch from Google Search Console (impressions, clicks, position)
- 1b: Fetch from Google Ads (search terms, conversions)
- 1c: Fetch keyword ideas from an ads API (search volume)
- 1d: Fetch live query logs (real user traffic)
- 1e: Fetch from Algolia (top searches from search analytics)
Output: One query list per source (e.g. GSC, ads, live, Algolia).
See: Fetch Queries

Step 2: Combine Queries

Purpose: Merge queries from all sources and deduplicate.
Input: All query outputs from Step 1.
Output: A single combined query list.
Process:
1. Load every source.
2. Deduplicate by query text.
3. Aggregate scores (e.g. impressions and clicks).
4. Sort by total score and save.

Step 3a: Generate Base Phrase Mappings

Purpose: Build rule-based phrase-to-filter mappings from catalog structure.
Input: Product catalog and feature ordering.
Output: A base set of phrase → filter mappings.
Process:
1. Take feature values from products.
2. Form phrases from heading, key, value, and unit.
3. Apply feature-specific rules for processor, memory, storage, connectivity, etc.
4. Persist base mappings.
See: Phrase-to-Filter Mappings

Step 3b: Embed Queries

Purpose: Turn the combined query list into embedding vectors.
Input: Combined query list from Step 2.
Output: Query embedding matrix (one vector per query).
Process:
1. Load combined queries.
2. Filter blacklisted queries.
3. Embed with the same model as in Embedding Strategy.
4. Save embeddings (with incremental caching where supported).
See: Embed Queries

Step 4: Expand Phrase Mappings

Purpose: Find more phrases using semantic similarity to the base set.
Input: Base phrase mappings, combined queries, and query embeddings.
Output: Expanded phrase mappings (larger phrase set).
Process:
1. Extract n-grams from queries.
2. Embed phrases and compare to search texts.
3. Accept matches above a similarity threshold.
4. Apply manual seeds and resolve conflicts (e.g. memory vs storage).
5. Save expanded mappings.
See: Phrase-to-Filter Mappings

Step 5: Cluster Queries

Purpose: Group similar queries into semantic clusters.
Input: Combined queries and query embeddings.
Output: Cluster manifest (each query assigned to a cluster).
Process:
1. Load query embeddings.
2. Compute all-to-all similarity in batches.
3. Assign each query to the best-matching cluster.
4. Form singleton clusters for unclustered queries.
5. Sort clusters by traffic score and save.
See: Query Clustering

Step 6: Match Source Data

Purpose: Match each query cluster to the best product, part, or article.
Input: Cluster manifest, cluster embeddings, and source embeddings.
Output: Cluster-to-source matches (one primary match per cluster).
Process:
1. Load cluster and source embeddings.
2. Compute similarity between clusters and sources.
3. Apply discovery scoring (combining semantic similarity with traffic and other signals).
4. Apply any catalog-specific boosts.
5. Pick one best match per cluster and save.
See: Product Matching

Step 7: Build Query Pages

Purpose: Generate one HTML page per cluster.
Input: Cluster manifest, source matches, and phrase mappings.
Output: Routing data plus HTML files (one per cluster).
Process:
1. Load clusters and matches.
2. For each cluster: build URL slug, derive filters from center query, resolve product/part/article, generate title and meta description, render HTML.
3. Save routing data.

Step 8: Generate Related Searches

Purpose: Create related-search links for every page.
Input: Routing data, source metadata, and query embeddings.
Output: Related-search store (one entry per page, several links each) plus a JSON checkpoint.
Process:
1. Enumerate all site pages (query pages, product pages, etc.).
2. Load page and query embeddings; sort pages by traffic.
3. For each page: compute similarity to all queries, pick top links using a 3-tier strategy (semantic, categorical, global), apply caps and deduplication, write to store.
4. Save checkpoint.
See: Related Search Generation

Step 9: Migrate Descriptions

Purpose: Migrate descriptions from legacy query pages to the current page store and enrich with multi-language content.
Input: Legacy query page data and current page store.
Output: Updated page store with migrated and enriched descriptions.
Process:
1. Load legacy descriptions from the old query page table.
2. For each active page: migrate legacy descriptions if missing, then enrich with localized content from cache for all supported languages.
3. Save updated pages.

Step 11: Migrate to Valkey

Purpose: Load data into Valkey so the search service can serve queries.
Input: Combined query list, query embeddings, and phrase mappings.
Output: Valkey instance with RediSearch-style index for query embeddings, cached phrase mappings, cached popular queries, and an autocomplete index (e.g. sorted sets).
Process:
1. Connect to Valkey.
2. Index query embeddings for similarity search.
3. Cache phrase mappings and popular queries with a TTL.
4. Build prefix → query autocomplete index.
5. Verify data.
See: Valkey Migration

Pipeline Execution

Running the Pipeline

The pipeline is run as a single script that executes steps in sequence. From the project’s SEO scripts directory, the equivalent of:

./run_seo_pipeline.sh

runs the full pipeline. Individual steps can be invoked separately (e.g. Step 0, then Step 1, etc.); implementation details are omitted here.

Scheduling

Execution is scheduled as follows:

Daily: Steps 1, 2, 3b, 5 (refresh queries and clusters).
Weekly: Full pipeline (regenerate all pages and downstream data).
On-demand: Any single step when needed.

Data Flow

flowchart LR
    subgraph Sources
        GSC[GSC]
        Ads[Google Ads]
        Live[Live]
        Algolia[Algolia]
    end
    subgraph Pipeline
        Combined[Combined queries]
        Emb[Query embeddings]
        Clusters[Clusters]
        Matches[Matches]
        Routing[Routing]
        HTML[HTML pages]
    end
    subgraph Outputs
        Related[Related-search store]
        Valkey[Valkey]
    end
    GSC --> Combined
    Ads --> Combined
    Live --> Combined
    Algolia --> Combined
    Combined --> Emb
    Emb --> Clusters
    Clusters --> Matches
    Matches --> Routing
    Routing --> HTML
    HTML --> Related
    Routing --> Valkey
    Related --> Valkey

Output Scale (Generic)

Queries: Total count varies by connected sources; a subset is blacklisted; the rest are valid for clustering.
Clusters: Many multi-member clusters plus singleton clusters for the rest.
Matches: A high percentage of clusters receive a source match; similarity scores depend on catalog and query overlap.
Pages: One query page per cluster; product and other pages add to total site size.
Related searches: Each page gets a fixed number of related links; total link count scales with number of pages.

Performance Characteristics

Total time: Full pipeline run is on the order of hours; exact duration depends on data size and hardware.
Incremental runs: Only steps that depend on changed data need to run; incremental updates are faster than a full run.
Resources: CPU is highest during embedding and similarity steps; memory scales with embedding and similarity matrix size; disk usage is moderate for all artifacts.

Incremental Processing

The pipeline supports incremental updates:

Embeddings: Only new or changed items can be embedded; cache reused for the rest.
Phrase mappings: New mappings can be merged with existing ones.
Clusters: Existing cluster centers can be reused when appropriate.
Related searches: Only pages whose data changed need to be recomputed.

This keeps daily or weekly refresh times lower than a full rebuild.

Error Handling

API failures: Retries with exponential backoff.
Missing data: Skip and log; continue where possible.
Invalid SKUs: Filter out and continue.
Embedding failures: Use cached embeddings when available.

Steps write checkpoints so the run can resume from the last successful step.

Monitoring and Logging

The pipeline logs progress per step (e.g. “Step N started”, “Step N complete”). Log destination and format are determined by the deployment environment.

References

Summary

The SEO pipeline automates query-to-page generation in 11 steps:

Input: A large set of queries from GSC, ads, live traffic, and Algolia analytics.
Process: Embed source data; fetch and combine queries; build and expand phrase mappings; embed and cluster queries; match clusters to products/parts/articles; build query pages; generate related searches; migrate descriptions and Valkey data.
Output: Thousands of query pages, expanded phrase mappings, a related-search store, and a Valkey-backed search service.

Benefits of the current setup: automated SEO page generation, semantic query handling, product matching driven by embeddings, broad internal linking via related searches, and a fast search service backed by Valkey.

← Back to Documentation Index

Products

Popular Searches and Blogs