SEO Pipeline Overview: 11-Step Query-to-Page Generation
This article provides a comprehensive overview of our SEO pipeline that transforms a large set of search queries into optimized query pages, related searches, and semantic search capabilities.
The Problem: Scaling SEO Content
Traditional SEO means creating and tuning a page for each search query by hand:
-
Many queries × non-trivial time per page → manual work does not scale
-
Each page needs title, meta description, content, and internal links
-
Pages must be updated when products or catalog change
-
Consistency and best practices must be kept across all pages
Automation is required to handle this volume.
The Solution: Automated SEO Pipeline
The pipeline has 11 main steps. It:
- Fetches queries from several sources (Google Search Console, Google Ads, live traffic, and Algolia analytics)
- Clusters similar queries into semantic groups
- Matches queries to products using embeddings
- Generates query pages with optimized content
- Builds related searches for navigation
- Loads data into Valkey for the search service
End-to-end execution takes on the order of hours; it produces thousands of query pages.
Pipeline Architecture
graph TD
S0[Step 0: Embed Source Data
products, parts, articles]
S1a[Step 1a: Fetch GSC Queries]
S1b[Step 1b: Fetch Google Ads]
S1c[Step 1c: Fetch Keyword Ideas]
S1d[Step 1d: Fetch Live Queries]
S1e[Step 1e: Fetch Algolia Queries]
S2[Step 2: Combine Queries
merge all sources]
S3a[Step 3a: Generate Base
Phrase Mappings]
S3b[Step 3b: Embed Queries
convert to vectors]
S4[Step 4: Expand Phrase Mappings
semantic similarity]
S5[Step 5: Cluster Queries
all-to-all similarity]
S6[Step 6: Match Source Data
query-product matching]
S7[Step 7: Build Query Pages
generate HTML]
S8[Step 8: Generate Related Searches
3-tier strategy]
S9[Step 9: Migrate Descriptions
migration + enrichment]
S11[Step 11: Migrate to Valkey
load search service]
S0 --> S6
S1a --> S2
S1b --> S2
S1c --> S2
S1d --> S2
S1e --> S2
S2 --> S3b
S3a --> S4
S3b --> S4
S3b --> S5
S4 --> S6
S5 --> S6
S6 --> S7
S7 --> S8
S0 --> S9
S7 --> S11
S8 --> S11Step 0: Embed Source Data
-
Purpose: Produce embeddings for all products, parts, and articles.
-
Input: Product catalog, parts data, and article content.
-
Output: Source metadata (keys, descriptions, types) and a matrix of embedding vectors (fixed dimension per item).
-
Process:
- Extract products, parts, and articles from their sources.
- Build a description per item.
- Embed descriptions with a sentence model (see Embedding Strategy).
- Persist embeddings and metadata.
-
See: Embed Source Data
Step 1: Fetch Queries
-
Purpose: Collect search queries from multiple sources.
-
Substeps:
- 1a: Fetch from Google Search Console (impressions, clicks, position)
- 1b: Fetch from Google Ads (search terms, conversions)
- 1c: Fetch keyword ideas from an ads API (search volume)
- 1d: Fetch live query logs (real user traffic)
- 1e: Fetch from Algolia (top searches from search analytics)
-
Output: One query list per source (e.g. GSC, ads, live, Algolia).
-
See: Fetch Queries
Step 2: Combine Queries
-
Purpose: Merge queries from all sources and deduplicate.
-
Input: All query outputs from Step 1.
-
Output: A single combined query list.
-
Process:
- Load every source.
- Deduplicate by query text.
- Aggregate scores (e.g. impressions and clicks).
- Sort by total score and save.
Step 3a: Generate Base Phrase Mappings
-
Purpose: Build rule-based phrase-to-filter mappings from catalog structure.
-
Input: Product catalog and feature ordering.
-
Output: A base set of phrase → filter mappings.
-
Process:
- Take feature values from products.
- Form phrases from heading, key, value, and unit.
- Apply feature-specific rules for processor, memory, storage, connectivity, etc.
- Persist base mappings.
Step 3b: Embed Queries
-
Purpose: Turn the combined query list into embedding vectors.
-
Input: Combined query list from Step 2.
-
Output: Query embedding matrix (one vector per query).
-
Process:
- Load combined queries.
- Filter blacklisted queries.
- Embed with the same model as in Embedding Strategy.
- Save embeddings (with incremental caching where supported).
-
See: Embed Queries
Step 4: Expand Phrase Mappings
-
Purpose: Find more phrases using semantic similarity to the base set.
-
Input: Base phrase mappings, combined queries, and query embeddings.
-
Output: Expanded phrase mappings (larger phrase set).
-
Process:
- Extract n-grams from queries.
- Embed phrases and compare to search texts.
- Accept matches above a similarity threshold.
- Apply manual seeds and resolve conflicts (e.g. memory vs storage).
- Save expanded mappings.
Step 5: Cluster Queries
-
Purpose: Group similar queries into semantic clusters.
-
Input: Combined queries and query embeddings.
-
Output: Cluster manifest (each query assigned to a cluster).
-
Process:
- Load query embeddings.
- Compute all-to-all similarity in batches.
- Assign each query to the best-matching cluster.
- Form singleton clusters for unclustered queries.
- Sort clusters by traffic score and save.
-
See: Query Clustering
Step 6: Match Source Data
-
Purpose: Match each query cluster to the best product, part, or article.
-
Input: Cluster manifest, cluster embeddings, and source embeddings.
-
Output: Cluster-to-source matches (one primary match per cluster).
-
Process:
- Load cluster and source embeddings.
- Compute similarity between clusters and sources.
- Apply discovery scoring (combining semantic similarity with traffic and other signals).
- Apply any catalog-specific boosts.
- Pick one best match per cluster and save.
-
See: Product Matching
Step 7: Build Query Pages
-
Purpose: Generate one HTML page per cluster.
-
Input: Cluster manifest, source matches, and phrase mappings.
-
Output: Routing data plus HTML files (one per cluster).
-
Process:
- Load clusters and matches.
- For each cluster: build URL slug, derive filters from center query, resolve product/part/article, generate title and meta description, render HTML.
- Save routing data.
Step 8: Generate Related Searches
-
Purpose: Create related-search links for every page.
-
Input: Routing data, source metadata, and query embeddings.
-
Output: Related-search store (one entry per page, several links each) plus a JSON checkpoint.
-
Process:
- Enumerate all site pages (query pages, product pages, etc.).
- Load page and query embeddings; sort pages by traffic.
- For each page: compute similarity to all queries, pick top links using a 3-tier strategy (semantic, categorical, global), apply caps and deduplication, write to store.
- Save checkpoint.
Step 9: Migrate Descriptions
-
Purpose: Migrate descriptions from legacy query pages to the current page store and enrich with multi-language content.
-
Input: Legacy query page data and current page store.
-
Output: Updated page store with migrated and enriched descriptions.
-
Process:
- Load legacy descriptions from the old query page table.
- For each active page: migrate legacy descriptions if missing, then enrich with localized content from cache for all supported languages.
- Save updated pages.
Step 11: Migrate to Valkey
-
Purpose: Load data into Valkey so the search service can serve queries.
-
Input: Combined query list, query embeddings, and phrase mappings.
-
Output: Valkey instance with RediSearch-style index for query embeddings, cached phrase mappings, cached popular queries, and an autocomplete index (e.g. sorted sets).
-
Process:
- Connect to Valkey.
- Index query embeddings for similarity search.
- Cache phrase mappings and popular queries with a TTL.
- Build prefix → query autocomplete index.
- Verify data.
-
See: Valkey Migration
Pipeline Execution
Running the Pipeline
The pipeline is run as a single script that executes steps in sequence. From the project’s SEO scripts directory, the equivalent of:
./run_seo_pipeline.sh
runs the full pipeline. Individual steps can be invoked separately (e.g. Step 0, then Step 1, etc.); implementation details are omitted here.
Scheduling
Execution is scheduled as follows:
-
Daily: Steps 1, 2, 3b, 5 (refresh queries and clusters).
-
Weekly: Full pipeline (regenerate all pages and downstream data).
-
On-demand: Any single step when needed.
Data Flow
flowchart LR
subgraph Sources
GSC[GSC]
Ads[Google Ads]
Live[Live]
Algolia[Algolia]
end
subgraph Pipeline
Combined[Combined queries]
Emb[Query embeddings]
Clusters[Clusters]
Matches[Matches]
Routing[Routing]
HTML[HTML pages]
end
subgraph Outputs
Related[Related-search store]
Valkey[Valkey]
end
GSC --> Combined
Ads --> Combined
Live --> Combined
Algolia --> Combined
Combined --> Emb
Emb --> Clusters
Clusters --> Matches
Matches --> Routing
Routing --> HTML
HTML --> Related
Routing --> Valkey
Related --> ValkeyOutput Scale (Generic)
-
Queries: Total count varies by connected sources; a subset is blacklisted; the rest are valid for clustering.
-
Clusters: Many multi-member clusters plus singleton clusters for the rest.
-
Matches: A high percentage of clusters receive a source match; similarity scores depend on catalog and query overlap.
-
Pages: One query page per cluster; product and other pages add to total site size.
-
Related searches: Each page gets a fixed number of related links; total link count scales with number of pages.
Performance Characteristics
-
Total time: Full pipeline run is on the order of hours; exact duration depends on data size and hardware.
-
Incremental runs: Only steps that depend on changed data need to run; incremental updates are faster than a full run.
-
Resources: CPU is highest during embedding and similarity steps; memory scales with embedding and similarity matrix size; disk usage is moderate for all artifacts.
Incremental Processing
The pipeline supports incremental updates:
-
Embeddings: Only new or changed items can be embedded; cache reused for the rest.
-
Phrase mappings: New mappings can be merged with existing ones.
-
Clusters: Existing cluster centers can be reused when appropriate.
-
Related searches: Only pages whose data changed need to be recomputed.
This keeps daily or weekly refresh times lower than a full rebuild.
Error Handling
-
API failures: Retries with exponential backoff.
-
Missing data: Skip and log; continue where possible.
-
Invalid SKUs: Filter out and continue.
-
Embedding failures: Use cached embeddings when available.
Steps write checkpoints so the run can resume from the last successful step.
Monitoring and Logging
The pipeline logs progress per step (e.g. “Step N started”, “Step N complete”). Log destination and format are determined by the deployment environment.
See Also
-
Embedding Strategy — How embeddings are chosen and generated
-
Query Clustering — How similar queries are grouped
-
Phrase-to-Filter Mappings — How phrases map to filters
-
Product Matching — How clusters are matched to products
-
Related Search Generation — How related links are built
-
Search Service Architecture — How Valkey is used in the live site
References
Summary
The SEO pipeline automates query-to-page generation in 11 steps:
-
Input: A large set of queries from GSC, ads, live traffic, and Algolia analytics.
-
Process: Embed source data; fetch and combine queries; build and expand phrase mappings; embed and cluster queries; match clusters to products/parts/articles; build query pages; generate related searches; migrate descriptions and Valkey data.
-
Output: Thousands of query pages, expanded phrase mappings, a related-search store, and a Valkey-backed search service.
Benefits of the current setup: automated SEO page generation, semantic query handling, product matching driven by embeddings, broad internal linking via related searches, and a fast search service backed by Valkey.