Source Data Embedding (Step 0): Building the Foundation
This article explains how we embed all discoverable source content (products, parts, and on-site pages) into a shared vector space to support semantic matching across the SEO and search pipeline.
The Problem: Matching Queries to Content by Meaning
Users describe intent with many different phrasings. Keyword matching breaks when query wording does not overlap with catalog wording.
We use semantic embeddings so both queries and source content can be compared by meaning (typically via cosine similarity).
What Gets Embedded
We embed all content that a user can discover and that the pipeline can route to:
-
Products: The sellable catalog items (title + feature text + any curated/AI copy).
-
Parts: Sellable components and accessories (names + specs).
-
On-site pages: Articles (
/a/), and other navigable pages that can appear in search and internal links.
Embedding all discoverable content enables cross-type retrieval (e.g., an article can rank for a product-oriented query, and a product can rank for an informational query when appropriate).
High-Level Architecture
flowchart TD
A[Catalog + parts + pages] --> B[Normalize & build descriptions]
B --> C[Deduplicate by canonical URL]
C --> D[Incremental embedding]
D --> E[Source embedding matrix + key index]
E --> F[Downstream: matching, routing, related searches, search service]The Embedding Pipeline
Step A: Consolidate Source Data
-
Purpose: Build one canonical dataset of source items.
-
Inputs: Product catalog, parts dataset, and page content.
-
Outputs: A consolidated source dataset with:
- Keys: Canonical URLs (used as stable identifiers)
- Descriptions: Text used for embedding
- Types: Product / part / article / page (for downstream routing)
-
Process:
- Extract items from each source.
- Build a description per item.
- Deduplicate by canonical URL so each URL is represented once.
Step B: Incremental Embedding
We use incremental learning principles to avoid recomputing embeddings for content that has not changed.
-
Purpose: Embed only new/changed items while reusing cached embeddings for unchanged items.
-
Inputs: Consolidated source dataset and an embedding cache (previous run).
-
Outputs: Updated embedding matrix and key index.
-
Process:
- Load cached keys and embeddings.
- Compute a change signal for each item (based on the item’s embedding text).
- Embed only new/changed items using the configured model (see Embedding Strategy).
- Merge cached and newly computed embeddings into a stable order keyed by URL.
Step C: Persist Artifacts
Embeddings are stored in an efficient numeric format (commonly via NumPy) along with a separate key index so downstream steps can map vector rows back to canonical URLs.
Description Construction (What Text We Embed)
Products
Product descriptions are built from:
-
Simplified Name: Product name and key identifiers (e.g., SKU/series naming).
-
Feature text: Human-readable representation of product features (see SKU Structure).
-
Long-form copy: Curated or AI-generated copy where available (see Content AI Generation).
Example (illustrative):
<product name>
<short description>
<key feature highlights>
Parts
Part descriptions are built from:
-
Customer-facing name
-
Internal name (Technical identifier)
-
Category
-
Specs/attributes rendered as text
On-site pages
Page descriptions are built from:
-
Title
-
Primary body snippet (intro/lead section)
-
Contextual identifiers (category/family labels where applicable)
The goal is stable, content-representative text that changes when page meaning changes.
Deduplication Rules
Each canonical URL appears once in the consolidated dataset.
-
Why: Multiple sources can describe the same URL through different generation paths; duplicates break routing, matching, and downstream indexing.
-
How: Build a dictionary keyed by canonical URL and enforce uniqueness at consolidation time. If duplicates found, script prints count and removes them
How Other Steps Use Source Embeddings
-
Product matching: Clusters or queries are matched to the closest source items by semantic similarity (see SEO Product Matching).
-
Related searches: The related-search generator uses embedding similarity to propose relevant navigational links (see SEO Related Searches).
-
Search service: Online search can use vector similarity over indexed embeddings (see Search Service Architecture).
See Also
-
SEO Embedding Strategy — Model choice and embedding conventions
-
SEO Product Matching — Using embeddings for routing to products/pages
-
SEO Related Searches — Embeddings applied to internal-link generation
-
Search Service Architecture — How embeddings are served online
-
SEO Pipeline Overview — End-to-end pipeline context
References
Summary
-
What: Embed products, parts, and discoverable pages into a shared vector space.
-
How: Consolidate and deduplicate by canonical URL, then incrementally embed only changed items.
-
Why: This enables semantic retrieval and routing across the pipeline (matching, related searches, and online vector search).