Source Data Embedding (Step 0): Building the Foundation

This article explains how we embed all discoverable source content (products, parts, and on-site pages) into a shared vector space to support semantic matching across the SEO and search pipeline.

The Problem: Matching Queries to Content by Meaning

Users describe intent with many different phrasings. Keyword matching breaks when query wording does not overlap with catalog wording.

We use semantic embeddings so both queries and source content can be compared by meaning (typically via cosine similarity).

What Gets Embedded

We embed all content that a user can discover and that the pipeline can route to:

Products: The sellable catalog items (title + feature text + any curated/AI copy).
Parts: Sellable components and accessories (names + specs).
On-site pages: Articles (/a/), and other navigable pages that can appear in search and internal links.

Embedding all discoverable content enables cross-type retrieval (e.g., an article can rank for a product-oriented query, and a product can rank for an informational query when appropriate).

High-Level Architecture

flowchart TD
    A[Catalog + parts + pages] --> B[Normalize & build descriptions]
    B --> C[Deduplicate by canonical URL]
    C --> D[Incremental embedding]
    D --> E[Source embedding matrix + key index]
    E --> F[Downstream: matching, routing, related searches, search service]

The Embedding Pipeline

Step A: Consolidate Source Data

Purpose: Build one canonical dataset of source items.
Inputs: Product catalog, parts dataset, and page content.
Outputs: A consolidated source dataset with:
- Keys: Canonical URLs (used as stable identifiers)
- Descriptions: Text used for embedding
- Types: Product / part / article / page (for downstream routing)
Process:
1. Extract items from each source.
2. Build a description per item.
3. Deduplicate by canonical URL so each URL is represented once.

Step B: Incremental Embedding

We use incremental learning principles to avoid recomputing embeddings for content that has not changed.

Purpose: Embed only new/changed items while reusing cached embeddings for unchanged items.
Inputs: Consolidated source dataset and an embedding cache (previous run).
Outputs: Updated embedding matrix and key index.
Process:
1. Load cached keys and embeddings.
2. Compute a change signal for each item (based on the item’s embedding text).
3. Embed only new/changed items using the configured model (see Embedding Strategy).
4. Merge cached and newly computed embeddings into a stable order keyed by URL.

Step C: Persist Artifacts

Embeddings are stored in an efficient numeric format (commonly via NumPy) along with a separate key index so downstream steps can map vector rows back to canonical URLs.

Description Construction (What Text We Embed)

Products

Product descriptions are built from:

Simplified Name: Product name and key identifiers (e.g., SKU/series naming).
Feature text: Human-readable representation of product features (see SKU Structure).
Long-form copy: Curated or AI-generated copy where available (see Content AI Generation).

Example (illustrative):

<product name>
<short description>
<key feature highlights>

Parts

Part descriptions are built from:

Customer-facing name
Internal name (Technical identifier)
Category
Specs/attributes rendered as text

On-site pages

Page descriptions are built from:

Title
Primary body snippet (intro/lead section)
Contextual identifiers (category/family labels where applicable)

The goal is stable, content-representative text that changes when page meaning changes.

Deduplication Rules

Each canonical URL appears once in the consolidated dataset.

Why: Multiple sources can describe the same URL through different generation paths; duplicates break routing, matching, and downstream indexing.
How: Build a dictionary keyed by canonical URL and enforce uniqueness at consolidation time. If duplicates found, script prints count and removes them

How Other Steps Use Source Embeddings

Product matching: Clusters or queries are matched to the closest source items by semantic similarity (see SEO Product Matching).
Related searches: The related-search generator uses embedding similarity to propose relevant navigational links (see SEO Related Searches).
Search service: Online search can use vector similarity over indexed embeddings (see Search Service Architecture).

References

Summary

What: Embed products, parts, and discoverable pages into a shared vector space.
How: Consolidate and deduplicate by canonical URL, then incrementally embed only changed items.
Why: This enables semantic retrieval and routing across the pipeline (matching, related searches, and online vector search).

← Back to Documentation Index

Products

Popular Searches and Blogs

Source Data Embedding (Step 0): Building the Foundation

The Problem: Matching Queries to Content by Meaning

What Gets Embedded

High-Level Architecture

The Embedding Pipeline

Step A: Consolidate Source Data

Step B: Incremental Embedding

Step C: Persist Artifacts

Description Construction (What Text We Embed)

Products

Parts

On-site pages

Deduplication Rules

How Other Steps Use Source Embeddings

See Also

References

Summary