Product Matching: Semantic Similarity with Discovery Scoring

This article explains how we match search query clusters to products, parts, and articles using semantic similarity with an optional discovery scoring model that balances relevance, traffic, and page rank.

The Problem: Finding the Best Match

When users search for "mini pc," we need to determine which product page best represents that query. We have thousands of products, parts, and articles—which one should we show?

The challenge is balancing multiple factors:

  • Semantic relevance: How well does the page match the query meaning?

  • Query traffic: How much search traffic does this query get?

  • Page popularity: How much traffic does the page already receive?

A naive approach (pure semantic similarity) might match "mini pc" to an obscure product with perfect similarity but zero traffic. A better approach considers all three factors.

Two Matching Modes

We support two matching modes:

1. Bulk Matching (Original)

Every cluster finds its best match independently. Multiple clusters can match the same page:

  • Cluster A: "mini pc" → Product X (similarity 0.95)

  • Cluster B: "small computer" → Product X (similarity 0.93)

  • Cluster C: "compact desktop" → Product X (similarity 0.91)

This creates redundancy but ensures every cluster gets its best match.

2. Iterative Exclusive Routing (1:1 Mapping)

Clusters are processed by traffic score (highest first). Once a page is claimed, it's removed from the pool:

  • Cluster A (10K traffic): "mini pc" → Product X (claimed)

  • Cluster B (5K traffic): "small computer" → Product Y (X unavailable)

  • Cluster C (2K traffic): "compact desktop" → Product Z (X, Y unavailable)

This creates unique query pages with automatic fallback to next-best matches.

The Algorithm: Semantic Similarity

Step 1: Load Embeddings

We load pre-computed embeddings for:

  1. Query clusters: Center query from each cluster
  2. Source pages: Products, parts, articles (from Step 0)

Both use the same all-mpnet-base-v2 model, ensuring comparable embeddings.

Step 2: Compute Similarity

For each cluster, we compute cosine similarity to all source pages:

similarities = util.cos_sim(cluster_embedding, source_embeddings)[0]

This produces a similarity score (0.0 to 1.0) for every source page.

Step 3: Apply Discovery Scoring (Optional)

If discovery scoring is enabled, we combine three factors:

50:30:20 Discovery Scoring Model:

discovery_score = (similarity * 0.5) + (query_score * 0.3) + (page_rank * 0.2)

Where:

  • Similarity (50%): Semantic relevance (cosine similarity)

  • Query Score (30%): Normalized query traffic (impressions + clicks)

  • Page Rank (20%): Normalized page traffic (logarithmic scale)

This balances relevance with traffic potential.

Step 4: Normalize Scores

Before combining, we normalize each component to [0, 1]:

Query Score Normalization:

max_query_score = max(cluster.total_score for cluster in clusters)
norm_query_score = query_score / max_query_score

Page Rank Normalization (logarithmic to prevent high-traffic pages from dominating):

max_page_rank = max(traffic_index.values())
norm_page_rank = log1p(page_rank) / log1p(max_page_rank)

Logarithmic scaling prevents the home page (highest traffic) from dominating all matches.

Step 5: Apply Chassis Boost (Products Only)

For product matches, we boost scores based on chassis type:

  • Treo chassis: +10% (newest, most popular)

  • S-chassis: +5% (compact, high demand)

  • H-chassis: No boost (older, less popular)

if matched_type == "product":
    sku = matched_key.replace("/p/", "")
    chassis_prefix = sku.split("-")[0]
    if chassis_prefix.startswith("Treo"):
        similarity *= 1.10
    elif chassis_prefix.startswith("S"):
        similarity *= 1.05

This ensures newer products are prioritized when similarity is close.

Step 6: Select Best Match

Bulk Matching:

best_idx = similarities.argmax()
if similarities[best_idx] >= threshold:
    matches.append(cluster → source_pages[best_idx])

Exclusive Routing:

# Sort clusters by traffic (highest first)
sorted_clusters = sorted(clusters, key=lambda c: c.total_score, reverse=True)

# ... (implementation details omitted)

Configurable Threshold

The matching threshold determines how strict matching is:

  • 0.80 (default): Moderate strictness, most clusters match

  • 0.85: Stricter, fewer but higher-quality matches

  • 0.75: More lenient, more clusters match

The threshold can be adjusted without code changes via the config file.

Threshold Analysis Mode

Before running the full pipeline, we can analyze threshold impact:

python 6_match_source_data.py --analyze-threshold

This generates samples across similarity ranges:

  • 0.90-1.00: Perfect matches

  • 0.80-0.90: Strong matches

  • 0.70-0.80: Moderate matches

  • 0.60-0.70: Weak matches

  • 0.50-0.60: Very weak matches

  • 0.40-0.50: Poor matches

A web UI displays these samples for manual review. After selecting a threshold, resume the pipeline:

python 6_match_source_data.py --resume-after-threshold

This loads the threshold from the UI decision and completes matching.

Incremental Embedding

We cache embeddings for both queries and source pages. When new data arrives:

  1. Load existing embeddings
  2. Embed only new items
  3. Append to cache

This avoids re-embedding unchanged data. See Embedding Strategy for details.

Output Format

The matching produces a JSON file with statistics and matches:

{
  "stats": {
    "threshold": 0.80,
# ... (implementation details omitted)

Matches are sorted by similarity (highest first).

Why Discovery Scoring?

Pure semantic similarity has limitations:

Problem 1: Obscure Products

  • Query: "mini pc" (10K traffic)

  • Best match: Obscure product (0.98 similarity, 0 traffic)

  • Better match: Popular product (0.95 similarity, 5K traffic)

Problem 2: Traffic Mismatch

  • High-traffic query → Low-traffic page (wasted opportunity)

  • Low-traffic query → High-traffic page (unnecessary)

Discovery Scoring Solution:

  • Balances relevance (50%) with traffic potential (30% + 20%)

  • High-traffic queries match high-traffic pages

  • Low-traffic queries match niche pages

  • Maximizes overall traffic distribution

Performance Characteristics

On a typical server:

  • Processing time: ~20 minutes for 12.5K clusters × 5K source pages

  • Memory usage: ~1 GB (embeddings + similarity matrix)

  • CPU usage: High during similarity computation

The process is CPU-bound. Using NumPy with BLAS acceleration speeds up matrix operations significantly.

Integration with SEO Pipeline

Product matching is Step 6 in the SEO pipeline:

  1. Step 0: Embed Source Data - Products, parts, articles
  2. Step 1: Fetch Queries - GSC, Google Ads, live, Algolia
  3. Step 2: Combine Queries - Merge all sources
  4. Step 3a: Generate Base Phrase Mappings - Initial filters
  5. Step 3b: Embed Queries - Convert to vectors
  6. Step 4: Expand Phrase Mappings - Find similar phrases
  7. Step 5: Cluster Queries - Group into pages
  8. Step 6: Match Products ← You are here
  9. Step 7: Build Query Pages - Generate HTML
  10. Step 8: Generate Related Searches - Find related queries
  11. Step 11: Migrate to Valkey - Load into search service

See SEO Pipeline Overview for the complete flow.

Bulk vs Exclusive: Which to Use?

Bulk Matching (original):

  • ✅ Every cluster gets its best match

  • ✅ Simple, predictable

  • ❌ Redundant query pages (multiple clusters → same product)

  • ❌ Wasted traffic potential

Exclusive Routing (1:1):

  • ✅ Unique query pages (no redundancy)

  • ✅ Automatic fallback to next-best matches

  • ✅ Better traffic distribution

  • ❌ Lower-traffic clusters may get suboptimal matches

  • ❌ More complex logic

Recommendation: Use Exclusive Routing for production. It creates unique, high-quality query pages with better traffic distribution.

Traffic Index and Logarithmic Scaling

The traffic index tracks page views for all source pages:

{
  "/p/Treo-N100-8-256-2H-W6-11P": 5000,
  "/p/S-i5-16-512-2H-W6-11P": 3000,
  "/": 50000
}

We use logarithmic scaling to prevent high-traffic pages from dominating:

norm_page_rank = log1p(page_rank) / log1p(max_page_rank)

Without this, the home page (50K traffic) would match every query. Logarithmic scaling compresses the range, giving all pages a fair chance.

Chassis Boost Rationale

We boost Treo and S-chassis products because:

  • Treo: Newest chassis, best features, highest demand

  • S-chassis: Compact form factor, popular for mini PCs

  • H-chassis: Older, being phased out

When similarity is close (e.g., 0.90 vs 0.89), the boost ensures newer products win. This aligns with business priorities.

References

Technical Concepts

Model Documentation

Related Articles

Summary

We match query clusters to products using semantic similarity with optional discovery scoring:

Semantic Similarity:

  • Compute cosine similarity between query and source embeddings

  • Threshold-based matching (default 0.80)

  • Chassis boost for Treo (+10%) and S-chassis (+5%)

Discovery Scoring (50:30:20):

  • 50% semantic relevance (cosine similarity)

  • 30% query traffic (normalized impressions + clicks)

  • 20% page rank (logarithmic scaling)

Two Modes:

  • Bulk: Every cluster gets best match (redundancy allowed)

  • Exclusive: 1:1 mapping, high-traffic clusters first (no redundancy)

Threshold Analysis:

  • Generate samples across similarity ranges

  • Manual review via web UI

  • Resume pipeline with selected threshold

The result is high-quality query-to-product matching that balances semantic relevance with traffic potential, creating effective SEO-optimized query pages.


← Back to Documentation Index