Product Matching: Semantic Similarity with Discovery Scoring
This article explains how we match search query clusters to products, parts, and articles using semantic similarity with an optional discovery scoring model that balances relevance, traffic, and page rank.
The Problem: Finding the Best Match
When users search for "mini pc," we need to determine which product page best represents that query. We have thousands of products, parts, and articles—which one should we show?
The challenge is balancing multiple factors:
-
Semantic relevance: How well does the page match the query meaning?
-
Query traffic: How much search traffic does this query get?
-
Page popularity: How much traffic does the page already receive?
A naive approach (pure semantic similarity) might match "mini pc" to an obscure product with perfect similarity but zero traffic. A better approach considers all three factors.
Two Matching Modes
We support two matching modes:
1. Bulk Matching (Original)
Every cluster finds its best match independently. Multiple clusters can match the same page:
-
Cluster A: "mini pc" → Product X (similarity 0.95)
-
Cluster B: "small computer" → Product X (similarity 0.93)
-
Cluster C: "compact desktop" → Product X (similarity 0.91)
This creates redundancy but ensures every cluster gets its best match.
2. Iterative Exclusive Routing (1:1 Mapping)
Clusters are processed by traffic score (highest first). Once a page is claimed, it's removed from the pool:
-
Cluster A (10K traffic): "mini pc" → Product X (claimed)
-
Cluster B (5K traffic): "small computer" → Product Y (X unavailable)
-
Cluster C (2K traffic): "compact desktop" → Product Z (X, Y unavailable)
This creates unique query pages with automatic fallback to next-best matches.
The Algorithm: Semantic Similarity
Step 1: Load Embeddings
We load pre-computed embeddings for:
- Query clusters: Center query from each cluster
- Source pages: Products, parts, articles (from Step 0)
Both use the same all-mpnet-base-v2 model, ensuring comparable embeddings.
Step 2: Compute Similarity
For each cluster, we compute cosine similarity to all source pages:
similarities = util.cos_sim(cluster_embedding, source_embeddings)[0]
This produces a similarity score (0.0 to 1.0) for every source page.
Step 3: Apply Discovery Scoring (Optional)
If discovery scoring is enabled, we combine three factors:
50:30:20 Discovery Scoring Model:
discovery_score = (similarity * 0.5) + (query_score * 0.3) + (page_rank * 0.2)
Where:
-
Similarity (50%): Semantic relevance (cosine similarity)
-
Query Score (30%): Normalized query traffic (impressions + clicks)
-
Page Rank (20%): Normalized page traffic (logarithmic scale)
This balances relevance with traffic potential.
Step 4: Normalize Scores
Before combining, we normalize each component to [0, 1]:
Query Score Normalization:
max_query_score = max(cluster.total_score for cluster in clusters)
norm_query_score = query_score / max_query_score
Page Rank Normalization (logarithmic to prevent high-traffic pages from dominating):
max_page_rank = max(traffic_index.values())
norm_page_rank = log1p(page_rank) / log1p(max_page_rank)
Logarithmic scaling prevents the home page (highest traffic) from dominating all matches.
Step 5: Apply Chassis Boost (Products Only)
For product matches, we boost scores based on chassis type:
-
Treo chassis: +10% (newest, most popular)
-
S-chassis: +5% (compact, high demand)
-
H-chassis: No boost (older, less popular)
if matched_type == "product":
sku = matched_key.replace("/p/", "")
chassis_prefix = sku.split("-")[0]
if chassis_prefix.startswith("Treo"):
similarity *= 1.10
elif chassis_prefix.startswith("S"):
similarity *= 1.05
This ensures newer products are prioritized when similarity is close.
Step 6: Select Best Match
Bulk Matching:
best_idx = similarities.argmax()
if similarities[best_idx] >= threshold:
matches.append(cluster → source_pages[best_idx])
Exclusive Routing:
# Sort clusters by traffic (highest first)
sorted_clusters = sorted(clusters, key=lambda c: c.total_score, reverse=True)
# ... (implementation details omitted)
Configurable Threshold
The matching threshold determines how strict matching is:
-
0.80 (default): Moderate strictness, most clusters match
-
0.85: Stricter, fewer but higher-quality matches
-
0.75: More lenient, more clusters match
The threshold can be adjusted without code changes via the config file.
Threshold Analysis Mode
Before running the full pipeline, we can analyze threshold impact:
python 6_match_source_data.py --analyze-threshold
This generates samples across similarity ranges:
-
0.90-1.00: Perfect matches
-
0.80-0.90: Strong matches
-
0.70-0.80: Moderate matches
-
0.60-0.70: Weak matches
-
0.50-0.60: Very weak matches
-
0.40-0.50: Poor matches
A web UI displays these samples for manual review. After selecting a threshold, resume the pipeline:
python 6_match_source_data.py --resume-after-threshold
This loads the threshold from the UI decision and completes matching.
Incremental Embedding
We cache embeddings for both queries and source pages. When new data arrives:
- Load existing embeddings
- Embed only new items
- Append to cache
This avoids re-embedding unchanged data. See Embedding Strategy for details.
Output Format
The matching produces a JSON file with statistics and matches:
{
"stats": {
"threshold": 0.80,
# ... (implementation details omitted)
Matches are sorted by similarity (highest first).
Why Discovery Scoring?
Pure semantic similarity has limitations:
Problem 1: Obscure Products
-
Query: "mini pc" (10K traffic)
-
Best match: Obscure product (0.98 similarity, 0 traffic)
-
Better match: Popular product (0.95 similarity, 5K traffic)
Problem 2: Traffic Mismatch
-
High-traffic query → Low-traffic page (wasted opportunity)
-
Low-traffic query → High-traffic page (unnecessary)
Discovery Scoring Solution:
-
Balances relevance (50%) with traffic potential (30% + 20%)
-
High-traffic queries match high-traffic pages
-
Low-traffic queries match niche pages
-
Maximizes overall traffic distribution
Performance Characteristics
On a typical server:
-
Processing time: ~20 minutes for 12.5K clusters × 5K source pages
-
Memory usage: ~1 GB (embeddings + similarity matrix)
-
CPU usage: High during similarity computation
The process is CPU-bound. Using NumPy with BLAS acceleration speeds up matrix operations significantly.
Integration with SEO Pipeline
Product matching is Step 6 in the SEO pipeline:
- Step 0: Embed Source Data - Products, parts, articles
- Step 1: Fetch Queries - GSC, Google Ads, live, Algolia
- Step 2: Combine Queries - Merge all sources
- Step 3a: Generate Base Phrase Mappings - Initial filters
- Step 3b: Embed Queries - Convert to vectors
- Step 4: Expand Phrase Mappings - Find similar phrases
- Step 5: Cluster Queries - Group into pages
- Step 6: Match Products ← You are here
- Step 7: Build Query Pages - Generate HTML
- Step 8: Generate Related Searches - Find related queries
- Step 11: Migrate to Valkey - Load into search service
See SEO Pipeline Overview for the complete flow.
Bulk vs Exclusive: Which to Use?
Bulk Matching (original):
-
✅ Every cluster gets its best match
-
✅ Simple, predictable
-
❌ Redundant query pages (multiple clusters → same product)
-
❌ Wasted traffic potential
Exclusive Routing (1:1):
-
✅ Unique query pages (no redundancy)
-
✅ Automatic fallback to next-best matches
-
✅ Better traffic distribution
-
❌ Lower-traffic clusters may get suboptimal matches
-
❌ More complex logic
Recommendation: Use Exclusive Routing for production. It creates unique, high-quality query pages with better traffic distribution.
Traffic Index and Logarithmic Scaling
The traffic index tracks page views for all source pages:
{
"/p/Treo-N100-8-256-2H-W6-11P": 5000,
"/p/S-i5-16-512-2H-W6-11P": 3000,
"/": 50000
}
We use logarithmic scaling to prevent high-traffic pages from dominating:
norm_page_rank = log1p(page_rank) / log1p(max_page_rank)
Without this, the home page (50K traffic) would match every query. Logarithmic scaling compresses the range, giving all pages a fair chance.
Chassis Boost Rationale
We boost Treo and S-chassis products because:
-
Treo: Newest chassis, best features, highest demand
-
S-chassis: Compact form factor, popular for mini PCs
-
H-chassis: Older, being phased out
When similarity is close (e.g., 0.90 vs 0.89), the boost ensures newer products win. This aligns with business priorities.
References
Technical Concepts
-
Semantic Similarity - Wikipedia
-
Cosine Similarity - Wikipedia
-
Logarithmic Scale - Wikipedia
-
NumPy - Official documentation
-
BLAS - Wikipedia
Model Documentation
-
all-mpnet-base-v2 - Hugging Face
-
Sentence Transformers - Official docs
Related Articles
-
Embedding Strategy - How we generate embeddings
-
Query Clustering - Grouping similar queries
-
SEO Pipeline Overview - Complete pipeline architecture
-
Related Search Generation - Finding related queries
-
Embed Source Data - Embedding products, parts, articles
Summary
We match query clusters to products using semantic similarity with optional discovery scoring:
Semantic Similarity:
-
Compute cosine similarity between query and source embeddings
-
Threshold-based matching (default 0.80)
-
Chassis boost for Treo (+10%) and S-chassis (+5%)
Discovery Scoring (50:30:20):
-
50% semantic relevance (cosine similarity)
-
30% query traffic (normalized impressions + clicks)
-
20% page rank (logarithmic scaling)
Two Modes:
-
Bulk: Every cluster gets best match (redundancy allowed)
-
Exclusive: 1:1 mapping, high-traffic clusters first (no redundancy)
Threshold Analysis:
-
Generate samples across similarity ranges
-
Manual review via web UI
-
Resume pipeline with selected threshold
The result is high-quality query-to-product matching that balances semantic relevance with traffic potential, creating effective SEO-optimized query pages.