Building the Best Semantic Search Engine
By Shinhe Cho
Jan 18, 2023
Building the Best Semantic Search Engine
Table of Contents
Have you heard of semantic search? Or semantic understanding? If you’re responsible for driving online revenue for your business, chances are you’ve heard these terms before.
At Bloomreach, we define “semantic search” as a search engine built with semantic understanding. This is beyond natural language processing (NLP), which is the ability for machines to understand text the way humans can. Semantic understanding is the ability to parse each word of a search query into its respective attribute and also make sense of ambiguous words. For example, for the query "women shirt dress," "shirt" is the style and "dress" is the product type. Conversely, for the query "women dress shirt," "dress" is the style and "shirt" is the product type.
Now, the reality is that the market is full of noise on this topic. If you’re doing a Google search for “semantic search,” you’re likely looking for a new site search solution. But what you’ll find on the results page is a litany of vendors who tell you they have semantic search while trying to sell you their products. How are buyers expected to properly evaluate site search solutions without visibility into the search engine itself?
That’s the problem we’re solving for today.
Bloomreach has been a market leader in product discovery solutions for e-commerce for 10+ years. We are proud to have some of the largest enterprise customers in retail and distribution, with a growing share of mid-market customers to boot. Our customers stick with us year after year because our powerful semantic search engine lets us deliver real value in revenue dollars due. But how does it actually work? And what’s really inside the engine? Read along to "go under the hood" and learn more about what really powers semantic search at Bloomreach.
How Semantic Search Works
Let’s start with a quick overview of how semantic search engines work. At its basic level, there are two functions of search: retrieval and ranking.
- Retrieval — The process of finding the set of products in a catalog that matches what the user is searching for
- Ranking — The process of ordering the retrieved products in a particular sequence
The best search engines will optimize for both retrieval and ranking when trying to determine search intent. First, the search engine retrieves the most relevant products for a user query. Then, the products are ranked in such a way that meets customer and business needs.
Now, let’s dive a bit deeper into these component parts.
Retrieval is generally evaluated along two dimensions: recall and precision.
- Recall — This measures how well the recall set delivered against the user query. Generally, it is the number of relevant products retrieved divided by the total number of relevant products available. This metric does not penalize irrelevant products.
- Precision — This measures how precise that recall set is for user intent. Generally, it is the number of relevant products retrieved divided by the total number of retrieved products. There’s a fine balance to consider with precision. For example, if you have 100 relevant products and your search delivers 10 of them, then you may have 100% precision, but this wouldn’t be ideal from a user experience perspective.
Ranking is based on a number of signals — from customer rules (i.e., boosting a specific brand) to product performance to what the algorithms are set up to optimize for. These signals lead to product scores that then determine the order of products.
Other search engines in the market today focus on various aspects of the search process. Some search vendors will apply basic algorithms that impact ranking (i.e., ranking optimization), which changes search results based on user behavior. Others are heavily rule-based or keyword-based, and focus on retrieval. These solutions typically have no actual query understanding or built-in intelligence. A team of developers will have to heavily curate and manually tune for things like synonyms, misspellings, assortment gaps, etc.
In both of the examples above, the AI investment is in one part of the search process or in one type of AI. This is like painting in a single color — so for example, while blue is a very popular color with a lot of versatility, if you only paint with blue, you're going to limit what you can achieve on a canvas.
In contrast, using a broad set of colors (or algorithms) and applying them where they make the most sense allows you to achieve at a higher level. This is what Bloomreach has done with our core engine, and it’s how we’ve maintained our market-leading position in commerce search for over 10+ years.
How Bloomreach Does Semantic Search
Bloomreach’s search experts have set the standard for driving search revenue through a combination of intelligent algorithms and features that touch every part of the search process, with true semantic understanding at the core.
Let’s start from the beginning. There are three key inputs into the search engine that impact its performance:
- Customer inputs — Think of your business’ product data, merchandising rules, business priorities, etc.
- Bloomreach algorithms — Our proprietary AI-driven algorithms that we’ve developed over years of commerce experience and continually optimize
- Other inputs — This is data that we’re able to collect from your customers’ buying patterns and your product’s performance on your site
All three inputs are critical in ensuring an optimal search experience for your customers — but the second input, Bloomreach’s algorithms, is what sets our solution apart from the rest.
In particular, our customers start with day zero learnings. Even before a user starts typing a query into the search bar, you have already benefited from years of commerce-specific data that is incorporated into our algorithms and informs our search retrieval and ranking processes. There’s no waiting for a pixel to collect data or for your merchandisers to create rules. Our semantic engine is already parsing out attributes and applying known synonyms due to our vast commerce dataset.
Two Modes of Retrieval
Our many years in digital commerce have taught us that there is a fine line between balancing recall and precision.
On the one hand, you always want to improve your recall set (i.e., the list of products you show a customer after they hit “enter”) by serving up the largest number of relevant results possible. On the other hand, you want to make sure each of those products are actually relevant to the customer’s search.
There are a number of reasons why a business would prioritize recall over precision, and vice versa. This is why we’ve created two distinct modes of retrieval that our customers can apply at the query level based on their unique business goals.
Bloomreach’s default retrieval mode prioritizes better recall. This mode takes your product data and applies specific algorithms to enhance the recall set, specifically:
- Semantic understanding — We apply semantic understanding from the beginning of the retrieval process by understanding customer intent and parsing product types and attributes from two sources:
- Search query, or what the customer types into the search bar,
- Product catalog, or the data provided in your product catalog,
- Spell correct — This set of algorithms is triggered when the original query has zero results but results exist for a similar query. The two algorithms are as follows:
- Term frequency — This is the default mode and considers a term that appears more frequently in your catalog as the likely candidate for spell correction.
- Closest match — This uses “edit distance,” or the minimum distance between two sets of letters or numbers to get from one term to another, to determine the candidate for spell correction.
- Query relaxation — This algorithm is activated once there are no exact matches found from the user’s initial search query. Semantic understanding recognizes the product type of the query and relaxes the query matching criteria from “match on all terms” to “match on one term”(e.g., the product type), thus reducing null results.
Bloomreach’s precision mode starts with the default mode and adds on layers of algorithms to prioritize better precision in the recall set, specifically:
- Search recall precision — This set of algorithms helps remove noisy product data from search results.
- Product type precision — Utilizing the product type uncovered by our semantic understanding, this algorithm identifies a set of product types that should be retained in the recall set for every query. For example, the recall set for “black shoes” will include all products that match product type "shoes", the product type "boots" extracted from the synonym rule "shoes → boots", and the product type "heels" and "pumps" identified from user data.
- Category precision — This algorithm filters the recall set based on product type match and dominant category. Dominant categories are determined by the categories that the top products in the recall set belong to.
- Facet precision — This algorithm targets and removes facet noise in search results by relying on the products in the dominant categories of that query. For example, the top 50 products for the search query “dress” may belong to the “evening dress,” “maxi dress,” and “cocktail dress” categories. Perhaps the 78th product is a pair of dress shoes that fall under the “men’s shoes” and “women’s shoes” categories. With facet precision turned on, “men’s shoes” and “women’s shoes” would not appear as a facet in the recall set.
Ranking Optimized for RPV
The ordering of the recall set, or ranking, is a critical part of the search process that aligns customer experience with your business goals. We know that users on your site are expecting a personalized search experience that delivers the results they want with each search. We also know that merchandisers are juggling multiple priorities that weigh inventory, promotions, brand requests, conversion goals, and more. Bloomreach’s ranking process takes into account these variables to produce numerical scores that help you understand how and why products are ranked the way they are.
First, the signals:
- Customer rules (hard boost/bury) — You may want all winter clothing boosted to the top row for every search query. Or, you may want produce that is nearly out of stock pushed down to the bottom of the results page. Bloomreach takes in those signals set by our customers and ranks products accordingly.
- Product performance and global performance — How well your products perform (product views, add-to-carts, conversions, and revenue) for given search queries is taken into account in our optimized ranking. Global performance consists of how well your products perform sitewide, regardless of search query.
- Personalization signals (1:1 and segment-based personalization) — Each visitor has a unique profile that gets updated in real time based on their on-site search, browse, and purchase behavior. These signals can be taken 1:1 or based on a user’s segment.
- Semantic understanding — Our underlying semantic understanding ranks the most relevant products at the top of the recall set by boosting products that match on both product type and product attributes.
- SKU-level intelligence — By collecting user behavior data at the SKU level, and not just at the product level, we are able to respond quickly to variant changes. For example, if a popular SKU goes out of stock, the corresponding ranking for the overall product can be lowered to account for the lack of availability.
These signals then produce several scores normalized from 1-100 that allow business users to quickly and efficiently understand the ranking algorithm at work and make adjustments as needed.
- The Performance Score — Based on the product performance and global performance signals, a “performance score” assesses how well a product has performed. Recent performance data is valued higher.
- The Relevance Score — Powered by our semantic understanding, the relevance score measures the match of the query to the product.
- The Personalization Score — With real-time 1:1 personalization, each visitor’s usage patterns are used to compute a personalization score that indicates a strong pattern. For example, if a user always engages with men’s products, then men’s products in the recall set will have a higher personalization score and those products will be ranked higher.
- The Merchandising Score — For customers using Bloomreach for merchandising, a merchandising score is attached to a product based on boost/bury rules. When you boost a product, Bloomreach increases the product's score, pushing it closer to the beginning of search results. Bloomreach doesn't ignore the product's performance data. Your boost or bury rule is just another signal that is used by the search algorithms to determine the final order of products in the grid.
From the beginning of a search query to the ongoing ranking optimization happening with each additional input, your search experience with Bloomreach is truly complete.
Further Developing The Best Semantic Search Engine
But that’s not all.
Beyond the core engine, our team of search experts have continued to develop features that solve the most common use cases for our customers. Read through the additional problems we’ve solved for in search:
- Partial Part Number Search
- Automatic Query Filtering
- Relevance by Segment
- SKU Select
Let Bloomreach Drive Your Semantic Search
While we are proud of the engine we’ve built over the last decade, we’re also excited for what’s to come at Bloomreach.
We’re thrilled to continue to apply the latest machine learning technologies into our core engine. These developments will help our customers continue to create magical moments in e-commerce while maintaining a competitive edge in a crowded marketplace.
If you’re looking to uncover missed opportunities in search revenue and reduce manual tuning, you need a solution that is built for commerce and powered by the most intelligent algorithms in the market today. Schedule a personalized demo of Bloomreach Discovery for the best-of-breed tools in search, merchandising, recommendations, and search engine optimization — all-in-one unified solution.