Building Your Own Search Engine: 7 Things You Need to Know Before Starting
By Mike Cassidy
Oct 15, 2019
21 min read
Building Your Own Search Engine: 7 Things You Need to Know Before Starting
Table of Contents
So, you’d like to build your own search engine. We’d like to save you the trouble. And it is trouble, as you no doubt know.
Of course it’s possible to go DIY on a search engine project. There are powerful starter kits out there — Solr, for instance. You can build a fine site search engine with Solr, provided you have the right people, sufficient time and enough money.
Oh, and a tolerance for risk and opportunity cost. Building your own site search takes time, meaning you are likely losing out on revenue as you design, build and tune your site search engine.
As you work out the bugs and as the system you’ve built plugs along, it’s likely the search experience you’re offering will be subpar, meaning dissatisfied customers and the need to win them back once your search engine is running at an acceptable level.
In short, here are 7 things you need to know before building your own search engine:
Manually Building a Search Engine’s Brain Takes Time
The way to think about building your own site search engine is by thinking about what you get with Solr out of the box. Sure, it’s scalable right off the shelf. It’s also a proven performer.
But think about search and what it takes to power the kind of results that are relevant and personalized down to the one-to-one level. Reaching that optimal level of customer experience requires:
vast amounts of data
a cloud-based infrastructure that is custom designed for your particular search system.
You know what you won’t find in the bottom of your big, new box of Solr? Sophisticated algorithms, vast amounts of data and the infrastructure you need to build a powerful site search engine.
In fact, developing the algorithms, gathering the data and designing the system to effectively use these to anticipate the intent of digital consumers is what puts the “do” in a do-it-yourself Solr search engine.
Solr on its own isn’t optimized to rank by revenue. It can’t rank by using personalization based on customer intent, behavior and affinities. It’s not designed to provide discovery beyond site search. It doesn’t come loaded with data regarding products, synonyms, buyer intent. It can’t extract content. In fact, it’s fair to say that out-of-the-box Solr will get you about 20 percent of the way to where you need to be to do search right.
If you want a Solr site search engine to do all those necessary things — rank by revenue, personalize, achieve semantic understanding, understand user behavior — you need to tell it how. You need to build the engine’s brain. Or more likely, a team of people needs to build the engine’s brain.
And that’s done with algorithms. Manually building a search engine’s brain takes time. A lot of time — a lot of building and trying and testing.
Manually Building Your Search Engine Can Quickly Exhaust All Your Resources
Take the synonyms for instance. Obviously a robust synonym thesaurus is a key to site search. When a consumer types “crimson, knee-length, spandex, party dress” into a site search box, the system needs to know that for that individual, a Herve Leger, thin-strap, bandage dress with strappy leather harness belt is one of the products that the customer would be highly interested in.
In fact, when we asked consumers to describe that very Herve Leger dress, 500 people came up with a mind-boggling series of combinations that included 129 words for “red,” 275 different descriptions of the belt, 105 descriptions of the length and 216 words to name the occasion at which one would wear the dress.
And you know how a Solr search system knows that? You tell it. Or a team of people you hire works on teaching the machine that “deep rouge” can mean red and that “corset belt” can mean strappy leather harness belt.
Nevermind that there aren’t enough hours in the day for humans to come up with hundreds of variations of a half dozen or so words that consumers might use to describe a dress — wouldn’t you rather they were doing something better with their time?
Personalized Search Results Is Key for Success
Solr is an impressive, scalable site search platform — as far as it is able to go. The challenge with using it as the foundation of a do-it-yourself site search engine is that it doesn’t go far enough.
In particular, Solr out-of-the-box doesn’t include the algorithms, data and infrastructure needed to build the kind of search that consumers demand today. Today’s digital consumers want personalized and relevant search results. They expect the Google experience on every site: type in what you’re looking for, using your own words to describe something that can be described in a million different ways, and presto — just what you were looking for.
Now let’s take a look at the data that a best-in-class search engine needs to provide the sort of customer experience that consumers expect.
An earlier Internet Retailer report uncovered that the number one challenge experienced with current site search was that, “Customers often see irrelevant results or results in the wrong order” and listed “Personalized results” as the top feature needed in a modern search solution.
In order to provide each individual user with personalized and relevant results, a search engine needs to understand user intent, product data and user behavior. The best search engines come to that understanding by constantly learning based on data in each of those three categories.
A search engine needs to be able to understand synonyms, because consumers use different words to describe the same thing and they often use words that are different from a retailer’s product descriptions. Shoes for instance can be “low-cut boots” or “high-top sneakers.”
The engine needs to know that word stems can come with all sorts of attachments — “ing,” “ed,” “s” — that can dramatically change their meanings, or not. “Linens,” for instance, are not two “linen.” In fact, linens are a product and linen is a product attribute. It also needs to understand all the ways you're customers search, including numerical searches.
And what about acronyms, slang, spelling errors and words, the definitions of which, depend on context? Are you looking for a “dress,” a “dress shirt” or a “shirt dress,” which is also spelled “shirtdress?”
Which brings us to the importance of product data. Out-of-the-box Solr isn’t big on brands or colors or sizes. When a customer searches for a “red valentino dress,” is red a color? Is valentino a style? Or is “Red Valentino” a brand?
It makes a difference.
And given the price range of Red Valentino products, you want your search engine to know it’s a brand.
Here’s why: It turns out consumers who use site search — especially those searching for Red Valentino — are among an enterprise’s most valuable customers, according to Econsultancy. But they won’t stick around if they become frustrated by poor site search, says RealDecoy in its report, “Endeca vs. Bloomreach: taking site search to a new level.”
The report cites Forrester research that found that 90 percent of site searchers do not read past the first page of results — and that searchers will often just give up if they are frustrated by poor results.
You Need A self-learning Site Search
Even if a site search engine is revving full throttle on user intent data and product data, you’re still not even close unless your search engine is also processing behavioral data that provides insights into product performance and experience personalization.
Knowing how visitors are engaging with your site is crucial to understanding how to best serve them. Are they browsing or using site search? How did they arrive at your site in the first place? What search queries are they using? What products are they looking at? What other products are they viewing in the same session? What are they adding to their carts? What are they buying?
The answers to those questions begin to build a pile of information, such as:
The most popular queries — on your site, on the web, on mobile and on social.
The most popular products, again broken down by channel.
The performance of each individual product on your digital properties.
The way a product performs for a given query.
A list of products that are similar to each other. This helps a site serve up recommendations, including products that are too new to have a digital track record.
Products that are popular within specific categories.
The most often rewritten queries.
The most clicked-on links on the site.
Without the ability to gather and process the data that provides that detail of information, your search engine will not be able to continuously learn and constantly improve. Face it, nobody wants a dumb search engine.
Superior Site Search Requires Superior Infrastructure
In some ways we’ve been a little hard on Solr in this blog about whether it is better to build or buy when it comes to transforming your site search for the current digital age.
Solr was never meant to be a site search engine — at least not without a lot of work and a number of add-on modules. Which, when you think about it, is exactly the point of parts one and two of this series.
A site search engine, you see, does not live by Solr alone.
Previously, we talked about the need to add algorithms and data to your brand-new, out-of-the-box Solr. Another big extra building block in creating a site search engine that will keep your customers happy and coming back .
OK, it’s not the sexiest thing on earth. But it’s vital if your Solr system is going to be successful. Solr is a search server. As such, it’s a key part of a search engine, with the emphasis on part.
Think of Solr out-of-the-box as an elegant and beautifully framed house — without the main systems or finishing touches. The house needs rooms — a kitchen, bathrooms, bedrooms. Maybe you’re a study, wine-cellar and home-theater type. If so, it will need those, too.
Setting your level of luxury aside, the point is, just like that fabulously framed house needs rooms, Solr out-of-the-box needs modules.
Enterprise search requires much more than Solr alone offers. Solr on its own is not a cluster management system. It’s not a configuration management system. It does not provide basic relevancy out-of-the-box.
Solr Doesn't Come with Merchandising Analytics
Solr does not provide an interface to boost and bury products — moves merchandisers make after relying on their knowledge, experience and intuition. Solr doesn’t come with merchandising analytics.
All those features need to be designed and built. In the case of building your own Solr, you need experts who understand the infrastructure - experts who understand that Solr requires a lot of management in order to scale.
You need a team of engineers to build modules to move, store and process all that data we talked about earlier.
One way to think about the necessary modules is to break their functions down into three general areas:
Data Science: A site search engine can’t learn without leaning on analytics. Search engines rely on user behavior data and models. When users search for “shoe” is it really a sandal that they’re after — or maybe a pump?
The engine needs natural language processing models so that it can understand the business the site is in and understand the feed of products it’s working with. Beyond Solr, a series of machines are required to store and processes all this seemingly random, but vitally important, information.
Understanding users’ intent and building a site that can respond to that intent, requires a way to capture real-time clickstream activity. And none of the data is any good if there is not an efficient way to load it into the system, so the system can learn from it.
Merchandising Teams Need Tools to Act
Merchandising: Teams responsible for promoting conversions on a digital site need the tools to translate their strategies into action. They need a system for writing dynamic business rules that they can quickly modify when needed. They need testing platforms that help them determine whether the moves they make are the right ones. They need diagnostic tools to monitor site performance and to determine the root cause of issues and trends
Systems need to be built to manage fluctuations in site traffic. Think of an e-commerce site during the holiday shopping season. The infrastructure supporting digital sites needs to be able to scale up quickly and scale back down when extra capacity is no longer needed.
Inventory: Solr on its own needs an ETL — or extract, transform and load — system to gather data from say, a retailer’s catalog, and enter it into the search engine in a process called feed ingestion. It will also need a scalable data storage system, one that can cope with constantly changing data feeds because of constantly changing catalogs.
It needs to be a distributed system that can automatically handle the needs of a dynamic market, one in which a retailer, for instance, might suddenly find that it needs to triple the size of its catalog overnight.
A site search engine also needs an indexing system that can work at top speed. The nature of commerce today means that product details — prices for instance, and inventory itself, are constantly changing. To keep up with consumers and the competition, a site needs to be able to quickly respond to changing inventory to remain relevant.
Building such a system can be done, but it’s a daunting task. By our calculations, for mid to large size retailer to build its own high-quality, Solr-based site search engine would take 30 to 40 engineers as long as two years.
Of course that represents a tremendous cost in time and money. But it also represents a tremendous opportunity cost. While the work of building a high-quality search engine is going on, the business’ customers are suffering with an inferior search experience. And the search engine cannot begin to fully learn from customer interactions until the engine is up and running.
None of which is a trivial matter. eConsultancy has said that up to 30 percent of consumers use site search. And those consumers are among an enterprise’s most valuable customers, given their higher propensity to convert. On the flip side, customers who get no results for their site search queries are three times more likely to leave a site than are others.
It’s no wonder that teams responsible for selling products and providing content on the web agonize over major changes to their site search — and particularly over the decision to build or buy when it comes time for a major overhaul.