How to Build a Search Engine: 7 Things You Need to Know

Mike Cassidy
Mike Cassidy

So, you’d like to build your own search engine. We’d like to save you the trouble. And it is trouble, as you no doubt know.

Of course it’s possible to go DIY on a search engine project. There are powerful starter kits out there — Solr, for instance. You can build a fine site search engine with Solr, provided you have the right people, sufficient time, and enough money.

But you also need a tolerance for risk and opportunity cost. Building your own site search takes time, meaning you are likely losing out on revenue as you design, build, and tune your site search engine.

As you work out the bugs and as the system you’ve built chugs along, it’s likely the search experience you’re offering will be subpar, meaning dissatisfied customers and the need to win them back once your search engine is running at an acceptable level.

In short, here’s everything you need to know before building your own search engine:

Manually Building a Search Engine’s Brain Takes Time

The way to think about building your own site search engine is by thinking about what you get with out-of-the-box solutions. Sure, Solr is scalable right off the shelf. It’s also a proven performer.

But think about search and what it takes to power the kind of results that are relevant and personalized down to the one-to-one level. Reaching that optimal level of customer experience requires:

  • Sophisticated algorithms

  • Vast amounts of data

  • A cloud-based infrastructure that is customized for your particular search system

You know what you won’t find in the bottom of your big, new box of Solr? Sophisticated algorithms, vast amounts of data, and the infrastructure you need to build a powerful site search engine.

In fact, developing the algorithms, gathering the data, and designing the system to effectively use these to anticipate the intent of digital consumers is what puts the “do” in a do-it-yourself Solr search engine.

Solr on its own isn’t optimized to rank by revenue. It can’t rank by using personalization based on customer intent, behavior, and affinities. It’s not designed to provide discovery beyond site search. It doesn’t come loaded with data regarding products, synonyms, or buyer intent. And it can’t extract content. In fact, it’s fair to say that out-of-the-box Solr will get you about 20% of the way to where you need to be to do search right.

If you want a Solr site search engine to do all those necessary things — rank by revenue, personalize, achieve semantic understanding, understand user behavior — you need to tell it how. You need to build the engine’s brain. Or more likely, a team of people needs to build the engine’s brain.

And that’s done with algorithms. Manually building a search engine’s brain takes time — a lot of time. 

Building Your Search Engine Can Quickly Exhaust All Your Resources

Take synonyms, for instance. Obviously a robust synonym thesaurus is key to site search. When a consumer types “crimson, knee-length, spandex party dress” into a site search box, the system needs to know that for that individual, a Herve Leger, thin-strap, bandage dress with a strappy leather harness belt is one of the products that the customer would be highly interested in.

In fact, when we asked consumers to describe that very dress, 500 people came up with a mind-boggling series of combinations that included 129 words for “red,” 275 different descriptions of the belt, 105 descriptions of the length, and 216 words to name the occasion at which one would wear the dress.

data collected from users trying to describe a red dress for search results

And you know how a Solr search system knows that? You tell it. Or a team of people you hire works on teaching the machine that “deep rouge” can mean red and that “corset belt” can mean strappy leather harness belt.

Nevermind that there aren’t enough hours in the day for humans to come up with hundreds of variations of a half dozen words that consumers might use to describe a dress — wouldn’t you rather they were doing something better with their time?

We think so, too.

Personalized Search Results Is Key for Success

Solr is an impressive, scalable site search platform — as far as it is able to go. The challenge with using it as the foundation of a do-it-yourself site search engine is that it doesn’t go far enough.

In particular, Solr out of the box doesn’t include the algorithms, data, and infrastructure needed to build the kind of search that consumers demand today. Today’s digital consumers want personalized and relevant search results. They expect the Google experience on every site: Type in what you’re looking for using your own words to describe something that can be described in a million different ways, and presto — just what you were looking for.

Now let’s take a look at the data that a best-in-class search engine needs to provide the sort of customer experience that consumers expect.

A 2018 Internet Retailer report uncovered that the number one challenge experienced with current site search was that “customers often see irrelevant results or results in the wrong order” and listed “personalized results” as the top feature needed in a modern search solution.

poll results for the types of features users feel are needed for a modern search solution

In order to provide each individual user with personalized and relevant results, a search engine needs to understand user intent, product data, and user behavior. The best search engines come to that understanding by constantly learning from data in each of those three categories.

A search engine needs to be able to understand synonyms because consumers use different words to describe the same thing, and they often use words that are different from a retailer’s product descriptions. Shoes, for instance, can be “low-cut boots” or “high-top sneakers.”

The engine needs to know that word stems can come with all sorts of attachments — “ing,” “ed,” “s” — that can dramatically change their meanings. “Linens,” for instance, are not two “linen.” In fact, linens are a product and linen is a product attribute. It also needs to understand all the ways your customers search, including numerical searches. 

a breakdown of a search result that identifies items, materials, and dimensions as different attributes.

And what about acronyms, slang, and spelling errors, the definitions of which depend on context? Are you looking for a “dress,” a “dress shirt,” or a “shirt dress” (which is also spelled “shirtdress”)?

Which brings us to the importance of commerce data. Out-of-the-box Solr isn’t big on brands or colors or sizes. When a customer searches for a “red valentino dress,” is red a color? Is valentino a style? Or is “Red Valentino” a brand?

It makes a difference. And given the price range of Red Valentino products, you want your search engine to know it’s a brand.

Here’s why: It turns out consumers who use site search — especially those searching for Red Valentino — are among an enterprise’s most valuable customers. But they won’t stick around if they become frustrated by poor site search, according to RealDecoy in its report, Endeca vs. Bloomreach: increasing conversions with site search.

The report cites Forrester research that found that 90 percent of site searchers do not read past the first page of results — and that searchers will often just give up if they are frustrated by poor results.

You Need a Self-learning Site Search

Even if a site search engine is revving full throttle on user-intent data and product data, you’re still not even close to peak performance unless your search engine is also processing behavioral data that provides insights into product performance and experience personalization.

Knowing how visitors are engaging with your site is crucial to understanding how to best serve them. Are they browsing or using site search? How did they arrive at your site in the first place? What search queries are they using? What products are they looking at? What other products are they viewing in the same session? What are they adding to their carts? What are they buying?

The answers to those questions begin to build a pile of information, such as:

  • The most popular queries — on your site, on the web, on mobile, and on social.

  • The most popular products, again broken down by channel.

  • The performance of each individual product on your digital properties.

  • The way a product performs for a given query.

  • A list of products that are similar to each other. This helps a site serve up recommendations, including products that are too new to have a digital track record.

  • Products that are popular within specific categories.

  • The most often rewritten queries.

  • The most clicked-on links on the site.

Without the ability to gather and process the data that provides that detail of information, your search engine will not be able to continuously learn and constantly improve. The goal is above-and-beyond search relevance, and without all these insights you’re stuck with a subpar search engine. 

Superior Site Search Requires Superior Infrastructure

In some ways, we’ve been a little hard on Solr in terms of whether it’s better to build or buy to transform your site search for the current digital age.

Solr was never meant to be a site search engine — at least not without a lot of work and a number of add-on modules. Which, when you think about it, is exactly the point of this article: You need the right infrastructure to make your search engine work. Because a site search engine does not live on a platform like Solr alone.

We’ve already talked about the need to add algorithms and data to your brand new, out-of-the-box Solr. But another big element in creating a site search engine that will keep your customers happy and coming back is the addition of modules — systems that help you index, configure, and rank your search results in the right way to provide an excellent customer experience.

Think of out-of-the-box Solr as an elegant and beautifully framed house, but without the main systems or finishing touches. The house needs rooms — a kitchen, bathrooms, bedrooms. Maybe you’re a study, wine-cellar, and home-theater type. If so, it will need those, too.

Setting your level of luxury aside, the point is, just like that fabulously framed house needs rooms, Solr as is needs modules.

Enterprise search requires much more than Solr alone offers. Solr on its own is not a cluster management system. It’s not a configuration management system. It does not provide basic relevancy out of the box. And these are the kind of tools that make site search really shine.

You Can’t Scale and Manage Your Search Without Merchandising Analytics

A search engine need to be able to scale with your business and sharpen its ability to deliver better, more meaningful results, which is impossible to do without merchandising analytics. Solr doesn’t have these capabilities from the get-go, and it does not provide an interface to boost and bury products — important moves merchandisers make to improve site search that rely on their knowledge, experience, and intuition. 

All those features need to be designed and built. In the case of building your own with Solr, you need experts who understand the infrastructure — experts who understand that Solr requires a lot of management in order to scale.

To do this on your own, you need a team of engineers to build modules that move, store, and process all that data we talked about earlier.

One way to think about the necessary modules is to break their functions down into three general areas: data science, merchandising, and inventory.

Data science covers all the learning and results-sharpening capabilities of a search engine that can only be derived by leaning on analytics. Search engines rely on user behavior data and models. Your engine needs to infer from all the historic searches and previous query performances whether users searching for “shoe” are really looking for a sandal, or if they’re searching for a pump or sneaker, instead.

The engine needs natural language processing models so that it can understand the business the site is in and understand the feed of products it’s working with. Beyond Solr, a series of machines are required to store and processes all this seemingly random, but vitally important, information.

Understanding user intent, and building a site that can respond to that intent, requires a way to capture real-time clickstream activity. And none of the data is any good if there is not an efficient way to load it into the system, so the system can learn from it.

Merchandising Teams Need the Right Tools to Act 

The other two general functions of modules — merchandising and inventory — are just as crucial for your search engine.

Merchandising is one of the most important strategies for ecommerce, and it’s one that you should invest in to get right. The merchandising teams responsible for promoting conversions on a digital site need the right tools to translate their strategies into action. They need a system for writing dynamic business rules that they can quickly modify when needed. They need testing capabilities that help them determine whether the moves they make are the right ones. And they need diagnostic tools to monitor site performance and to determine the root cause of issues and trends.

Systems also need to be built to manage fluctuations in site traffic. Think of an ecommerce site during the holiday shopping season — the infrastructure supporting digital sites needs to be able to scale up quickly and scale back down when extra capacity is no longer needed.  

And inventory modules are just as crucial for your merchandising teams. Without the most up-to-date and relevant data on your stock, they can’t optimize promotions or customer journeys with any degree of certainty.

If you’re building your own engine with a platform like Solr, it needs an ETL (extract, transform, and load) system to gather data from sources, like a retailer’s catalog, and enter it into the search engine in a process called feed ingestion. It will also need a scalable data storage system, one that can cope with constantly changing data feeds because of constantly changing catalogs. Plus, it needs to be a distributed system that can automatically handle the needs of a dynamic market, one in which a retailer, for instance, might suddenly find that it needs to triple the size of its catalog overnight.

A site search engine also needs an indexing system that can work at top speed. The nature of digital commerce today means that product details, like prices and stock levels, are constantly changing. To keep up with consumers and the competition, a site needs to be able to quickly respond to changing inventory to remain relevant.  

Building such a system can be done, but it’s a daunting task. By our calculations, for a mid- to large-size retailer trying to build its own high-quality, Solr-based site search engine, it would take 30 to 40 engineers as long as two years. 

That’s obviously a tremendous investment of both time and money. But it also represents a tremendous opportunity cost. While the work of building a high-quality search engine is going on, the business’ customers are suffering with an inferior search experience. And the search engine can’t begin to fully learn from customer interactions until the engine is up and running.

None of this is a trivial matter, especially when up to 30% of consumers use site search. And those consumers are among an enterprise’s most valuable customers, given their higher propensity to convert. On the flip side, customers who get no results for their site search queries are three times more likely to leave a site than are others.

It’s no wonder that teams responsible for selling products and providing content on the web agonize over major changes to their site search, and particularly over the decision to build or buy when it comes time for a major overhaul.

Bloomreach Completes the Commerce Experience

If you’re still considering building your search engine, let Bloomreach save you the trouble. Bloomreach Discovery offers a powerful combination of AI-powered site search, SEO, recommendations, and product merchandising so you can deliver the perfect results to your customers — all without having to build it yourself. 

If you’re interested in learning more, schedule a personalized demo today


Mike Cassidy

Lead Storyteller

Mike has defined the voice for fast-growing businesses establishing new markets in ways that elevate their brands and attract buyers at every stage of the buying experience.

He also has written stories about the rapid evolution of ecommerce, the power of machine learning and the fastinating challenges that face enterprises today.

Table of Contents

Share with Your Community

Recent Posts

Maintain an Edge With These New Posts


Subscribe to get our hot takes on ecommerce topics, trends and innovations delivered to straight your inbox.

Life With Bloomreach

Watch this video to learn what your life could look like when you use Bloomreach.