{"id":30174,"date":"2024-05-27T21:13:15","date_gmt":"2024-05-27T20:57:00","guid":{"rendered":"https:\/\/www.bloomreach.com\/library\/%library_type%\/evaluating-ai-your-guide-to-using-golden-test-sets"},"modified":"2024-06-14T19:52:01","modified_gmt":"2024-06-14T19:52:01","slug":"evaluating-ai-your-guide-to-using-golden-test-sets","status":"publish","type":"library","link":"https:\/\/www.bloomreach.com\/en\/blog\/evaluating-ai-your-guide-to-using-golden-test-sets","title":{"rendered":"Evaluating AI: Your Guide to Using Golden Test Sets"},"content":{"rendered":"<p dir=\"ltr\">With the influx of new AI capabilities that vendors are adding to their software, ecommerce leaders need a better way to evaluate which tech fits their use cases. There\u2019s no such thing as the best AI for all companies \u2014 there\u2019s only the AI that\u2019s best for you, your data, and your customers.<\/p>\n<p dir=\"ltr\">Most companies are still using RFPs to evaluate AI tools, but the standard feature checklist (with an explanation of each feature) isn\u2019t effective at measuring whether the AI is delivering actual value. All vendors have access to essentially the same foundational models \u2014 the difference will come from how they have fine-tuned those models and applied them to problems worth solving.<\/p>\n<p dir=\"ltr\">The normal tactic of asking for case studies and references doesn\u2019t necessarily work either, not only because these features are new and unproven, but also because their results depend heavily on the product and customer data you feed them.<\/p>\n<p dir=\"ltr\">Instead of trying to make a better RFP, ecommerce leaders should borrow the concept of a golden test set from AI model companies and adapt it for their business. Let\u2019s dive in.<\/p>\n<p dir=\"ltr\"><img decoding=\"async\" src=\"https:\/\/www.bloomreach.com\/wp-content\/uploads\/2024\/06\/evaluation-ai-in-article-3.png\" \/><\/p>\n<h2 dir=\"ltr\">What Is a Golden Test Set?<\/h2>\n<p dir=\"ltr\">A golden test set is a collection of test scenarios that has been curated by human experts and is used to evaluate AI model performance. The set is designed to cover key scenarios that the AI is designed to solve. So, whenever a company releases a new version of its AI model, it tests the model\u2019s performance against this test set so the company can evaluate progress over time in a standardized and objective way. This makes it easier to make informed decisions about which model is the best to deploy.<\/p>\n<p dir=\"ltr\">To make this work for ecommerce product discovery, companies should create their own golden set of test questions and scenarios that they can test vendors against. While this is immediately useful for anyone planning to replace their search technology soon, it\u2019ll also be useful for future-proofing your processes. As vendors get more aggressive in reaching out about their latest AI innovations, this will be an easy way to benchmark them against current performance.<\/p>\n<p dir=\"ltr\">Perhaps most importantly, it\u2019ll help ground you as current and future vendors tell you how they\u2019re incorporating new AI models into their technology. By testing against your own golden set, you cna more easily determine whether the vendor\u2019s technology actually creates business results for you.<\/p>\n<h2 dir=\"ltr\">How To Create Your Golden Set<\/h2>\n<p dir=\"ltr\">When creating your own version of a golden set for product discovery, you can certainly start from scratch. However, if you\u2019d like some guidance and advice, try using Bloomreach\u2019s recommended set (detailed below) and tweaking it for your needs.<\/p>\n<h3 dir=\"ltr\">Step 1: Analyze Your Queries<\/h3>\n<p dir=\"ltr\">Most search engines are sophisticated enough to apply different AI strategies to different types of queries. A simple example is head queries vs. long-tail queries. Head queries with one or two words are easy to return results for with semantic or even keyword search, whereas tail queries benefit from broad approaches like vector embeddings.<\/p>\n<p dir=\"ltr\">Therefore, your golden set should include a broad set of query types that represent the searches that actually happen on your site. You can analyze your search data to figure this out. If you need a starting place, here are some common types of queries:<\/p>\n<ul>\n<li value=\"1\">Top queries by search volume<\/li>\n<li value=\"2\">Top queries by most revenue<\/li>\n<li value=\"3\">Tail and torso (2+ words in the query)<\/li>\n<li value=\"4\">Random sampling<\/li>\n<li value=\"5\">Null searches<\/li>\n<li value=\"6\">High exit rate queries<\/li>\n<li value=\"7\">High bounce rate queries<\/li>\n<li value=\"8\">Low\/no revenue queries<\/li>\n<li value=\"9\">Low ATC rate search queries<\/li>\n<li value=\"10\">Queries with no product clicks<\/li>\n<li value=\"11\">Queries with recent increases in traffic<\/li>\n<li value=\"12\">Queries with recent decreases in performance<\/li>\n<\/ul>\n<h3 dir=\"ltr\">Step 2: Determine Your Split<\/h3>\n<p dir=\"ltr\">Once you\u2019ve analyzed your query types, you\u2019ll need to decide how many examples of each one you want to include in your golden set.<\/p>\n<p dir=\"ltr\">Your first instinct might be to simply go by percentage. That is, if 80% of your queries are head queries, then that\u2019s how many should appear in your golden set. However, it\u2019s not quite that straightforward.<\/p>\n<p dir=\"ltr\">Some query types are trickier for AI technology to get right than others, and those are the ones you want to make sure have good representation within your golden set.<\/p>\n<p dir=\"ltr\">To return to our head queries example \u2014 most search engine technology can handle these well even without sophisticated AI. You should include some in your golden set just to be sure, but you certainly don\u2019t need 80%.<\/p>\n<p dir=\"ltr\">Here\u2019s how Bloomreach typically creates a golden set when running relevance reports for our customers:<\/p>\n<ul>\n<li value=\"1\">Top 500 no revenue queries<\/li>\n<li value=\"2\">Top 500 null search queries<\/li>\n<li value=\"3\">Top 1,000 queries by revenue<\/li>\n<li value=\"4\">100 torso queries &#8211; random sampling<\/li>\n<li value=\"5\">100 long-tail queries &#8211; random sampling<\/li>\n<li value=\"6\">200 misspelled queries<\/li>\n<li value=\"7\">Bottom 100 queries by revenue<\/li>\n<\/ul>\n<p><img decoding=\"async\" src=\"https:\/\/www.bloomreach.com\/wp-content\/uploads\/2024\/06\/evaluating-ai-in-article-1.png\" \/><\/p>\n<p dir=\"ltr\">Once your golden set is complete, you can move on to running your evaluation.<\/p>\n<h2 dir=\"ltr\">How To Run Your Evaluation<\/h2>\n<h3 dir=\"ltr\">Methodology<\/h3>\n<ol dir=\"ltr\">\n<li>Run your current technology against the test set and create your benchmark. Note: some of the benchmark results will return a negative result by design (e.g., null search queries). This is intentional. In the next step, you\u2019ll see whether the new technology you\u2019re evaluating can beat that zero benchmark.<\/li>\n<li>We recommend using what we call the \u201ceyeball test\u201d to score your results. Type in a query \u2014 when you look at the search results it returns, are they what you expect to see? Another way to think of it is, are these the results a rational human would expect to come back for this query? If so, give it a score of 1. If not, give it a score of 0.<\/li>\n<li>Score all vendors you\u2019re evaluating against the same golden set. To read the results, simply look for the highest scores for each query type. There won\u2019t necessarily be a clear winner \u2014 some vendors excel at particular query types but have weak performance in others. You\u2019ll need to decide which types are most critical for your business needs. Since you\u2019re using your own data, you should be able to make a strong guess at which improvements will translate into the most revenue for you (and the vendor can likely provide guidance here as well).<\/li>\n<\/ol>\n<p><img decoding=\"async\" src=\"https:\/\/www.bloomreach.com\/wp-content\/uploads\/2024\/06\/evaluating-ai-in-article-2.png\" \/><\/p>\n<p><b><strong>Example scorecard:<\/strong><\/b> This demonstrates that, across most categories, Vendor A consistently returns better results than the benchmark.<\/p>\n<p dir=\"ltr\">The above process is somewhat subjective, but then again, so is relevance. What one person considers a good result for their customers, another might find confusing. It\u2019s okay to pick the result that\u2019s biased toward the customer experience you want to achieve for your business.<\/p>\n<p dir=\"ltr\">A note on ranking \u2014 you may be wondering at this point how to evaluate ranking optimizations and tools. All vendors should offer some level of AI to automatically optimize ranking for a particular goal (usually revenue). This is easier to evaluate with an RFP since the vendor should offer specific tools that you can see in a demo or try for yourself in a POC environment. After you test a vendor\u2019s AI against your golden set, you can highlight the weakest areas and ask the vendor to show you how the tools they provide will help you optimize those areas further.<\/p>\n<h2 dir=\"ltr\">Working With Vendors\u2019 Evaluation Processes<\/h2>\n<p dir=\"ltr\">Not all vendors offer the same testing capabilities in pre-sales. Here are some tips for asking a vendor to work with your golden set during the evaluation process.<\/p>\n<p dir=\"ltr\">If you\u2019re offered a sandbox, run the test yourself without configuring any rules. Remember, the goal of the golden test set is to evaluate the core AI. Every solution will offer some options to improve on top of the AI foundation, but by comparing the OOTB without configuration, you can compare the AI performance apples to apples across vendors.<\/p>\n<p dir=\"ltr\">If you\u2019re offered a POC \u2014 something where the vendor stands up a demo site to show your results with their AI \u2014 ask them either to use the golden test or to give you access to the demo so you can do it yourself.<\/p>\n<p dir=\"ltr\">If you\u2019re only offered demos that the vendor controls, try asking if they\u2019ve ever done any internal benchmarks of their AI against any of the scenarios you\u2019ve chosen to focus on. If the answer is no, maybe AI isn\u2019t a core function of that company.<\/p>\n<h2 dir=\"ltr\">Put Your Data to the Test Against Bloomreach Discovery<\/h2>\n<p dir=\"ltr\">If you\u2019re looking for a new product discovery solution, <a href=\"https:\/\/www.bloomreach.com\/en\/products\/discovery\">Bloomreach Discovery<\/a> is your best bet to drive impactful results fast. Powered by <a href=\"https:\/\/www.bloomreach.com\/en\/products\/loomi\">Loomi<\/a>, an AI built for ecommerce and fine-tuned with 14+ years of data, Bloomreach Discovery empowers you to improve the metrics that matter most for your business.<\/p>\n<p dir=\"ltr\">See how your golden test set fares with Bloomreach Discovery by requesting a <a href=\"https:\/\/visit.bloomreach.com\/search-impact-validation-bloomreach\" target=\"_blank\" rel=\"noopener\">search impact validation<\/a> today.<\/p>\n<p dir=\"ltr\">&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>With the influx of new AI capabilities that vendors are adding to their software, ecommerce leaders need a better way to evaluate which tech fits their use cases. There\u2019s no such thing as the best AI for all companies \u2014 there\u2019s only the AI that\u2019s best for you, your data, and your customers. Most companies [&hellip;]<\/p>\n","protected":false},"author":101,"featured_media":30175,"template":"","ew-regions":[],"ew-solutions":[],"library_type":[513],"library_blog_tag":[362,357],"industry":[],"channel":[],"topic":[283],"class_list":["post-30174","library","type-library","status-publish","has-post-thumbnail","hentry","library_type-blog","library_blog_tag-ai-and-innovation","library_blog_tag-best-practices","topic-ai"],"acf":{"library_blog_banner_content":"","library_blog_banner_cta1_text":"","library_blog_banner_cta1_href":"","library_blog_banner_cta1_new_tab":false,"library_blog_banner_cta2_text":"","library_blog_banner_cta2_href":"","library_blog_banner_cta2_new_tab":false,"library_blog_banner_bg_color":"#EAF7FE","library_blog_banner_cta_text_color":"#FFF","library_blog_banner_cta_bg_color":"#019ACE","library_blog_banner_cta2_text_color":"#000","library_blog_banner_cta2_bg_color":"#FFF","library_blog_chatgpt_content":"","library_blog_chatgpt_cta_href":"","library_blog_chatgpt_cta_text":"Ask ChatGPT"},"_links":{"self":[{"href":"https:\/\/www.bloomreach.com\/en\/wp-json\/wp\/v2\/library\/30174","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.bloomreach.com\/en\/wp-json\/wp\/v2\/library"}],"about":[{"href":"https:\/\/www.bloomreach.com\/en\/wp-json\/wp\/v2\/types\/library"}],"author":[{"embeddable":true,"href":"https:\/\/www.bloomreach.com\/en\/wp-json\/wp\/v2\/users\/101"}],"version-history":[{"count":0,"href":"https:\/\/www.bloomreach.com\/en\/wp-json\/wp\/v2\/library\/30174\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.bloomreach.com\/en\/wp-json\/wp\/v2\/media\/30175"}],"wp:attachment":[{"href":"https:\/\/www.bloomreach.com\/en\/wp-json\/wp\/v2\/media?parent=30174"}],"wp:term":[{"taxonomy":"ew_regions","embeddable":true,"href":"https:\/\/www.bloomreach.com\/en\/wp-json\/wp\/v2\/ew-regions?post=30174"},{"taxonomy":"ew_solutions","embeddable":true,"href":"https:\/\/www.bloomreach.com\/en\/wp-json\/wp\/v2\/ew-solutions?post=30174"},{"taxonomy":"library_type","embeddable":true,"href":"https:\/\/www.bloomreach.com\/en\/wp-json\/wp\/v2\/library_type?post=30174"},{"taxonomy":"library_blog_tag","embeddable":true,"href":"https:\/\/www.bloomreach.com\/en\/wp-json\/wp\/v2\/library_blog_tag?post=30174"},{"taxonomy":"industry","embeddable":true,"href":"https:\/\/www.bloomreach.com\/en\/wp-json\/wp\/v2\/industry?post=30174"},{"taxonomy":"channel","embeddable":true,"href":"https:\/\/www.bloomreach.com\/en\/wp-json\/wp\/v2\/channel?post=30174"},{"taxonomy":"topic","embeddable":true,"href":"https:\/\/www.bloomreach.com\/en\/wp-json\/wp\/v2\/topic?post=30174"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}