Padmini Jaikumar

Jul 2, 2018

Inside the BloomReach Algorithm - Using Machine Learning to Understand Page Templates

 

In this article, Apurva Pathak and Padmini Jaikumar talk about using machine learning to automatically extract the unique underlying templates on an e-commerce website from millions of webpages.

E-commerce today is the undisputed driver for growth in the retail industry, with the number of pages on some merchants’ sites reaching over a million pages. BloomReach’s platform helps bring more shoppers to the merchant’s site by improving SEO, and then helps shoppers find the products they are interested in by understanding their intent better and personalizing the search experience.

BloomReach’s platform works by building a deep understanding of the merchant’s site. The BloomReach bots crawl the merchant’s site every day. The BloomReach parser then runs on the pages that were crawled, extracting information of interest such as the product title, category trees, description, inventory etc.

Parsing logic for a merchant needs to work across the merchant’s site which may have hundreds of thousands of pages. To add to the complexity, the structure of the pages is constantly changing. This blog post talks about BloomReach’s work in automatically learning the underlying templates on a merchant’s site. Pages belonging to a template can then be parsed using the same parsing logic.

At a high level, the pipeline first extracts features of interest from pages that are crawled. Feature cleaning is then performed to improve clustering quality. The cleaned features are clustered to determine sets of pages with a similar structure. Finally, outlier pages are removed, i.e. pages that are not too similar to the cluster they have been assigned. These outliers are then clustered again so that rare templates can be found - rare templates are the templates that account for very few pages on the site. It is important that such rare templates are found, since they may account for a significant fraction of the traffic on the site, for example, the home page or the category pages that can be navigated from the home page. The pipeline outputs clusters of pages with the same underlying template. The pipeline also sends notifications when templates change (which causes our parsing coverage to change), new templates are added or templates are deprecated.

The sections below discuss each step of the algorithm in greater detail.

 

Site Template Extraction

To give a flavour of the algorithm, visual results are presented below of the templates extracted for a large e-commerce merchant with tens of thousands of web-pages.

The tags for the cluster (Category Pages with Product Listings) etc., are manually written here for explanation purposes.

 

Template Cluster 1: Category Pages with Product Listings

Sample Page 1

Screen Shot 2016-09-06 at 10.43.28 PM.png

Sample Page 2

Screen Shot 2016-09-06 at 10.43.46 PM.png

Sample Page 3

Screen Shot 2016-09-06 at 10.44.09 PM.png

Sample Page 4

Screen Shot 2016-09-06 at 10.44.25 PM.png


Template 2: Category pages with different sub-categories for a category

Sample Page 1

Screen Shot 2016-09-06 at 11.23.04 PM.png

Sample Page 2

Screen Shot 2016-09-06 at 11.22.54 PM.png

Sample Page 3

Screen Shot 2016-09-06 at 11.22.45 PM.png

Sample Page 4

Screen Shot 2016-09-06 at 11.22.33 PM.png

 

Template 3: Product Pages with options

This template cluster brings together product pages which have multiple options for selection.

Sample Page 1

Screen Shot 2016-09-06 at 11.06.16 PM.png

Sample Page 2

Screen Shot 2016-09-06 at 11.06.06 PM.png

Sample Page 3

Screen Shot 2016-09-06 at 11.05.55 PM.png

Sample Page 4

Screen Shot 2016-09-06 at 11.05.47 PM.png

 

Template 4: Product Pages with no selection criteria

This template brings together product pages that don’t have selection criteria - ex: size, colour etc.

Sample Page 1

Screen Shot 2016-09-06 at 10.54.48 PM.png

Sample Page 2

Screen Shot 2016-09-06 at 10.55.01 PM.png

Sample Page 3

Screen Shot 2016-09-06 at 10.55.09 PM.png

Sample Page 4

Screen Shot 2016-09-06 at 10.55.20 PM.png

 

Inside the BloomReach Algorithm

 

The template learning algorithm comprises of the following steps:

  • Feature Extraction

  • Learn Main Templates

  • Learn Rare Templates

 

Feature Extraction

The aim with feature extraction is to obtain features with good quality and coverage to cluster webpages that have similar templates together. The features are extracted from the static HTML of the pages. Multiple features were considered for clustering including ID names and class names from the HTML tags, number of links on the page, CSS style-sheet names on the page. After studying the coverage of these features (i.e. the number of pages that had these features, and their variability across different types of pages), we settled on ID names and class names from HTML tags as the feature set. Intuitively, this feature set makes sense as pages with the same template would need to have the same HTML elements since they have the same styling.

 

Feature Cleaning

The features extracted had a lot of noise, primarily coming from two cases.

 

  • Dynamic id and class name generation
    • Some merchants appended the product name/id to the base HTML ID/class name tag. For example, there were IDs created like: ‘ImageLinkROS’, ImageLinkBNU, ‘ImageLinkPKV’, ‘ImageLinkMER’, etc . These expanded strings need to be reduced the base string: ImageLink, for effective clustering.

    • To reduce such strings to the base string, we use a simple algorithm based on computing the Longest Common Subsequence.

      • 1. Obtain clusters of ID strings of the same length.

      • 2. Within each cluster iteratively obtain the most frequent Longest Common Subsequence (LCS) L such that L satisfies some length and frequency constraints.

      • 3. Remove all the strings in the cluster that have L as a substring, and proceed back to step 2.

      • 4. Each cluster of strings is reduced in this way to a smaller set of LCS strings L.

 

  • Features that appear in a large/small fraction of pages:

    • Features displaying common facets like “Add to Cart”, header and footer appeared in a large fraction of pages and were detrimental to clustering quality.

    • Features appearing in a very small fraction of pages were hurting clustering quality as well. For example, there were some id strings that appeared in exactly 1 page - these were features that appeared to be a hashcode of the base page.

    • Removal of features that appear in a large/small fraction of pages was achieved with straightforward percentage based thresholding.

 

Feature Vectorization

For each page, a binary vector with dimension equal to the number of unique IDs and classes obtained across all webpages (after feature cleaning) is generated.  0/1 at each location indicates the absence/presence of that feature in the page’s raw HTML.

 

Learn Main Templates

Once clean features are extracted, the next step is to cluster these features to obtain clusters of pages sharing the same underlying template.

 

Clustering

K-means was used for clustering, with K ranging from K = 2 to K = 30. Clustering is repeated a few times and clustering result with the lowest sum of squared error is chosen. Initialization is performed using the K-means++ algorithm.

 

Choosing the number of K

In this unsupervised clustering problem, the number of page templates, or K-means clusters is not known beforehand. To determine the optimal K,  the “Elbow in SSE” curve method is used.

SSE, or the sum of squared error, is computed for every K as below:

 

 

The elbow in the SSE values method is used to determine the optimum value of K. Increasing K will reduce the error, but this will cause the model to overfit.

A sample SSE curve for a clustering result can be seen below, where the optimum value of K is 12.

 

 

We algorithmically determine the elbow using a simple rule that looks at the reduction in SSE for increasing K.
 

Learn Rare Templates

For BloomReach’s use case it is important to detect page clusters that are relatively uncommon, i.e. the template cluster has very few pages in it. Even though these clusters have very few pages, they could account for high traffic. High-level category pages, i.e. category pages which contain listings of other category pages receive substantial traffic but there are very few such pages.

In K-means, outliers are usually merged with main clusters. Consider the example below which illustrates the issue. As shown in Fig 2, cluster centers detected for K=3 all belong to the main cluster. The points in the two outlier clusters are tagged as belonging to one of the three clusters.

Fig 2: K-mean cluster centers for K=3

To achieve clustering of rare templates, we first remove outliers from the main model and then cluster only the outlier points using K-means.

 

Outlier Detection

The algorithm computes the standard deviation of all points from their cluster centers as:

 

 

All points that are greater than 3 away from their cluster center are classified as outliers.

 

Outlier Clustering

The outlier points detected are then clustered using K-means as described earlier. Clustering is performed for K = 1 to 15, using Kmeans++ initialization and the optimal K is chosen using the “elbow in SSE” method.

 

Learning Templates at Scale

BloomReach’s template extraction was tested across hundreds of e-commerce websites. The number of webpages across these e-commerce sites varied from a few thousand to millions. The feature vector length, which is the set of reduced ID and class names in the HTML tags, ranged from a few hundred to a few thousand. The number of main templates was typically in the range of 5 - 20, and the number of rare templates was typically lesser than 10. The number of templates found was in line with the number of templates our QA team identified by eyeballing the site. We also observed that parsing coverage across a template cluster was very consistent - typically 90% or higher, which provided additional validation for our results.

The template extraction pipeline is now deployed to run regularly across our e-commerce merchants’ websites to continuously detect new, updated or deprecated templates as they are changed by the merchant. We found that running the pipeline on a fraction of the pages was sufficient to detect changes. The pipeline is lightweight and the algorithm scales well across different site structures, verticals and a large number of pages.