Identifying Duplicate Pages on a Website (Part 2)| No Comments
In our previous post, we described the problem of duplicate pages for large web sites. Many versions of the same page can end up in a search engine index for a number of reasons, including extraneous parameters and multiple paths, plus other undesirable pages, such as out-of-stock products. This leads to low search rankings for these pages and ultimately lost sales.
BloomReach helps its customers manage potential duplicate content by creating clusters of equivalent pages and nominating a “canonical” page to represent them all, forcing all BloomReach links to the relevant content to target the canonical page. The additional links to the canonical page will provide a better user experience while improving its page rank and relevance score.
Additionally, BloomReach makes the identity of these canonical pages available to its customers, giving them the option to automatically forward all pages in the cluster to the canonical page. This forwarding effectively eliminates all duplicate pages from the site, ensuring that the canonical page gets the most traffic, is most likely to be indexed, is most relevant to search queries, and is most likely to lead to a conversion when visited by a user.
At BloomReach, we use a variety of algorithms to build clusters of mutually equivalent web pages. Since most of our customers are e-commerce sites, we employ specific strategies for two specific page types: product pages, which contains a specific page with all information on a product, and category pages, which contains a list of products and links to the product pages. We frame the problem as a general forwarding capability from less-desirable pages to more-desirable pages. All algorithms individually run over our customers’ sites, building clusters. Finally we merge clusters to produce the final equivalence classes and canonical pages. Figure 1 depicts our process.
The challenges in this process include:
- Scale – doing the content clustering over 100M pages is hard. Standard clustering algorithms do not work at this scale.
- Merging of clusters is hard. One needs to implement associative merging. We have implemented it using the hadoop framework. While it scales, it scales linearly in terms of the number of associative steps required – we will describe this in more detail later.
- Noise – the data is noisy and at times subjective. A site might be going through maintenance and during that time every page is getting redirected to home page. It does not mean that we should group every page together.
For example, one algorithm for clustering operates on the content of product pages, identifying redundant products. It follows a four-step process, first filtering and parsing the raw data to isolate relevant content. Next the algorithm extracts features from the products, defined as weighted n-grams present on the page. Based on the n-grams, the algorithm employs Min-Hash to compute a cluster id as a list of hash codes. All products with the same cluster id are treated as equivalent.
A second algorithm operates on the contents of category pages, clustering pages based on the products listed. This algorithm consists of three steps, starting with filtering and parsing to list the products on the page. All pages with a similar set of products constitute a cluster, and the canonical page is the one with all the products. This is an expensive calculation but easily parallelizable.
A final example is redirect clustering. Every page that is redirected via HTTP 301 or 302 is linked to the final page. There could be a sequence of redirects, forming a chain or tree. The cluster is all connected pages, and the canonical page is the final destination of all the pages.
Cluster merging is primarily concerned with resolving conflicts. If two pages are in the same cluster for one algorithm, but not in another algorithm, the merge uses proprietary algorithms to estimate the strength of each algorithm and the likelihood that the clustering or non-clustering is due to noise. The merging takes place in a series of hadoop steps.
The final result of our extensive website analysis is a subset of pages that we judge to be non-redundant and highest quality. We target our products to these pages, and we provide services to our customers to let them prune out the unnecessary pages, increasing the signal-to-noise ratio of the rest of the site and raising the relevancy score of the best pages.