How to use machine learning to automate near-duplicate content detection?

During SEO on-site verification, whenever you are suspecting that different URLs display the same or similar content you should immediately investigate the problem. Duplicated content can affect search engine rankings and lead to traffic loss. Mostly, duplication is an effect of URL variations but can also happen by mistakenly producing similar content and placing it under different URL address. No matter what the cause was, we explain how duplicate content detection can be done automatically.

A solution to the problem of duplicated content depends mostly on two factors:

  1. text length,
  2. our definition of duplicates.

Two documents can be marked as duplicates if they are the same i.e. contain the same phrases in the same order. This situation is pretty straightforward to detect and does not require any sophisticated tools (checksum techniques).

How to deal with near-duplicate content?

In this article we would like to cover different and more common scenario, that occurs when two documents use different language constructs, but write exactly about the same thing, like the following topics: Effective weight loss methods and Best weight reduction strategies. The complexity of this problem largely depends on the length of the document. Typically, long documents contain enough information so that semantics and context do not play an important role. In this case, we can reliably detect duplicates by relying on two popular representations of text.

Why having near-duplicate content on your website is an SEO issue?

The search engines aim is to present varied results to answer the user’s query. This means that in the vast majority of cases for popular queries, our site may appear on the first page of search results only once. In the case where we have many pages covering the same topic in a very similar way, the rank of our website for a given query is blurred, so that neither page has a chance to rank high enough to generate traffic.

Also, many duplications of content on the site harm the overall rating of the domain. So it is a rule of thumb to keep your site neat and free of content duplication.

Method #1: Shingles

This approach treats the document as a set of character- or word-level shingles. Shingles are all possible fixed size subsequences of string.

Example: the word marketing is decomposed into 4-character shingles in the following way:

  • mark,
  • arke,
  • rket,
  • keti,
  • etin,
  • ting.

The same idea applies to words as atomic components. Once the text is mapped into a set of shingles, the similarity is measured with the Jaccard index which takes values between 0 (totally dissimilar documents) and 1 (perfect duplicates).

The higher overlap between two sets (more elements in common) increases the Jaccard index. The technique is pretty straightforward to implement, but an exhaustive comparison of all documents from a given collection turns into a time- and resource-consuming process beyond a few thousands of items. For this reason, an efficient hashing technique, called MinHash, was developed to speed up the computation at a cost of little lower precision [Broder, Andrei Z. (1997), “On the resemblance and containment of documents”.].

Case Study #1

We applied the method to automatically detect near-duplicates on e-commerce sites. Our client was a large company providing products all over the world, with over a dozen local domains (with national TLDs, such as .nz, .au, etc.). In order not to mislead Google’s bot with the multiplication of very similar pages in the same language, but for different countries, we needed hreflang codes for each location. Product descriptions differed slightly in different countries, as local marketing could make changes to general texts.

What is important in this case is that descriptions were rather short, that is why we decided to use the shingle cover rather than other statistical techniques (described below).

We divided the texts into five-character cuts (including spaces). In this way, some shingles contained fragments of two consecutive words. Also, it reduced the impact of grammatical varieties. As mentioned in the method, each page was represented by a set of shingles and the Jaccard index measured similarity between texts. Using the minhash technique, we could do it in a reasonable amount of time. Comparing thousands of URLs from a dozen or so domains without hashing means above billion operations, whereas minhash does this in linear time.

As a result, we can automatically assign hreflang codes to each address. The specificity of hreflangs is that each country URL should be assigned all indicators. If we did it manually, we would need over 1.5 billion comparisons (a dozen or so services with a dozen national domains for each of them and with thousands of URLs in each domain).

Of course, some of this would be obvious for the human (like comparing the product page with the address site), but there is always a huge work to do.

Method #2: TF-IDF

In the second approach, a document is represented as a high-dimensional vector of word features with appropriate weights. In most cases, term frequency (TF) with inverse document frequency (IDF) is used as the weighting scheme. Words that appear often in the given document, but are rare in the whole collection usually receive large weights. The similarity between documents is now translated into cosine distance between vectors and takes values between 0 (perfect duplicates) and 1 (totally dissimilar documents). As in the shingling approach, efficient hashing [Charikar, Moses S. (2002), “Similarity estimation techniques from rounding algorithms”] technique, called SimHash, was developed to quickly detect similar vectors and thus mark possible duplicates.

Case Study #2

We used the TF-IDF representation to mark possible duplicates in a collection of 10255 polish documents of length 500-1000 words.

The method allowed us to narrow down the search of about 100 million possible pairs up to a thousand.

The operation of the system is based on the conversion of each text into a vector of more than 40 000 entries in length, which is built based on the words appearing in the whole collection.

Then, the text vectors are exhaustively compared with each other and if they are suitably close to each other (threshold on cosine distance). After that, they are marked as possible duplicates and additionally confirmed by human labor. We decide to use an exhaustive search to be sure that all documents where checked. The SimHash technique, although very efficient computationally, performs only approximate search and does not guarantee complete identification.

Method #3: Word2Vec

The methods described above work well when the content of the document is not too short. The main drawbacks of the techniques are large memory costs associated with constructing documents vectors (hundreds of thousands of entries) and lack of semantic and contextual similarity between words. Terms such as intelligent, smart, bright or brilliant are all synonyms and carry the same meaning. In the TF-IDF scheme, they are treated as separate features and this may result in more documents falsely classified as not being duplicates.

The more human-like vector representation of text can be built with the help of machine learning. Words do not occur alone but always appear in some context. Given some center word, the probability of predicting surrounding words is not uniform, but specific to the language and context.

For example, in the sentence “Quick solution for dealing with the paper shortage”, given the word “dealing” there is a high chance that the next word will be “with” and the preceding word “for”. Probability of generating further words like “solution” or “paper” will be certainly lower, but the presence of words like “duck” or “goat” will be even lower.

If we treat each word as a fixed-size vector, we can optimize its entries in such a way that natural cooccurrences and preferences are preserved. Words that appear in the same context will be similar in the vector space. This is the basis of the Word2Vec [Mikolov, Tomas; Sutskever, Ilya; Chen, Kai; Corrado, Greg S.; Dean, Jeff (2013). Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems.] technique which is typically trained on massive text corpora, like Wikipedia, Google News or Common Crawl. As a result, for a word president, nearby words in the vector space are chairman, CEO, executive, prez which is how humans would label similar words.

The Word2Vec representation is fixed-sized and captures the semantic similarity between words. The same basic idea can be transitioned into a dense representation of sentences and finally whole documents. Representation on the word level may be enough to capture the similarity between short pieces of text like titles, forum threads, and questions where the amount of information embedded in the text is very limited.


The good news is that you can avoid or fix the majority of duplicate content issues. You need to learn how duplicate content affects your website position at search engine rankings and then take action to detect and solve it. As a result, you reinforce your position in the online world. While Google’s algorithm is getting smarter and more sophisticated it is crucial to adjust your SEO strategy with the aligned advanced approach with the use of machine learning and data science.