SIM-HASH: Detection of Duplicate Texts

saurav omar
2 min readJul 6, 2019

--

The problem of finding duplicates in a document is a pretty complex task,

It is used in different fields like plagiarism, search engines likewise they were used in many places. There are many ways to do solve these problems.some of them are:

  • Minhash,
  • SimHash
  • Fuzzy Search
  • Latent Semantic Indexing
  • Standard Boolean Model

and many more ways we can solve this problem but in this article, we are going to discuss SIM-HASH.

What is SIM-HASH?

SimHash is a hashing function and its property is that the more similar the text inputs are, the smaller the Hamming distance of their hashes.

  • The algorithm works by splitting the text into chunks and hashing each chunk with a hashing function of your choice.
  • Each hashed chunk is represented as a binary vector and the bit values are transformed into +1 or -1 depending on whether the bit value is 1 or 0.

Properties of SIM HASH:

  • The fingerprint of a document is a hash of its features.
  • Similar documents has similar hash values.

Let’s see how it works?

  • First string or text split into chunks and remove stop words(lemming, stemming)
 TEXT: How can I be an engineer, what should I do to be an engineer.How:1, can:1, I:2 be:2, an:2, engineer:2, what:1, should:1, do:1, to:1
  • Convert words to 8-bit hash values.
How: 00001010, can:11001100, I:00111100 be:10101010, an:10101011, engineer:10101011 ......
  • Convert 0 bit to -1 and multiply by a frequency which is calculated above and sum up with column-wise.
For first column How: -1, can: 1,I: -1, be: 1 an:1 engineer: 1which is equal to 
-1+1+(-1*2) + 1*2 + 1*2 + 1*2 + 1*2 = 6
Similarily we can do for other columns
  • If the values are greater than 1 for column then take as 1 otherwise take 0.
let sum after the calcuation 
6411-17-1-2
which is equivalent to 11110100O because of negative
1 because of positive

Now we have the 8 bit vectors of the one document similarly we can calculate for other documents differentiate using hamming distance.

That’s It.

Happy learning.

--

--

saurav omar
saurav omar

Written by saurav omar

Geek and Always ready to give more than 100%

Responses (1)