cascading-simhash a library to cluster by minhashes in Hadoop

By Nate Murray | Published: May 9, 2011

simhashing

Say you have a large corpus of web documents and you want to group them together by some notion of “similarity”. For instance, we may want to detect plagiarism or find content that appears on multiple pages of a site.

In this scenario, it’s impractical to do a pairwise comparison of all documents. Fortunately, we can use simhashing.

Broadly speaking, simhashing is a algorithm that calculates a “cluster id” (the minimum hash, or minhash) from the content. Because the minhash for an item is calculated independently of the other items in the set, minhashing is an ideal candidate for MapReduce.

上述资讯来自网友投稿，如有侵犯了您的权益，请来信告诉我们：liujun100@vip.qq.com

cascadingsimhash a library to cluster by minhashes in Hadoop

cascading-simhash a library to cluster by minhashes in Hadoop

simhashing