I have large postgresql database, containing documents. Every document represented as a row in the table. When new document added to the database I need to check for duplicates. But I can't just use select
to find exact match. Two documents can vary slightly and still can be considered as a duplicates, for example if some minor fields are different and all other fields are equal.
I research this problem and find method to solve this problem. It is possible to calculate MinHash
signature for every document and construct inverted index, to query similar documents from the database. But I can't understand how to map MinHash
to relational database.
As I understand, MinHash
signature is a list of N hashes, where N is a number of attributes. Similarity calculated as follows:
# Given 2 signatures Sa and Sb containing N hashes.
# Calculate number of equal hashes Neq.
number_of_equal_hashes = 0
for ix in range(0, N):
if Sa[ix] == Sb[ix]:
number_of_equal_hashes += 1
similarity = float(number_of_equal_hashes)/N
This is simple if you already have two signatures, the problem is to find all documents (with corresponding signatures) in the database with similarity less or equal some value.
For example, I can create table with multiple columns like this:
| minhash0 | minhash1 | minhash3 | docid |
Each minhashX
column corresponds to minhash of the one of the document's attribute and docid
is a document's identifier.
I can query similar records this way:
select * from invidx
where ((case when minhash0=minhash2search0 then 1 else 0 end) +
(case when minhash1=minhash2search1 then 1 else 0 end) +
(case when minhash2=minhash2search2 then 1 else 0 end))/N > THRESHOLD
where minhash2searchX
is minhashes of new document and THRESHOLD is minimal similarity. But this approach require full scan. Is there any method to speedup this algorithm?
To use use advantages of inverted index, I'd suggest you full-text search engine for your purposes, e.g. Lucene or Solr (which is based on Lucene)
You can construct "document" (in terms of Lucene), which would contain fields, which associated with MinHashes of your documents (db records). Note, that you're able to index numeric fields as well (you're just need to describe field types in scheme). Also, you have to store primary key of each document, to map Lucene "documents" on records from your db.
Index entire collection of your documents.
For finding similar documents to given document - you're have to calculate MinHashes for each field, and query Lucene for similar documents:
field1:MinHash1 OR field2:MinHash2 OR ...
As more fields matched document - the higher rank it would have. So, you can take few documents with highest rank, and make a decision - if they are really similar in your case
Also, boosting of fields may be useful for you