[SOLVED] LSH - Binary matrix representation from shingles

LSH - Binary matrix representation from shingles

I have a large dataset of news articles, 48000 to be precise. I have made ngrams of each article where n = 3. my ngrams look like this:

[[(tikro, enters, into), (enter, into, research), (into, research, and),...]]

now I need to make a binary matrix of each shingle and article:

          article1 article2 article3
shingle1     1        0        0
shingle2     1        0        1
shingle3     0        1        0

At first I have kept all the shingles in a single list. After that, I have tried this to check if it works.

for art in article:
    for sh in ngrams:
        if sh in art:
            print('found')

as one is set and another is string it does not work. any suggestions, how to make it work? or any other approach?

thank you

Solution

Before searching shingles in articles you could use join to concatenate words of a shingle into a 3-word-phrase.

For example we have ngrams like:

ngrams = [('tikro', 'enters', 'into'),
          ('enter', 'into', 'research'),
          ('into', 'research', 'and')]

Then we concatenate words into phrase for each shingle:

shingles = [' '.join(x) for x in ngrams]

After the transformation the shingles is something like:

['tikro enters into', 
 'enter into research', 
 'into research and']

which are strings you could search in your articles.