pythonpandasn-gramlsh

LSH - Binary matrix representation from shingles


I have a large dataset of news articles, 48000 to be precise. I have made ngrams of each article where n = 3. my ngrams look like this:

[[(tikro, enters, into), (enter, into, research), (into, research, and),...]] 

now I need to make a binary matrix of each shingle and article:

          article1 article2 article3
shingle1     1        0        0
shingle2     1        0        1
shingle3     0        1        0

At first I have kept all the shingles in a single list. After that, I have tried this to check if it works.

for art in article:
    for sh in ngrams:
        if sh in art:
            print('found')

as one is set and another is string it does not work. any suggestions, how to make it work? or any other approach?

thank you


Solution

  • Before searching shingles in articles you could use join to concatenate words of a shingle into a 3-word-phrase.

    For example we have ngrams like:

    ngrams = [('tikro', 'enters', 'into'),
              ('enter', 'into', 'research'),
              ('into', 'research', 'and')]
    

    Then we concatenate words into phrase for each shingle:

    shingles = [' '.join(x) for x in ngrams]
    

    After the transformation the shingles is something like:

    ['tikro enters into', 
     'enter into research', 
     'into research and']
    

    which are strings you could search in your articles.