pythonpandasdataframeshingles

Fastest Way to Shingle from Pandas Column


I need the fastest possible way to shingle strings from a data frame and then create a master list.

Given the following data frame:

import pandas as pd
d=['Hello', 'Helloworld']
f=pd.DataFrame({'strings':d})
f
    strings
0   Hello
1   Helloworld

I'd like to generate a list shingled strings (of length 3) like this: (All possible 3-letter combinations are included.)

[['Hel', 'ell', 'llo'],['Hel', 'ell', 'llo', 'low', 'owo', 'wor', 'orl', 'rld']]

... and a master list of all unique values like this:

['wor', 'Hel', 'ell', 'owo', 'llo', 'rld', 'orl', 'low']

I can do it as follows, but I suspect there is a much faster way:

#Shingle into strings of exactly 3
def shingle(word):
    r = [word[i:i + 3] for i in range(len(word) - 3 + 1)]
    return [''.join(t) for t in r]
#Shingle (i.e. "hello" -> "hel","ell",'llo')
r=[shingle(w) for w in f['strings']]
#Get all elements into one list:
import itertools
colsunq=list(itertools.chain.from_iterable(r))
#Remove duplicates:
colsunq=list(set(colsunq))
colsunq

['wor', 'Hel', 'ell', 'owo', 'llo', 'rld', 'orl', 'low']

Thanks in advance!


Solution

  • I'm 4 years late, but here's an answer. I don't think that it's possible to determine the "fastest" way since that's heavily dependent on hardware and algorithms. (It might fall under something similar to Kolmogorov complexity.)

    But, I needed to shingle over 11 million files. I put each of the words in a numpy array and ran the following code.

    shingles = set()
    
    for i in range(words.shape[0] - w + 1):
        a = words[i:i + w]
        shingles.add(tuple(a))
    

    This code processed 27.2 billion words in about 6 hours.