I need the fastest possible way to shingle strings from a data frame and then create a master list.
Given the following data frame:
import pandas as pd
d=['Hello', 'Helloworld']
f=pd.DataFrame({'strings':d})
f
strings
0 Hello
1 Helloworld
I'd like to generate a list shingled strings (of length 3) like this: (All possible 3-letter combinations are included.)
[['Hel', 'ell', 'llo'],['Hel', 'ell', 'llo', 'low', 'owo', 'wor', 'orl', 'rld']]
... and a master list of all unique values like this:
['wor', 'Hel', 'ell', 'owo', 'llo', 'rld', 'orl', 'low']
I can do it as follows, but I suspect there is a much faster way:
#Shingle into strings of exactly 3
def shingle(word):
r = [word[i:i + 3] for i in range(len(word) - 3 + 1)]
return [''.join(t) for t in r]
#Shingle (i.e. "hello" -> "hel","ell",'llo')
r=[shingle(w) for w in f['strings']]
#Get all elements into one list:
import itertools
colsunq=list(itertools.chain.from_iterable(r))
#Remove duplicates:
colsunq=list(set(colsunq))
colsunq
['wor', 'Hel', 'ell', 'owo', 'llo', 'rld', 'orl', 'low']
Thanks in advance!
I'm 4 years late, but here's an answer. I don't think that it's possible to determine the "fastest" way since that's heavily dependent on hardware and algorithms. (It might fall under something similar to Kolmogorov complexity.)
But, I needed to shingle over 11 million files. I put each of the words in a numpy array and ran the following code.
shingles = set()
for i in range(words.shape[0] - w + 1):
a = words[i:i + w]
shingles.add(tuple(a))
This code processed 27.2 billion words in about 6 hours.