I searched a lot for a solution and I indeed have found similar questions. This answer gives back the longest sequence of CHARACTERS that might NOT belong in all of the strings in the input list. This answer gives back the longest common sequences of WORDS that MUST belong to all of the strings in the input list.
I am looking for a combination of the above solutions. That is, I want the longest sequence of common WORDS that might NOT appear in all of the words/phrases of the input list.
Here are some examples of what is expected:
['exterior lighting', 'interior lighting']
--> 'lighting'
['ambient lighting', 'ambient light']
--> 'ambient'
['led turn signal lamp', 'turn signal lamp', 'signal and ambient lamp', 'turn signal light']
--> 'turn signal lamp'
['ambient lighting', 'infrared light']
--> ''
Thank you
A rather crude solution but I think it works:
from nltk.util import everygrams
import pandas as pd
def get_word_sequence(phrases):
ngrams = []
for phrase in phrases:
phrase_split = [ token for token in phrase.split()]
ngrams.append(list(everygrams(phrase_split)))
ngrams = [i for j in ngrams for i in j] # unpack it
counts_per_ngram_series = pd.Series(ngrams).value_counts()
counts_per_ngram_df = pd.DataFrame({'ngram':counts_per_ngram_series.index, 'count':counts_per_ngram_series.values})
# discard the pandas Series
del(counts_per_ngram_series)
# filter out the ngrams that appear only once
counts_per_ngram_df = counts_per_ngram_df[counts_per_ngram_df['count'] > 1]
if not counts_per_ngram_df.empty:
# populate the ngramsize column
counts_per_ngram_df['ngramsize'] = counts_per_ngram_df['ngram'].str.len()
# sort by ngramsize, ngram_char_length and then by count
counts_per_ngram_df.sort_values(['ngramsize', 'count'], inplace = True, ascending = [False, False])
# get the top ngram
top_ngram = " ".join(*counts_per_ngram_df.head(1).ngram.values)
return top_ngram
return ''