pythonlongest-substring

Find longest sequence of common words from list of words in python


I searched a lot for a solution and I indeed have found similar questions. This answer gives back the longest sequence of CHARACTERS that might NOT belong in all of the strings in the input list. This answer gives back the longest common sequences of WORDS that MUST belong to all of the strings in the input list.

I am looking for a combination of the above solutions. That is, I want the longest sequence of common WORDS that might NOT appear in all of the words/phrases of the input list.

Here are some examples of what is expected:

['exterior lighting', 'interior lighting'] --> 'lighting'

['ambient lighting', 'ambient light'] --> 'ambient'

['led turn signal lamp', 'turn signal lamp', 'signal and ambient lamp', 'turn signal light'] --> 'turn signal lamp'

['ambient lighting', 'infrared light'] --> ''

Thank you


Solution

  • A rather crude solution but I think it works:

    from nltk.util import everygrams
    import pandas as pd
    
    def get_word_sequence(phrases):
    
        ngrams = []
    
        for phrase in phrases:        
            phrase_split = [ token for token in phrase.split()]
            ngrams.append(list(everygrams(phrase_split)))
    
        ngrams = [i for j in ngrams for i in j]  # unpack it    
    
        counts_per_ngram_series = pd.Series(ngrams).value_counts()
    
        counts_per_ngram_df = pd.DataFrame({'ngram':counts_per_ngram_series.index, 'count':counts_per_ngram_series.values})
    
        # discard the pandas Series
        del(counts_per_ngram_series)
    
        # filter out the ngrams that appear only once
        counts_per_ngram_df = counts_per_ngram_df[counts_per_ngram_df['count'] > 1]
    
        if not counts_per_ngram_df.empty:    
            # populate the ngramsize column
            counts_per_ngram_df['ngramsize'] = counts_per_ngram_df['ngram'].str.len()
    
            # sort by ngramsize, ngram_char_length and then by count
            counts_per_ngram_df.sort_values(['ngramsize', 'count'], inplace = True, ascending = [False, False])
    
            # get the top ngram
            top_ngram = " ".join(*counts_per_ngram_df.head(1).ngram.values)
    
            return top_ngram
    
        return ''