pythonpandasdataframenlpn-gram

Summarizing n-grams efficiently in Python on big data


I have a very large dataset of roughly 6 million records, it does look like this snippet:

data = pd.DataFrame({
    'ID': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
    'TEXT': [
        "Mouthwatering BBQ ribs cheese, and coleslaw.",
        "Delicious pizza with pepperoni and extra cheese.",
        "Spicy Thai curry with cheese and jasmine rice.",
        "Tiramisu dessert topped with cocoa powder.",
        "Sushi rolls with fresh fish and soy sauce.",
        "Freshly baked chocolate chip cookies.",
        "Homemade lasagna with layers of cheese and pasta.",
        "Gourmet burgers with all the toppings and extra cheese.",
        "Crispy fried chicken with mashed potatoes and extra cheese.",
        "Creamy tomato soup with a grilled cheese sandwich."
    ],
    'DATE': [
        '2023-02-01', '2023-02-01', '2023-02-01', '2023-02-01', '2023-02-02',
        '2023-02-02', '2023-02-01', '2023-02-01', '2023-02-02', '2023-02-02'
    ]
})

I want to generate bigrams and trigrams from the column 'TEXT.' I'm interested in two types of ngrams for both trigrams and bigrams: those that start with 'extra' and those that don't start with 'extra.' Once we have those, I want to summarize (count the unique ID frequency) of those ngrams by unique 'DATE.' This means that if an ngram appears in an ID more than once, I will count it only once because I want to know in how many different 'IDs' it ultimately appeared.

I'm very new to Python. I come from the R world, in which there is a library called quanteda that uses C programming and parallel computing. Searching for those ngrams looks something like this:

corpus_food %>%
  tokens(remove_punct = TRUE) %>% 
  tokens_ngrams(n = 2) %>% 
  tokens_select(pattern = "^extra", valuetype = "regex") %>%
  dfm() %>%
  dfm_group(groups = lubridate::date(DATE)) %>%
  textstat_frequency()

yielding my desired results:

       feature frequency rank docfreq group
1 extra_cheese         3    1       2   all

My desired result would look like this:

ngram nunique group
cheese and 3 1/02/2023
and extra 2 1/02/2023
extra cheese 2 1/02/2023
and extra cheese 2 1/02/2023
mouthwatering bbq 1 1/02/2023
bbq ribs 1 1/02/2023
ribs cheese 1 1/02/2023
and coleslaw 1 1/02/2023
mouthwatering bbq ribs 1 1/02/2023
bbq ribs cheese 1 1/02/2023
ribs cheese and 1 1/02/2023
cheese and coleslaw 1 1/02/2023
delicious pizza 1 1/02/2023
pizza with 1 1/02/2023
with pepperoni 1 1/02/2023
pepperoni and 1 1/02/2023
delicious pizza with 1 1/02/2023
pizza with pepperoni 1 1/02/2023
with pepperoni and 1 1/02/2023
pepperoni and extra 1 1/02/2023
spicy thai 1 1/02/2023
thai curry 1 1/02/2023

I am in no way comparing the two languages, Python and R. They are amazing, but at the moment, I'm interested in a very straightforward and fast method to achieve my results in Python. I am open to hearing of a way to achieve what I'm looking for in a faster and more efficient way in Python. I'm new to Python.

So far I have found a way to create the bigrams and trigrams but I have no idea as to how perform the selection of those that start with "extra" and those who don't and this very process of creating the ngrams is taking over an hour so I will take all advice on how to reduce the time.

Work around:

import nltk
from nltk import bigrams
from nltk.util import trigrams 
from nltk.tokenize import word_tokenize

data['bigrams'] = data['TEXT'].apply(lambda x: list(bigrams(word_tokenize(x))))
data['trigrams'] = data['TEXT'].apply(lambda x: list(trigrams(word_tokenize(x))))

Reading through some posts, some people suggest on using the gensim lib. Would that be a good direction?


Solution

  • It is easy to find ngrams using sklearn's CountVectorizer using the ngram_range argument.

    You can create a document-term matrix with ngrams of size 2 and 3 only, then append to your original dataset and doing pivoting and aggregation with pandas to find what you need.

    First we'll get the document-term matrix and append to our original data:

    # Perform the count vectorization, keeping bigrams and trigrams only
    from sklearn.feature_extraction.text import CountVectorizer
    cv = CountVectorizer(ngram_range=(2,3))
    X = cv.fit_transform(data['TEXT'])
    
    # Create dataframe of document-term matrix
    cv_df = pd.DataFrame(X.todense(), columns=cv.get_feature_names_out())
    
    # Append to original data
    df = pd.concat([data, cv_df], axis=1)
    

    Then we group by ID and date, and filter where the count is greater than 0, to find the ID-date combinations where each 2 or 3-gram appears, then count the unique IDs for each:

    # Group and pivot
    pivoted_df = df.groupby(['ID','DATE']).sum().stack().reset_index()
    pivoted_df.columns = ['ID', 'DATE', 'ngram', 'count']
    
    # Find n-grams which appear for each ID-DATE combo and count unique ids
    pivoted_df = pivoted_df[pivoted_df['count']>0]
    pivoted_df.groupby(['DATE','ngram']).nunique('ID').reset_index()
    

    Finally, we can create additional columns for the ngram size, and also whether or not the ngram starts with extra, and use for filtering:

    # Add additional columns for ngram size 
    pivoted_df['ngram_size'] = pivoted_df['ngram'].str.split().str.len()
    
    # Add additional column for starting with extra
    pivoted_df['extra'] = pivoted_df['ngram'].str.startswith('extra')
    
    # Find all the 2-grams that start with "extra"
    pivoted_df[(pivoted_df['extra']) & (pivoted_df['ngram_size']==2)]
    

    That being said, if you have 6M records, you have a large dataset, and with this approach you will definitely run into memory issues. You will probably want to filter your data down to what you are most interested in to start, and also make sure you use the min_df parameter in the CountVectorizer in order to keep your data tractable.