pythonpandasdataframeperformancevectorization

Improve Pandas and Vectorization Performance on large dataset


CONTEXT

I have a large dataset (100-250mb) in a CSV file and need to assign groupings to the population of people. The groupings are based on a dynamic ruleset defined in another CSV file. For ease of reproduction, I've added sample data and sample 'rulesets' / query strings

DATA

 # Data looks like this:
 ID    Gender    Age     Country
 1     Male      60      USA
 2     Female    25      UK
 3     Male      30      Australia    

CURRENT CODE

import pandas as pd
import numpy as np

query1 = '(Gender in ["Male","Female"]) & (Country=="USA")'
query2 = '(Country in ["USA", "UK"]) & (Gender=="Male")'
query3 = '(Age > 40) & (Gender=="Male")'

query_list = [query1, query2, query3]
query_names = ['USA', 'MALE_USA_UK', 'MALE_OVER_40']

def assign_name(row, id_list, name, column_list):
    id = row['ID']
    if name in column_list:
        if row[name] == 'Yes':
            return 'Yes'
    if str(id) in id_list:
        return 'Yes'
    return 'No'

# Create a dataframe with random data
data = {
    'ID': range(1, 101),
    'Gender': ['Male', 'Female'] * 50,
    'Age': np.random.randint(18, 70, size=100),
    'Country': ['USA', 'Canada', 'UK', 'Australia'] * 25
}

df = pd.DataFrame(data)
df = pd.DataFrame(data)
tmp = df.copy()

for query in query_list:
    name = query_names[query_list.index(query)]
    out = tmp.query(query)
        
    # Create a list of people that were derived in out.  These are 'yes'
    person_list = out['ID'].to_list()
    column_list = out.columns.to_list()

    # Give them a 'Yes' or 'No' based on them being in the 'out' df
    df[name] = df.apply(
            lambda row: assign_name(row, person_list, name, column_list), axis = 1)

PROBLEM

With larger datasets with 200k+ rows and 50+ different classification groups, this process takes a long time to run. I often get the DataFrame is highly fragmented error on .insert. I would like help building a solution that is quicker and more efficient.


Solution

  • I would give you four suggestions.

    First, you're looping over all combinations of query_list and query_names. This means that for the MALE_USA_UK column, it evaluates every different query, and assigns it to the same column. The final result in that column will be the query from '(Age > 40) & (Gender=="Male")', the over-40 query, since that's what ran last. This looks like both a bug and a performance problem.

    I think what you intended was to loop over pairs of queries/names, matching up queries and names at the same position. The Python builtin zip() can do this.

    Example:

    for query, name in zip(query_list, query_names):
    

    Second, DataFrame.query() essentially does two things. It evaluates your boolean index using DataFrame.eval(), then uses DataFrame.loc[] to convert that boolean index into the rows that matched.

    In the code after this, you are undoing the effects of DataFrame.loc[], by searching for matching values. Instead of doing DataFrame.loc[] and undoing it, it would be faster to use DataFrame.eval().

    Example:

    for query, name in zip(query_list, query_names):
        in_query = tmp.eval(query)
    
        # Give them a 'Yes' or 'No' based on in_query being true or false
        df[name] = np.where(in_query, 'Yes', 'No')
    

    Third, I would suggest using booleans instead of Yes/No strings. Booleans are true/false values. They come with a long list of benefits.

    Example:

    for query, name in zip(query_list, query_names):
        in_query = tmp.eval(query)
        df[name] = in_query
    

    The fourth suggestion pertains to your fragmentation warning.

    I often get the DataFrame is highly fragmented error on .insert.

    The reason why you get this warning is because of a Pandas internal structure called BlockManager. If you care about the details, you can read here about what this is and does. If you don't care, you can just use pd.concat() to solve the problem.

    The way you can do that is to build up a list of new columns, then add them all at once using pd.concat(). This avoids fragmentation.

    new_columns = []
        
    for query, name in zip(query_list, query_names):
        in_query = df.eval(query)
        in_query.name = name
        new_columns.append(in_query)
        
    df = pd.concat([df, *new_columns], axis=1)
    

    Finally, here is the full re-written code based on all of these suggestions.

    import pandas as pd
    import numpy as np
    
    query1 = '(Gender in ["Male","Female"]) & (Country=="USA")'
    query2 = '(Country in ["USA", "UK"]) & (Gender=="Male")'
    query3 = '(Age > 40) & (Gender=="Male")'
    
    query_list = [query1, query2, query3]
    query_names = ['USA', 'MALE_USA_UK', 'MALE_OVER_40']
    
    # Create a dataframe with random data
    data = {
        'ID': range(1, 101),
        'Gender': ['Male', 'Female'] * 50,
        'Age': np.random.randint(18, 70, size=100),
        'Country': ['USA', 'Canada', 'UK', 'Australia'] * 25
    }
    
    df = pd.DataFrame(data)
    
    new_columns = []
        
    for query, name in zip(query_list, query_names):
        in_query = df.eval(query)
        in_query.name = name
        new_columns.append(in_query)
        
    df = pd.concat([df, *new_columns], axis=1)