pythonpandasmachine-learningparquet

Resolve infinite time for parquet file loading in pandas


!python -V
#%%
import pandas as pd
import pickle
from sklearn.metrics import root_mean_squared_error
from sklearn.feature_extraction import DictVectorizer
import mlflow

def read_dataframe1(filename):
    df = pd.read_parquet(filename)

    df['duration'] = df.tpep_dropoff_datetime - df.tpep_pickup_datetime
    df.duration = df.duration.apply(lambda td: td.total_seconds()/60)

    df = df[df.duration >= 1] & (df.duration <= 60)

    categorical = ['PULocationID', 'DOLocationID']
    df[categorical] = df[categorical].astype(str)

    return df

#%%
train_df = read_dataframe1("https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet") 

When I run the last line for loading the parquet file with the function read_dataframe1, it does not load even after waiting over 30 minutes and the cell keeps running

What can be the reason?


Solution

  • I suspect that your issue is here:

    df = df[df.duration >= 1] & (df.duration <= 60)
    

    This is creating a result that is n * m rows and columns. Where n is the number of rows matching df[df.duration >= 1] and m is the number of rows in df. I strongly suspect what you want to do is filter df for rows matching that combined criteria (duration between 1 and 60), so you likely wanted to do:

    df = df[(df.duration >= 1) & (df.duration <= 60)]
    

    The following will likely do the same in the event you find it simpler:

    df = df[df.duration.between(1, 60)]
    

    For Reference:

    import pandas
    
    df = pandas.DataFrame([
        {"foo": -1}, {"foo": -2}, {"foo": -3},
        {"foo": 1}, {"foo": 2}, {"foo": 3},
        {"foo": 11}, {"foo": 12}, {"foo": 13},
    ])
    
    print("-------------------")
    print("Rows matching df.foo <= 10")
    print(df.foo <= 10)
    print("-------------------\n")
    
    print("-------------------")
    print("Rows matching df[df.foo >= 1]")
    print(df[df.foo >= 1])
    print("-------------------\n")
    
    print("-------------------")
    print("Your current result")
    print("Rows matching df[df.foo >= 1] & (df.foo <= 10)")
    print(df[df.foo >= 1] & (df.foo <= 10))
    print("-------------------\n")
    
    print("-------------------")
    print("Likely your objective")
    print("Rows matching df[(df.foo >= 1) & (df.foo <= 10)]")
    print(df[(df.foo >= 1) & (df.foo <= 10)])
    print("-------------------\n")
    

    That will give you:

    -------------------
    Rows matching df.foo <= 10
    0     True
    1     True
    2     True
    3     True
    4     True
    5     True
    6    False
    7    False
    8    False
    Name: foo, dtype: bool
    -------------------
    
    -------------------
    Rows matching df[df.foo >= 1]
       foo
    3    1
    4    2
    5    3
    6   11
    7   12
    8   13
    -------------------
    
    -------------------
    Your current result
    Rows matching df[df.foo >= 1] & (df.foo <= 10)
         foo      0      1      2      3      4      5      6      7      8
    3  False  False  False  False  False  False  False  False  False  False
    4  False  False  False  False  False  False  False  False  False  False
    5  False  False  False  False  False  False  False  False  False  False
    6  False  False  False  False  False  False  False  False  False  False
    7  False  False  False  False  False  False  False  False  False  False
    8  False  False  False  False  False  False  False  False  False  False
    -------------------
    
    -------------------
    Likely your objective
    Rows matching df[(df.foo >= 1) & (df.foo <= 10)]
       foo
    3    1
    4    2
    5    3
    -------------------