!python -V
#%%
import pandas as pd
import pickle
from sklearn.metrics import root_mean_squared_error
from sklearn.feature_extraction import DictVectorizer
import mlflow
def read_dataframe1(filename):
df = pd.read_parquet(filename)
df['duration'] = df.tpep_dropoff_datetime - df.tpep_pickup_datetime
df.duration = df.duration.apply(lambda td: td.total_seconds()/60)
df = df[df.duration >= 1] & (df.duration <= 60)
categorical = ['PULocationID', 'DOLocationID']
df[categorical] = df[categorical].astype(str)
return df
#%%
train_df = read_dataframe1("https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet")
When I run the last line for loading the parquet file with the function read_dataframe1, it does not load even after waiting over 30 minutes and the cell keeps running
What can be the reason?
I suspect that your issue is here:
df = df[df.duration >= 1] & (df.duration <= 60)
This is creating a result that is n * m
rows and columns. Where n
is the number of rows matching df[df.duration >= 1]
and m
is the number of rows in df
. I strongly suspect what you want to do is filter df
for rows matching that combined criteria (duration between 1 and 60), so you likely wanted to do:
df = df[(df.duration >= 1) & (df.duration <= 60)]
The following will likely do the same in the event you find it simpler:
df = df[df.duration.between(1, 60)]
For Reference:
import pandas
df = pandas.DataFrame([
{"foo": -1}, {"foo": -2}, {"foo": -3},
{"foo": 1}, {"foo": 2}, {"foo": 3},
{"foo": 11}, {"foo": 12}, {"foo": 13},
])
print("-------------------")
print("Rows matching df.foo <= 10")
print(df.foo <= 10)
print("-------------------\n")
print("-------------------")
print("Rows matching df[df.foo >= 1]")
print(df[df.foo >= 1])
print("-------------------\n")
print("-------------------")
print("Your current result")
print("Rows matching df[df.foo >= 1] & (df.foo <= 10)")
print(df[df.foo >= 1] & (df.foo <= 10))
print("-------------------\n")
print("-------------------")
print("Likely your objective")
print("Rows matching df[(df.foo >= 1) & (df.foo <= 10)]")
print(df[(df.foo >= 1) & (df.foo <= 10)])
print("-------------------\n")
That will give you:
-------------------
Rows matching df.foo <= 10
0 True
1 True
2 True
3 True
4 True
5 True
6 False
7 False
8 False
Name: foo, dtype: bool
-------------------
-------------------
Rows matching df[df.foo >= 1]
foo
3 1
4 2
5 3
6 11
7 12
8 13
-------------------
-------------------
Your current result
Rows matching df[df.foo >= 1] & (df.foo <= 10)
foo 0 1 2 3 4 5 6 7 8
3 False False False False False False False False False False
4 False False False False False False False False False False
5 False False False False False False False False False False
6 False False False False False False False False False False
7 False False False False False False False False False False
8 False False False False False False False False False False
-------------------
-------------------
Likely your objective
Rows matching df[(df.foo >= 1) & (df.foo <= 10)]
foo
3 1
4 2
5 3
-------------------