I have a number of parquet files with pricing data, the bid and ask prices and sizes are stored as a list of float values e.g.
bidprices \
0 [4.51088, 4.51079, 4.51065, 4.51051, 4.51011, ...
1 [4.51088, 4.51079, 4.51065, 4.51051, 4.51011, ...
2 [4.51073, 4.51052, 4.51029, 4.51002]
3 [4.51049, 4.51049, 4.51039]
4 [4.51049, 4.51039]
... ...
633621 [4.52003, 4.52001, 4.51988, 4.5195]
bidsizes \
0 [1000000, 5000000, 10000000, 20000000, 4000000...
1 [1000000, 5000000, 10000000, 20000000, 4000000...
2 [1000000, 4000000, 5000000, 10000000]
3 [500000, 1000000, 3000000]
4 [1000000, 3000000]
... ...
633621 [500000, 500000, 2000000, 7000000]
I am using boto3
to connect to an AWS s3 bucket and read the files into a dataframe. There are no connectivity or permission issues, the code has been tested and works when running from a Windows machine.
session = boto3.Session(profile_name='aws-profile')
s3 = session.client('s3')
for key in key_name:
response = s3.get_object(Bucket=bucket, Key= key + '/' + self.symbol + '_' + x + '.parquet')
content = response['Body'].read()
file_obj = io.BytesIO(content)
df = pd.read_parquet(file_obj)
files.append(df)
However, when I run from my machine (MacOS Sequoia Version 15.1 (24B83)) python3 version Python 3.9.6
the dataframe produces empty columns where the lists should be, the same thing happens when the file is stored locally.
df.isnull().all()
gives
[1739342 rows x 11 columns]
time False
sym False
provider False
valuedate False
received False
bid False
ask False
bidprices True
bidsizes True
askprices True
asksizes True
dtype: bool
I have tried updating python versions, checking permissions and verified the files aren't broken. The strangest thing is I have one file saved locally that doesn't lose the list values when read into a df, but I can't see any differences in how it is stored compared to the other local files that don't work.
I haven't included the full code as it doesn't appear to be the reason here but am happy to include it if necessary. Any help greatly appreciated.
Had the same issue right now. Try using a different engine. fastparquet
didn't work for me but pyarrow
did.
df = pd.read_parquet(file_obj, engine="pyarrow")