python-3.xlistmacosboto3parquet

Issue reading lists from parquet file into a dataframe showing as None on MacOS but working for Windows


I have a number of parquet files with pricing data, the bid and ask prices and sizes are stored as a list of float values e.g.

                                                bidprices  \
0       [4.51088, 4.51079, 4.51065, 4.51051, 4.51011, ...   
1       [4.51088, 4.51079, 4.51065, 4.51051, 4.51011, ...   
2                    [4.51073, 4.51052, 4.51029, 4.51002]   
3                             [4.51049, 4.51049, 4.51039]   
4                                      [4.51049, 4.51039]   
...                                                   ...      
633621                [4.52003, 4.52001, 4.51988, 4.5195]   

                                                 bidsizes  \
0       [1000000, 5000000, 10000000, 20000000, 4000000...   
1       [1000000, 5000000, 10000000, 20000000, 4000000...   
2                   [1000000, 4000000, 5000000, 10000000]   
3                              [500000, 1000000, 3000000]   
4                                      [1000000, 3000000]   
...                                                   ...      
633621                 [500000, 500000, 2000000, 7000000]   

I am using boto3 to connect to an AWS s3 bucket and read the files into a dataframe. There are no connectivity or permission issues, the code has been tested and works when running from a Windows machine.

session = boto3.Session(profile_name='aws-profile')
                s3 = session.client('s3')
                for key in key_name:
                    response = s3.get_object(Bucket=bucket, Key= key + '/' + self.symbol + '_' + x + '.parquet')
                    content = response['Body'].read()
                    file_obj = io.BytesIO(content)
                    df = pd.read_parquet(file_obj)
                    files.append(df)

However, when I run from my machine (MacOS Sequoia Version 15.1 (24B83)) python3 version Python 3.9.6 the dataframe produces empty columns where the lists should be, the same thing happens when the file is stored locally.

df.isnull().all() gives

[1739342 rows x 11 columns]
time         False
sym          False
provider     False
valuedate    False
received     False
bid          False
ask          False
bidprices     True
bidsizes      True
askprices     True
asksizes      True
dtype: bool

I have tried updating python versions, checking permissions and verified the files aren't broken. The strangest thing is I have one file saved locally that doesn't lose the list values when read into a df, but I can't see any differences in how it is stored compared to the other local files that don't work.

I haven't included the full code as it doesn't appear to be the reason here but am happy to include it if necessary. Any help greatly appreciated.


Solution

  • Had the same issue right now. Try using a different engine. fastparquet didn't work for me but pyarrow did.

    df = pd.read_parquet(file_obj, engine="pyarrow")