pythonpandasgoogle-cloud-platformgoogle-cloud-storagespss-files

GCP AI Platform cannot read .SAV file stored in Google Cloud Storage (Python)


I have an AI Platform VM instance set up with a Python3 notebook. I also have a Google Cloud Storage bucket that contains numerous .CSV and .SAV files. I have no difficulties using standard python packages likes Pandas to read in data from the CSV files, but my notebook appears unable to locate my .SAV files in my storage bucket.

Does anyone know what is going on here and/or how I can resolve this issue?

import numpy as np
import pandas as pd
import pyreadstat

df = pd.read_spss("gs://<STORAGE_BUCKET>/datafile.sav")

---------------------------------------------------------------------------
PyreadstatError                           Traceback (most recent call last)
<ipython-input-10-30836249273f> in <module>
----> 1 df = pd.read_spss("gs://<STORAGE_BUCKET>/datafile.sav")

/opt/conda/lib/python3.7/site-packages/pandas/io/spss.py in read_spss(path, usecols, convert_categoricals)
     41 
     42     df, _ = pyreadstat.read_sav(
---> 43         path, usecols=usecols, apply_value_formats=convert_categoricals
     44     )
     45     return df

pyreadstat/pyreadstat.pyx in pyreadstat.pyreadstat.read_sav()

pyreadstat/_readstat_parser.pyx in pyreadstat._readstat_parser.run_conversion()

PyreadstatError: File gs://<STORAGE_BUCKET>/datafile.sav does not exist!

Solution

  • The read_spss function can only read from a local file path:

    path: pathstr or Path - File path.

    Compare that with the read_csv function:

    filepath_or_bufferstr: str, path object or file-like object - Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is expected.