pythonpandasgoogle-cloud-datalab

Read CSV file to Datalab from Google Cloud Storage and convert to pandas dataframe


I am trying to read a csv file save in gs to a dataframe for analysis

I have follow the following steps without success

mybucket = storage.Bucket('bucket-name')
data_csv = mybucket.object('data.csv')
df = pd.read_csv(data_csv)

this doesn't work since data_csv is not a path as expected by pd.read_csv I also tried

%%gcs read --object $data_csv --variable data
#result: %gcs: error: unrecognized arguments: Cloud Storage Object gs://path/to/file.csv

How can I read my file for analysis do this?

Thanks


Solution

  • %%gcs returns bytes objects. To read it use BytesIO from io (python 3)

    mybucket = storage.Bucket('bucket-name')
    data_csv = mybucket.object('data.csv')
    
    %%gcs read --object $data_csv --variable data
    
    df = pd.read_csv(BytesIO(data_csv), sep = ';')
    

    if your csv file is comma separated, no need to specify < sep = ',' > which is the default read more about io library and packages here: Core tools for working with streams