python-3.xgoogle-cloud-platformgoogle-cloud-storageavrogoogle-cloud-datalab

Writing to GCS using AVRO within AI Notebook


Summary:

1) How to write a Pandas data frame into GCS(Google cloud storage) within a Jupyter Notebook(like AI Notebook)

2) In the same notebook, how to call that object to be uploaded into a new data set in Bigquery

Problem

I do have an object that is big enough to make unfeasible to download it locally and then write it on GCS -> BQ. However, the object is not big enough to be processed using Apache-Beam. I brought into the notebook using BQ magic. After making some transformations, I want to send an object back towards my data repositories. Therefore, I am trying to use AVRO to copy it but I can not figure out how to make it work. I have tried following this guide(https://github.com/ynqa/pandavro), but I have not figured yet how the function should be spelt.

I´m doing this:

OUTPUT_PATH='{}/resumen2008a2019.avro'.format('gcs://xxxx')
pdx.to_avro(OUTPUT_PATH,df4)

That is returning me the following error: FileNotFoundError: [Errno 2] No such file or directory: 'gcs://xxxx'

Why not Parquet? It is not being able to transform the data correctly into a JSON: ArrowInvalid: ('Could not convert with type str: tried to convert to double', 'Conversion failed for column salario with type object')

Why not directly? I tried using this post as guide (Write a Pandas DataFrame to Google Cloud Storage or BigQuery). But it is three year old, and many of the stuff does not work like that anymore.

Should I surrender, and just write a classic ol´ csv?


Solution

  • Writing a DataFrame to BigQuery directly is very much supported and works without a hitch.

    Assuming you are using Google Cloud AI Platform notebook (so that we don't need to setup service accounts and install bq packages ) you can do the following to write from Dataframe to a BQ Table:

        client = bigquery.Client(location="US")
        dataset_id = 'your_new_dataset'
        dataset = client.create_dataset(dataset_id) 
    
        records = [
            {"title": "The Meaning of Life", "release_year": 1983},
            {"title": "Monty Python and the Holy Grail", "release_year": 1975},
            {"title": "Life of Brian", "release_year": 1979},
            {"title": "And Now for Something Completely Different", "release_year": 1971},
        ]
    
        # Optionally set explicit indices.
        # If indices are not specified, a column will be created for the default
        # indices created by pandas.
        index = ["Q24980", "Q25043", "Q24953", "Q16403"]
        df = pandas.DataFrame(records, index=pandas.Index(index, name="wikidata_id"))
    
        table_ref = dataset.table("monty_python")
        job = client.load_table_from_dataframe(df, table_ref, location="US")
    
        job.result()  # Waits for table load to complete.
        print("Loaded dataframe to {}".format(table_ref.path))
    

    If you do want to use Pandavro then you will need to modify the output path "gs://" as this is not a local path and will not be understood by tools that can only write to file system. You will essentially have to split this into the following steps:

    1. Write the file to a local directory
    2. Run a load job with the resultant avro file into BigQuery