Summary:
1) How to write a Pandas data frame into GCS(Google cloud storage) within a Jupyter Notebook(like AI Notebook)
2) In the same notebook, how to call that object to be uploaded into a new data set in Bigquery
Problem
I do have an object that is big enough to make unfeasible to download it locally and then write it on GCS -> BQ. However, the object is not big enough to be processed using Apache-Beam. I brought into the notebook using BQ magic. After making some transformations, I want to send an object back towards my data repositories. Therefore, I am trying to use AVRO to copy it but I can not figure out how to make it work. I have tried following this guide(https://github.com/ynqa/pandavro), but I have not figured yet how the function should be spelt.
I´m doing this:
OUTPUT_PATH='{}/resumen2008a2019.avro'.format('gcs://xxxx')
pdx.to_avro(OUTPUT_PATH,df4)
That is returning me the following error: FileNotFoundError: [Errno 2] No such file or directory: 'gcs://xxxx'
Why not Parquet? It is not being able to transform the data correctly into a JSON: ArrowInvalid: ('Could not convert with type str: tried to convert to double', 'Conversion failed for column salario with type object')
Why not directly? I tried using this post as guide (Write a Pandas DataFrame to Google Cloud Storage or BigQuery). But it is three year old, and many of the stuff does not work like that anymore.
Should I surrender, and just write a classic ol´ csv?
Writing a DataFrame to BigQuery directly is very much supported and works without a hitch.
Assuming you are using Google Cloud AI Platform notebook (so that we don't need to setup service accounts and install bq packages ) you can do the following to write from Dataframe to a BQ Table:
client = bigquery.Client(location="US")
dataset_id = 'your_new_dataset'
dataset = client.create_dataset(dataset_id)
records = [
{"title": "The Meaning of Life", "release_year": 1983},
{"title": "Monty Python and the Holy Grail", "release_year": 1975},
{"title": "Life of Brian", "release_year": 1979},
{"title": "And Now for Something Completely Different", "release_year": 1971},
]
# Optionally set explicit indices.
# If indices are not specified, a column will be created for the default
# indices created by pandas.
index = ["Q24980", "Q25043", "Q24953", "Q16403"]
df = pandas.DataFrame(records, index=pandas.Index(index, name="wikidata_id"))
table_ref = dataset.table("monty_python")
job = client.load_table_from_dataframe(df, table_ref, location="US")
job.result() # Waits for table load to complete.
print("Loaded dataframe to {}".format(table_ref.path))
If you do want to use Pandavro then you will need to modify the output path "gs://" as this is not a local path and will not be understood by tools that can only write to file system. You will essentially have to split this into the following steps: