pythonpandasgoogle-drive-apipydrive

Google spreadsheet to Pandas dataframe via Pydrive without download


How do I read the content of a Google spreadsheet into a Pandas dataframe without downloading the file?


I think gspread or df2gspread may be good shots, but I've been working with pydrive so far and got close to the solution.

With Pydrive I managed to get the export link of my spreadsheet, either as .csv or .xlsx file. After the authentication process, this looks like


    gauth = GoogleAuth()
    gauth.LocalWebserverAuth()
    drive = GoogleDrive(gauth)
    
    # choose whether to export csv or xlsx
    data_type = 'csv'
    
    # get list of files in folder as dictionaries
    file_list = drive.ListFile({'q': "'my-folder-ID' in parents and 
    trashed=false"}).GetList()
    
    export_key = 'exportLinks'
    
    excel_key = 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'
    csv_key = 'text/csv'
    
    if data_type == 'excel':
        urls = [ file[export_key][excel_key] for file in file_list ]
    
    elif data_type == 'csv':
        urls = [ file[export_key][csv_key] for file in file_list ]

The type of url I get for xlsx is

https://docs.google.com/spreadsheets/export?id=my-id&exportFormat=xlsx

and similarly for csv

https://docs.google.com/spreadsheets/export?id=my-id&exportFormat=csv

Now, if I click on these links (or visit them with webbrowser.open(url)), I download the file, that I can then normally read into a Pandas dataframe with pandas.read_excel() or pandas.read_csv(), as described here.

How can I skip the download, and directly read the file into a dataframe from these links?

I tried several solutions:

    pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 6, saw 2

Interestingly these numbers (1, 6, 2) do not depend on the number of rows and columns in my spreadsheet, hinting that the script is trying to read not what it is intended to.

    ValueError: Excel file format cannot be determined, you must specify an engine manually.

and specifying e.g. engine = 'openpyxl' gives

zipfile.BadZipFile: File is not a zip file

    r = requests.get(url)
    data = r.content
    df = pd.read_csv(BytesIO(data))

still gives


    pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 6, saw 2

If I print(data) I get hundreds of lines of html code


    b'\n<!DOCTYPE html>\n<html lang="de">\n  <head>\n  <meta charset="utf-8">\n  <meta content="width=300, initial-scale=1" name="viewport">\n 
    ...
    ...
     </script>\n  </body>\n</html>\n'


Solution

  • In your situation, how about the following modification? In this case, by retrieving the access token from gauth, the Spreadsheet is exported as XLSX data, and the XLSX data is put into the dataframe.

    Modified script:

    gauth = GoogleAuth()
    gauth.LocalWebserverAuth()
    
    url = "https://docs.google.com/spreadsheets/export?id={spreadsheetId}&exportFormat=xlsx"
    res = requests.get(url, headers={"Authorization": "Bearer " + gauth.attr['credentials'].access_token})
    values = pd.read_excel(BytesIO(res.content))
    print(values)