pythongoogle-sheets-api

Fetching only the non-hidden, visible data of a Google spreadsheet using Python and Google Sheets API?


I am using Google Sheets API to retrieve data from online spreadsheets and to read them into a pandas dataframe. I have successfully set up the script to fetch the data but this default implementation fetches everything, even hidden rows/columns. There are many rows on the spreadsheet that have been hidden. I do not want to retrieve those, as the hidden state of a row means it is irrelevant. Therefore, I am looking for an approach to fetch a spreadsheet without the manually hidden rows/columns. Or, alternatively, an approach to fetch a spreadsheet content and exclude the hidden cells afterwards. So far I haven't managed to figure out whether such a feature is implemented in the Google Sheets API.

My current, working implementation is as follows. The problem here is that this includes even hidden rows/columns:

def getSpreadsheetData(name, spreadsheet_id, sheet_id=None):

    global values_input, service
    creds = ServiceAccountCredentials.from_json_keyfile_name(creds_file_path, SCOPES)
    service = build('sheets', 'v4', credentials=creds)

    sheet = service.spreadsheets()
    data_table = sheet.values().get(spreadsheetId=spreadsheet_id,
                                      range=name).execute()
    data_values = data_table.get('values', [])

    if not data_values:
        print('No data found.')
        return -1

    else:
        df = pd.DataFrame(data_values)
        return df

Edit: This problem, in my opinion, is different from merely filtering the spreadsheet based on a column value, as described here. I want to fetch only those rows of a spreadsheet that are not hidden. The default API call to fetch content of a spreadsheet (see my code above) fetches all the rows, even those that have manually been hidden and hence are not visible for people opening the spreadsheet via a link.

Edit 2: I updated the post to make it more clear that my current working implementation returns all hidden cells and this is exactly what I do not want. I want an approach that can exclude those rows/columns that have manually been hidden by a user, as hidden cells are unimportant and irrelevant. Hence, I do not need them in my output dataframe.

Edit 3: I put together a small sample dataset shared below. Please copy the content to a Google Sheet. With this data, I found that the parser doesn't manage to distinguish what is supposed to be header and what is supposed to be in the first row. Note that the sample dataset has three visible rows and one hidden row. The function is required to return these three visible rows in a pandas dataframe. Instead, I get one header and two rows only, where the header is actually a combination of the true column names and the first row content. Dataset:

Lesson  Date/Time       German        English                   Completed
Hallo!  Oct 21st, 2021  nicht         not   
Hallo!  Oct 21st, 2021  nicht so gut  not so good               10/22
Hallo!  Oct 21st, 2021  oder          or    
Hallo!  Oct 21st, 2021  schön         nice; beautiful; pretty   10/22
    

Note: row 3 (the one containing 'oder' and 'or') was hidden.

Code:

def getSpreadsheetData(spreadsheet_id, sheet_id=None):

    global values_input, service
    creds_file_path = ""        # add your service account path here 
    SCOPES = ['https://www.googleapis.com/auth/spreadsheets', 'https://www.googleapis.com/auth/drive']

    creds = ServiceAccountCredentials.from_json_keyfile_name(creds_file_path, SCOPES)
    service = build('sheets', 'v4', credentials=creds)
    access_token = creds.get_access_token().access_token

    url = 'https://docs.google.com/spreadsheets/d/' + spreadsheet_id + '/gviz/tq?tqx=out:csv&gid=' + sheet_id
    res = requests.get(url, headers={'Authorization': 'Bearer ' + access_token})
    df = pd.read_csv(io.StringIO(res.text), sep=',')
    return df

Incorrect output:

  Lesson Hallo! Date/Time Oct 21st, 2021  German nicht              English not Completed   Unnamed: 5
0        Hallo!           Oct 21st, 2021  nicht so gut              not so good      10/22         NaN
1        Hallo!           Oct 21st, 2021         schön  nice; beautiful; pretty      10/22         NaN

As it can be seen, the header and first row is combined, which is incorrect. Including the header=0 parameter in pd.read_csv() didn't help either, same incorrect output is returned.



Solution

  • About your following goal,

    This problem, in my opinion, is different from merely filtering the spreadsheet based on a column value, as described here. I want to fetch only those rows of a spreadsheet that are not hidden. The default API call to fetch content of a spreadsheet (see my code above) fetches all the rows, even those that have manually been hidden and hence are not visible for people opening the spreadsheet via a link.

    My sample script retrieves the showing rows from the sheet with the hidden rows using Query Language. So, about I want to fetch only those rows of a spreadsheet that are not hidden., this can be achieved by the sample script. From The default API call to fetch content of a spreadsheet (see my code above) fetches all the rows, even those that have manually been hidden and hence are not visible for people opening the spreadsheet via a link., I thought that you have tried to use Sheets API. In this case, all rows are retrieved even when the hidden rows are existing.

    And from read them into a pandas dataframe., I thought that your goal can be achieved by modifying the sample script of this answer.

    So, in order to achieve your goal, the sample script is as follows.

    Sample script:

    Sample script:

    service = build('sheets', 'v4', credentials=creds) # This is from your script.
    
    spreadsheet_id = "###" # Please set the Spreadsheet ID.
    sheet_id = "###"  # Please set the sheet name.
    
    # Removed ---> url = 'https://docs.google.com/spreadsheets/d/' + spreadsheet_id + '/gviz/tq?tqx=out:csv&gid=' + sheet_id
    url = 'https://docs.google.com/spreadsheets/d/' + spreadsheet_id + '/pub?output=csv&gid=' + sheet_id
    res = requests.get(url, headers={'Authorization': 'Bearer ' + creds.token})
    df = pd.read_csv(io.StringIO(res.text), sep=',')
    

    Reference:

    Added 1:

    When the above script couldn't resolve your issue, please test the following script. In this script, the values are retrieved with Sheets API. At that time, at first, the row numbers filtered by the basic filter are retrieved. And, using the row numbers, the showing rows are retrieved.

    Sample script:

    service = build('sheets', 'v4', credentials=creds) # This is from your script.
    
    spreadsheet_id = "###" # Please set the Spreadsheet ID.
    sheet_name = "Sheet1"  # Please set the sheet name.
    
    fields = 'sheets(data(rowMetadata(hiddenByFilter)))'
    res = service.spreadsheets().get(spreadsheetId=spreadsheet_id, ranges=sheet_name, fields=fields).execute()
    rowMetadata = res["sheets"][0]["data"][0]["rowMetadata"]
    filteredRows = {"shownRows": [], "hiddenRows": []}
    for i, r in enumerate(rowMetadata):
        filteredRows["hiddenRows" if "hiddenByFilter" in r and r["hiddenByFilter"] else "shownRows"].append(i + 1)
    
    result = service.spreadsheets().values().get(spreadsheetId=spreadsheet_id, range=sheet_name).execute()
    values = result.get("values", [])
    v = []
    for e in filteredRows["shownRows"]:
        if e - 1 < len(values):
            v.append(values[e - 1])
        else:
            break
    df = pd.DataFrame(v[1:], columns=v[0])
    print(df)
    

    Added 2:

    When the rows are hidden by the basic filter and the manual operation, you can retrieve the showing rows using the following script.

    Sample script:

    service = build('sheets', 'v4', credentials=creds) # This is from your script.
    
    spreadsheet_id = "###" # Please set the Spreadsheet ID.
    sheet_name = "Sheet1"  # Please set the sheet name.
    
    fields = 'sheets(data(rowMetadata(hiddenByFilter,hiddenByUser)))'
    res = service.spreadsheets().get(spreadsheetId=spreadsheet_id, ranges=sheet_name, fields=fields).execute()
    rowMetadata = res["sheets"][0]["data"][0]["rowMetadata"]
    filteredRows = {"shownRows": [], "hiddenRows": []}
    for i, r in enumerate(rowMetadata):
        filteredRows["hiddenRows" if ("hiddenByFilter" in r and r["hiddenByFilter"]) or ("hiddenByUser" in r and r["hiddenByUser"]) else "shownRows"].append(i + 1)
    
    result = service.spreadsheets().values().get(spreadsheetId=spreadsheet_id, range=sheet_name).execute()
    values = result.get("values", [])
    v = []
    for e in filteredRows["shownRows"]:
        if e - 1 < len(values):
            v.append(values[e - 1])
        else:
            break
    df = pd.DataFrame(v[1:], columns=v[0])
    print(df)