I am using Google Sheets API to retrieve data from online spreadsheets and to read them into a pandas dataframe. I have successfully set up the script to fetch the data but this default implementation fetches everything, even hidden rows/columns. There are many rows on the spreadsheet that have been hidden. I do not want to retrieve those, as the hidden state of a row means it is irrelevant. Therefore, I am looking for an approach to fetch a spreadsheet without the manually hidden rows/columns. Or, alternatively, an approach to fetch a spreadsheet content and exclude the hidden cells afterwards. So far I haven't managed to figure out whether such a feature is implemented in the Google Sheets API.
My current, working implementation is as follows. The problem here is that this includes even hidden rows/columns:
def getSpreadsheetData(name, spreadsheet_id, sheet_id=None):
global values_input, service
creds = ServiceAccountCredentials.from_json_keyfile_name(creds_file_path, SCOPES)
service = build('sheets', 'v4', credentials=creds)
sheet = service.spreadsheets()
data_table = sheet.values().get(spreadsheetId=spreadsheet_id,
range=name).execute()
data_values = data_table.get('values', [])
if not data_values:
print('No data found.')
return -1
else:
df = pd.DataFrame(data_values)
return df
Edit: This problem, in my opinion, is different from merely filtering the spreadsheet based on a column value, as described here. I want to fetch only those rows of a spreadsheet that are not hidden. The default API call to fetch content of a spreadsheet (see my code above) fetches all the rows, even those that have manually been hidden and hence are not visible for people opening the spreadsheet via a link.
Edit 2: I updated the post to make it more clear that my current working implementation returns all hidden cells and this is exactly what I do not want. I want an approach that can exclude those rows/columns that have manually been hidden by a user, as hidden cells are unimportant and irrelevant. Hence, I do not need them in my output dataframe.
Edit 3: I put together a small sample dataset shared below. Please copy the content to a Google Sheet. With this data, I found that the parser doesn't manage to distinguish what is supposed to be header and what is supposed to be in the first row. Note that the sample dataset has three visible rows and one hidden row. The function is required to return these three visible rows in a pandas dataframe. Instead, I get one header and two rows only, where the header is actually a combination of the true column names and the first row content. Dataset:
Lesson Date/Time German English Completed
Hallo! Oct 21st, 2021 nicht not
Hallo! Oct 21st, 2021 nicht so gut not so good 10/22
Hallo! Oct 21st, 2021 oder or
Hallo! Oct 21st, 2021 schön nice; beautiful; pretty 10/22
Note: row 3 (the one containing 'oder' and 'or') was hidden.
Code:
def getSpreadsheetData(spreadsheet_id, sheet_id=None):
global values_input, service
creds_file_path = "" # add your service account path here
SCOPES = ['https://www.googleapis.com/auth/spreadsheets', 'https://www.googleapis.com/auth/drive']
creds = ServiceAccountCredentials.from_json_keyfile_name(creds_file_path, SCOPES)
service = build('sheets', 'v4', credentials=creds)
access_token = creds.get_access_token().access_token
url = 'https://docs.google.com/spreadsheets/d/' + spreadsheet_id + '/gviz/tq?tqx=out:csv&gid=' + sheet_id
res = requests.get(url, headers={'Authorization': 'Bearer ' + access_token})
df = pd.read_csv(io.StringIO(res.text), sep=',')
return df
Incorrect output:
Lesson Hallo! Date/Time Oct 21st, 2021 German nicht English not Completed Unnamed: 5
0 Hallo! Oct 21st, 2021 nicht so gut not so good 10/22 NaN
1 Hallo! Oct 21st, 2021 schön nice; beautiful; pretty 10/22 NaN
As it can be seen, the header and first row is combined, which is incorrect. Including the header=0
parameter in pd.read_csv()
didn't help either, same incorrect output is returned.
About your following goal,
This problem, in my opinion, is different from merely filtering the spreadsheet based on a column value, as described here. I want to fetch only those rows of a spreadsheet that are not hidden. The default API call to fetch content of a spreadsheet (see my code above) fetches all the rows, even those that have manually been hidden and hence are not visible for people opening the spreadsheet via a link.
My sample script retrieves the showing rows from the sheet with the hidden rows using Query Language. So, about I want to fetch only those rows of a spreadsheet that are not hidden.
, this can be achieved by the sample script. From The default API call to fetch content of a spreadsheet (see my code above) fetches all the rows, even those that have manually been hidden and hence are not visible for people opening the spreadsheet via a link.
, I thought that you have tried to use Sheets API. In this case, all rows are retrieved even when the hidden rows are existing.
And from read them into a pandas dataframe.
, I thought that your goal can be achieved by modifying the sample script of this answer.
So, in order to achieve your goal, the sample script is as follows.
service = build('sheets', 'v4', credentials=creds) # This is from your script.
spreadsheet_id = "###" # Please set the Spreadsheet ID.
sheet_id = "###" # Please set the sheet name.
# Removed ---> url = 'https://docs.google.com/spreadsheets/d/' + spreadsheet_id + '/gviz/tq?tqx=out:csv&gid=' + sheet_id
url = 'https://docs.google.com/spreadsheets/d/' + spreadsheet_id + '/pub?output=csv&gid=' + sheet_id
res = requests.get(url, headers={'Authorization': 'Bearer ' + creds.token})
df = pd.read_csv(io.StringIO(res.text), sep=',')
In this sample, import csv
, import io
and import requests
are also used. And, the access token is retrieved from creds
of service = build('sheets', 'v4', credentials=creds)
.
When this script is run, the showing rows are retrieved from the sheet with the hidden rows, and the retrieved values are put in the dataframe.
In this method, the showing rows can be retrieved from not only the sheet with the manually hidden rows but also the sheet with the filtered rows by the basic filter.
When the above script couldn't resolve your issue, please test the following script. In this script, the values are retrieved with Sheets API. At that time, at first, the row numbers filtered by the basic filter are retrieved. And, using the row numbers, the showing rows are retrieved.
service = build('sheets', 'v4', credentials=creds) # This is from your script.
spreadsheet_id = "###" # Please set the Spreadsheet ID.
sheet_name = "Sheet1" # Please set the sheet name.
fields = 'sheets(data(rowMetadata(hiddenByFilter)))'
res = service.spreadsheets().get(spreadsheetId=spreadsheet_id, ranges=sheet_name, fields=fields).execute()
rowMetadata = res["sheets"][0]["data"][0]["rowMetadata"]
filteredRows = {"shownRows": [], "hiddenRows": []}
for i, r in enumerate(rowMetadata):
filteredRows["hiddenRows" if "hiddenByFilter" in r and r["hiddenByFilter"] else "shownRows"].append(i + 1)
result = service.spreadsheets().values().get(spreadsheetId=spreadsheet_id, range=sheet_name).execute()
values = result.get("values", [])
v = []
for e in filteredRows["shownRows"]:
if e - 1 < len(values):
v.append(values[e - 1])
else:
break
df = pd.DataFrame(v[1:], columns=v[0])
print(df)
When the rows are hidden by the basic filter and the manual operation, you can retrieve the showing rows using the following script.
service = build('sheets', 'v4', credentials=creds) # This is from your script.
spreadsheet_id = "###" # Please set the Spreadsheet ID.
sheet_name = "Sheet1" # Please set the sheet name.
fields = 'sheets(data(rowMetadata(hiddenByFilter,hiddenByUser)))'
res = service.spreadsheets().get(spreadsheetId=spreadsheet_id, ranges=sheet_name, fields=fields).execute()
rowMetadata = res["sheets"][0]["data"][0]["rowMetadata"]
filteredRows = {"shownRows": [], "hiddenRows": []}
for i, r in enumerate(rowMetadata):
filteredRows["hiddenRows" if ("hiddenByFilter" in r and r["hiddenByFilter"]) or ("hiddenByUser" in r and r["hiddenByUser"]) else "shownRows"].append(i + 1)
result = service.spreadsheets().values().get(spreadsheetId=spreadsheet_id, range=sheet_name).execute()
values = result.get("values", [])
v = []
for e in filteredRows["shownRows"]:
if e - 1 < len(values):
v.append(values[e - 1])
else:
break
df = pd.DataFrame(v[1:], columns=v[0])
print(df)