I need to read the urls from hyperlinks in Google sheets (where applicable) to be used in other parts of my analysis. I was reading my data with read_sheet()
or range_read()
, but that only gives me the value of the cell (what's visible - no formulae or urls). I turned to range_read_cells()
and had some success, but now I'm stuck. Consider the following:
library(googledrive)
library(googlesheets4)
library(dplyr)
df_test <- read_sheet("https://docs.google.com/spreadsheets/d/1oLj5YyBG2-ucvhhMUrl2K0Ko8jJkX4uvDChVfSfxT0o/edit#gid=0")
View(df_test) # no urls - text only
df_test2 <- range_read_cells(ss = "https://docs.google.com/spreadsheets/d/1oLj5YyBG2-ucvhhMUrl2K0Ko8jJkX4uvDChVfSfxT0o/edit#gid=0",
sheet = "Sheet1",
range = "A1:C5",
cell_data = "full")
View(df_test2) # urls are in a deep within a list called 'cell'
x <-unlist(df_test2$cell)
x["hyperlink"]
The x["hyperlink"]
call lets me access the url, but only the first one:
> x["hyperlink"]
hyperlink
"https://www.google.com/"
How can I apply this to the whole sheet? I need to read each url from the sheet and write them a new column. Note that not all rows will have a url, but I assume I can handle that case conditionally.
I've shared a Google sheet with sample data at https://docs.google.com/spreadsheets/d/1oLj5YyBG2-ucvhhMUrl2K0Ko8jJkX4uvDChVfSfxT0o/edit#gid=0. It's public, so you should be able to see it.
Thanks in advance!
I was trying to solve this exact problem today. I wrote the following function, which can be used in conjunction with purrr::map_df
to do what you're looking to do:
library(googlesheets4)
library(dplyr)
library(purrr)
library(stringr)
extract_hyperlink_data <- function(cell) {
cell_unlisted <- unlist(cell)
list(
text = cell_unlisted[["formattedValue"]],
hyperlink = if ("hyperlink" %in% names(cell_unlisted)) cell_unlisted[["hyperlink"]] else NA
)
}
df_test <- read_sheet("https://docs.google.com/spreadsheets/d/1oLj5YyBG2-ucvhhMUrl2K0Ko8jJkX4uvDChVfSfxT0o/edit#gid=0")
df_test2 <- range_read_cells(ss = "https://docs.google.com/spreadsheets/d/1oLj5YyBG2-ucvhhMUrl2K0Ko8jJkX4uvDChVfSfxT0o/edit#gid=0",
sheet = "Sheet1",
range = "A1:C5",
cell_data = "full")
URL_data <- df_test2 %>%
filter(stringr::str_starts(loc, "C") & loc != "C1") %>%
pull(cell) %>%
purrr::map_df(., extract_hyperlink_data) %>%
rename(Website = text, Website_URL = hyperlink)
df_test %>%
left_join(URL_data) %>%
View
Hope this helps - let me know if you need to troubleshoot. You could probably build a custom parser to use in conjunction with this, so that you don't need to know which columns have hyperlinks before you start, but I haven't done that yet.