google-sheetsgoogle-sheets-apigooglesheets4

How to read hyperlinks from range_read_cells() in googlesheets4


I need to read the urls from hyperlinks in Google sheets (where applicable) to be used in other parts of my analysis. I was reading my data with read_sheet() or range_read(), but that only gives me the value of the cell (what's visible - no formulae or urls). I turned to range_read_cells() and had some success, but now I'm stuck. Consider the following:

library(googledrive)
library(googlesheets4)
library(dplyr)

df_test <- read_sheet("https://docs.google.com/spreadsheets/d/1oLj5YyBG2-ucvhhMUrl2K0Ko8jJkX4uvDChVfSfxT0o/edit#gid=0")

View(df_test) # no urls - text only

df_test2 <- range_read_cells(ss = "https://docs.google.com/spreadsheets/d/1oLj5YyBG2-ucvhhMUrl2K0Ko8jJkX4uvDChVfSfxT0o/edit#gid=0",
                             sheet = "Sheet1", 
                             range = "A1:C5", 
                             cell_data = "full")

View(df_test2) # urls are in a deep within a list called 'cell'

x <-unlist(df_test2$cell)
x["hyperlink"]

The x["hyperlink"] call lets me access the url, but only the first one:

> x["hyperlink"]
                hyperlink 
"https://www.google.com/" 

How can I apply this to the whole sheet? I need to read each url from the sheet and write them a new column. Note that not all rows will have a url, but I assume I can handle that case conditionally.

I've shared a Google sheet with sample data at https://docs.google.com/spreadsheets/d/1oLj5YyBG2-ucvhhMUrl2K0Ko8jJkX4uvDChVfSfxT0o/edit#gid=0. It's public, so you should be able to see it.

Thanks in advance!


Solution

  • I was trying to solve this exact problem today. I wrote the following function, which can be used in conjunction with purrr::map_df to do what you're looking to do:

    library(googlesheets4)
    library(dplyr)
    library(purrr)
    library(stringr)
    
    extract_hyperlink_data <- function(cell) {
      cell_unlisted <- unlist(cell)
      list(
        text = cell_unlisted[["formattedValue"]],
        hyperlink = if ("hyperlink" %in% names(cell_unlisted)) cell_unlisted[["hyperlink"]] else NA
      )
    }
    
    df_test <- read_sheet("https://docs.google.com/spreadsheets/d/1oLj5YyBG2-ucvhhMUrl2K0Ko8jJkX4uvDChVfSfxT0o/edit#gid=0")
    
    df_test2 <- range_read_cells(ss = "https://docs.google.com/spreadsheets/d/1oLj5YyBG2-ucvhhMUrl2K0Ko8jJkX4uvDChVfSfxT0o/edit#gid=0",
                                 sheet = "Sheet1", 
                                 range = "A1:C5", 
                                 cell_data = "full")
    URL_data <- df_test2 %>%
      filter(stringr::str_starts(loc, "C") & loc != "C1") %>%
      pull(cell) %>%
      purrr::map_df(., extract_hyperlink_data) %>%
      rename(Website = text, Website_URL = hyperlink)
    
    df_test %>%
      left_join(URL_data) %>%
      View
    
    

    Hope this helps - let me know if you need to troubleshoot. You could probably build a custom parser to use in conjunction with this, so that you don't need to know which columns have hyperlinks before you start, but I haven't done that yet.