rregexdataframe

Extract displayed text from an hyperlink string stored in a data frame


I'm getting crazy with a (apparently) very simple problem.

I stored in a data frame Excel formulas as strings. Some of these strings are hyperlinks. From these hyperlink formulas, I want to extract only the displayed text and modify in place the cell of the data frame accordingly.

Here is a reprex:

DF <- data.frame(
  x = c('HYPERLINK("[Individual files/NetworkData4MassiveExport_Individual_1.xlsx]1_2016_Nodes!A3", "\'Nodes\' tab")', 'NA()'),
  y = c('NA()', 'HYPERLINK("[Individual files/NetworkData4MassiveExport_Individual_2.xlsx]2_2019_Nodes!A3", 1)'))
DF

DF[] <- lapply(DF, function(x)
  ifelse(substr(x, 1, 9) == 'HYPERLINK', trimws(gsub('[\")]', '', unlist(strsplit(x, ','))[2])), gsub('"', '', x)))
DF

Solution

  • Maybe a single sub() command can handle it.

    lapply(DF, \(x) sub("HYPERLINK\\(.+, (.+)\\)", "\\1", x)) |>
      as.data.frame()
    
    #               x    y
    # 1 "'Nodes' tab" NA()
    # 2          NA()    1