rregexstringrstringi

how to extract specific character using str_extrac() in R


Context

I have a character vector a.

I want to extract the text between the last slash(/) and the .nc using the str_extract()function.

I have tried like this: str_extract(a, "(?=/).*(?=.nc)"), but failed.

Question

How can I get the text between the last lash and .nc in character vector a.

Reproducible code

a = c(
  'data/temp/air/pm2.5/pm2.5_year_2014.nc',
  'data/temp/air/pm10/pm10_year_2014.nc',
  'efcv/asdfe/weewr/rtrkhh/ss_fef_10233_dfdfe.nc'
)

# My solution (failed)

str_extract(a, "(?=/).*(?=.nc)")
# [1] "/temp/air/pm2.5/pm2.5_year_2014"       
# [2] "/temp/air/pm10/pm10_year_2014"         
# [3] "/asdfe/weewr/rtrkhh/ss_fef_10233_dfdfe"


# The expected output should like this:

# [1] "pm2.5_year_2014"       
# [2] "pm10_year_2014"         
# [3] "ss_fef_10233_dfdfe"


Solution

  • Here is a regex replacement approach:

    a = c(
        'data/temp/air/pm2.5/pm2.5_year_2014.nc',
        'data/temp/air/pm10/pm10_year_2014.nc',
        'efcv/asdfe/weewr/rtrkhh/ss_fef_10233_dfdfe.nc'
    )
    output <- gsub(".*/|\\.[^.]+$", "", a)
    output
    
    [1] "pm2.5_year_2014"    "pm10_year_2014"     "ss_fef_10233_dfdfe"
    

    Here is the regex logic:

    Then we replace these matches by empty string to remove them, leaving behind the filenames.