rparsingconditional-statementslapply

Conditionally parse number from text string in and assign to new column


I am attempting to conditionally parse numbers from text strings within a dataframe and then assign that parsed number to the corresponding row within the last column. The condition is grepl("apple", df$col1). Not every row will meet the condition, so the corresponding cell in the last column can be NA. Perhaps easier seen than explained:

col1 <- c("I have 1 apple", NA, "I have 2 apples", NA, "I have 3 apples", NA, "I have 4 apples")
col2 <- c(7:13)
df <- as.data.frame(cbind(col1, col2))

df$col3 <- NA

This gets close to my desired result:

df$col3 = unlist(apply(df$col1, readr::parse_number))

However, I want to only parse and assign to df$col3 the rows that meet the condition grepl("apple", df$col1) because in my actual dataset, there are numbers within text strings that I do not want to parse. Is a solution an if_else with lapply?


Solution

  • A solution with tidyverse:

    library(tidyverse)
    col1 <- c("I have 1 apple", NA, "I have 2 apples", NA, "I have 3 apples", NA, "I have 4 pears")
    col2 <- c(7:13)
    df <- as.data.frame(cbind(col1, col2))
    
    df %>% 
      mutate(col3 = ifelse(str_detect(col1, "apple"), 
                           str_extract(col1, "\\d+"), NA))
    #>              col1 col2 col3
    #> 1  I have 1 apple    7    1
    #> 2            <NA>    8 <NA>
    #> 3 I have 2 apples    9    2
    #> 4            <NA>   10 <NA>
    #> 5 I have 3 apples   11    3
    #> 6            <NA>   12 <NA>
    #> 7  I have 4 pears   13 <NA>
    

    Created on 2024-12-31 with reprex v2.0.2

    Function str_extract applies regex pattern matching to extract a sequence of numbers from a string. If you have floating-point number with a dot the pattern can be expanded as "[\\d\\.]+".

    If you know that apple(s) always goes after the number you can use lookahead assertions in regex:

    df %>% 
      mutate(col3 = str_extract(col1, "\\d+(?= apple)"))