how to apply varying gsub pattern (variable function) to each row of data.table in R

I've got a data.table DT with a string column and a numeric column that indicates how many words from the start of the string should be extracted.

    > require(data.table)
    > DT <- data.table(string_col = c("A BB CCC", "DD EEE FFFF GDG", "AB DFD EFGD ABC DBC", "ABC DEF") 
                     , first_n_words = c(2, 3, 3, 1))
    > DT
                string_col first_n_words
    1:            A BB CCC             2
    2:     DD EEE FFFF GDG             3
    3: AB DFD EFGD ABC DBC             3
    4:             ABC DEF             1

I'd like to add a new column with the first-n-words of the string_col, as following:

> output_DT
            string_col first_n_words output_string_col
1:            A BB CCC             2              A BB
2:     DD EEE FFFF GDG             3       DD EEE FFFF
3: AB DFD EFGD ABC DBC             3       AB DFD EFGD
4:             ABC DEF             1               ABC

This is the gsub syntax that can be used:

> gsub(paste0("^((\\w+\\W+){", first_n_words - 1, "}\\w+).*$"),"\\1", string_col)

I basically need to create this gsub function for every row, using first_n_words of that row before applying it to string_col of that row. I'm only interested in a data.table syntax solution as it's a very large data set. a gsub solution would be most desired.

Edit: I've tried the following and it doesn't work

> DT[, output_string_col := gsub(paste0("^((\\w+\\W+){", first_n_words - 1, "}\\w+).*$"),"\\1", string_col)]
Warning message:
In gsub(paste0("^((\\w+\\W+){", first_n_words - 1, "}\\w+).*$"),  :
  argument 'pattern' has length > 1 and only the first element will be used
>## This is not the desired output    
> DT 
                string_col first_n_words output_string_col
    1:            A BB CCC             2              A BB
    2:     DD EEE FFFF GDG             3            DD EEE
    3: AB DFD EFGD ABC DBC             3            AB DFD
    4:             ABC DEF             1           ABC DEF

This is not the desired output

Solution

An answer to keep your use of data.table is to use a grouping operation, as you need a value in gsub, not a vector:

DT[,line := .I]
DT[, output_string_col := gsub(paste0("^((\\w+\\W+){", first_n_words - 1, "}\\w+).*$"),"\\1", string_col),by = line]

> DT
            string_col first_n_words line output_string_col
1:            A BB CCC             2    1              A BB
2:     DD EEE FFFF GDG             3    2       DD EEE FFFF
3: AB DFD EFGD ABC DBC             3    3       AB DFD EFGD
4:             ABC DEF             1    4               ABC

Edit

As @Franck remarqued the grouping should be on first_n_wordsto be more efficient

DT[, output_string_col := gsub(paste0("^((\\w+\\W+){", first_n_words[1] - 1, "}\\w+).*$"),"\\1", string_col),by = first_n_words]

the benchmark with this modified version gives faster results :

library(microbenchmark)

denis <- function(x){
  x[, output_string_col := gsub(paste0("^((\\w+\\W+){", first_n_words[1] - 1, "}\\w+).*$"),"\\1", string_col),by = first_n_words]
}



Tim <- function(x){
  x[, output_string_col := apply(x, 1, function(x) {
    gsub(paste0("^((\\w+\\W+){", as.numeric(x[2]) - 1, "}\\w+).*$"), "\\1", x[1])
  })]
}

miss <- function(x){
  x[, output_string_col := stringr::word(string_col, end = first_n_words)]
}

DT <- DT[sample(1:4, 1000, replace = TRUE),]

microbenchmark(
  Tim(DT),
  miss(DT),
  denis(DT)
)

Unit: milliseconds
      expr       min        lq      mean    median        uq
   Tim(DT) 56.851716 57.836126 60.435164 58.714486 60.753051
  miss(DT) 11.042056 11.516928 12.427029 11.871800 12.617031
 denis(DT)  1.993437  2.355283  2.555936  2.615181  2.680001
        max neval
 111.169277   100
  20.916932   100
   3.530668   100