I've got a data.table DT with a string column and a numeric column that indicates how many words from the start of the string should be extracted.
> require(data.table)
> DT <- data.table(string_col = c("A BB CCC", "DD EEE FFFF GDG", "AB DFD EFGD ABC DBC", "ABC DEF")
, first_n_words = c(2, 3, 3, 1))
> DT
string_col first_n_words
1: A BB CCC 2
2: DD EEE FFFF GDG 3
3: AB DFD EFGD ABC DBC 3
4: ABC DEF 1
I'd like to add a new column with the first-n-words of the string_col, as following:
> output_DT
string_col first_n_words output_string_col
1: A BB CCC 2 A BB
2: DD EEE FFFF GDG 3 DD EEE FFFF
3: AB DFD EFGD ABC DBC 3 AB DFD EFGD
4: ABC DEF 1 ABC
This is the gsub syntax that can be used:
> gsub(paste0("^((\\w+\\W+){", first_n_words - 1, "}\\w+).*$"),"\\1", string_col)
I basically need to create this gsub function for every row, using first_n_words of that row before applying it to string_col of that row. I'm only interested in a data.table syntax solution as it's a very large data set. a gsub solution would be most desired.
Edit: I've tried the following and it doesn't work
> DT[, output_string_col := gsub(paste0("^((\\w+\\W+){", first_n_words - 1, "}\\w+).*$"),"\\1", string_col)]
Warning message:
In gsub(paste0("^((\\w+\\W+){", first_n_words - 1, "}\\w+).*$"), :
argument 'pattern' has length > 1 and only the first element will be used
>## This is not the desired output
> DT
string_col first_n_words output_string_col
1: A BB CCC 2 A BB
2: DD EEE FFFF GDG 3 DD EEE
3: AB DFD EFGD ABC DBC 3 AB DFD
4: ABC DEF 1 ABC DEF
This is not the desired output
An answer to keep your use of data.table is to use a grouping operation, as you need a value in gsub, not a vector:
DT[,line := .I]
DT[, output_string_col := gsub(paste0("^((\\w+\\W+){", first_n_words - 1, "}\\w+).*$"),"\\1", string_col),by = line]
> DT
string_col first_n_words line output_string_col
1: A BB CCC 2 1 A BB
2: DD EEE FFFF GDG 3 2 DD EEE FFFF
3: AB DFD EFGD ABC DBC 3 3 AB DFD EFGD
4: ABC DEF 1 4 ABC
As @Franck remarqued the grouping should be on first_n_words
to be more efficient
DT[, output_string_col := gsub(paste0("^((\\w+\\W+){", first_n_words[1] - 1, "}\\w+).*$"),"\\1", string_col),by = first_n_words]
the benchmark with this modified version gives faster results :
library(microbenchmark)
denis <- function(x){
x[, output_string_col := gsub(paste0("^((\\w+\\W+){", first_n_words[1] - 1, "}\\w+).*$"),"\\1", string_col),by = first_n_words]
}
Tim <- function(x){
x[, output_string_col := apply(x, 1, function(x) {
gsub(paste0("^((\\w+\\W+){", as.numeric(x[2]) - 1, "}\\w+).*$"), "\\1", x[1])
})]
}
miss <- function(x){
x[, output_string_col := stringr::word(string_col, end = first_n_words)]
}
DT <- DT[sample(1:4, 1000, replace = TRUE),]
microbenchmark(
Tim(DT),
miss(DT),
denis(DT)
)
Unit: milliseconds
expr min lq mean median uq
Tim(DT) 56.851716 57.836126 60.435164 58.714486 60.753051
miss(DT) 11.042056 11.516928 12.427029 11.871800 12.617031
denis(DT) 1.993437 2.355283 2.555936 2.615181 2.680001
max neval
111.169277 100
20.916932 100
3.530668 100