rstringdata.table

Extract extension from variable filename with data.table


I have a data.table storing heterogeneous file names in a string column. I want to extract the extension from that column, always taking the characters after the last dot occurrence. Sometimes the filename may contain more dots.

I tried:

files0=data.table(filename=c("simple_file.csv","file with.two dots.xls"))
files0[,chunks:=length(tstrsplit(filename,"\\."))]
files0[,extension:=tstrsplit(filename,"\\.")[chunks]]

How do I make sure that tstrsplit is only applied to each row so that this approach works?

PS: I also managed to generate a column with storing the correct number of text "chunks" with str_count, but the problem remains that when I want to create the "extension" column, the entire "filename" column seems to be used for each row.


Solution

  • No need for sapply or strsplit, those will add unnecessary complexity and inefficiency. We can use tools::file_ext (built-in to R) or just do a sub ourselves.

    dat[, tools::file_ext(filename)]
    # [1] "csv" "xls"
    dat[, sub(".*\\.", "", filename)]
    # [1] "csv" "xls"