I am trying to load a very large tab file by data.table::fread
, my input data may have a extra tab before last col,like:
#The entire table is tab split, we just show finally tab as \t
col1 col2 col3\t # a tab here, I have many cols in fact (>3000)
1 2 9\t
1 3 3\t
3 9 6\t
I noted fread
will infer colname if have extra col, but its not correctly
# fread infered as
V1 col1 col2 col3
I dont want V1 exists in first
I have tried
shift and reset colname, it could worked, but now I plan to use select to select columns when I read the file, so I can't do that
use data.table::fread(sep='\t', cmd=paste0('pigz -d -c ', input, ' | sed \'s/\t$//\''), select = select_col)
, Its worked, but seems very slower than process gz by data.table::fread
Starting with this result:
fread("input.tab", sep="\t")
# col1 col2 col3 V4
# <int> <int> <int> <lgcl>
# 1: 1 2 9 NA
# 2: 1 3 3 NA
# 3: 3 9 6 NA
if you know how many columns exist ahead of time, then you can simply select=
them:
fread("input.tab", select=1:3)
# col1 col2 col3
# <int> <int> <int>
# 1: 1 2 9
# 2: 1 3 3
# 3: 3 9 6
Alternatively, if it's a large file and you want to be flexible, then you can do something like this:
# read in just enough rows so that you are confident-enough
# that the last column is all NA, a symptom of the trailing-tab
tmp <- fread("input.tab", sep="\t", nrows=10)
tmp
# col1 col2 col3 V4
# <int> <int> <int> <lgcl>
# 1: 1 2 9 NA
# 2: 1 3 3 NA
# 3: 3 9 6 NA
ncols <- ncol(tmp) - all(is.na(tmp[[ncol(tmp)]]))
ncols
# [1] 3
fread("input.tab", select=seq(ncols))
# col1 col2 col3
# <int> <int> <int>
# 1: 1 2 9
# 2: 1 3 3
# 3: 3 9 6
If you have predetermined column indices in select_col
, then you can use
fread(..., select = select_col[select_col < ncols])
This naive heuristic can be fooled if the tab-problem does not exist but the first n
rows of the table are legitimately empty/null.