rdata.tablefread

how to make data.table omit extra col


I am trying to load a very large tab file by data.table::fread, my input data may have a extra tab before last col,like:

#The entire table is tab split, we just show finally tab as \t
col1    col2    col3\t # a tab here, I have many cols in fact (>3000)
1    2    9\t
1    3    3\t
3    9    6\t

I noted fread will infer colname if have extra col, but its not correctly

# fread infered as 
V1   col1    col2   col3

I dont want V1 exists in first

I have tried

  1. shift and reset colname, it could worked, but now I plan to use select to select columns when I read the file, so I can't do that

  2. use data.table::fread(sep='\t', cmd=paste0('pigz -d -c ', input, ' | sed \'s/\t$//\''), select = select_col), Its worked, but seems very slower than process gz by data.table::fread


Solution

  • Starting with this result:

    fread("input.tab", sep="\t")
    #     col1  col2  col3     V4
    #    <int> <int> <int> <lgcl>
    # 1:     1     2     9     NA
    # 2:     1     3     3     NA
    # 3:     3     9     6     NA
    

    if you know how many columns exist ahead of time, then you can simply select= them:

    fread("input.tab", select=1:3)
    #     col1  col2  col3
    #    <int> <int> <int>
    # 1:     1     2     9
    # 2:     1     3     3
    # 3:     3     9     6
    

    Alternatively, if it's a large file and you want to be flexible, then you can do something like this:

    # read in just enough rows so that you are confident-enough 
    # that the last column is all NA, a symptom of the trailing-tab
    tmp <- fread("input.tab", sep="\t", nrows=10)
    tmp
    #     col1  col2  col3     V4
    #    <int> <int> <int> <lgcl>
    # 1:     1     2     9     NA
    # 2:     1     3     3     NA
    # 3:     3     9     6     NA
    ncols <- ncol(tmp) - all(is.na(tmp[[ncol(tmp)]]))
    ncols
    # [1] 3
    fread("input.tab", select=seq(ncols))
    #     col1  col2  col3
    #    <int> <int> <int>
    # 1:     1     2     9
    # 2:     1     3     3
    # 3:     3     9     6
    

    If you have predetermined column indices in select_col, then you can use

    fread(..., select = select_col[select_col < ncols])
    

    This naive heuristic can be fooled if the tab-problem does not exist but the first n rows of the table are legitimately empty/null.