miller

Miller returns nothing to stdout


I am currently working with a huge TSV file (~5,000 columns and 500,000 records) structured approximately as follows:

f.ID    f.1.0.0    f.2.0.0    f.3.0.1    f.3.0.2
1    A    22    B32    -1    
2    F    38    B1    65 

I cannot inspect it manually, but I have a sister file that should be in the same file format (with the join key f.ID in common).

Everything works fine on the sister file:

$ mlr --itsv cut -f f.ID file1.tab | head -n2
f.ID=1
f.ID=2

But when I try to subset it on known columns (e.g. f.ID), miller returns nothing:

$ mlr --itsv cut -f f.ID file2.tab | head -n2

I am having a hard time figuring out how to diagnose what is going on with this file, as I suspect it's formatted in a non-standard way. Is there a way to get what Miller is doing for each record or to get where it is failing?


Solution

  • If you can use another tool, try using duckdb cli and run

    duckdb --csv -c "SELECT COLUMNS('^f.1.0.0$') from read_csv_auto('input.tsv');" >output.csv
    

    Start with a limited number of rows

    duckdb --csv -c "SELECT COLUMNS('^f.1.0.0$') from read_csv_auto('input.tsv') limit 1000;" >output_1000.csv