I am currently working with a huge TSV file (~5,000 columns and 500,000 records) structured approximately as follows:
f.ID f.1.0.0 f.2.0.0 f.3.0.1 f.3.0.2
1 A 22 B32 -1
2 F 38 B1 65
I cannot inspect it manually, but I have a sister file that should be in the same file format (with the join key f.ID
in common).
Everything works fine on the sister file:
$ mlr --itsv cut -f f.ID file1.tab | head -n2
f.ID=1
f.ID=2
But when I try to subset it on known columns (e.g. f.ID
), miller returns nothing:
$ mlr --itsv cut -f f.ID file2.tab | head -n2
I am having a hard time figuring out how to diagnose what is going on with this file, as I suspect it's formatted in a non-standard way. Is there a way to get what Miller is doing for each record or to get where it is failing?
If you can use another tool, try using duckdb cli and run
duckdb --csv -c "SELECT COLUMNS('^f.1.0.0$') from read_csv_auto('input.tsv');" >output.csv
Start with a limited number of rows
duckdb --csv -c "SELECT COLUMNS('^f.1.0.0$') from read_csv_auto('input.tsv') limit 1000;" >output_1000.csv