bashawk

Use awk to output each line from a file to a new filename based on specific separators


I have the following file with tabs as the field separators:

header1 header2 header3 header4 header5
1field1 1field2 1field3 1field4 1field5
2field1 2field2 2field3 2field4 2field5
3field1 3field2 3field3 3field4 3field5
4field1 4field2 4field3 4field4 4field5

and would like to output each line to a new file (skipping the first line). Each new file will be named from the 1st and 5th fields with an underscore separator. The file from line 1 (2 technically) would be named "1field1_1field5.txt" and contain all the fields from that line and so on. I have the following awk command which outputs the correct filenames to standard out

awk -v FS='\t' -v OFS='_' 'NR>1 {print ($1,$5 ".txt") }'

but when I try to output the text into filenames instead

awk -v FS='\t' -v OFS='_' 'NR>1 {print > ($1,$5 ".txt") }'

I get the following error

awk: cmd. line:1: NR>1 { print > ($1,$5 ".txt") }
awk: cmd. line:1:                               ^ syntax error

I have copied/pasted from 10 different other articles to get this far, but I'm stuck on how my formatting is wrong.


Solution

  • Using any awk, you should do the following if your $1 and $5 fields are unique per row:

    awk -F '\t' 'NR>1 { out=$1 "_" $5 ".txt"; print > out; close(out) }'
    

    and this otherwise:

    awk -F '\t' 'NR>1 { out=$1 "_" $5 ".txt"; if (!seen[out]++) printf "" > out; print >> out; close(out) }'
    

    The close() is so you don't end up with a "too many open files" error if your input is large. The printf "" > out is to empty/init the output file in case it already existed before your script ran.

    With GNU awk you could get away without the close():

    awk -F '\t' 'NR>1 { print > ($1 "_" $5 ".txt") }'
    

    but the script will slow down significantly for large input as it tries to internally handle opening/closing all of the output files as-needed.