bashawksunos

Can I split a CSV file in multiple output files, adding a header to each file, in one go with awk?


I have a CSV file like this

>cat test.in
1|aaa|bbb
1|ccc|ddd
2|xxx|yyy
2|www|zzz
2|www|ttt

and I want to split it in separate files:

Is there a neat way to do it in one go with awk?

So far I've managed to have awk create the output files but I can't make it add the header, so I just loop over the output files and add it afterwards

>cat script.sh
#!/bin/bash

FIELD_SEPARATOR="|"
OUTPUT_HEADER="Key|Value"

awk '{FS=OFS="'${FIELD_SEPARATOR}'"; print $2,$3> "prefix." $1 "." $2 ".out"}' test.in

# add the header to all the output files
echo $OUTPUT_HEADER > header
for filename in $(ls prefix.*.out 2>/dev/null); do
  cat header $filename > $filename.tmp && mv $filename.tmp $filename
done
rm header

which gives the expected output

>ls prefix.*.out
prefix.1.aaa.out  prefix.1.ccc.out  prefix.2.www.out  prefix.2.xxx.out

>cat prefix.1.aaa.out
Key|Value
aaa|bbb

>cat prefix.1.ccc.out
Key|Value
ccc|ddd

>cat prefix.2.www.out
Key|Value
www|zzz
www|ttt

>cat prefix.2.xxx.out
Key|Value
xxx|yyy

Solution

  • A simple way to do this in awk is keep an array of the filenames created. If the filename isn't already in the array, output the header and then append your field-2 and field-3 as contents. A check that the number of fields is 3 helps ignore blank lines, etc.

    You can write your script as:

    awk -F"|" '
      BEGIN { hdr="Key|Value"; OFS=FS } 
      NF==3 { 
        ofn="prefix." $1 "." $2 ".out"
        if (! (ofn in arr)) { 
          print hdr > ofn 
        }
        arr[ofn] = 1
        print $2,$3 >> ofn 
      }
    ' test.in
    

    Or if you like long 1-liners:

    awk -F"|" 'BEGIN {hdr="Key|Value"; OFS=FS} NF==3 { ofn="prefix." $1 "." $2 ".out"; if (! (ofn in arr)) { print hdr > ofn } arr[ofn] = 1; print $2,$3 >> ofn }' test.in
    

    Example Use/Output

    $ awk -F"|" 'BEGIN {hdr="Key|Value"; OFS=FS} NF==3 { ofn="prefix." $1 "." $2 ".out"; if (! (ofn in arr)) { print hdr > ofn } arr[ofn] = 1; print $2,$3 >> ofn }' test.in
    

    Result:

    $ l
    total 28
    drwxr-xr-x 2 david david 4096 Nov 29 14:07 .
    drwxr-xr-x 7 david david 4096 Nov 29 13:57 ..
    -rw-r--r-- 1 david david   18 Nov 29 14:07 prefix.1.aaa.out
    -rw-r--r-- 1 david david   18 Nov 29 14:07 prefix.1.ccc.out
    -rw-r--r-- 1 david david   36 Nov 29 14:07 prefix.2.www.out
    -rw-r--r-- 1 david david   18 Nov 29 14:07 prefix.2.xxx.out
    -rw-r--r-- 1 david david   50 Nov 29 13:58 test.in
    

    with, e.g.

    $ for i in prefix*; do printf "\nfile: %s\n" "$i"; cat "$i"; done
    
    file: prefix.1.aaa.out
    Key|Value
    aaa|bbb
    
    file: prefix.1.ccc.out
    Key|Value
    ccc|ddd
    
    file: prefix.2.www.out
    Key|Value
    www|zzz
    www|ttt
    
    file: prefix.2.xxx.out
    Key|Value
    xxx|yyy
    

    A single awk command is all you need. Let me know if you have questions.


    Alternative for Ancient Sun SparcStation without indx in arr Running SunOS (5.10)

    Per your comment about the Old Sun SparcStation not implementing indx in array in the awk that is on the machine, an alternative is to build an indexed array and then scan the array to determine whether that output filename already exists and set a flag to indicate the status. This will be slightly less efficient than the indx in array approach, but shouldn't be off by much.

    Since the version of awk on SunOS (5.10) will not even response to awk --version, there is no telling how old it is. As mentioned, the last time I worked on a Sun SparkStation was somewhere in 1991-1993, great machines for the time before the internet :).

    Give the following a try:

    awk -F"|" '
      BEGIN { hdr="Key|Value"; OFS=FS } 
      NF==3 { 
        ofn = "prefix." $1 "." $2 ".out"
        exists = 0
        for (i=0; i<n; i++) {     # loop index arr checking for ofn
          if (arr[i] == ofn) {    # if already in arr
            exists = 1            # set exists flag, break
            break
          }
        }
        if (exists == 0) {        # if not in arr, output header, add to arr
          print hdr > ofn
          arr[n++] = ofn
        }
        print $2,$3 >> ofn        # output data to file
      }
    ' test.in
    

    (operation and output is the same)