I have a CSV file like this
>cat test.in
1|aaa|bbb
1|ccc|ddd
2|xxx|yyy
2|www|zzz
2|www|ttt
and I want to split it in separate files:
prefix.FIELD1.FIELD2.out
Is there a neat way to do it in one go with awk
?
So far I've managed to have awk
create the output files but I can't make it add the header, so I just loop over the output files and add it afterwards
>cat script.sh
#!/bin/bash
FIELD_SEPARATOR="|"
OUTPUT_HEADER="Key|Value"
awk '{FS=OFS="'${FIELD_SEPARATOR}'"; print $2,$3> "prefix." $1 "." $2 ".out"}' test.in
# add the header to all the output files
echo $OUTPUT_HEADER > header
for filename in $(ls prefix.*.out 2>/dev/null); do
cat header $filename > $filename.tmp && mv $filename.tmp $filename
done
rm header
which gives the expected output
>ls prefix.*.out
prefix.1.aaa.out prefix.1.ccc.out prefix.2.www.out prefix.2.xxx.out
>cat prefix.1.aaa.out
Key|Value
aaa|bbb
>cat prefix.1.ccc.out
Key|Value
ccc|ddd
>cat prefix.2.www.out
Key|Value
www|zzz
www|ttt
>cat prefix.2.xxx.out
Key|Value
xxx|yyy
A simple way to do this in awk
is keep an array of the filenames created. If the filename isn't already in the array, output the header and then append your field-2 and field-3 as contents. A check that the number of fields is 3 helps ignore blank lines, etc.
You can write your script as:
awk -F"|" '
BEGIN { hdr="Key|Value"; OFS=FS }
NF==3 {
ofn="prefix." $1 "." $2 ".out"
if (! (ofn in arr)) {
print hdr > ofn
}
arr[ofn] = 1
print $2,$3 >> ofn
}
' test.in
Or if you like long 1-liners:
awk -F"|" 'BEGIN {hdr="Key|Value"; OFS=FS} NF==3 { ofn="prefix." $1 "." $2 ".out"; if (! (ofn in arr)) { print hdr > ofn } arr[ofn] = 1; print $2,$3 >> ofn }' test.in
Example Use/Output
$ awk -F"|" 'BEGIN {hdr="Key|Value"; OFS=FS} NF==3 { ofn="prefix." $1 "." $2 ".out"; if (! (ofn in arr)) { print hdr > ofn } arr[ofn] = 1; print $2,$3 >> ofn }' test.in
Result:
$ l
total 28
drwxr-xr-x 2 david david 4096 Nov 29 14:07 .
drwxr-xr-x 7 david david 4096 Nov 29 13:57 ..
-rw-r--r-- 1 david david 18 Nov 29 14:07 prefix.1.aaa.out
-rw-r--r-- 1 david david 18 Nov 29 14:07 prefix.1.ccc.out
-rw-r--r-- 1 david david 36 Nov 29 14:07 prefix.2.www.out
-rw-r--r-- 1 david david 18 Nov 29 14:07 prefix.2.xxx.out
-rw-r--r-- 1 david david 50 Nov 29 13:58 test.in
with, e.g.
$ for i in prefix*; do printf "\nfile: %s\n" "$i"; cat "$i"; done
file: prefix.1.aaa.out
Key|Value
aaa|bbb
file: prefix.1.ccc.out
Key|Value
ccc|ddd
file: prefix.2.www.out
Key|Value
www|zzz
www|ttt
file: prefix.2.xxx.out
Key|Value
xxx|yyy
A single awk
command is all you need. Let me know if you have questions.
Alternative for Ancient Sun SparcStation without indx in arr
Running SunOS (5.10)
Per your comment about the Old Sun SparcStation not implementing indx in array
in the awk
that is on the machine, an alternative is to build an indexed array and then scan the array to determine whether that output filename already exists and set a flag to indicate the status. This will be slightly less efficient than the indx in array
approach, but shouldn't be off by much.
Since the version of awk
on SunOS (5.10) will not even response to awk --version
, there is no telling how old it is. As mentioned, the last time I worked on a Sun SparkStation was somewhere in 1991-1993, great machines for the time before the internet :)
.
Give the following a try:
awk -F"|" '
BEGIN { hdr="Key|Value"; OFS=FS }
NF==3 {
ofn = "prefix." $1 "." $2 ".out"
exists = 0
for (i=0; i<n; i++) { # loop index arr checking for ofn
if (arr[i] == ofn) { # if already in arr
exists = 1 # set exists flag, break
break
}
}
if (exists == 0) { # if not in arr, output header, add to arr
print hdr > ofn
arr[n++] = ofn
}
print $2,$3 >> ofn # output data to file
}
' test.in
(operation and output is the same)