bashcsvsedgnu-parallelgnu-findutils

gnu parallel + sed to edit both csv header and contents


I'm trying to use command line tools to edit some CSV I have in the following format for several year folders:

I'm trying to append the file name to its content, creating a new column called filename with ./year_1/csv_filename_1.csv to all columns in it. After that, I would gzip it.

Due to the number of year folders (almost 100) and the CSVs quantities in each (totaling 100k+), I plan to use gnu parallel to run it, and

I was trying to use sed doing something like

fname="1929/csv_filename_1.csv" &&          \ # to simulate parallel's parameterization
    sed -E -e '1s/$/,filename/'             \ # append ",filename" to CSV header
           -e '2,\$s/$/,${fname}/' ${fname} \ # append the filename string to the content

But I can't get the sed to work with the second expression because I either get "${fname}" written as-is to the file, or the sed error "sed: -e expression #1, char 6: unknown command: '\'" complaining about a comma or the slash. I also have tried to group the expressions like -e '1{s/$/,filename/};2,\${s/$/,${fname}/}' for no avail.

Currently, I gave up sed and started trying with awk, but not knowing why it didn't work is bothering me, so I came to ask why and how to make it work.

Just one more piece of info regarding how I intend to run this thing. It would be something like

find ~/dataset -iname "*csv" -print0 | parallel -0 -j0 '<the whole command here (sed + gz)>'

How could I do this? What am I forgetting? Thanks, folks!

PS: I just got it with awk

awk -v d="csv_filename_1.csv" -F"," 'FNR==1{a="filename"} FNR>1{a=d} {print $0","a}' csv_filename_1.csv | less

Solution

  • This might work for you (GNU parallel and sed):

    find . -type f -name '*.csv' | parallel sed -i \''1s/$/,filename/;1!s#$#,{}#'\' {}
    

    Use find to deliver the filename to the parallel command.

    Use sed to append ,filename to the heading of each file and the file name present in {} to each line in the file.

    N.B. The use of alternative delimiters s#...#...# in the second sed command to allow for the filename slashes. Also the find should be executed in the dataset directory.