bashawksedcatunix-head

Concatenating CSV files in bash preserving the header only once


Imagine I have a directory containing many subdirectories each containing some number of CSV files with the same structure (same number of columns and all containing the same header).

I am aware that I can run from the parent folder something like

find ./ -name '*.csv' -exec cat {} \; > ~/Desktop/result.csv

And this will work fine, expect for the fact that the header is repeated each time (once for each file).

I'm also aware that I can do something like sed 1d <filename> or tail -n +<N+1> <filename> to skip the first line of a file.

But in my case, it seems a bit more specialised. I want to preserve the header once for the first file and then skip the header for every file after that.

Is anyone aware of a way to achieve this using standard Unix tools (like find, head, tail, sed, awk etc.) and bash?

For example input files

   /folder1
            /file1.csv
            /file2.csv
   /folder2
            /file1.csv

Where each file has header:

A,B,C and each file has one data row 1,2,3

The desired output would be:

A,B,C
1,2,3
1,2,3
1,2,3

Marked As Duplicate

I feel this is different to other questions like this and this specifically because those solutions reference file1 and file2 in the solution. My question asks about a directory structure with an arbitrary number of files where I would not want to type out each file one by one.


Solution

  • You may use this find + xargs + awk:

    find . -name '*.csv' -print0 | xargs -0 awk 'NR==1 || FNR>1'
    

    NR==1 || FNR>1 condition will be true for very first line in combined output or for every non-first line.