bashcsvawk

Split CSV files into smaller files but keeping the headers?


I have a huge CSV file, 1m lines. I was wondering if there is a way to split this file into smaller ones but keeping the first line (CSV header) on all the files.

It seems split is very fast but is also very limited. You cannot add a suffix to the filenames like .csv.

split -l11000 products.csv file_

Is there an effective way to do this task in just bash? A one-line command would be great.


Solution

  • The answer to this question is yes, this is possible with AWK.

    The idea is to keep the header in mind and print all the rest in filenames of the form filename.00001.csv:

    awk -v l=11000 '(NR==1){header=$0;next}
                    (NR%l==2) {
                       close(file); 
                       file=sprintf("%s.%0.5d.csv",FILENAME,++c)
                       sub(/csv[.]/,"",file)
                       print header > file
                    }
                    {print > file}' file.csv
    

    This works in the following way:

    note: If you don't care about the filename, you can use the following shorter version:

    awk -v m=100 '
        (NR==1){h=$0;next}
        (NR%m==2) { close(f); f=sprintf("%s.%0.5d",FILENAME,++c); print h > f }
        {print > f}' file.csv