unixsplitgrepcsplit

Splitting large file in two while keeping header


I have a very large text file (ca. 1.8TB) that I need to split at a certain entry. I know which line this entry is on, but I can also identify it via a grep command. I only care about the part of the file from this entry on.

I saw that certain Unix commands like csplit would do just that. However, the file also has an important header (30 lines long), and it is important that the newly created file(s) would also contain this header. As there's no way to prepend to files, I'm kind of stumped how to do this. Csplit and split don't seem to have the option to append their output to an existing file, and I think the file is too large for me to edit it with a text editor.

I would appreciate any advice!


Solution

  • I tested these commands on a file with 10 million lines and I hope that you will find them useful.

    Extract the header (the first 30 lines of your file) into a separate file, header.txt:

    perl -ne 'print; exit if $. == 30' 1.8TB.txt > header.txt
    

    Now you can edit the file header.txt in order to add an empty line or two at its end, as a visual separator between it and the rest of the file.

    Now copy your huge file from the 5 millionth line and up to the end of the file – into the new file 0.9TB.txt. Instead of the number 5000000, enter here the number of the line you want to start copying the file from, as you say that you know it:

    perl -ne 'print if $. >= 5000000' 1.8TB.txt > 0.9TB.txt
    

    Be patient, it can take a while. You can launch 'top' command to see what's going on. You can also track the growing file with tail -f 0.9TB.txt

    Now merge the header.txt and 0.9TB.txt:

    perl -ne 'print' header.txt 0.9TB.txt > header_and_0.9TB.txt
    

    Let me know if this solution worked for you.

    Edit: The steps 2 and 3 can be combined into one:

    perl -ne 'print if $. >= 5000000' 1.8TB.txt >> header.txt
    mv header.txt 0.9TB.txt
    

    Edit 26.05.21: I tested this solution with split and it was magnitudes faster:

    If you dont have perl, use head to extract the header:

    head -n30 1.8TB.txt > header.txt
    
    split -l 5000030 1.8TB.txt 0.9TB.txt
    

    (Note the file with the extention *.txtab, created by split)

    cat 0.9TB.txtab >> header.txt
    
    mv header.txt header_and_0.9TB.txt