[SOLVED] Optimal way to split huge file

Optimal way to split huge file

I am trying to split a huge text file (~500 million lines of text) which is pretty regular and looks like this:

-- Start ---

blah blah

-- End --

-- Start --

blah blah

-- End --

...

where ... implies a repeating pattern and "blah blah" is of variable length ~ 2000 lines. I want to split off the first

-- Start --

blah blah

-- End --

block into a separate file and delete it from the original file in the FASTEST (runtime, given I will run this MANY times) possible way.

The ideal solution would cut the initial block from the original file and paste it into the new file without loading the tail of the huge initial file.

I attempted csplit in the following way:

csplit file.txt /End/+1

which is a valid way of doing this, but not very efficient in time.

EDIT: Is there a solution if we remove the last "start-end" block from file instead of the first one?

Solution

If you want the beginning removed from the original file, you have no choice but to read and write the whole rest of the file. To remove the end (as you suggest in your edit) it can be much more efficient:

use File::ReadBackwards;
use File::Slurp 'write_file';
my $fh = File::ReadBackwards->new( 'inputfile', "-- End --\n" )
    or die "couldn't read inputfile: $!\n";
my $last_chunk = $fh->readline
    or die "file was empty\n";
my $position = $fh->tell;
$fh->close;
truncate( 'inputfile', $position );
write_file( 'lastchunk', $last_chunk );