bashfileunix

How can I split a large text file into smaller files with an equal number of lines?


I've got a large (by number of lines) plain text file that I'd like to split into smaller files, also by number of lines. So if my file has around 2M lines, I'd like to split it up into 10 files that contain 200k lines, or 100 files that contain 20k lines (plus one file with the remainder; being evenly divisible doesn't matter).

I could do this fairly easily in Python, but I'm wondering if there's any kind of ninja way to do this using Bash and Unix utilities (as opposed to manually looping and counting / partitioning lines).


Solution

  • Have a look at the split command:

    For version: (GNU coreutils) 8.32

    $ split --help
    Usage: split [OPTION]... [FILE [PREFIX]]
    Output pieces of FILE to PREFIXaa, PREFIXab, ...;
    default size is 1000 lines, and default PREFIX is 'x'.
    
    With no FILE, or when FILE is -, read standard input.
    
    Mandatory arguments to long options are mandatory for short options too.
      -a, --suffix-length=N   generate suffixes of length N (default 2)
          --additional-suffix=SUFFIX  append an additional SUFFIX to file names
      -b, --bytes=SIZE        put SIZE bytes per output file
      -C, --line-bytes=SIZE   put at most SIZE bytes of records per output file
      -d                      use numeric suffixes starting at 0, not alphabetic
          --numeric-suffixes[=FROM]  same as -d, but allow setting the start value
      -x                      use hex suffixes starting at 0, not alphabetic
          --hex-suffixes[=FROM]  same as -x, but allow setting the start value
      -e, --elide-empty-files  do not generate empty output files with '-n'
          --filter=COMMAND    write to shell COMMAND; file name is $FILE
      -l, --lines=NUMBER      put NUMBER lines/records per output file
      -n, --number=CHUNKS     generate CHUNKS output files; see explanation below
      -t, --separator=SEP     use SEP instead of newline as the record separator;
                                '\0' (zero) specifies the NUL character
      -u, --unbuffered        immediately copy input to output with '-n r/...'
          --verbose           print a diagnostic just before each
                                output file is opened
          --help     display this help and exit
          --version  output version information and exit
    
    The SIZE argument is an integer and optional unit (example: 10K is 10*1024).
    Units are K,M,G,T,P,E,Z,Y (powers of 1024) or KB,MB,... (powers of 1000).
    Binary prefixes can be used, too: KiB=K, MiB=M, and so on.
    
    CHUNKS may be:
      N       split into N files based on size of input
      K/N     output Kth of N to stdout
      l/N     split into N files without splitting lines/records
      l/K/N   output Kth of N to stdout without splitting lines/records
      r/N     like 'l' but use round robin distribution
      r/K/N   likewise but only output Kth of N to stdout
    
    GNU coreutils online help: <https://www.gnu.org/software/coreutils/>
    Full documentation <https://www.gnu.org/software/coreutils/split>
    or available locally via: info '(coreutils) split invocation'
    $ 
    
    

    You could do something like this:

    split -l 200000 filename
    

    which will create files each with 200000 lines named xaa xab xac ...

    Another option, split by size of output file (still splits on line breaks):

    split -C 20m --numeric-suffixes input_filename output_prefix
    

    creates files like output_prefix01 output_prefix02 output_prefix03 ... each of maximum size 20 megabytes.