bashfileawksedsplit

How to split file by percentage of no. of lines?


How to split file by percentage of no. of lines?

Let's say I want to split my file into 3 portions (60%/20%/20% parts), I could do this manually, -_- :

$ wc -l brown.txt 
57339 brown.txt

$ bc <<< "57339 / 10 * 6"
34398
$ bc <<< "57339 / 10 * 2"
11466
$ bc <<< "34398 + 11466"
45864
bc <<< "34398 + 11466 + 11475"
57339

$ head -n 34398 brown.txt > part1.txt
$ sed -n 34399,45864p brown.txt > part2.txt
$ sed -n 45865,57339p brown.txt > part3.txt
$ wc -l part*.txt
   34398 part1.txt
   11466 part2.txt
   11475 part3.txt
   57339 total

But I'm sure there's a better way!


Solution

  • There is a utility that takes as arguments the line numbers that should become the first of each respective new file: csplit. This is a wrapper around its POSIX version:

    #!/bin/bash
    
    usage () {
        printf '%s\n' "${0##*/} [-ks] [-f prefix] [-n number] file arg1..." >&2
    }
    
    # Collect csplit options
    while getopts "ksf:n:" opt; do
        case "$opt" in
            k|s) args+=(-"$opt") ;;           # k: no remove on error, s: silent
            f|n) args+=(-"$opt" "$OPTARG") ;; # f: filename prefix, n: digits in number
            *) usage; exit 1 ;;
        esac
    done
    shift $(( OPTIND - 1 ))
    
    fname=$1
    shift
    ratios=("$@")
    
    len=$(wc -l < "$fname")
    
    # Sum of ratios and array of cumulative ratios
    for ratio in "${ratios[@]}"; do
        (( total += ratio ))
        cumsums+=("$total")
    done
    
    # Don't need the last element
    unset cumsums[-1]
    
    # Array of numbers of first line in each split file
    for sum in "${cumsums[@]}"; do
        linenums+=( $(( sum * len / total + 1 )) )
    done
    
    csplit "${args[@]}" "$fname" "${linenums[@]}"
    

    After the name of the file to split up, it takes the ratios for the sizes of the split files relative to their sum, i.e.,

    percsplit brown.txt 60 20 20
    percsplit brown.txt 6 2 2
    percsplit brown.txt 3 1 1
    

    are all equivalent.

    Usage similar to the case in the question is as follows:

    $ percsplit -s -f part -n 1 brown.txt 60 20 20
    $ wc -l part*
     34403 part0
     11468 part1
     11468 part2
     57339 total
    

    Numbering starts with zero, though, and there is no txt extension. The GNU version supports a --suffix-format option that would allow for .txt extension and which could be added to the accepted arguments, but that would require something more elaborate than getopts to parse them.

    This solution plays nice with very short files (split file of two lines into two) and the heavy lifting is done by csplit itself.