bashsplitfilesizeline-endingscsplit

Split file by context and size in bash


I have a set of large files that have to be split into 100MB parts. The problem I am running into is the fact that lines are terminated by the ^B ASCII (or \u002) character.

Thus, I need to be able to get 100MB parts (plus or minus a few bytes obviously) that also accounts for the line endings.

Example file:

000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B000111222333...nnn^B

The size of a "line" can vary in size.

I know of split and csplit, but couldn't wrap my head around combining the two.

#!/bin/bash
split -b 100m filename                              #splitting by size
csplit filename “/$(echo -e “\u002”)/+1” “{*}”      #splitting by context

Any suggestions on how I can do 100MB chunks that maintain the lines intact? As a side note, I am not able to change the line endings to a \n because that will corrupt the file as the data between ^B has to maintain the new line characters if present.


Solution

  • The following will implement your splitting logic in native bash -- not very fast to execute, but it'll work anywhere bash can be installed without needing 3rd-party tools to run:

    #!/bin/bash
    
    prefix=${1:-"out."}                        # first optional argument: output file prefix
    max_size=${2:-$(( 1024 * 1024 * 100 ))}    # 2nd optional argument: size in bytes
    
    cur_size=0                                 # running count: size of current chunk
    file_num=1                                 # current numeric suffix; starting at 1
    exec >"$prefix$file_num"                   # open first output file
    
    while IFS= read -r -d $'\x02' piece; do    # as long as there's new input...
      printf '%s\x02' "$piece"                 # write it to our current output file      
      cur_size=$(( cur_size + ${#piece} + 1 )) # add its length to our counter
      if (( cur_size > max_size )); then       # if our counter is over our maximum size...
        (( ++file_num ))                       # increment the file counter
        exec >"$prefix$file_num"               # open a new output file
        cur_size=0                             # and reset the output size counter
      fi
    done
    
    if [[ $piece ]]; then  # if the end of input had content without a \x02 after it...
      printf '%s' "$piece" # ...write that trailing content to our output file.
    fi
    

    A version that relies on dd (the GNU version, here; could be changed to be portable), but which should be much faster with large inputs:

    #!/bin/bash
    
    prefix=${1:-"out."}                        # first optional argument: output file prefix
    
    file_num=1                                 # current numeric suffix; starting at 1
    exec >"$prefix$file_num"                   # open first output file
    
    while true; do
      dd bs=1M count=100                       # tell GNU dd to copy 100MB from stdin to stdout
      if IFS= read -r -d $'\x02' piece; then   # read in bash to the next boundary
        printf '%s\x02' "$piece"               # write that segment to stdout
        exec >"$prefix$((++file_num))"         # re-open stdout to point to the next file
      else
        [[ $piece ]] && printf '%s' "$piece"   # write what's left after the last boundary
        break                                  # and stop
      fi
    done
    
    # if our last file is empty, delete it.
    [[ -s $prefix$file_num ]] || rm -f -- "$prefix$file_num"