bashshellsplitcsplit

Splitting of Big File into Smaller Chunks in Shell Scripting


I need to split the bigger file into smaller chunks based on the last occurrence of the pattern in the bigger file using shell script. For eg.

Sample.txt ( File will be sorted based on the third field on which pattern to be searched )

NORTH EAST|0004|00001|Fost|Weaather|<br/> 
NORTH EAST|0004|00001|Fost|Weaather|<br/> 
SOUTH|0003|00003|Haet|Summer|<br/> 
SOUTH|0003|00003|Haet|Summer|<br/> 
SOUTH|0003|00003|Haet|Summer|<br/> 
EAST|0007|00016|uytr|kert|<br/> 
EAST|0007|00016|uytr|kert|<br/> 
WEST|0002|00112|WERT|fersg|<br/> 
WEST|0002|00112|WERT|fersg|<br/>
SOUTHWEST|3456|01134|GDFSG|EWRER|<br/> 

"Pattern 1 = 00003 " to be searched output file must contain sample_00003.txt

NORTH EAST|0004|00001|Fost|Weaather|<br/> 
NORTH EAST|0004|00001|Fost|Weaather|<br/>
SOUTH|0003|00003|Haet|Summer|<br/> 
SOUTH|0003|00003|Haet|Summer|<br/> 
SOUTH|0003|00003|Haet|Summer|<br/> 

"Pattren 2 = 00112" to be searched output file must contain sample_00112.txt

EAST|0007|00016|uytr|kert|<br/> 
EAST|0007|00016|uytr|kert|<br/> 
WEST|0002|00112|WERT|fersg|<br/> 
WEST|0002|00112|WERT|fersg|<br/> 

Used

awk -F'|' -v 'pattern="00003"' '$3~pattern big_file' > smallfile

and grep commands but it was very time consuming since file is 300+ MB of size.


Solution

  • Not sure if you'll find a faster tool than awk, but here's a variant that fixes your own attempt and also speeds things up a little by using string matching rather than regex matching.

    It processes lookup values in a loop, and outputs everything from where the previous iteration left off through the last occurrence of the value at hand to a file named smallfile<n>, where <n> is an index starting with 1.

    ndx=0; fromRow=1
    for val in '00003' '00112' '|'; do  # 2 sample values to match, plus dummy value
      chunkFile="smallfile$(( ++ndx ))"
      fromRow=$(awk -F'|' -v fromRow="$fromRow" -v outFile="$chunkFile" -v val="$val" '
        NR < fromRow { next }
        { if ($3 != val) { if (p) { print NR; exit } } else { p=1 } } { print > outFile }
      ' big_file)
    done
    

    Note that dummy value | ensures that any remaining rows after the last true value to match are saved to a chunk file too.


    Note that moving all the logic into a single awk script should be much faster, because big_file would only have to be read once:

    awk -F'|' -v vals='00003|00112' '
      BEGIN { split(vals, val); outFile="smallfile" ++ndx }
      { 
        if ($3 != val[ndx]) { 
          if (p) { p=0; close(outFile); outFile="smallfile" ++ndx } 
        } else { 
          p=1 
        } 
        print > outFile
      }
    ' big_file