I need to split the bigger file into smaller chunks based on the last occurrence of the pattern in the bigger file using shell script. For eg.
Sample.txt ( File will be sorted based on the third field on which pattern to be searched )
NORTH EAST|0004|00001|Fost|Weaather|<br/>
NORTH EAST|0004|00001|Fost|Weaather|<br/>
SOUTH|0003|00003|Haet|Summer|<br/>
SOUTH|0003|00003|Haet|Summer|<br/>
SOUTH|0003|00003|Haet|Summer|<br/>
EAST|0007|00016|uytr|kert|<br/>
EAST|0007|00016|uytr|kert|<br/>
WEST|0002|00112|WERT|fersg|<br/>
WEST|0002|00112|WERT|fersg|<br/>
SOUTHWEST|3456|01134|GDFSG|EWRER|<br/>
"Pattern 1 = 00003 " to be searched output file must contain sample_00003.txt
NORTH EAST|0004|00001|Fost|Weaather|<br/>
NORTH EAST|0004|00001|Fost|Weaather|<br/>
SOUTH|0003|00003|Haet|Summer|<br/>
SOUTH|0003|00003|Haet|Summer|<br/>
SOUTH|0003|00003|Haet|Summer|<br/>
"Pattren 2 = 00112" to be searched output file must contain sample_00112.txt
EAST|0007|00016|uytr|kert|<br/>
EAST|0007|00016|uytr|kert|<br/>
WEST|0002|00112|WERT|fersg|<br/>
WEST|0002|00112|WERT|fersg|<br/>
Used
awk -F'|' -v 'pattern="00003"' '$3~pattern big_file' > smallfile
and grep commands but it was very time consuming since file is 300+ MB of size.
Not sure if you'll find a faster tool than awk
, but here's a variant that fixes your own attempt and also speeds things up a little by using string matching rather than regex matching.
It processes lookup values in a loop, and outputs everything from where the previous iteration left off through the last occurrence of the value at hand to a file named smallfile<n>
, where <n>
is an index starting with 1
.
ndx=0; fromRow=1
for val in '00003' '00112' '|'; do # 2 sample values to match, plus dummy value
chunkFile="smallfile$(( ++ndx ))"
fromRow=$(awk -F'|' -v fromRow="$fromRow" -v outFile="$chunkFile" -v val="$val" '
NR < fromRow { next }
{ if ($3 != val) { if (p) { print NR; exit } } else { p=1 } } { print > outFile }
' big_file)
done
Note that dummy value |
ensures that any remaining rows after the last true value to match are saved to a chunk file too.
Note that moving all the logic into a single awk
script should be much faster, because big_file
would only have to be read once:
awk -F'|' -v vals='00003|00112' '
BEGIN { split(vals, val); outFile="smallfile" ++ndx }
{
if ($3 != val[ndx]) {
if (p) { p=0; close(outFile); outFile="smallfile" ++ndx }
} else {
p=1
}
print > outFile
}
' big_file