awkfilesplitting

awk command to split an 8GB file into multiple files basis number of rows with new filename and header in each file


awk command to split an 8GB file into multiple files basis number of rows with new filename and header in each file

I have an 8GB file with 26 column headers. I have to split it into multiple files with each file having 400000 lakhs including header. which means each file should have the header as well.

I have tried multiple commands but even though I am getting the desired output there is one small problem but a weird one.

After the 1st line as the header,the header is inserted again at every 50000th line. For eg after using the below command, I got FileName_28062021_1.txt file. If I open this file I can see the header in 1st , 50001st,100001st,150001st lines: Not sure how to resolve it. Original Command tried:

awk '
    NR==1{header=$0; count=1; print header > "FileName_28062021_" count ".txt"; next }
    !( (NR-1) % 399999){count++; print header > "FileName_28062021_" count ".txt";}
    {print $0 > "FileName_28062021_" count ".txt"}
' FileName_28062021-SourceFile.txt
    
SERVERIF:/data1/tempCheckAWK $ wc -l FileName_28062021-NonSplit.txt
46646575 FileName_28062021-NonSplit.txt

Second AWK command tried

SERVERIF:/data1/tempCheckAWK $ vi tempAWK.sh
awk '
NR==1 { header = $0 }
(NR % 400000) == 1 {
close(out)
out = "FileName_28062021_" (++count) ".txt"
print header > out
}
NR>1 { print > out }
' FileName_28062021-NonSplit.txt

SERVERIF:/data1/tempCheckAWK $ sh tempAWK.sh
SERVERIF:/data1/tempCheckAWK $ ls -ltr
Jun 10 13:43 FileName_28062021-NonSplit.txt
Jun 28 23:56 tempAWK.sh
Jun 28 23:59 FileName_28062021_1.txt
Jun 28 23:59 FileName_28062021_2.txt

....

SERVERIF:/data1/tempCheckAWK $wc -l FileName_28062021_1.txt
400000 FileName_28062021_1.txt

SERVERIF:/data1/tempCheckAWK $grep "Transactions Id" FileName_28062021_1.txt
Transactions Id|Transaction Date|Investment Id|External Code
Transactions Id|Transaction Date|Investment Id|External Code
Transactions Id|Transaction Date|Investment Id|External Code
Transactions Id|Transaction Date|Investment Id|External Code
Transactions Id|Transaction Date|Investment Id|External Code
Transactions Id|Transaction Date|Investment Id|External Code
Transactions Id|Transaction Date|Investment Id|External Code
Transactions Id|Transaction Date|Investment Id|External Code

I have tried other solutions provided on stackoverflow. Still no luck, the header repeats after it repeats after 50000th


Solution

  • So when I executed the below command to check the number of occurrences of the header in the input file. it gave me lots of records as given below. So the issue was not there in the AWK command but the input file itself. 
    
    SERVERIF:/data1/tempCheckAWK $grep -n "Transactions Id" FileName_28062021-NonSplit.txt
        1:Transactions Id|Transaction Date|Investment Id|External Code
        50001:Transactions Id|Transaction Date|Investment Id|External Code
        100001:Transactions Id|Transaction Date|Investment Id|External Code
        150001:Transactions Id|Transaction Date|Investment Id|External Code