I would like to grep and separate fields from bed files to generate a new bed file with these new arranged data.
I would go from here:
1 15903 rs557514207 G G,A RS=557514207;RSPOS=15903;dbSNPBuildID=142;SSR=0;SAO=0;VP=0x050000000005150026000200;GENEINFO=WASH7P:653635;WGT=1;VC=DIV;ASP;VLD;G5;KGPhase3;CAF=0.5589,.,0.4411;COMMON=1;TOPMED=0.30307084607543323,0.00039022680937818,0.69653892711518858`
1 11012 rs544419019 C G RS=544419019;RSPOS=11012;dbSNPBuildID=142;SSR=0;SAO=0;VP=0x050000020005150024000100;GENEINFO=DDX11L1:100287102;WGT=1;VC=SNV;R5;ASP;VLD;G5;KGPhase3;CAF=0.9119,0.08806;COMMON=1`
1 15903 rs557514207 G G,C RS=557514207;RSPOS=15903;dbSNPBuildID=142;SSR=0;SAO=0;VP=0x050000000005150026000200;GENEINFO=WASH7P:653635;WGT=1;VC=DIV;ASP;VLD;G5;KGPhase3;CAF=0.5589,.,0.4411;COMMON=1;TOPMED=0.30307084607543323,0.00039022680937818,0.69653892711518858
To here:
1 15903 rs557514207 G G CAF=0.5589,.
1 15903 rs557514207 G A CAF=0.5589,0.4411
1 11012 rs544419019 C G CAF=0.9119,0.08806
1 15903 rs557514207 G G CAF=0.5589,.
1 15903 rs557514207 G C CAF=0.5589,0.4411
So separating column 5 by comma and add a new line and separating column 6 by Word CAF= and also the values that correspond to column 5 and keep the information in the new lines. Column 6 includes a strings, concatenated by semicolon. I'm interessted in the part ;CAF=value1,value2; between the semicolon. Resulting in this example into two new lines CAF=value1 CAF=value2, which is connected to the split of G,A two new lines for G and A.
awk -F'\t' -v OFS='\t' '
{
# split column 6; CAF part starts from element 2
split($6, c6, /^.*CAF=|,|;.*$/)
# split column 5
n=split($5, c5, /,/)
# print initial columns and relevant parts of 5 and 6
for (i=1; i<=n; i++)
print $1,$2,$3,$4, c5[i], "CAF="c6[2]","c6[2+i]
}
' infile >outfile