[SOLVED] Bash: text processing command

Bash: text processing command

I have been able to do what I want with one command one line, but I do know there must be some more elegant way to do what I am doing. Please tell me what your methods are... I would like to learn more sophisticated way of processing text files...

Original file is a vcf file looks like this

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=20180307
##source=PLINKv1.90
##contig=<ID=1,length=249214117>
##contig=<ID=2,length=242842533>
##contig=<ID=3,length=197896741>
...
...
...
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT
22  16258171    22:16258171:D:3 A   .   .   .   .   GT
22  16258174    22:16258174:T:C T   .   .   .   .   GT
22  16258183    22:16258183:A:T A   .   .   .   .   GT
22  16258189    22:16258189:G:T G   .   .   .   .   GT

My goal is to generate a file looks like this:

22  16258171  16258171  D  3
22  16258174  16258174  T  C
22  16258183  16258183  A  T
22  16258189  16258189  G  T
22  16258211  16258211  A  G
22  16258211  16258211  A  T
22  16258220  16258220  T  G
22  16258221  16258221  C  T
22  16258224  16258224  C  T
22  16258227  16258227  G  A

I did the following steps to achieve the final goal but it's so cumbersome and so ugly too...

#remove comments
sed '/^[[:blank:]]*#/d;s/#.*//' chr22.vcf > no_comment_chr22.vcf

#take out the third columns for splitting
cut -d $'\t' -f 3 no_comment_chr22.vcf > no_comment_chr22.col3_to_split.txt

#Split string by delimiter and get N-th element, use as col4
cut -d':' -f3 no_comment_chr22.col3_to_split.txt > chr22_as_col4.txt

#Split string by delimiter and get N-th element, use as col5
cut -d':' -f4 no_comment_chr22.col3_to_split.txt > chr22_as_col5.txt

#get first 2 columns
cut -d $'\t' -f 1-2 no_comment_chr22.vcf > no_comment_chr22.col1to2.txt

#get the second column as col3 
cut -d $'\t' -f 2 no_comment_chr22.vcf > no_comment_chr22.ascol3.txt

#Combine files column-wise
paste no_comment_chr22.col1to2.txt no_comment_chr22.ascol3.txt chr22_as_col4.txt chr22_as_col5.txt | column -s $'\t' -t  > chr22_input_5cols.txt

I was able to get what I need but .. gahhh, this is so ugly. Please tell me what people do to advance their text processing skills and how to improve things like this.. thank you!!

Solution

Using awk:

awk -F'(:| +)' '/^#/ {next} {print $1,$2,$4,$5,$6}' sample.vcf


22 16258171 16258171 D 3
22 16258174 16258174 T C
22 16258183 16258183 A T
22 16258189 16258189 G T

This is specifying a regular expression as the filed delimiter (-F) and then ignoring the comment lines (^#) or printing the corresponding fields (1,2,4,5,6).