I have been able to do what I want with one command one line, but I do know there must be some more elegant way to do what I am doing. Please tell me what your methods are... I would like to learn more sophisticated way of processing text files...
Original file is a vcf file looks like this
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=20180307
##source=PLINKv1.90
##contig=<ID=1,length=249214117>
##contig=<ID=2,length=242842533>
##contig=<ID=3,length=197896741>
...
...
...
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
22 16258171 22:16258171:D:3 A . . . . GT
22 16258174 22:16258174:T:C T . . . . GT
22 16258183 22:16258183:A:T A . . . . GT
22 16258189 22:16258189:G:T G . . . . GT
My goal is to generate a file looks like this:
22 16258171 16258171 D 3
22 16258174 16258174 T C
22 16258183 16258183 A T
22 16258189 16258189 G T
22 16258211 16258211 A G
22 16258211 16258211 A T
22 16258220 16258220 T G
22 16258221 16258221 C T
22 16258224 16258224 C T
22 16258227 16258227 G A
I did the following steps to achieve the final goal but it's so cumbersome and so ugly too...
#remove comments
sed '/^[[:blank:]]*#/d;s/#.*//' chr22.vcf > no_comment_chr22.vcf
#take out the third columns for splitting
cut -d $'\t' -f 3 no_comment_chr22.vcf > no_comment_chr22.col3_to_split.txt
#Split string by delimiter and get N-th element, use as col4
cut -d':' -f3 no_comment_chr22.col3_to_split.txt > chr22_as_col4.txt
#Split string by delimiter and get N-th element, use as col5
cut -d':' -f4 no_comment_chr22.col3_to_split.txt > chr22_as_col5.txt
#get first 2 columns
cut -d $'\t' -f 1-2 no_comment_chr22.vcf > no_comment_chr22.col1to2.txt
#get the second column as col3
cut -d $'\t' -f 2 no_comment_chr22.vcf > no_comment_chr22.ascol3.txt
#Combine files column-wise
paste no_comment_chr22.col1to2.txt no_comment_chr22.ascol3.txt chr22_as_col4.txt chr22_as_col5.txt | column -s $'\t' -t > chr22_input_5cols.txt
I was able to get what I need but .. gahhh, this is so ugly. Please tell me what people do to advance their text processing skills and how to improve things like this.. thank you!!
Using awk
:
awk -F'(:| +)' '/^#/ {next} {print $1,$2,$4,$5,$6}' sample.vcf
22 16258171 16258171 D 3
22 16258174 16258174 T C
22 16258183 16258183 A T
22 16258189 16258189 G T
This is specifying a regular expression as the filed delimiter (-F
) and then ignoring the comment lines (^#
) or printing the corresponding fields (1,2,4,5,6).