I'm trying to write a simple script to extract particular data from a VCF file, which displays variants in genome sequences.
The script needs to extract the header from the file, as well as SNVs while omitting any indels. Variants are displayed in 2 columns, the ALT and REF. Each column is separated by white space. Indels will have 2 characters in the ALT or REF, SNVs will always have 1.
What I have so far extracts the headers (which always begin with ##), but not any of the variant data.
original_file = open('/home/user/Documents/NA12878.vcf', 'r')
extracted_file = open('NA12878_SNV.txt', 'w+')
for line in original_file:
if '##' in line:
extracted_file.write(line)
# Extract SNVs while omitting indels
# Indels will have multiple entries in the REF or ALT column
# The ALT and REF columns appear at position 4 & 5 respectively
for line in original_file:
ref = str.split()[3]
alt = str.split()[4]
if len(ref) == 1 and len(alt) == 1:
extracted_file.write(line)
original_file.close()
extracted_file.close()
original_file = open('NA12878.vcf', 'r')
extracted_file = open('NA12878_SNV.txt', 'w+')
i=0
for line in original_file:
if '##' in line:
extracted_file.write(line)
else:
ref = line.split(' ')[3]
alt = line.split(' ')[4]
if len(ref) == 1 and len(alt) == 1:
extracted_file.write(line)
# Extract SNVs while omitting indels
# Indels will have multiple entries in the REF or ALT column
# The ALT and REF columns appear at position 4 & 5 respectively
original_file.close()
extracted_file.close()
There Was two issues:
So i set the code to execute with only one loop and placed tab as the split parameter.