I am trying to convert a .vcf
file into the correct format for BayeScan
. I have tried using PGDSpider
as recommended but my .vcf
file is too big so I get a memory issue.
I then found a perl
script on Github
that may be able to convert my file even though it is really big. The script can be found here. However it does not correctly identify the number of populations I have. It only finds 1 popualtion, whereas I have 30.
The top of my population file looks like so, following the example format in the perl script.
index01_barcode_10_PA-1-WW-10 pop1
index02_barcode_29_PA-5-Ferm-19 pop2
index01_barcode_17_PA-1-WW-17 pop1
index02_barcode_20_PA-5-Ferm-10 pop2
index03_barcode_16_PA-7-CA-14 pop3
I have also tried the script with a sorted population file. I have no experience with perl language so I am struggling to work out why the script is not working.
I think it is to do with this section of the script but cannot be sure:
# read and process pop file
while (<POP>){
chomp $_;
@line = split /\t/, $_;
$pops{$line[0]} = $line[1];
}
close POP;
# Get populations and sort them
my @upops = sort { $a cmp $b } uniq ( values %pops );
print "found ", scalar @upops, " populations\n";
Appolgies as I am not sure how to make this a reproducible example but I am hoping someone could at least help me understand what this part of the code is doing and if there is a way to adapt it? Isthe problem that my individual names include _
and -
?
Thank you so much for your advice and help in advance :)
Firslty thank you to @toolic for his help and guidance :) Whilst trying to create a reproducible example it started working and I think the problem is how I made my populations file.
Previously I used: paste sample_names pops | column -s $'\t' -t > pop_file.txt
to output the file printed in the question.
However it works if i simply use: paste sample_names pops > pop_file.txt
Also I have put the full path to the .vcf
file instead of path from the current directory.
I hope this helps anyone who comes across this issue in the future :)