bioinformaticsfastadna-sequence

Compare hg38 fasta to primate common ancestor fasta and get vcf


I'm trying to get human-specific snp and I download hg38 fasta and ensembl primate common ancestor fasta (homo_sapiens_ancestor_GRCh38.tar.gz). Is there a tool to compare them and output a vcf that contains the difference between hg38 and primate common ancestor? Thanks for your help.


Solution

  • So what you'll need is to align the primate genome to the hg38 genome using some sort of aligner. Then with this aligned bam (or sam) file you can run mpileup using samtools, passing that to bcftools to call variants (generate the vcf file).

    Usually this is done with sequencing data, but perhaps you can emulate that by splitting up the primate fasta into 'reads'. This can be done in a variety of languages, which I won't get into right now. Then the 'reads' file (either fast or fastq [fastq has quality scores from sequencing]) can be aligned to the hg38 genome via bowtie2 or bwa-mem, then the resultant sam or bam (binary version of sam) can be run through mpileup into bcftools to determine the variants and output a vcf. Here's a helpful website. http://samtools.sourceforge.net/mpileup.shtml There's a lot to unpack about this process so please comment more about it if you need more guidance.