bioinformaticsvcftoolsvcf-variant-call-format

vcf to ped format: redefine non-dbSNPs


When I am converting a vcf file to ped format (with vcftools or with vcf to ped converter of 1000G), I run into the problem that the IDs of the variants that don't have a dbSNP ID get the base pair position of that variant as an ID. Example of couple of variants:

1   rs35819278  0   23333187
1   23348003    0   23348003
1   23381893    0   23381893
1   rs18325622  0   23402111
1   rs23333532  0   23408301
1   rs55531117  0   23810772
1   23910834    0   23910834

However, I would like the variants without dbSNP ID to get the the format "chr:basepairposition". So the example of above would look like:

1   rs35819278  0   23333187
1   chr1:23348003   0   23348003
1   chr1:23381893   0   23381893
1   rs18325622  0   23402111
1   rs23333532  0   23408301
1   rs55531117  0   23810772
1   chr1:23910834   0   23910834

Would be great if anyone could help me to explain what command or which script I have to use to change this 2nd column for the variants without a dbSNP ID.

Thanks!


Solution

  • This can be done with sed. Since tabs are involved, the exact syntax may vary a bit depending on what sed is installed on your system; the following should work for Linux:

    cat [.map filename] | sed 's/^\([0-9]*\)\t\([0-9]\)/\1\tchr\1:\2/g' > [new filename]
    

    This looks for lines starting with [number][tab][digit], and makes them start with [number][tab]chr[number]:[digit] instead, while leaving other lines unchanged.

    OS X is a bit more painful (you'll need to use ctrl-V or [[:blank:]] to deal with the tab).