When I am converting a vcf file to ped format (with vcftools or with vcf to ped converter of 1000G), I run into the problem that the IDs of the variants that don't have a dbSNP ID get the base pair position of that variant as an ID. Example of couple of variants:
1 rs35819278 0 23333187
1 23348003 0 23348003
1 23381893 0 23381893
1 rs18325622 0 23402111
1 rs23333532 0 23408301
1 rs55531117 0 23810772
1 23910834 0 23910834
However, I would like the variants without dbSNP ID to get the the format "chr:basepairposition". So the example of above would look like:
1 rs35819278 0 23333187
1 chr1:23348003 0 23348003
1 chr1:23381893 0 23381893
1 rs18325622 0 23402111
1 rs23333532 0 23408301
1 rs55531117 0 23810772
1 chr1:23910834 0 23910834
Would be great if anyone could help me to explain what command or which script I have to use to change this 2nd column for the variants without a dbSNP ID.
Thanks!
This can be done with sed. Since tabs are involved, the exact syntax may vary a bit depending on what sed is installed on your system; the following should work for Linux:
cat [.map filename] | sed 's/^\([0-9]*\)\t\([0-9]\)/\1\tchr\1:\2/g' > [new filename]
This looks for lines starting with [number][tab][digit], and makes them start with [number][tab]chr[number]:[digit] instead, while leaving other lines unchanged.
OS X is a bit more painful (you'll need to use ctrl-V or [[:blank:]] to deal with the tab).