bashbioinformaticsbcftools

Replace 2nd and 3rd occurrence of a character with another character, for each line, Bash


I am trying to reformat the reference legend files to make them compatible with bcftools.

Essentially, I need to go from this:

id position a0 a1 TYPE AFR AMR EAS EUR SAS ALL
1:123:A:T 123 A T SNP 0.01 0.01 0 0 0 0.01
1:679:A:T 123 A T SNP 0.01 0.01 0 0 0 0.01

to this:

id position a0 a1 TYPE AFR AMR EAS EUR SAS ALL
1:123_A_T 123 A T SNP 0.01 0.01 0 0 0 0.01
1:679_A_T 123 A T SNP 0.01 0.01 0 0 0 0.01

ideally using bash.


Solution

  • If sed is an option:

    sed 's/:/_/2; s/:/_/2' file > reformatted_file
    

    (This command s/:/_/2 is substituting the second ":" to an underscore, then substituting the third ":" to an underscore, although it's technically now the second ":" (s/:/_/2), because the first one has already been changed. Does that make sense?)

    Or with only bash:

    while read -r line
    do
        tmp="${line//:/_}"
        echo "${tmp/_/:}"
    done < file > reformatted_file
    

    (*This works with your example, but replacing every ":" with an underscore, then changing the first one back to a ":" might have unintended effects on your file, e.g. it might mess up your header)