awksedseparator

Separating the last number in each line from the letters


I have a long file with provisional SNP IDs and alleles, which looks like this:

14_611646T,C
14_881226CT,C
14_861416.1GGC,GGCGCGCGCG

I would like to separate the last number in each line from the letters (separate SNP ID from alleles). So to look like this:

14_611646 T,C
14_881226 CT,C
14_861416.1 GGC,GGCGCGCGCG

I tried both awk and sed, however, underscore keeps making the problem. For example:

sed 's/^[0-9][0-9]*/& / File1 > File2

gave me

14 _611646T,C
14 _881226CT,C
14 _861416.1GGC,GGCGCGCGCGC

Can anyone help me?


Solution

  • Try to understand what is the most smart way to achieve this.

    It's better to avoid using a regex that match all the line, instead try to find the portion that need change.

    Using sed with -E aka Extented Regex Expression :

    sed -E 's/^[0-9_.]+/& /' file
    

    Yields:

    14_611646 T,C
    14_881226 CT,C
    14_861416.1 GGC,GGCGCGCGCG
    

    The regular expression matches as follows:

    Node Explanation
    ^ the beginning of the string anchor
    [0-9_.]+ any character of: '0' to '9', '_', '.' (1 or more times (matching the most amount possible))

    In the right part of sed's substitution, & is what matched in the left part.

    Bonus

    sed 's/[[:upper:]]/ &/' file
    

    [[:upper:]] is a POSIX regex class meant for all upper case letters.