I have a long file with provisional SNP IDs and alleles, which looks like this:
14_611646T,C
14_881226CT,C
14_861416.1GGC,GGCGCGCGCG
I would like to separate the last number in each line from the letters (separate SNP ID from alleles). So to look like this:
14_611646 T,C
14_881226 CT,C
14_861416.1 GGC,GGCGCGCGCG
I tried both awk
and sed
, however, underscore keeps making the problem. For example:
sed 's/^[0-9][0-9]*/& / File1 > File2
gave me
14 _611646T,C
14 _881226CT,C
14 _861416.1GGC,GGCGCGCGCGC
Can anyone help me?
Try to understand what is the most smart way to achieve this.
It's better to avoid using a regex that match all the line, instead try to find the portion that need change.
sed
with -E
aka E
xtented R
egex E
xpression :sed -E 's/^[0-9_.]+/& /' file
14_611646 T,C
14_881226 CT,C
14_861416.1 GGC,GGCGCGCGCG
Node | Explanation |
---|---|
^ |
the beginning of the string anchor |
[0-9_.]+ |
any character of: '0' to '9', '_', '.' (1 or more times (matching the most amount possible)) |
In the right part of sed
's substitution, &
is what matched in the left part.
sed 's/[[:upper:]]/ &/' file
[[:upper:]]
is a POSIX
regex class meant for all upper case letters.