I have a file which looks like so which has 2 column (space-delimited):
chr1.21.imputed_info:1 100880328
chr1.31.imputed_info:1 10566215
chr1.23.imputed_info:--- 110198129
chr1.23.imputed_info:--- 114445880
chr1.24.imputed_info:--- 118141492
chr1.25.imputed_info:--- 120257110
chr1.25.imputed_info:1 121280613
chr1.30.imputed_info:--- 121287994
chr1.30.imputed_info:--- 145604302
I want to extract the number following "chr" which goes from 1-22 and the second column. So my output would look like so:
1 100880328
1 10566215
1 110198129
1 114445880
1 118141492
1 120257110
1 121280613
1 121287994
1 145604302
A few important considerations:
The number following chr1,chr2 etc could go over up to 50. So you could have chr1.50 for example, or chr2.45 etc
The "info:" part at the end of the column1 may look like info:1, info:2.. info:22 OR info:---
I have come up with this in Bash:
cat file.txt | sed 's/chr//g' | sed 's/.imputed_info://g'
This gets me very close but it does this:
1.211 100880328
1.31 10566215
1.23--- 110198129
1.23--- 114445880
1.24--- 118141492
1.25--- 120257110
1.251 121280613
1.25--- 121287994
1.30--- 145604302
1.301 149906413
I know there would be ways to do this in R and Python but I should say this is a huge file so going through Bash would a great time saver.. So if anyone has a nice (and ideally clean solution - I do realise my sed command is kinda ugly) it would be great. Thanks.
Shorter way:
sed 's/^chr//;s/\..* / /' filename
EDIT:
Translation: remove the leading "chr" (if it's there), and replace everything from the first '.' to the last space (that is, a '.' followed by anything, followed by ' ') with a single space.