OK, I've found similar answers on SO but my sed / grep / awk fu is so poor that I couldn't quite adapt them to my task. Which is, given this file "test.gff":
accn|CP014704 RefSeq CDS 403 915 . + 0 ID=AZ909_00020;locus_tag=AZ909_00020;product=transcriptional regulator
accn|CP014704 RefSeq CDS 928 2334 . + 0 ID=AZ909_00025;locus_tag=AZ909_00025;product=FAD/NAD(P)-binding oxidoreductase
accn|CP014704 RefSeq CDS 31437 32681 . + 0 ID=AZ909_00145;locus_tag=AZ909_00145;product=gamma-glutamyl-phosphate reductase;gene=proA
accn|CP014704 RefSeq CDS 2355 2585 . + 0 ID=AZ909_00030;locus_tag=AZ909_00030;product=hypothetical protein
I want to extract two values 1) text to the right of "ID=" up to the semicolon and 2) text to the right of "product=" up to the end of the line OR a semicolon (since you can see one of the lines also has a "gene=" value.
So I want something like this:
ID product
AZ909_00020 transcriptional regulator
AZ909_00025 FAD/NAD(P)-binding oxidoreductase
AZ909_00145 gamma-glutamyl-phosphate reductase
This is as far as I got:
printf "ID\tproduct\n"
sed -nr 's/^.*ID=(.*);.*product=(.*);/\1\t\2\p/' test.gff
Thanks!
Try the following:
sed 's/.*ID=\([^;]*\);.*product=\([^;]*\).*/\1\t\2/' test.gff
Compared to your attempt, I changed the way you match for the product. Since we don't know if the field ends with a ;
or EOL
, we just match the largest possible number of non ;
characters. I also added a .*
at the end to match any possible leftover characters after the product. This way, when we do the substitution, the entire line will match and we will be able to rewrite it completely.
If you want something slightly more robust, here's a perl one-liner:
perl -nle '($id)=/ID=([^;]*)/; ($prod)=/product=([^;]*)/; print "$id\t$prod"' test.gff
This extracts the two fields separately using regular expressions. It will work correctly, even if the fields appear in reverse order.