bashtextsedgrepgff

use sed to extract two pieces of text at once from a line


OK, I've found similar answers on SO but my sed / grep / awk fu is so poor that I couldn't quite adapt them to my task. Which is, given this file "test.gff":

accn|CP014704   RefSeq  CDS 403 915 .   +   0   ID=AZ909_00020;locus_tag=AZ909_00020;product=transcriptional regulator
accn|CP014704   RefSeq  CDS 928 2334    .   +   0   ID=AZ909_00025;locus_tag=AZ909_00025;product=FAD/NAD(P)-binding oxidoreductase
accn|CP014704   RefSeq  CDS 31437   32681   .   +   0   ID=AZ909_00145;locus_tag=AZ909_00145;product=gamma-glutamyl-phosphate reductase;gene=proA
accn|CP014704   RefSeq  CDS 2355    2585    .   +   0   ID=AZ909_00030;locus_tag=AZ909_00030;product=hypothetical protein

I want to extract two values 1) text to the right of "ID=" up to the semicolon and 2) text to the right of "product=" up to the end of the line OR a semicolon (since you can see one of the lines also has a "gene=" value.

So I want something like this:

ID    product
AZ909_00020    transcriptional regulator
AZ909_00025    FAD/NAD(P)-binding oxidoreductase
AZ909_00145    gamma-glutamyl-phosphate reductase

This is as far as I got:

printf "ID\tproduct\n"

sed -nr 's/^.*ID=(.*);.*product=(.*);/\1\t\2\p/' test.gff

Thanks!


Solution

  • Try the following:

    sed 's/.*ID=\([^;]*\);.*product=\([^;]*\).*/\1\t\2/' test.gff
    

    Compared to your attempt, I changed the way you match for the product. Since we don't know if the field ends with a ; or EOL, we just match the largest possible number of non ; characters. I also added a .* at the end to match any possible leftover characters after the product. This way, when we do the substitution, the entire line will match and we will be able to rewrite it completely.

    If you want something slightly more robust, here's a perl one-liner:

    perl -nle '($id)=/ID=([^;]*)/; ($prod)=/product=([^;]*)/; print "$id\t$prod"' test.gff
    

    This extracts the two fields separately using regular expressions. It will work correctly, even if the fields appear in reverse order.