sortinggreppipeuniq

How can I use grep with pipe to sort uniq lines from a gff file


I am taking a fourth year bioinformatics course. In this current assignment, the prof has given us a gff file with all the miRNA genes in the human genome annotated as gene-MIR. We are supposed to use grep, along with a regular expression and other command-line tools to generate a list of unique miRNA names in the human genome. It seems fairly straight forward and I understand how to do most of it. But I am having trouble sorting the file and removing the repeated lines. We are supposed to do this in one command line, but I am having trouble doing so.

This is the grep command I used to generate a list of gene-MIR names:

grep -Eo "(\gene-MIR)\d*\w*" file.gff

But this only generates a huge list with multiple repeats. So I tried:

grep -Eo "(\gene-MIR)\d*\w*" file.gff > file2 | sort < file2 | uniq -c > file3

But this did not work either. I have tried many variations of the above, but I unsure of what to do next.

Can anyone offer any help/advice?


Solution

  • You can use

    grep -o 'gene-MIR[[:alnum:]_]*' file.gff | sort -u > file3
    

    Details: