stringbashunixfuzzy-searchapproximate

fuzzy search / approximate string matching with standard unix tools


I'm working with prokka annotation files who give me the protein product of a gene found in the uniprot database. Unfortunately, many genes are linked with multiple, very similar product names, e.g.

1%2C2-phenylacetyl-CoA epoxidase%2C subunit A
1%2C2 phenylacetyl-CoA epoxidase%2C subunit A
1%2C2-phenylacetyl CoA epoxidase%2C subunit A
1%2C2-Phenylacetyl CoA Epoxidase%2C subunit A

whereas these variants are actually different products

1%2C2-phenylacetyl-CoA epoxidase%2C subunit A
1%2C2-phenylacetyl-CoA epoxidase%2C subunit B
1%2C2-phenylacetyl-CoA epoxidase%2C subunit C
1%2C2-phenylacetyl-CoA epoxidase%2C subunit E

To avoid trouble when mapping my genes to their respective products, I decided to substitute all possible ambiguities and problematic characters such as "-" " " "/" with "@" and put all strings to lower case.

But would there be a way to search e.g. for

1%2C2-Phenylacetyl CoA Epoxidase%2C subunit A

including possible, closely related entries with standard unix tools as grep? I could not find an answer so far.


Solution

  • If you want true fuzzy search, defined by string distance metrics, check out tre-agrep. For your application, I would use grep with case-insensitive matching and period special characters.

    grep -i "1.2C2.phenylacetyl.CoA.epoxidase.2C subunit A" drugNames.txt
    

    will match any character in the place of periods, and does not pay attention to case, which is what you want.