I am making regular expressions to extract dosage instructions from a pharmaceutical catalog. I am getting information from many different brands, and formatting is not consistent even within a brand so my expression has to be kind of lenient. The regular expressions are being implemented in Ruby (but not by me).
My regex is as follows:
/(take|chew\s|usage:|use:|intake:|dosage:?|dose:|directions:|recommendations:|adults:)\s*(.*take\s+|.*chew\s+|.*mix\s+|.*supplement,\s+)?(?<dosage_amount>\S+(\sto\s\S+)?(\sor\s\S+)?(\s\(\d+\)\s)?\b)[\s,](?<dosage_format>\S+\b(\s\([\w\-\.]+\))?)?[\s,]*?(?<dosage_frequency>[\S\s]*(daily|per day|a day|needed|morning|evening))?[\s,]?\s?(daily\s)?(?<dosage_permutation>(with|on|at|in|before|after|taken)[,\w\s\-]*)?(?=or as|\.)?/
An example of the code working correctly would be with the following description --
"Suggested use: As a dietary supplement, take 1-3 capsules daily,in divided doses, before a meal."
-- where I get dosage_amount= 1-3, dosage_format= capsule, dosage_frequency= once per day, and dosage_permutation= "in divided doses, before a meal".
However, I am getting problems with descriptions like:
"Directions: For adults, take one (1) tablet daily, preferably with a meal or follow the advice of your health care professional. Let tablets dissolve on tongue before swallowing. As a reminder, discuss the supplements and medications you take with your health care providers. "
The problem is where the word "take" is used more than once in the description. I will get dosage_amount= with, and dosage_format= your. (It is matching the second 'take', and not the first.)
Is there a way to force regex to only match the first 'take' in the description? I have tried experimenting with making it greedy vs. non-greedy as outlined here, but I can not make it work.
Thank you.
Try to replace the greedy part here:
.*take
with a non greedy version:
.*?take
The first variant consumes as many characters as possible, the second as few as possible.