pythonregexpython-re

How to extract the volume from a string using a regular expression?


I need to extract the volume with regular expression from strings like "Candy BAR 350G" (volume = 350G),

"Gin Barrister 0.9ml" (volume = 0.9ml),

"BAXTER DRY Gin 40% 0.5 ml" (volume = 0.5 ml),

"SWEET CORN 340G/425ML GLOBUS" (volume = 340G/425ML)

I tried using '\d+\S*[gGMmLl]'

and it worked well, but I faced strings like "Candies 2x150G" (volume that I need is 150G but I get 2x150G) or

"FOOD DYES 3COL.9G" (I need 9G however I get 3COL.9G)

I don't know what else add to regular expression


Solution

  • Let's start with the full code, and we can break it down into smaller blocks:

    import re
    
    fluids = [
        "Candy BAR 350G",
        "Gin Barrister 0.9ml",
        "BAXTER DRY Gin 40% 0.5 ml",
        "SWEET CORN 340G/425ML GLOBUS",
        "Candies 2x150G",
        "FOOD DYES 3COL.9G"
    ]
    
    pattern = r"(\d[\d.]{0,})\s?(ml|g)"
    
    for fluid in fluids:
        print(re.findall(pattern, fluid, flags=re.IGNORECASE))
    

    which produces

    [('350', 'G')]
    [('0.9', 'ml')]
    [('0.5', 'ml')]
    [('340', 'G'), ('425', 'ML')]
    [('150', 'G')]
    [('9', 'G')]
    

    Note first, that we make our lives simpler by passing the regex flag re.IGNORECASE. We also make sure the pattern is a raw string using r"..." so that Python doesn't get funny about the backslashes in the pattern (it thinks the user is trying to escape characters in the string otherwise, when that is not our intention).

    If a Python regex pattern is passed anything inside of (...) brackets without any assertions like ?= or ?!, it becomes a capturing group. Depending on the level of nesting, you're telling the regex method exactly what part of the pattern you're interested in returning to the user. We use capturing groups to make sure that we don't capture any whitespace text (which we search for using \s?), and instead grab the quantity (\d[\d.]{0,}) and unit terms (ml|g). Because the capture groups for volume and units are at the same level of nesting, they get returned as a tuple when discovered by re.findall.

    The numbers were captured using the regex pattern \d[\d.]{0,} which says look for something that has to start with a digit (\d) and is then followed by any combination of the characters ([\d.]) (representing any digit or a full stop) from zero to any amount of repetition ({0,}).

    The units are captured with ml|g, telling the interpreter to either match the ml or g in the second capture group.

    Hope this helps.