regex

Ensure regex does not pull from numbers in named materials


I have a regex string to grab all sorts of numbers, however I notice it also grabs numbers that are immediately following text.

For instance using a test sentence like: The block of Nylon-12 was 1.23 by 4E-56 by -7.89 I would like to extract the 1.23, 4E-56 and -7.89. I also appears to be grabbing the -12 from the Nylon-12.

Fairly new to regex syntax, how should I start my expression to ensure it is not grabbing the number from a word. If there is a space between any text characters and number characters, that is fine but when there is no space like in Nylon-12, I do not want to be capturing them.

My regex expression I made to grab numbers is provided here:

[+\-]?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+\-]?(?:0|[1-9]\d*)(?:\.\d+)?)?

where using regexper looks like this: regex visualized

EDIT:

This seems to be an issue relating to the +/- operator search. I tried to put a \b at the start and see the results.

If I throw in Nylon12, Nylon-13, or Nylon+14 into the regex, it returns -13 and +14, not the 12.


Solution

  • You can use a so-called "positive lookbehind assertion" to achieve this.

    In principle you want to match numbers that are preceded by either some whitespace, or are at the start of the string.

    In most regex dialects, you can use this syntax:

    (?<=SOMETHING)
    

    Lookbehind assertions are a little confusing because they don't match characters in the output directly. Rather they assert that characters before your match should match some pattern. There are both positive and negative (meaning the preceding text should match or should not match) versions. And there are both lookbehind and lookahead versions (meaning the text either before or after your match should be checked.)

    This article does a good job of explaining them.

    Here's your same expression with the assertion added:

    (?<=^|\s)[+\-]?[+\-]?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+\-]?(?:0|[1-9]\d*)(?:\.\d+)?)?
    

    The lookbehind pattern I used is ^|\s which means the match should be preceded by either the beginning of the input (^) or any whitespace (\s). Your example doesn't show it but I assume in a case like this:

    37 blocks of Nylon-12 was 1.23 by 4E-56 by -7.89
    

    The 37 should be returned as well. The ^ part handles that since it isn't technically preceded by whitespace.