phpregexocrpreg-match-allsku

Extract SKU values which may be numeric or alphanumeric and must be 4 to 20 characters long


I am open to including more code than just a regular expression.

I am writing some code that takes a picture, runs a couple Imagick filters, then a tesseractOCR pass, to output text.

From that text, I am using a regex with PHP to extract a SKU (model number for a product) and output the results into an array, which is then inserted to a table.

All is well, except that in my expression I'm using now:

\w[^a-z\s\/?!@#-$%^&*():;.,œ∑´®†¥¨ˆøπåß∂ƒ©˙∆˚¬Ω≈ç√∫˜µ≤≥]{4,20}

I will still get back some strings which contain ONLY letters.

The ultimate goal:

-strings that may contain uppercase letters and numbers,
-strings that contain only numbers,
-strings that do not contain only letters,
-strings which do not contain any lowercase letters,
-these strings must be between 4-20 characters

as an example:

a SKU could be 5209, or it could also be WRE5472UFG5621.


Solution

  • Until the regex maestros show up, a lazy person such as myself would just do two rounds on this and keep it simple. First, match all strings that are only A-Z, 0-9 (rather than crafting massive no-lists or look-abouts). Then, use preg_grep() with the PREG_GREP_INVERT flag to remove all strings that are A-Z only. Finally, filter for unique matches to eliminate repeat noise.

    $str = '-9 Cycles 3 Temperature Levels Steam Sanitizet+ -Sensor Dry | ALSO AVAILABLE (PRICES MAY VARY) |- White - 1258843 - DVE45R6100W {+ Platinum - 1501 525 - DVE45R6100P desirable: 1258843 DVE45R6100W';
    
    $wanted = [];
    
    // First round: Get all A-Z, 0-9 substrings (if any)
    if(preg_match_all('~\b[A-Z0-9]{6,24}\b~', $str, $matches)) {
    
        // Second round: Filter all that are A-Z only
        $wanted = preg_grep('~^[A-Z]+$~', $matches[0], PREG_GREP_INVERT);
    
        // And remove duplicates:
        $wanted = array_unique($wanted);
    }
    

    Result:

    array(3) {
        [2] · string(7) "1258843"
        [3] · string(11) "DVE45R6100W"
        [4] · string(11) "DVE45R6100P"
    }
    

    Note that I've increased the match length to {6,24} even though you speak of a 4-character match, since your sample string has 4-digit substrings that were not in your "desirable" list.

    Edit: I've moved the preg_match_all() into a conditional construct containing the the remaining ops, and set $wanted as an empty array by default. You can conveniently both capture matches and evaluate if matched in one go (rather than e.g. have if(!empty($matches))).

    Update: Following @mickmackusa's answer with a more eloquent regex using a lookahead, I was curious over the performance of a "plain" regex with filtering, vs. use of a lookahead. Then, a test case (only 1 iteration at 3v4l to not bomb them, use your own server for more!).

    The test case used 100 generated strings with potential matches, run at 5000 iterations using both approaches. Matching results returned are identical. The single-step regex with lookahead took 0.83 sec on average, while the two-step "plain" regex took 0.69 sec on average. It appears that using a lookahead is marginally more costly than the more "blunt" approach.