regexrubymatching

Capturing all matches of a string value from an array of regex patterns, while prioritizing closest matches


Let's say I have an array of names, along with a regex union of them:

match_array = [/Dan/i, /Danny/i, /Daniel/i]
match_values = Regexp.union(match_array)

I'm using a regex union because the actual data set I'm working with contains strings that often have extraneous characters, whitespaces, and varied capitalization.

I want to iterate over a series of strings to see if they match any of the values in this array. If I use .scan, only the first matching element is returned:

'dan'.scan(match_values) # => ["dan"]
'danny'.scan(match_values) # => ["dan"]
'daniel'.scan(match_values) # => ["dan"]
'dannnniel'.scan(match_values) # => ["dan"]
'dannyel'.scan(match_values) # => ["dan"]

I want to be able to capture all of the matches (which is why I thought to use .scan instead of .match), but I want to prioritize the closest/most exact matches first. If none are found, then I'd want to default to the partial matches. So the results would look like this:

'dan'.scan(match_values) # => ["dan"]
'danny'.scan(match_values) # => ["danny","dan"]
'daniel'.scan(match_values) # => ["daniel","dan"]
'dannnniel'.scan(match_values) # => ["dan"]
'dannyel'.scan(match_values) # => ["danny","dan"]

Is this possible?


Solution

  • You can do something like this:

    match_array = [/Dan/i, /Danny/i, /Daniel/i]
    
    strings=['dan','danny','daniel','dannnniel','dannyel']
    
    p strings.
        map{|s| [s, match_array.filter{|m| s=~m}]}.to_h
    

    Prints:

    {"dan"=>[/Dan/i], 
     "danny"=>[/Dan/i, /Danny/i], 
     "daniel"=>[/Dan/i, /Daniel/i], 
     "dannnniel"=>[/Dan/i], 
     "dannyel"=>[/Dan/i, /Danny/i]}
    

    And you can convert the regexes to strings of any case if desired:

    p strings.
        map{|s| [s, match_array.filter{|m| s=~m}.
           map{|r| r.source.downcase}]}.to_h
    

    Prints:

    {"dan"=>["dan"], 
     "danny"=>["dan", "danny"], 
     "daniel"=>["dan", "daniel"], 
     "dannnniel"=>["dan"], 
     "dannyel"=>["dan", "danny"]}
    

    Then if 'closest' is equivalent to 'longest' just sort by length of the regex source (ie, Dan in the regex /Dan/i):

    p strings.
        map{|s| [s, match_array.filter{|m| s=~m}.
            map{|r| r.source.downcase}.
                sort_by(&:length).reverse]}.to_h 
    

    Prints:

    {"dan"=>["dan"], 
     "danny"=>["danny", "dan"], 
     "daniel"=>["daniel", "dan"], 
     "dannnniel"=>["dan"], 
     "dannyel"=>["danny", "dan"]}
    

    But that only works with literal string matches. What would you expect with "dannnniel"=~/.*/ which is a 'closer' match than "dannnniel"=~/Dan/i?

    Suppose by 'closest' you mean the longest substring returned by the regex match -- so something like /.*/ is longer than any substring of the string to be matched. You can do:

    match_array = [/Dan/i, /Danny/i, /Daniel/i, /.{3}/, /.*/]
    
    strings=['dan','danny','daniel','dannnniel','dannyel']
    
    p strings.
        map{|s| [s, match_array.filter{|m| s=~m}.
            sort_by{|m| s[m].length}.reverse]}.to_h
    

    Which now sorts on the length of the match vs the length of the regex:

    {"dan"=>[/.*/, /.{3}/, /Dan/i], 
     "danny"=>[/.*/, /Danny/i, /.{3}/, /Dan/i],
     "daniel"=>[/.*/, /Daniel/i, /.{3}/, /Dan/i], 
     "dannnniel"=>[/.*/, /.{3}/, /Dan/i],
     "dannyel"=>[/.*/, /Danny/i, /.{3}/, /Dan/i]}