Let's say I have an array of names, along with a regex union of them:
match_array = [/Dan/i, /Danny/i, /Daniel/i]
match_values = Regexp.union(match_array)
I'm using a regex union because the actual data set I'm working with contains strings that often have extraneous characters, whitespaces, and varied capitalization.
I want to iterate over a series of strings to see if they match any of the values in this array. If I use .scan
, only the first matching element is returned:
'dan'.scan(match_values) # => ["dan"]
'danny'.scan(match_values) # => ["dan"]
'daniel'.scan(match_values) # => ["dan"]
'dannnniel'.scan(match_values) # => ["dan"]
'dannyel'.scan(match_values) # => ["dan"]
I want to be able to capture all of the matches (which is why I thought to use .scan
instead of .match
), but I want to prioritize the closest/most exact matches first. If none are found, then I'd want to default to the partial matches. So the results would look like this:
'dan'.scan(match_values) # => ["dan"]
'danny'.scan(match_values) # => ["danny","dan"]
'daniel'.scan(match_values) # => ["daniel","dan"]
'dannnniel'.scan(match_values) # => ["dan"]
'dannyel'.scan(match_values) # => ["danny","dan"]
Is this possible?
You can do something like this:
match_array = [/Dan/i, /Danny/i, /Daniel/i]
strings=['dan','danny','daniel','dannnniel','dannyel']
p strings.
map{|s| [s, match_array.filter{|m| s=~m}]}.to_h
Prints:
{"dan"=>[/Dan/i],
"danny"=>[/Dan/i, /Danny/i],
"daniel"=>[/Dan/i, /Daniel/i],
"dannnniel"=>[/Dan/i],
"dannyel"=>[/Dan/i, /Danny/i]}
And you can convert the regexes to strings of any case if desired:
p strings.
map{|s| [s, match_array.filter{|m| s=~m}.
map{|r| r.source.downcase}]}.to_h
Prints:
{"dan"=>["dan"],
"danny"=>["dan", "danny"],
"daniel"=>["dan", "daniel"],
"dannnniel"=>["dan"],
"dannyel"=>["dan", "danny"]}
Then if 'closest' is equivalent to 'longest' just sort by length of the regex source (ie, Dan
in the regex /Dan/i
):
p strings.
map{|s| [s, match_array.filter{|m| s=~m}.
map{|r| r.source.downcase}.
sort_by(&:length).reverse]}.to_h
Prints:
{"dan"=>["dan"],
"danny"=>["danny", "dan"],
"daniel"=>["daniel", "dan"],
"dannnniel"=>["dan"],
"dannyel"=>["danny", "dan"]}
But that only works with literal string matches. What would you expect with "dannnniel"=~/.*/
which is a 'closer' match than "dannnniel"=~/Dan/i
?
Suppose by 'closest' you mean the longest substring returned by the regex match -- so something like /.*/
is longer than any substring of the string to be matched. You can do:
match_array = [/Dan/i, /Danny/i, /Daniel/i, /.{3}/, /.*/]
strings=['dan','danny','daniel','dannnniel','dannyel']
p strings.
map{|s| [s, match_array.filter{|m| s=~m}.
sort_by{|m| s[m].length}.reverse]}.to_h
Which now sorts on the length of the match vs the length of the regex:
{"dan"=>[/.*/, /.{3}/, /Dan/i],
"danny"=>[/.*/, /Danny/i, /.{3}/, /Dan/i],
"daniel"=>[/.*/, /Daniel/i, /.{3}/, /Dan/i],
"dannnniel"=>[/.*/, /.{3}/, /Dan/i],
"dannyel"=>[/.*/, /Danny/i, /.{3}/, /Dan/i]}