Capturing all matches of a string value from an array of regex patterns, while prioritizing closest matches
Asked Answered
B

5

5

Let's say I have an array of names, along with a regex union of them:

match_array = [/Dan/i, /Danny/i, /Daniel/i]
match_values = Regexp.union(match_array)

I'm using a regex union because the actual data set I'm working with contains strings that often have extraneous characters, whitespaces, and varied capitalization.

I want to iterate over a series of strings to see if they match any of the values in this array. If I use .scan, only the first matching element is returned:

'dan'.scan(match_values) # => ["dan"]
'danny'.scan(match_values) # => ["dan"]
'daniel'.scan(match_values) # => ["dan"]
'dannnniel'.scan(match_values) # => ["dan"]
'dannyel'.scan(match_values) # => ["dan"]

I want to be able to capture all of the matches (which is why I thought to use .scan instead of .match), but I want to prioritize the closest/most exact matches first. If none are found, then I'd want to default to the partial matches. So the results would look like this:

'dan'.scan(match_values) # => ["dan"]
'danny'.scan(match_values) # => ["danny","dan"]
'daniel'.scan(match_values) # => ["daniel","dan"]
'dannnniel'.scan(match_values) # => ["dan"]
'dannyel'.scan(match_values) # => ["danny","dan"]

Is this possible?

Boast answered 9/7 at 2:52 Comment(2)
What would you expect with "dannnniel"=~/.*/ which is a 'closer' match than "dannnniel"=~/Dan/i?Contribute
What would be the desired return value if the string were "dan and daniel"?Langer
C
2

You can do something like this:

match_array = [/Dan/i, /Danny/i, /Daniel/i]

strings=['dan','danny','daniel','dannnniel','dannyel']

p strings.
    map{|s| [s, match_array.filter{|m| s=~m}]}.to_h

Prints:

{"dan"=>[/Dan/i], 
 "danny"=>[/Dan/i, /Danny/i], 
 "daniel"=>[/Dan/i, /Daniel/i], 
 "dannnniel"=>[/Dan/i], 
 "dannyel"=>[/Dan/i, /Danny/i]}

And you can convert the regexes to strings of any case if desired:

p strings.
    map{|s| [s, match_array.filter{|m| s=~m}.
       map{|r| r.source.downcase}]}.to_h

Prints:

{"dan"=>["dan"], 
 "danny"=>["dan", "danny"], 
 "daniel"=>["dan", "daniel"], 
 "dannnniel"=>["dan"], 
 "dannyel"=>["dan", "danny"]}

Then if 'closest' is equivalent to 'longest' just sort by length of the regex source (ie, Dan in the regex /Dan/i):

p strings.
    map{|s| [s, match_array.filter{|m| s=~m}.
        map{|r| r.source.downcase}.
            sort_by(&:length).reverse]}.to_h 

Prints:

{"dan"=>["dan"], 
 "danny"=>["danny", "dan"], 
 "daniel"=>["daniel", "dan"], 
 "dannnniel"=>["dan"], 
 "dannyel"=>["danny", "dan"]}

But that only works with literal string matches. What would you expect with "dannnniel"=~/.*/ which is a 'closer' match than "dannnniel"=~/Dan/i?

Suppose by 'closest' you mean the longest substring returned by the regex match -- so something like /.*/ is longer than any substring of the string to be matched. You can do:

match_array = [/Dan/i, /Danny/i, /Daniel/i, /.{3}/, /.*/]

strings=['dan','danny','daniel','dannnniel','dannyel']

p strings.
    map{|s| [s, match_array.filter{|m| s=~m}.
        sort_by{|m| s[m].length}.reverse]}.to_h

Which now sorts on the length of the match vs the length of the regex:

{"dan"=>[/.*/, /.{3}/, /Dan/i], 
 "danny"=>[/.*/, /Danny/i, /.{3}/, /Dan/i],
 "daniel"=>[/.*/, /Daniel/i, /.{3}/, /Dan/i], 
 "dannnniel"=>[/.*/, /.{3}/, /Dan/i],
 "dannyel"=>[/.*/, /Danny/i, /.{3}/, /Dan/i]}
Contribute answered 9/7 at 15:21 Comment(1)
Note that if 'mundane' is appended to strings, the key-value pair "mundane"=>[/Dan/i] would be added to the hash. This primarily reflects the vagueness of the question.Langer
C
3
match_array = [/Daniel/i, /Danny/i, /Dan/i]

def prioritized_scan(string, match_array)
  matches = []
  match_array.each do |pattern|
    string.scan(pattern) do |match|
      matches << match unless matches.include?(match)
    end
  end
  matches
end

p prioritized_scan('dan', match_array)
p prioritized_scan('danny', match_array)
p prioritized_scan('daniel', match_array)
p prioritized_scan('dannnniel', match_array)
p prioritized_scan('dannyel', match_array)

Output

["dan"]
["danny", "dan"]
["daniel", "dan"]
["dan"]
["danny", "dan"]
Crony answered 9/7 at 3:3 Comment(0)
H
3

I think you could do the following:

  1. Sort the array of your regexes by the length of chars in them (unless you want to manually sort it):

    match_array = [/Dan/i, /Danny/i, /Daniel/i]
    sorted_regexes = match_array.sort_by{|x| -x.source.length}
    
    p sorted_regexes
    

    Output:

    [/Daniel/i, /Danny/i, /Dan/i]
    
  2. Iterate over it to find matches (it will find the best match first as it will check the longest regexes first):

    def find_matches(string, sorted_regexes)
      sorted_regexes.reduce([]) do |acc, regex|
        match = string.match(regex)
        acc.push(match[0]) if match
        acc
      end
    end
    
    p find_matches('dan', sorted_regexes)
    p find_matches('danny', sorted_regexes)
    p find_matches('daniel', sorted_regexes)
    p find_matches('dannnniel', sorted_regexes)
    p find_matches('dannyel', sorted_regexes)
    

    Output:

    ["dan"]
    ["danny", "dan"]
    ["daniel", "dan"]
    ["dan"]
    ["danny", "dan"]
    
    
Heyerdahl answered 9/7 at 3:22 Comment(0)
D
3

While this does not union or use your list I thought I would provide another option using a backref for the "root" of "dan". /(dan)?(\g<1>(?:iel|ny)?)/i

This assumes that each derivative should only appear occur once for instance:

  • "dandan" will only show ["dan"] rather than ["dan","dan"]; and
  • "dandannydaniel" will be ["dan","danny","daniel"] rather than ["dan","dan","danny","dan","daniel"]

Example:

a = %w[dan
danny
daniel
dannnniel
dannyel
dandan
dandannydaniel]

a.map {|s| {s => s.scan(/(dan)?(\g<1>(?:iel|ny)?)/i).flatten.uniq} }
#=> [{"dan"=>["dan"]}, 
#    {"danny"=>["dan", "danny"]}, 
#    {"daniel"=>["dan", "daniel"]}, 
#    {"dannnniel"=>["dan"]}, 
#    {"dannyel"=>["dan", "danny"]}, 
#    {"dandan"=>["dan"]}, 
#    {"dandannydaniel"=>["dan", "danny", "daniel"]}]
Delfinadelfine answered 9/7 at 19:4 Comment(1)
Thanks for the edit. I see I was inconsistent with the word boundary. I've now removed it.Langer
L
3

You could write

rgx = /^(?=(dan))(?=(daniel|danny))?/i

Then

["dan", "danny", "daniel", "dannnniel", "dannyel", "dannyboy", "dandan"].each do |str|
  puts "#{str}: #{str.scan(rgx)}"
end

displays

dan: [["dan", nil]]
danny: [["dan", "danny"]]
daniel: [["dan", "daniel"]]
dannnniel: [["dan", nil]]
dannyel: [["dan", "danny"]]
dannyboy: [["dan", "danny"]]
dandan: [["dan", nil]]

Ruby demo | Regex demo

Note that, to make it self-documenting, I've expressed the regular expression at the "Regex demo" link in free-spacing mode.

Langer answered 9/7 at 20:57 Comment(0)
C
2

You can do something like this:

match_array = [/Dan/i, /Danny/i, /Daniel/i]

strings=['dan','danny','daniel','dannnniel','dannyel']

p strings.
    map{|s| [s, match_array.filter{|m| s=~m}]}.to_h

Prints:

{"dan"=>[/Dan/i], 
 "danny"=>[/Dan/i, /Danny/i], 
 "daniel"=>[/Dan/i, /Daniel/i], 
 "dannnniel"=>[/Dan/i], 
 "dannyel"=>[/Dan/i, /Danny/i]}

And you can convert the regexes to strings of any case if desired:

p strings.
    map{|s| [s, match_array.filter{|m| s=~m}.
       map{|r| r.source.downcase}]}.to_h

Prints:

{"dan"=>["dan"], 
 "danny"=>["dan", "danny"], 
 "daniel"=>["dan", "daniel"], 
 "dannnniel"=>["dan"], 
 "dannyel"=>["dan", "danny"]}

Then if 'closest' is equivalent to 'longest' just sort by length of the regex source (ie, Dan in the regex /Dan/i):

p strings.
    map{|s| [s, match_array.filter{|m| s=~m}.
        map{|r| r.source.downcase}.
            sort_by(&:length).reverse]}.to_h 

Prints:

{"dan"=>["dan"], 
 "danny"=>["danny", "dan"], 
 "daniel"=>["daniel", "dan"], 
 "dannnniel"=>["dan"], 
 "dannyel"=>["danny", "dan"]}

But that only works with literal string matches. What would you expect with "dannnniel"=~/.*/ which is a 'closer' match than "dannnniel"=~/Dan/i?

Suppose by 'closest' you mean the longest substring returned by the regex match -- so something like /.*/ is longer than any substring of the string to be matched. You can do:

match_array = [/Dan/i, /Danny/i, /Daniel/i, /.{3}/, /.*/]

strings=['dan','danny','daniel','dannnniel','dannyel']

p strings.
    map{|s| [s, match_array.filter{|m| s=~m}.
        sort_by{|m| s[m].length}.reverse]}.to_h

Which now sorts on the length of the match vs the length of the regex:

{"dan"=>[/.*/, /.{3}/, /Dan/i], 
 "danny"=>[/.*/, /Danny/i, /.{3}/, /Dan/i],
 "daniel"=>[/.*/, /Daniel/i, /.{3}/, /Dan/i], 
 "dannnniel"=>[/.*/, /.{3}/, /Dan/i],
 "dannyel"=>[/.*/, /Danny/i, /.{3}/, /Dan/i]}
Contribute answered 9/7 at 15:21 Comment(1)
Note that if 'mundane' is appended to strings, the key-value pair "mundane"=>[/Dan/i] would be added to the hash. This primarily reflects the vagueness of the question.Langer

© 2022 - 2024 — McMap. All rights reserved.