How do I get the match data for all occurrences of a Ruby regular expression in a string?
Asked Answered
F

5

44

I need the MatchData for each occurrence of a regular expression in a string. This is different than the scan method suggested in Match All Occurrences of a Regex, since that only gives me an array of strings (I need the full MatchData, to get begin and end information, etc).

input = "abc12def34ghijklmno567pqrs"
numbers = /\d+/

numbers.match input # #<MatchData "12"> (only the first match)
input.scan numbers  # ["12", "34", "567"] (all matches, but only the strings)

I suspect there is some method that I've overlooked. Suggestions?

Fear answered 24/7, 2011 at 2:16 Comment(3)
I want the begin and end positions for each match. But that is irrelevant to my question. MatchData exists for a reason, doesn't it? If I can get it for the first match, it follows that it would be useful for all matches.Fear
Ok, I want more than one thing, in a convenient package, for each match.Fear
You have the convenient package, as you name it, in the solution I gave below (from which you can get begin, end or whatever match data you need as you wish) . Or is it anything else that you are looking for?Naphtha
N
77

You want

"abc12def34ghijklmno567pqrs".to_enum(:scan, /\d+/).map { Regexp.last_match }

which gives you

[#<MatchData "12">, #<MatchData "34">, #<MatchData "567">] 

The "trick" is, as you see, to build an enumerator in order to get each last_match.

Naphtha answered 24/7, 2011 at 15:29 Comment(2)
This should be on apidock.com or similar. You saved me from at least 10 new grey hairs :)Schuck
It's unbelievable that there isn't a built-in method for this, that we have to resort to a hack like this.Millsap
F
9

My current solution is to add an each_match method to Regexp:

class Regexp
  def each_match(str)
    start = 0
    while matchdata = self.match(str, start)
      yield matchdata
      start = matchdata.end(0)
    end
  end
end

Now I can do:

numbers.each_match input do |match|
  puts "Found #{match[0]} at #{match.begin(0)} until #{match.end(0)}"
end

Tell me there is a better way.

Fear answered 24/7, 2011 at 2:19 Comment(5)
this should actually be appended to your original question, unless you intend it to be the answer.Conscript
Also, while matchdata = self.match(str, start) is considered a very hard to maintain construct because it is difficult to know if this is an error or intentional.Conscript
Why should it be appended to the question? It's an answer. I'm just hoping there is a better answer, which is why I didn't just accept my own. If a better answer isn't found, then eventually I will mark it as the answer.Fear
Please reread what I wrote. Append it UNLESS you intend it to be the answer. Stack Overflow prefers that information added by the original poster be appended to your original question, however answers provided by the OP can be added as an answer. stackoverflow.com/faq#howtoaskConscript
It's clean, it's easy to read and it works just fine. You could write is a an enumerator if you wish. I didn't notice your answer before writing mine. They're basically the same.Buckshee
C
9

I’ll put it here to make the code available via a search:

input = "abc12def34ghijklmno567pqrs"
numbers = /\d+/
input.gsub(numbers) { |m| p $~ }

The result is as requested:

⇒ #<MatchData "12">
⇒ #<MatchData "34">
⇒ #<MatchData "567">

See "input.gsub(numbers) { |m| p $~ } Matching data in Ruby for all occurrences in a string" for more information.

Candelaria answered 2/2, 2013 at 13:34 Comment(7)
Thanks for doing that, works perfectly, especially as I wanted to actually use gsub anyway.Oceanid
Rather than do this, use scan if all you intend to do is get the MatchData. It communicates intention clearer.Night
@justin, the question explicitly says that scan does not return MatchData's, but just an array of matched strings.Camisole
@Camisole it's been a while, but iirc, $~ is the MatchData for the last match, which would make my comment relevant stillNight
@Justin, technically, you are right. $~ is, indeed, the MatchData for the last match. However, there is a little trick - since gsub sets $~ multiple times per iteration, on each iteration { |m| p $~ } returns different MatchData's. Besides, I'm not sure I understand how scan can be useful in getting MatchData's. Can you explain this part, please?Camisole
@Camisole as a drop in replacement for gsub here. ideone.com/tRfi12Night
@Night oh! I see. Thanks, now I get what you mean.Camisole
D
4

I'm surprised nobody mentioned the amazing StringScanner class included in Ruby's standard library:

require 'strscan'

s = StringScanner.new('abc12def34ghijklmno567pqrs')

while s.skip_until(/\d+/)
  num, offset = s.matched.to_i, [s.pos - s.matched_size, s.pos - 1]

  # ..
end

No, it doesn't give you the MatchData objects, but it does give you an index-based interface into the string.

Decasyllabic answered 23/11, 2017 at 1:58 Comment(0)
D
0
input = "abc12def34ghijklmno567pqrs"
n = Regexp.new("\\d+")
[n.match(input)].tap { |a| a << n.match(input,a.last().end(0)+1) until a.last().nil? }[0..-2]

=> [#<MatchData "12">, #<MatchData "34">, #<MatchData "567">]
Dehorn answered 23/11, 2017 at 0:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.