Ruby regex extracting words
Asked Answered
C

3

11

I'm currently struggling to come up with a regex that can split up a string into words where words are defined as a sequence of characters surrounded by whitespace, or enclosed between double quotes. I'm using String#scan

For instance, the string:

'   hello "my name" is    "Tom"'

should match the words:

hello
my name
is
Tom

I managed to match the words enclosed in double quotes by using:

/"([^\"]*)"/

but I can't figure out how to incorporate the surrounded by whitespace characters to get 'hello', 'is', and 'Tom' while at the same time not screw up 'my name'.

Any help with this would be appreciated!

Caisson answered 17/11, 2011 at 5:15 Comment(0)
L
22
result = '   hello "my name" is    "Tom"'.split(/\s+(?=(?:[^"]*"[^"]*")*[^"]*$)/)

will work for you. It will print

=> ["", "hello", "\"my name\"", "is", "\"Tom\""]

Just ignore the empty strings.

Explanation

"
\\s            # Match a single character that is a “whitespace character” (spaces, tabs, and line breaks)
   +             # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
(?=           # Assert that the regex below can be matched, starting at this position (positive lookahead)
   (?:           # Match the regular expression below
      [^\"]          # Match any character that is NOT a “\"”
         *             # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
      \"             # Match the character “\"” literally
      [^\"]          # Match any character that is NOT a “\"”
         *             # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
      \"             # Match the character “\"” literally
   )*            # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
   [^\"]          # Match any character that is NOT a “\"”
      *             # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
   \$             # Assert position at the end of a line (at the end of the string or before a line break character)
)
"

You can use reject like this to avoid empty strings

result = '   hello "my name" is    "Tom"'
            .split(/\s+(?=(?:[^"]*"[^"]*")*[^"]*$)/).reject {|s| s.empty?}

prints

=> ["hello", "\"my name\"", "is", "\"Tom\""]
Linetta answered 17/11, 2011 at 5:27 Comment(3)
Great dissection of the regex. Very helpful.Isolate
great solution if you need to keep the quoted words together! +1Program
An impressive use of regular expressions! How would you adapt this answer so as to not retain the quotes on my name and Tom? -- i.e. such that the resulting array looks like ["hello", "my name", "is", "Tom"] rather than ["hello", "\"my name\"", "is", "\"Tom\""] -- With all due respect, I believe the solution presented by @DarkCastle is better for a couple of reasons. See my comment on that answer.Tamper
T
4
text = '   hello "my name" is    "Tom"'

text.scan(/\s*("([^"]+)"|\w+)\s*/).each {|match| puts match[1] || match[0]}

Produces:

hello
my name
is
Tom

Explanation:

0 or more spaces followed by

either

some words within double-quotes OR

a single word

followed by 0 or more spaces

Toandfro answered 17/11, 2011 at 5:36 Comment(3)
What the OP is asking, is not possible w/o lookahead.Ley
I had meant it for the original solution, where just a regex is used to split. Any after processing was not what I had in mind.Ley
This solutions is better (easier to read; does not need much of an explanation; and does not retain the quotes) and faster (by about one second over a million iterations) if slightly modified as such: text.scan(/\s*("([^"]+)"|\w+)\s*/).map { |match| match[1].nil? ? match[0] : match[1] } -- result: ["hello", "my name", "is", "Tom"]Tamper
J
1

You can try this regex:

/\b(\w+)\b/

which uses \b to find the word boundary. And this web site http://rubular.com/ is helpful.

Jameejamel answered 30/7, 2012 at 13:44 Comment(1)
This does not work. It it makes no attempt to capture between the quotes as a single matchActable

© 2022 - 2024 — McMap. All rights reserved.