How can I fix my regex to not match too much with a greedy quantifier? [duplicate]
Asked Answered
S

6

4

I have the following line:

"14:48 say;0ed673079715c343281355c2a1fde843;2;laka;hello ;)"

I parse this by using a simple regexp:

if($line =~ /(\d+:\d+)\ssay;(.*);(.*);(.*);(.*)/) {
    my($ts, $hash, $pid, $handle, $quote) = ($1, $2, $3, $4, $5);
}

But the ; at the end messes things up and I don't know why. Shouldn't the greedy operator handle "everything"?

Structuralism answered 1/11, 2008 at 17:35 Comment(0)
B
18

The greedy operator tries to grab as much stuff as it can and still match the string. What's happening is the first one (after "say") grabs "0ed673079715c343281355c2a1fde843;2", the second one takes "laka", the third finds "hello " and the fourth matches the parenthesis.

What you need to do is make all but the last one non-greedy, so they grab as little as possible and still match the string:

(\d+:\d+)\ssay;(.*?);(.*?);(.*?);(.*)
Barbecue answered 1/11, 2008 at 17:44 Comment(5)
That's great! Can you quick tell me the difference between .*? og .* Thanks! :)Structuralism
The difference is that .*? stops at the first instance of whatever follows, whereas .* stops at the last instance of whatever follows.Airtoair
Ah, great folks! Appreciate it! :-)Structuralism
The ? modifies the * operator to make it non-greedy. You can also use ? with + to make it non-greedy, as well.Barbecue
Very good general-case answer, but, for this specific question, I would favor [^;]* over .*? because the boundary which terminates the match is a single character. There are cases where .*? is what you need, but I find it best to avoid .* entirely whenever possible.Chaing
I
7
(\d+:\d+)\ssay;([^;]*);([^;]*);([^;]*);(.*)

should work better

Ingrown answered 1/11, 2008 at 17:40 Comment(2)
I think you have an extra ([^;]*); I think the last part is a comment with a smily "Hello ;)"Peag
Ady: Right: the last part can be as simple as (.*) to get the rest of the line. FixedIngrown
A
7

Although a regex can easily do this, I'm not sure it's the most straight-forward approach. It's probably the shortest, but that doesn't actually make it the most maintainable.

Instead, I'd suggest something like this:

$x="14:48 say;0ed673079715c343281355c2a1fde843;2;laka;hello ;)";

if (($ts,$rest) = $x =~ /(\d+:\d+)\s+(.*)/)
{
    my($command,$hash,$pid,$handle,$quote) = split /;/, $rest, 5;
    print join ",", map { "[$_]" } $ts,$command,$hash,$pid,$handle,$quote
}

This results in:

[14:48],[say],[0ed673079715c343281355c2a1fde843],[2],[laka],[hello ;)]

I think this is just a bit more readable. Not only that, I think it's also easier to debug and maintain, because this is closer to how you would do it if a human were to attempt the same thing with pen and paper. Break the string down into chunks that you can then parse easier - have the computer do exactly what you would do. When it comes time to make modifications, I think this one will fare better. YMMV.

Ashcraft answered 1/11, 2008 at 18:6 Comment(0)
H
3

Try making the first 3 (.*) ungreedy (.*?)

Housebreaker answered 1/11, 2008 at 17:39 Comment(0)
F
3

If the values in your semicolon-delimited list cannot include any semicolons themselves, you'll get the most efficient and straightforward regular expression simply by spelling that out. If certain values can only be, say, a string of hex characters, spell that out. Solutions using a lazy or greedy dot will always lead to a lot of useless backtracking when the regex does not match the subject string.

(\d+:\d+)\ssay;([a-f0-9]+);(\d+);(\w+);([^;\r\n]+)
Finnigan answered 2/11, 2008 at 1:21 Comment(1)
Jan, if you want something to be marked up as source code, each line has to start with four spaces. And welcome to SO.Lustrum
B
2

You could make * non-greedy by appending a question mark:

$line =~ /(\d+:\d+)\ssay;(.*?);(.*?);(.*?);(.*)/

or you can match everything except a semicolon in each part except the last:

$line =~ /(\d+:\d+)\ssay;([^;]*);([^;]*);([^;]*);(.*)/
Bcd answered 1/11, 2008 at 17:47 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.