PHP/Regex: simple regex for bbcode [s] or [strike] fails to work
Asked Answered
H

1

2

For a silly bbcode parser I wanted to add two definitions into one, my original definition was this for preg_replace:

'#\[s\](.*?)\[/s\]#si', '<strike>\\1</strike>'

And this works, I wished for the user to be able to use either [s] or [strike] to initiate text in that format, so I naturally added something like this thinking it would work:

'#\[(s|strike)\](.*?)\[/(s|strike)\]#si', '<strike>\\1</strike>'

Unfortunately that fails, instead of what you would expect, both [s] and [strike] (used properly) make: s and strike (my markdown is correct to show its real looking result, it shows s or strike regardless of what is inside it)

Why does it replace the inner text with the tag name instead? Is my adding parentheses around the s|strike the problem? I am probably doing this all wrong..

Hurried answered 22/11, 2010 at 11:38 Comment(26)
BBcode is not regular. Use a BBCode parserArlberg
@Gordon, for many reasons I am wanting to use regular expressions for this project. I am combining it with many filtering/etc things.Hurried
@Hurried Regex aint no Golden Hammer.Arlberg
@Gordon, what a better way to learn regex? besides, you are really limited using php's bbcode (I mean if you want to get creative with some sort of syntax features)Hurried
RegEx limit BBCodes too (just think of nested blocks), but I think using regex for simple BBCodes is fine enough. After all BBCodes have not a fixed definition that is to be followed...Slumgullion
@poke, I am using a [syntax=cpp][/syntax] like bbcode regex to pass source code to geshi, so it is quite fun to use this instead. :)Hurried
@Hurried well, a better way to learn Regex would be to find a problem they are good at solving. Use them for matching, not for parsing.Arlberg
@poke, nested blocks? I mean I tried a crazy one like [u][strike][b]underbold[/u] boldandstrike[/b][u]onlyunderline&strike[/strike][/u]should be clear and it worked. I do not use more complex stuff so I don't think it is a problem.Hurried
@John: Yeah, it works because the browser which then interprets the resulting html is clever (or rather stupid?) enough to ignore that the nesting isn't correct. Nevertheless I meant things like code blocks with special inside features, or even lists, or quotes or... For simple formatting, I think it is fine enough as a simple replacement for direct html (so you can strip <> out).Slumgullion
@Gordon: You’re barking up the wrong tree. Modern regexes have next to nothing to do with REGULAR languages and compatibility classes. Regular expressions haven’t been REGULAR since Ken Thompson first put (.)\1 into his backtracking NFA code in grep: the language described by (.)\1 is not REGULAR in that st00pid textbook REGULARity definition that nobody uses and which does not apply to modern regexes.Russon
@Russon see here please.Arlberg
@Gordon: That article is wrong! I can easily make a pattern he can’t break. He’s not talking about modern regexes, only about textbook REGULAR regular expressions, something that nobody uses. Even egrep can match (.)\1, which is not REGULAR. See here, here, and here — &c&c&c!Russon
@Russon please put a comment at that page and tell Kore please, as I am not mathematically skilled enough to discuss this topic in the detail in probably requires. However, even if Kore is wrong, it is still (in most cases) not feasible to reinvent a BBCode parser with Regex when native parsers are already readily available. Same for HTML. That last link of yours is a fine example of that. With all due respect to your skill, all these lines of code for something you could just a dedicated DOM parser?Arlberg
@Gordon: You don’t mean not feasible; you mean not practical, or perhaps not expedient. I certainly do not advise reïnventing perfectly good wheels. I am just sick and tired of people mindlessly parroting this slap-down refrain, “You cannot do X with regexes”, when they really mean one or more of “We do not know how to do so”, “Do not do so”, or “There are easier ways to accomplish your goal.” It’s dismissive and disingenuous, even dishonest. But the querents should understand there’s no moral superiority in fitting everything into a single regex; indeed, it has numerous drawbacks.Russon
@Russon So basically you agree with me: BBCode is not regular. Regex aint no Golden Hammer and OP should use a BBCode parser instead? :)Arlberg
@Gordon: On the contrary, I vehemently and vociferously disagree with you. The misapplication of the high-brow term REGULAR has nothing to do with real pattern matching. It has a highly irregular and utterly counterintuitive meaning that deceives anybody but an ivory-tower egghead. I am sick and tired of hearing you and everybody else pretending that regular expressions are REGULAR. They are not, and it is even required that they not be: notice that even POSIX BREs must support backrefs, thereby putting the lie to all your REGULAR pontificating. \((?:[^()]*+|(?0))*\) is a beautiful regex.Russon
@Russon well, they should be renamed Irregular Expressions then :) I mean, seriously, I am happy you clarified that for me. I am not sarcastic or ironic here. I promise next time I will just say "use a proper parser", because ultimately, I don't care if Regex can do X. Using for them for BBCode or HTML parsing when there is dedicated parsers readily available is indeed not practical. And using them in that case is using them as a Golden Hammer. As for beauty: that's in the eye of the beholder. I find all Regex plain ugly.Arlberg
@Gordon: Now I agree; it doesn’t matter whether regexes can do something given the availability of alternatives that are shorter, easier, and more robust. On legibility, all serious patterns I write have careful whitespace, indentation, comments, alphabetic subroutines, and a separation of declaration from execution. Look at the patterns in the 3 refs I gave earlier, plus here and here. See?Russon
@Russon thanks for the links. I will read them, though I doubt it will make me change my mind about the readability. I just prefer a method with a nice name telling me what it does.Arlberg
@Gordon: check out the html chunker at the very start of that posting, where I pull off head tokens one by one within a for/given loop. How much more readable could you get? :)Russon
@Russon that's an exception to the madness you usually see with Regex ;)Arlberg
@Gordon: That sort of thing should be terminated with extreme prejudice. Plus RFC 822 is dead; the modern way to write that is like this, which I think even you would agree is a tremendous improvement — eh?Russon
@Russon I know, just wanted to counter with an extremely unreadable example. I agree the other one looks much better. I would even dare to say, it is intriguing, but some other part of me thinks all Regex are perverted low level smut that should be locked away behind nice and clean APIs.Arlberg
@Gordon, Modern regular expressions written in a lucid and literate style have about as much in common with the old grep style line-noise atrocities as (insert your favorite very-high-level programming language) has with machine language. A modern regex can be very close to a BNF grammar spec, for example. There is absolutely no reason to accept these /☕✷⅋⋙$⚣™‹ª∞¶⌘̤℈⁑‽#♬˘$π❧/ tortured abominations from anybody in this day and age. Modern regexes are a whole new world, if not numinous then at least luminous. :)Russon
@tchrist, I refreshed this page and thoroughly enjoyed that argument, changed my opinion on these things (I was scared of using it for bbcode just before I read), hehe -- maybe I'll learn more and decide if I want to keep it like that.Hurried
@John, @Gordon: Here’s yet another recent treatise of mine on writing maintainable regexes.Russon
S
3

The problem is that you added two new regex groups, (s|strike) in the opening tag and (s|strike) in the closing tag. So inside your resulting code you will get s or strike. You can fix that by simply using the correct group number, 2.

Another way would be to make that new groups non-referencing, by adding a ?: to the beginning, but I guess the first solution is easier to understand:

#\[(?:s|strike)\](.*?)\[/(?:s|strike)\]#si
Slumgullion answered 22/11, 2010 at 11:42 Comment(4)
Ah, thank you, this helps my understanding. I thought only (.*?) would capture a group, I totally forgot (anything) can too. EDIT: but does the first (s|strike) make a group too? why just the second one? is the first one \0? Confuses me, but I may get it after sleep :PHurried
All (..) capture groups (unless it begins with ?:). But groups are numbered starting with 1 because the “group” 0 usually represents the whole matched string (in this case [s]some text[/s]).Slumgullion
Oh!!.. This makes complete sense to me now. Thank you :)Hurried
Named groups are also captivating (sorry ☺) as used in (?<GROUP_NAME> … ). They also number, but the preferred way to access them is \k<GROUP_NAME> from within the pattern and $+{GROUP_NAME} from without. There are a few situations where you can refer to numbered or named groups without backref notation. Mostly in the (CONDITION) YES_PART | NO_PART) condition test of the conditional pattern. You can write ((2)…|…) or (<GROUP_NAME>)…|…). There are also some recursion tests where you don’t use the backslash to talk about the group. Named groups are superior to numbered ones.Russon

© 2022 - 2024 — McMap. All rights reserved.