PHP/Regex: simple regex for bbcode [s] or [strike] fails to work

Asked 22/11, 2010 at 11:38 Answered 22/11, 2010 at 11:42

For a silly bbcode parser I wanted to add two definitions into one, my original definition was this for preg_replace:

'#\[s\](.*?)\[/s\]#si', '<strike>\\1</strike>'

And this works, I wished for the user to be able to use either [s] or [strike] to initiate text in that format, so I naturally added something like this thinking it would work:

'#\[(s|strike)\](.*?)\[/(s|strike)\]#si', '<strike>\\1</strike>'

Unfortunately that fails, instead of what you would expect, both [s] and [strike] (used properly) make: ~~s~~ and ~~strike~~ (my markdown is correct to show its real looking result, it shows s or strike regardless of what is inside it)

Why does it replace the inner text with the tag name instead? Is my adding parentheses around the s|strike the problem? I am probably doing this all wrong..

Hurried answered 22/11, 2010 at 11:38 Comment(26)

BBcode is not regular. Use a BBCode parser – Arlberg 22/11, 2010 at 11:39

@Gordon, for many reasons I am wanting to use regular expressions for this project. I am combining it with many filtering/etc things. – Hurried 22/11, 2010 at 11:42

@Hurried Regex aint no Golden Hammer. – Arlberg 22/11, 2010 at 11:47

@Gordon, what a better way to learn regex? besides, you are really limited using php's bbcode (I mean if you want to get creative with some sort of syntax features) – Hurried 22/11, 2010 at 11:50

RegEx limit BBCodes too (just think of nested blocks), but I think using regex for simple BBCodes is fine enough. After all BBCodes have not a fixed definition that is to be followed... – Slumgullion 22/11, 2010 at 11:52

@poke, I am using a [syntax=cpp][/syntax] like bbcode regex to pass source code to geshi, so it is quite fun to use this instead. :) – Hurried 22/11, 2010 at 11:54

@Hurried well, a better way to learn Regex would be to find a problem they are good at solving. Use them for matching, not for parsing. – Arlberg 22/11, 2010 at 11:56

@poke, nested blocks? I mean I tried a crazy one like [u][strike][b]underbold[/u] boldandstrike[/b][u]onlyunderline&strike[/strike][/u]should be clear and it worked. I do not use more complex stuff so I don't think it is a problem. – Hurried 22/11, 2010 at 12:4

@John: Yeah, it works because the browser which then interprets the resulting html is clever (or rather stupid?) enough to ignore that the nesting isn't correct. Nevertheless I meant things like code blocks with special inside features, or even lists, or quotes or... For simple formatting, I think it is fine enough as a simple replacement for direct html (so you can strip <> out). – Slumgullion 22/11, 2010 at 12:8

@Gordon: You’re barking up the wrong tree. Modern regexes have next to nothing to do with REGULAR languages and compatibility classes. Regular expressions haven’t been REGULAR since Ken Thompson first put (.)\1 into his backtracking NFA code in grep: the language described by (.)\1 is not REGULAR in that st00pid textbook REGULARity definition that nobody uses and which does not apply to modern regexes. – Russon 22/11, 2010 at 13:47

@Russon see here please. – Arlberg 22/11, 2010 at 15:1

@Gordon: That article is wrong! I can easily make a pattern he can’t break. He’s not talking about modern regexes, only about textbook REGULAR regular expressions, something that nobody uses. Even egrep can match (.)\1, which is not REGULAR. See here, here, and here — &c&c&c! – Russon 22/11, 2010 at 15:56

@Russon please put a comment at that page and tell Kore please, as I am not mathematically skilled enough to discuss this topic in the detail in probably requires. However, even if Kore is wrong, it is still (in most cases) not feasible to reinvent a BBCode parser with Regex when native parsers are already readily available. Same for HTML. That last link of yours is a fine example of that. With all due respect to your skill, all these lines of code for something you could just a dedicated DOM parser? – Arlberg 22/11, 2010 at 16:15

@Gordon: You don’t mean not feasible; you mean not practical, or perhaps not expedient. I certainly do not advise reïnventing perfectly good wheels. I am just sick and tired of people mindlessly parroting this slap-down refrain, “You cannot do X with regexes”, when they really mean one or more of “We do not know how to do so”, “Do not do so”, or “There are easier ways to accomplish your goal.” It’s dismissive and disingenuous, even dishonest. But the querents should understand there’s no moral superiority in fitting everything into a single regex; indeed, it has numerous drawbacks. – Russon 22/11, 2010 at 17:42

@Russon So basically you agree with me: BBCode is not regular. Regex aint no Golden Hammer and OP should use a BBCode parser instead? :) – Arlberg 22/11, 2010 at 18:6

@Gordon: On the contrary, I vehemently and vociferously disagree with you. The misapplication of the high-brow term REGULAR has nothing to do with real pattern matching. It has a highly irregular and utterly counterintuitive meaning that deceives anybody but an ivory-tower egghead. I am sick and tired of hearing you and everybody else pretending that regular expressions are REGULAR. They are not, and it is even required that they not be: notice that even POSIX BREs must support backrefs, thereby putting the lie to all your REGULAR pontificating. $(?:[^()]*+|(?0))*$ is a beautiful regex. – Russon 22/11, 2010 at 19:35

@Russon well, they should be renamed Irregular Expressions then :) I mean, seriously, I am happy you clarified that for me. I am not sarcastic or ironic here. I promise next time I will just say "use a proper parser", because ultimately, I don't care if Regex can do X. Using for them for BBCode or HTML parsing when there is dedicated parsers readily available is indeed not practical. And using them in that case is using them as a Golden Hammer. As for beauty: that's in the eye of the beholder. I find all Regex plain ugly. – Arlberg 22/11, 2010 at 20:14

@Gordon: Now I agree; it doesn’t matter whether regexes can do something given the availability of alternatives that are shorter, easier, and more robust. On legibility, all serious patterns I write have careful whitespace, indentation, comments, alphabetic subroutines, and a separation of declaration from execution. Look at the patterns in the 3 refs I gave earlier, plus here and here. See? – Russon 22/11, 2010 at 20:41

@Russon thanks for the links. I will read them, though I doubt it will make me change my mind about the readability. I just prefer a method with a nice name telling me what it does. – Arlberg 22/11, 2010 at 20:58

@Gordon: check out the html chunker at the very start of that posting, where I pull off head tokens one by one within a for/given loop. How much more readable could you get? :) – Russon 22/11, 2010 at 21:1

@Russon that's an exception to the madness you usually see with Regex ;) – Arlberg 22/11, 2010 at 21:12

@Gordon: That sort of thing should be terminated with extreme prejudice. Plus RFC 822 is dead; the modern way to write that is like this, which I think even you would agree is a tremendous improvement — eh? – Russon 22/11, 2010 at 21:33

@Russon I know, just wanted to counter with an extremely unreadable example. I agree the other one looks much better. I would even dare to say, it is intriguing, but some other part of me thinks all Regex are perverted low level smut that should be locked away behind nice and clean APIs. – Arlberg 22/11, 2010 at 22:0

@Gordon, Modern regular expressions written in a lucid and literate style have about as much in common with the old grep style line-noise atrocities as (insert your favorite very-high-level programming language) has with machine language. A modern regex can be very close to a BNF grammar spec, for example. There is absolutely no reason to accept these /☕✷⅋⋙$⚣™‹ª∞¶⌘̤℈⁑‽#♬˘$π❧/ tortured abominations from anybody in this day and age. Modern regexes are a whole new world, if not numinous then at least luminous. :) – Russon 22/11, 2010 at 22:1

@tchrist, I refreshed this page and thoroughly enjoyed that argument, changed my opinion on these things (I was scared of using it for bbcode just before I read), hehe -- maybe I'll learn more and decide if I want to keep it like that. – Hurried 23/11, 2010 at 14:52

@John, @Gordon: Here’s yet another recent treatise of mine on writing maintainable regexes. – Russon 23/11, 2010 at 15:3

The problem is that you added two new regex groups, (s|strike) in the opening tag and (s|strike) in the closing tag. So inside your resulting code you will get s or strike. You can fix that by simply using the correct group number, 2.

Another way would be to make that new groups non-referencing, by adding a ?: to the beginning, but I guess the first solution is easier to understand:

#\[(?:s|strike)\](.*?)\[/(?:s|strike)\]#si

Slumgullion answered 22/11, 2010 at 11:42 Comment(4)

Ah, thank you, this helps my understanding. I thought only (.*?) would capture a group, I totally forgot (anything) can too. EDIT: but does the first (s|strike) make a group too? why just the second one? is the first one \0? Confuses me, but I may get it after sleep :P – Hurried 22/11, 2010 at 11:48

All (..) capture groups (unless it begins with ?:). But groups are numbered starting with 1 because the “group” 0 usually represents the whole matched string (in this case [s]some text[/s]). – Slumgullion 22/11, 2010 at 11:54

Oh!!.. This makes complete sense to me now. Thank you :) – Hurried 22/11, 2010 at 11:56

Named groups are also captivating (sorry ☺) as used in (?<GROUP_NAME> … ). They also number, but the preferred way to access them is \k<GROUP_NAME> from within the pattern and $+{GROUP_NAME} from without. There are a few situations where you can refer to numbered or named groups without backref notation. Mostly in the (CONDITION) YES_PART | NO_PART) condition test of the conditional pattern. You can write ((2)…|…) or (<GROUP_NAME>)…|…). There are also some recursion tests where you don’t use the backslash to talk about the group. Named groups are superior to numbered ones. – Russon 22/11, 2010 at 13:51

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags