Coldfusion ReReplace "&" but not htmlspecialchars
Asked Answered
U

4

6

I need to replace all & with with & in a string like this:

Übung 1: Ü & Ä

or in html

Übung 1: Ü & Ä

Like you see htmlspecialchars in the string (but the & is not displayed as &), so I need to exclude them from my replace. I'm not so familiar with regular expressions. All I need is an expression that does the following:

Search for & that does either follow a (space) or does not follow some chars, excluding a space, which are ending with a ;. then replace that with &.

I tried something like this:

<cfset data = ReReplace(data, "&[ ]|[^(?*^( ));]", "&amp;", "ALL") />

but that replaces every char with the $amp;... ^^'

Sorry, I really don't get that regex things.

Undine answered 4/1, 2013 at 8:51 Comment(2)
@AlEverett: please tell me if you know another way. But remember that Ü&Ä (no spaces) also must work.Undine
Al, "The XY problem is asking about your attempted solution rather than your actual problem." - but Toby explicitly stated his actual problem in the first line "I need to replace all & with with &amp; in a string" - and then went on to provide an example and explain how he was trying to solve it. There's nothing wrong with that - indeed, it is absolutely preferable to do that instead of just asking a question without exhibiting any effort attempted to solve it.Kraigkrait
K
10

Problem with existing attempt

The reason your attempted pattern &[ ]|[^(?*^( ));] is failing is primarily because you have a | but no bounding container - this means you are replacing &[ ] OR [^(?*^( ));] - and that latter will match most things - you are also misunderstanding how character classes work.

Inside [..] (a character class) there are a few simple rules:

  • if it starts with a ^ it is negated, otherwise the ^ is literal.
  • if there is a hyphen it is treated as a range (e.g. a-z or 1-5 )
  • if there is a backslash, it either marks a shorthand class (e.g. \w), or escapes the following character (inside a char class this is only required for [ ] ^ - \).
  • you are only matching a single character (subject to any qualifiers); there is no ordering/sequence inside the class, and duplicates of the same character are ignored.

Also, you don't need to put a space inside a character class - a literal space works fine (unless you are in free-spacing comment mode, which needs to be explicitly enabled).

Hopefully that helps you understand what was going wrong?

As for actually solving your problem...

Solution

To match an ampersand that does not start a HTML entity, you can use:

&(?![a-z][a-z0-9]+;|#(?:\d+|x[\dA-F]+);)

That is, an ampersand, followed by a negative lookahead for either of:

  • a letter, then a letter or a number, the a semicolon - i.e. a named entity reference

  • a hash, then either a number, or an x followed by a hex number, and finally a semicolon - i.e. a numeric entity reference.

To use this in CFML, to replace & with &amp; would be:

<cfset data = rereplaceNoCase( data , '&(?![a-z][a-z0-9]+;|##(?:\d+|x[\dA-F]+);)' , '&amp;' , 'all' ) />
Kraigkrait answered 4/1, 2013 at 13:27 Comment(3)
Well that's what I expected as an answer, thx! Oh, and if I understand correctly you forgot to add the ; in '&amp;', 'all').Undine
Doh, yep, missed that semicolon. :( btw, just added a few notes on your initial attempt which are hopefully useful...Kraigkrait
Thx Peter, that helps a lot!Undine
M
4

I think it would be easier to simply replace all occurrences of & with &amp;, and then replace the wrongly replaced ones again:

<cfset data = ReReplace(ReReplace(data, "&", "&amp;", "ALL"), "&amp;([^;&]*;)", "&\1", "ALL") />

I haven't tested this in ColdFusion (since I have no clue how to), but it should work, because in JavaScript, the regex itself works:

var s = "I we&nt out on 1 se&123;p 2012 and& it was be&tter & than 15 jan 2012"
console.log(s.replace(/&/g, '&amp;').replace(/&amp;([^;&]*;)/g, '&$1'));
//"I we&amp;nt out on 1 se&123;p 2012 and&amp; it was be&amp;tter &amp; than 15 jan 2012"

So I assume the regex will also do its trick in CF.

Manas answered 4/1, 2013 at 9:18 Comment(4)
Could the person that downvoted me please explain why? If I made a mistake, I'd like to learn from it.Manas
I changed to correct answer to the one above, because your's is just a workaround (that works of course). Hope you understand... ^^Undine
Just to say the downvote was not from me - I downvoted the other answer for being broken, but not this one since it's a working workaround, if non-ideal.Kraigkrait
@PeterBoughton: I appreciate it. Toby: that's okay, the best answer should be the accepted one on SO.Manas
E
0

The other option you have is to not use REGEX at all. For the sample string you listed, you are simply tying to replace the html ampersand ("&"), without affecting the html entities. This can be accomplished just using REPLACE.

Remember that when using entities, there will be no spaces around the ampersand character, where as to convert an ampersand character to an HTML entity, there is typically a leading and trailing space. REPLACE will find every case of " & " and update, without affecting any of the "&Uuml" strings (e.g. no leading and trailing space).

<cfset html = "&Uuml;bung 1: &Uuml; & &Auml;">
<cfset parsedHtml = REPLACE(html," & ", " &amp; ","All")>
Ensilage answered 4/1, 2013 at 12:40 Comment(2)
Fails for input such as K&R, :&, &c.Kraigkrait
I thought of that Option myself, but it does not work if you have something like &Uuml;bung 1: &Uuml;&&Auml; (Übung 1: Ü&Ä).Undine
A
-1

For performance & issues free, just go with Decimal code point like so...

<cfset html = Replace(html, Chr(38), "&amp;", "all")>
Ataraxia answered 15/4, 2018 at 17:24 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.