Problem with existing attempt
The reason your attempted pattern &[ ]|[^(?*^( ));]
is failing is primarily because you have a |
but no bounding container - this means you are replacing &[ ]
OR [^(?*^( ));]
- and that latter will match most things - you are also misunderstanding how character classes work.
Inside [
..]
(a character class) there are a few simple rules:
- if it starts with a
^
it is negated, otherwise the ^
is literal.
- if there is a hyphen it is treated as a range (e.g. a-z or 1-5 )
- if there is a backslash, it either marks a shorthand class (e.g.
\w
), or escapes the following character (inside a char class this is only required for [
]
^
-
\
).
- you are only matching a single character (subject to any qualifiers); there is no ordering/sequence inside the class, and duplicates of the same character are ignored.
Also, you don't need to put a space inside a character class - a literal space works fine (unless you are in free-spacing comment mode, which needs to be explicitly enabled).
Hopefully that helps you understand what was going wrong?
As for actually solving your problem...
Solution
To match an ampersand that does not start a HTML entity, you can use:
&(?![a-z][a-z0-9]+;|#(?:\d+|x[\dA-F]+);)
That is, an ampersand, followed by a negative lookahead for either of:
a letter, then a letter or a number, the a semicolon - i.e. a named entity reference
a hash, then either a number, or an x followed by a hex number, and finally a semicolon - i.e. a numeric entity reference.
To use this in CFML, to replace &
with &
would be:
<cfset data = rereplaceNoCase( data , '&(?![a-z][a-z0-9]+;|##(?:\d+|x[\dA-F]+);)' , '&' , 'all' ) />
Ü&Ä
(no spaces) also must work. – Undine