Remove simple HTML-Tags from String in Oracle via RegExp, Explanation needed
Asked Answered
A

4

9

I do not understand, why my columns reg1 and reg2 remove "bbb" from my string, and only reg3 works as expected.

WITH t AS (SELECT 'aaa <b>bbb</b> ccc' AS teststring FROM dual)

SELECT
  teststring,
  regexp_replace(teststring, '<.+>') AS reg1,
  regexp_replace(teststring, '<.*>') AS reg2,
  regexp_replace(teststring, '<.*?>') AS reg3
FROM t


TESTSTRING             REG1        REG2          REG3
aaa <b>bbb</b> ccc     aaa ccc     aaa ccc       aaa bbb ccc

Thanks a lot!

Aronarondel answered 10/6, 2015 at 12:51 Comment(0)
L
19

Because regex is greedy by default. I.e. the expressions .* or .+ try to take as many characters as possible. Therefore <.+> will span from the first < to the last >. Make it lazy by using the lazy operator ?:

regexp_replace(teststring, '<.+?>')

or

regexp_replace(teststring, '<.*?>')

Now, the search for > will stop at the first > encountered.

Note that . includes > as well, therefore the greedy variant (without ?) swallows all the > but the last.

Lipcombe answered 10/6, 2015 at 12:56 Comment(2)
Thanks a lot! I thought, ? stands only for "one or zero".Aronarondel
? stands for "one or zero" unless when occuring after a quanitfier, where it stands as lazy operator. See: Lazy quantification.Lipcombe
S
1

You can remove HTML tags from string

REGEXP_REPLACE (teststring,'<[^>]*>',' ')

Scotland answered 6/9, 2022 at 9:1 Comment(1)
Although your answer is correct in the sense of obtaining the result expected by the OP, the OP was asking a why some of his code worked and some didn't work, he already had an answer, and was looking for an explanation. The explanation was provided 7 years ago, and your answer doesn't answer that question, and is posted in an old thread.Mccourt
H
0

Because the first one and the second one are finding this match: <b>bbb</b> - in this case b>bbb</b matches both .* and .+

The third one also won't do what you need. You are looking for something like this: <[^>]*>. But you also need to replace all matches with ""

Hershey answered 10/6, 2015 at 13:0 Comment(1)
The third one do exacly what I need. I didn't understand, why. @Olivier gave the useful answer.Aronarondel
A
-1

If you are merely trying to display the string without all the HTML tags, you can use the function: utl_i18n.unescape_reference(column_name)

Alow answered 2/11, 2018 at 0:11 Comment(1)
A late response but unescape_reference only replaces the tags for &lt;, etc. It doesn't strip out any other HTML (e.g. <p>).Coincide

© 2022 - 2024 — McMap. All rights reserved.