How to strip HTML tags from a string in SQL Server?
Asked Answered
M

12

144

I've got data in SQL Server 2005 that contains HTML tags and I'd like to strip all that out, leaving just the text between the tags. Ideally also replacing things like &lt; with <, etc.

Is there an easy way to do this or has someone already got some sample T-SQL code?

I don't have the ability to add extended stored procs and the like, so would prefer a pure T-SQL approach (preferably one backwards compatible with SQL 2000).

I just want to retrieve the data with stripped out HTML, not update it, so ideally it would be written as a user-defined function, to make for easy reuse.

So for example converting this:

<B>Some useful text</B>&nbsp;
<A onclick="return openInfo(this)"
   href="http://there.com/3ce984e88d0531bac5349"
   target=globalhelp>
   <IMG title="Source Description" height=15 alt="Source Description" 
        src="/ri/new_info.gif" width=15 align=top border=0>
</A>&gt;&nbsp;<b>more text</b></TD></TR>

to this:

Some useful text > more text
Mallis answered 19/1, 2009 at 14:17 Comment(0)
E
196

There is a UDF that will do that described here:

User Defined Function to Strip HTML

CREATE FUNCTION [dbo].[udf_StripHTML] (@HTMLText VARCHAR(MAX))
RETURNS VARCHAR(MAX) AS
BEGIN
    DECLARE @Start INT
    DECLARE @End INT
    DECLARE @Length INT
    SET @Start = CHARINDEX('<',@HTMLText)
    SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText))
    SET @Length = (@End - @Start) + 1
    WHILE @Start > 0 AND @End > 0 AND @Length > 0
    BEGIN
        SET @HTMLText = STUFF(@HTMLText,@Start,@Length,'')
        SET @Start = CHARINDEX('<',@HTMLText)
        SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText))
        SET @Length = (@End - @Start) + 1
    END
    RETURN LTRIM(RTRIM(@HTMLText))
END
GO

Edit: note this is for SQL Server 2005, but if you change the keyword MAX to something like 4000, it will work in SQL Server 2000 as well.

Exenterate answered 19/1, 2009 at 14:24 Comment(10)
Great, thanks. Comments there link to an improved version: lazycoders.blogspot.com/2007/06/… which deals with more html entities.Mallis
Note that as a string-intensive UDF in SQL Server 2005 or later, this is a perfect candidate for implementing a CLR UDF function for a massive performance boost. More info on doing so here: #35009Exieexigency
Note the lazycoders post has two typos. Remove the single quotes from around the CHAR(13) + CHAR(10) in two of the sections that have these. Subtle enough I didn't catch it until it exceeded the length of a short field (interestingly, and required for me, all replacements are shorter than the original string).Ibnsaud
What about html encoded values? would need them decoded. Thanks.Ringside
I used the lazycoders, plus the typo fix from @Ibnsaud above - works great. To save time, the lazycoders blog version is here: lazycoders.blogspot.com/2007/06/…Tartaglia
This worked for me after I used nvarchar instead of varchar because I use unicode characters in html tagsHobbs
Why does SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText)) recalculate CHARINDEX('<',@HTMLText) instead of reusing @Start, which has only just been calculated to be that exact same thing on the preceding statement?Wyattwyche
@RedFilter That does not remove Embedded CSS Styles.Stomacher
after using the above, i had some weird characters added to my text. changing the varchar(max) parameter to nvarchar(max) fixed it.Reata
This answer does not remove predefined entities like &nbsp; and it will strip out the > that was requested in the original question. The answer by Patrick Honorez does handle the original question and is a more complete solution. YMMV.Cythera
W
30

Derived from @Goner Doug answer, with a few things updated:
- using REPLACE where possible
- conversion of predefined entities like &eacute; (I chose the ones I needed :-)
- some conversion of list tags <ul> and <li>

ALTER FUNCTION [dbo].[udf_StripHTML]
--by Patrick Honorez --- www.idevlop.com
--inspired by https://mcmap.net/q/158788/-how-to-strip-html-tags-from-a-string-in-sql-server/39253602#39253602
(
@HTMLText varchar(MAX)
)
RETURNS varchar(MAX)
AS
BEGIN
DECLARE @Start  int
DECLARE @End    int
DECLARE @Length int

set @HTMLText = replace(@htmlText, '<br>',CHAR(13) + CHAR(10))
set @HTMLText = replace(@htmlText, '<br/>',CHAR(13) + CHAR(10))
set @HTMLText = replace(@htmlText, '<br />',CHAR(13) + CHAR(10))
set @HTMLText = replace(@htmlText, '<li>','- ')
set @HTMLText = replace(@htmlText, '</li>',CHAR(13) + CHAR(10))

set @HTMLText = replace(@htmlText, '&rsquo;' collate Latin1_General_CS_AS, ''''  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&quot;' collate Latin1_General_CS_AS, '"'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&amp;' collate Latin1_General_CS_AS, '&'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&euro;' collate Latin1_General_CS_AS, '€'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&lt;' collate Latin1_General_CS_AS, '<'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&gt;' collate Latin1_General_CS_AS, '>'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&oelig;' collate Latin1_General_CS_AS, 'oe'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&nbsp;' collate Latin1_General_CS_AS, ' '  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&copy;' collate Latin1_General_CS_AS, '©'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&laquo;' collate Latin1_General_CS_AS, '«'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&reg;' collate Latin1_General_CS_AS, '®'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&plusmn;' collate Latin1_General_CS_AS, '±'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&sup2;' collate Latin1_General_CS_AS, '²'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&sup3;' collate Latin1_General_CS_AS, '³'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&micro;' collate Latin1_General_CS_AS, 'µ'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&middot;' collate Latin1_General_CS_AS, '·'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&ordm;' collate Latin1_General_CS_AS, 'º'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&raquo;' collate Latin1_General_CS_AS, '»'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&frac14;' collate Latin1_General_CS_AS, '¼'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&frac12;' collate Latin1_General_CS_AS, '½'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&frac34;' collate Latin1_General_CS_AS, '¾'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&Aelig' collate Latin1_General_CS_AS, 'Æ'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&Ccedil;' collate Latin1_General_CS_AS, 'Ç'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&Egrave;' collate Latin1_General_CS_AS, 'È'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&Eacute;' collate Latin1_General_CS_AS, 'É'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&Ecirc;' collate Latin1_General_CS_AS, 'Ê'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&Ouml;' collate Latin1_General_CS_AS, 'Ö'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&agrave;' collate Latin1_General_CS_AS, 'à'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&acirc;' collate Latin1_General_CS_AS, 'â'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&auml;' collate Latin1_General_CS_AS, 'ä'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&aelig;' collate Latin1_General_CS_AS, 'æ'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&ccedil;' collate Latin1_General_CS_AS, 'ç'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&egrave;' collate Latin1_General_CS_AS, 'è'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&eacute;' collate Latin1_General_CS_AS, 'é'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&ecirc;' collate Latin1_General_CS_AS, 'ê'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&euml;' collate Latin1_General_CS_AS, 'ë'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&icirc;' collate Latin1_General_CS_AS, 'î'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&ocirc;' collate Latin1_General_CS_AS, 'ô'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&ouml;' collate Latin1_General_CS_AS, 'ö'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&divide;' collate Latin1_General_CS_AS, '÷'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&oslash;' collate Latin1_General_CS_AS, 'ø'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&ugrave;' collate Latin1_General_CS_AS, 'ù'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&uacute;' collate Latin1_General_CS_AS, 'ú'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&ucirc;' collate Latin1_General_CS_AS, 'û'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&uuml;' collate Latin1_General_CS_AS, 'ü'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&quot;' collate Latin1_General_CS_AS, '"'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&amp;' collate Latin1_General_CS_AS, '&'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&lsaquo;' collate Latin1_General_CS_AS, '<'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&rsaquo;' collate Latin1_General_CS_AS, '>'  collate Latin1_General_CS_AS)


-- Remove anything between <STYLE> tags
SET @Start = CHARINDEX('<STYLE', @HTMLText)
SET @End = CHARINDEX('</STYLE>', @HTMLText, CHARINDEX('<', @HTMLText)) + 7
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '')
SET @Start = CHARINDEX('<STYLE', @HTMLText)
SET @End = CHARINDEX('</STYLE>', @HTMLText, CHARINDEX('</STYLE>', @HTMLText)) + 7
SET @Length = (@End - @Start) + 1
END

-- Remove anything between <whatever> tags
SET @Start = CHARINDEX('<', @HTMLText)
SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText))
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '')
SET @Start = CHARINDEX('<', @HTMLText)
SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText))
SET @Length = (@End - @Start) + 1
END

RETURN LTRIM(RTRIM(@HTMLText))

END
Washday answered 16/9, 2016 at 13:3 Comment(5)
I've used this and love it, but I did add one more replace to the top group: </p> I changed to a char 13 + char 10 also since the end of a paragraph tag would typically indicate a new line. It worked perfectly in my particular scenarioLately
This answer worked great for the most part, but there is an assumption that all of your HTML tags are valid. In my case, there was a truncation issue on the VARCHAR upload that eliminated some closing tags. A simple PATINDEX RTrim did the trick to remove everything else.Mightily
In addition to the change @Lately made (plus a few more that needed carriage returns), I also moved the replaces that result in < and > to the very end. Otherwise they got removed with the tags.Phifer
I have added the following line in the first block: set @HTMLText = replace(@htmlText, '&#x0D;',CHAR(13) + CHAR(10))Cardiff
Adding to a really old post. But you just need twolines, assuming your HTML is well formed. So you can use the above code up until where you start removing "Anything between style tags:", and just add and return here: DECLARE @bodyXML xml = CONVERT(xml, @HTMLText) RETURN @BodyXML.value('.','varchar(max)')Polymeric
P
11

Here's an updated version of this function that incorporates the RedFilter answer (Pinal's original) with the LazyCoders additions and the goodeye typo corrections AND my own addition to handle in-line <STYLE> tags inside the HTML.

ALTER FUNCTION [dbo].[udf_StripHTML]
(
@HTMLText varchar(MAX)
)
RETURNS varchar(MAX)
AS
BEGIN
DECLARE @Start  int
DECLARE @End    int
DECLARE @Length int

-- Replace the HTML entity &amp; with the '&' character (this needs to be done first, as
-- '&' might be double encoded as '&amp;amp;')
SET @Start = CHARINDEX('&amp;', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '&')
SET @Start = CHARINDEX('&amp;', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1
END

-- Replace the HTML entity &lt; with the '<' character
SET @Start = CHARINDEX('&lt;', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '<')
SET @Start = CHARINDEX('&lt;', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1
END

-- Replace the HTML entity &gt; with the '>' character
SET @Start = CHARINDEX('&gt;', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '>')
SET @Start = CHARINDEX('&gt;', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1
END

-- Replace the HTML entity &amp; with the '&' character
SET @Start = CHARINDEX('&amp;amp;', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '&')
SET @Start = CHARINDEX('&amp;amp;', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1
END

-- Replace the HTML entity &nbsp; with the ' ' character
SET @Start = CHARINDEX('&nbsp;', @HTMLText)
SET @End = @Start + 5
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, ' ')
SET @Start = CHARINDEX('&nbsp;', @HTMLText)
SET @End = @Start + 5
SET @Length = (@End - @Start) + 1
END

-- Replace any <br> tags with a newline
SET @Start = CHARINDEX('<br>', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, CHAR(13) + CHAR(10))
SET @Start = CHARINDEX('<br>', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1
END

-- Replace any <br/> tags with a newline
SET @Start = CHARINDEX('<br/>', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, CHAR(13) + CHAR(10))
SET @Start = CHARINDEX('<br/>', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1
END

-- Replace any <br /> tags with a newline
SET @Start = CHARINDEX('<br />', @HTMLText)
SET @End = @Start + 5
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, CHAR(13) + CHAR(10))
SET @Start = CHARINDEX('<br />', @HTMLText)
SET @End = @Start + 5
SET @Length = (@End - @Start) + 1
END

-- Remove anything between <STYLE> tags
SET @Start = CHARINDEX('<STYLE', @HTMLText)
SET @End = CHARINDEX('</STYLE>', @HTMLText, CHARINDEX('<', @HTMLText)) + 7
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '')
SET @Start = CHARINDEX('<STYLE', @HTMLText)
SET @End = CHARINDEX('</STYLE>', @HTMLText, CHARINDEX('</STYLE>', @HTMLText)) + 7
SET @Length = (@End - @Start) + 1
END

-- Remove anything between <whatever> tags
SET @Start = CHARINDEX('<', @HTMLText)
SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText))
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '')
SET @Start = CHARINDEX('<', @HTMLText)
SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText))
SET @Length = (@End - @Start) + 1
END

RETURN LTRIM(RTRIM(@HTMLText))

END
Pomeranian answered 31/8, 2016 at 15:28 Comment(3)
For my information, any reason using STUFF() instead of REPLACE() (which woudl be mush shorter IMO) ?Washday
I hadn't really thought about it. I simply copied/modified the original, as indicated. Replace might very-well be a better option. I wonder if there's a performance comparison between the two functions to consider...Pomeranian
@GonerDoug cheers for this, was reading through the accepted comments being like, this really needs updating.Iey
F
10

If your HTML is well formed, I think this is a better solution:

create function dbo.StripHTML( @text varchar(max) ) returns varchar(max) as
begin
    declare @textXML xml
    declare @result varchar(max)
    set @textXML = REPLACE( @text, '&', '' );
    with doc(contents) as
    (
        select chunks.chunk.query('.') from @textXML.nodes('/') as chunks(chunk)
    )
    select @result = contents.value('.', 'varchar(max)') from doc
    return @result
end
go

select dbo.StripHTML('This <i>is</i> an <b>html</b> test')
Fogg answered 12/8, 2011 at 21:49 Comment(7)
This worked for me. +1. But could you please explain your code, so that developers understand it more easily? :)Frigging
it looks like it loads the html as an xml document then selects all of the values out of it. Note: this code pukes on &nbsp;Ringside
Put a hack in for not bombing on HTML codes. Obviously just a quick hack for in-house use or whatever (just as with accepted UDF).Fogg
It does have to be well formed, so it's not as fault tolerant as RedFilter's.Minima
HTML is not a subset of XML. XHTML is, but HTML is not headed down that road any more.Carse
I was unable to get this to work using the sample HTML from the original question.Cythera
Yes, it does have to be well formed, which is a tyrant. Closest to op's snippet would be select dbo.StripHTML('<B>Some useful text</B>&nbsp; <A onclick="return openInfo(this)" href="http://there.com/3ce984e88d0531bac5349" target="globalhelp"> <IMG title="Source Description" height="15" alt="Source Description" src="/ri/new_info.gif" width="15" align="top" border="0" /> </A>&gt;&nbsp;<b>more text</b>')Fogg
A
7

Here is a version that doesn't require an UDF and works even if the HTML contains tags without matching closing tags.

TRY_CAST(REPLACE(REPLACE(REPLACE([HtmlCol], '>', '/> '), '</', '<'), '--/>', '-->') AS XML).value('.', 'NVARCHAR(MAX)')
Askari answered 14/9, 2019 at 6:49 Comment(2)
I was unable to get this method to work using the sample HTML from the original question.Cythera
This breaks on <br/> and <hr/> and <img/> and <input />Aerugo
C
4

This is not a complete new solution but a correction for afwebservant's solution:

--note comments to see the corrections

CREATE FUNCTION [dbo].[StripHTML] (@HTMLText VARCHAR(MAX))  
RETURNS VARCHAR(MAX)  
AS  
BEGIN  
 DECLARE @Start  INT  
 DECLARE @End    INT  
 DECLARE @Length INT  
 --DECLARE @TempStr varchar(255) (this is not used)  

 SET @Start = CHARINDEX('<',@HTMLText)  
 SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText))  
 SET @Length = (@End - @Start) + 1  

 WHILE @Start > 0 AND @End > 0 AND @Length > 0  
 BEGIN  
   IF (UPPER(SUBSTRING(@HTMLText, @Start, 4)) <> '<BR>') AND (UPPER(SUBSTRING(@HTMLText, @Start, 5)) <> '</BR>')  
    begin  
      SET @HTMLText = STUFF(@HTMLText,@Start,@Length,'')  
      end  
-- this ELSE and SET is important
   ELSE  
      SET @Length = 0;  

-- minus @Length here below is important
   SET @Start = CHARINDEX('<',@HTMLText, @End-@Length)  
   SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText, @Start))  
-- instead of -1 it should be +1
   SET @Length = (@End - @Start) + 1  
 END  

 RETURN RTRIM(LTRIM(@HTMLText))  
END  
Carse answered 21/5, 2013 at 19:56 Comment(1)
This worked for me after I used nvarchar instead of varchar because I use unicode characters inside html tagsHobbs
P
4

How about using XQuery with a one liner:

DECLARE @MalformedXML xml, @StrippedText varchar(max)
SET @MalformedXML = @xml.query('for $x in //. return ($x)//text()')
SET @StrippedText = CAST(@MalformedXML as varchar(max))

This loops through all elements and returns the text() only.

To avoid text between elements concatenating without spaces, use:

DECLARE @MalformedXML xml, @StrippedText varchar(max)
SET @MalformedXML = @xml.query('for $x in //. return concat((($x)//text())[1]," ")')
SET @StrippedText = CAST(@MalformedXML as varchar(max))

And to respond to "How do you use this for a column:

  SELECT CAST(html_column.query('for $x in //. return concat((($x)//text()) as varchar(max))
  FROM table

For the above code, ensure your html_column is of data type xml, if not, you need to save a casted version of the html as xml. I would do this as a separate exercise when you are loading HTML data, as SQL will throw an error if it finds malformed xml, e.g. mismatched start/end tags, invalid characters.

These are excellent for when you want to build seachh phrases, strip HTML, etc.

Just note that this returns type xml, so CAST or COVERT to text where appropriate. The xml version of this data type is useless, as it is not a well formed XML.

Polymeric answered 26/9, 2017 at 2:57 Comment(4)
Without the actual solution to cast from xml I feel like this is at best a partial solution.Earthshaker
CAST(@xml as varchar(max)). Or CONVERT(xml), @XML). Assumed most developers would figure that out.Polymeric
It is definitely reasonable to assume that developers know how to cast, but keep in mind that someone reading your answer may not directly see that 'simply' casting is all that needs to be done. Especially because it is mentioned that we can cast where appropriate. -- I am not trying to be negative, just hope this helps you in creation of answers that are more easy to recognize as being usefull!Earthshaker
So what part of this is the column name? Lets say I have a table called data with a column called html and I want to select all values in that column but strip the html tags how could I use your answer to achieve that?Cyte
H
3

Try this. It's a modified version of the one posted by RedFilter ... this SQL removes all tags except BR, B, and P with any accompanying attributes:

CREATE FUNCTION [dbo].[StripHtml] (@HTMLText VARCHAR(MAX))
RETURNS VARCHAR(MAX)
AS
BEGIN
 DECLARE @Start  INT
 DECLARE @End    INT
 DECLARE @Length INT
 DECLARE @TempStr varchar(255)

 SET @Start = CHARINDEX('<',@HTMLText)
 SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText))
 SET @Length = (@End - @Start) + 1

 WHILE @Start > 0 AND @End > 0 AND @Length > 0
 BEGIN
   IF (UPPER(SUBSTRING(@HTMLText, @Start, 3)) <> '<BR') AND (UPPER(SUBSTRING(@HTMLText, @Start, 2)) <> '<P') AND (UPPER(SUBSTRING(@HTMLText, @Start, 2)) <> '<B') AND (UPPER(SUBSTRING(@HTMLText, @Start, 3)) <> '</B')
   BEGIN
      SET @HTMLText = STUFF(@HTMLText,@Start,@Length,'')
   END
    
   SET @Start = CHARINDEX('<',@HTMLText, @End)
   SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText, @Start))
   SET @Length = (@End - @Start) - 1
 END

 RETURN RTRIM(LTRIM(@HTMLText))
END
Hyaluronidase answered 13/11, 2012 at 21:35 Comment(4)
didn't work for me SELECT dbo.StripHtml('<b>somestuff</b>'); returns that exact stringCloudberry
@ladieu, this is expected. Check very first line of the answer ("this SQL removes all tags except BR, B, and P with any accompanying attributes").Florescence
this SQL function is incorrect. Please refer to the answer below for the corrected function.Bushing
@Bushing using "below" and "above" as a reference for where to find answers on an SO page is nonsensical, because the answer order can be changed using the tabs at the top (and further more, voting can change answer order). Please specify an answer using the name of the author posting itWyattwyche
A
3

Try this if you don't want to use the UDF function.

SELECT COLUMN1, TRY_CONVERT(xml, COLUMN2).value('.', 'nvarchar(max)') as COL2, COLUMN3
FROM   DBO.TABLENAME        
Aggregate answered 15/9, 2020 at 16:14 Comment(1)
I was not able to get this answer to work on a simple query like <font color=black><font color=navy><b>SELECT</b></font> *<BR><font color=navy><b>FROM</b></font> Table<font>Cythera
I
2

While Arvin Amir's answer comes close to a full one-line solution you can drop in anywhere; he's got a slight bug in his select statement (missing the end of the line), and I wanted to handle the most common character references.

What I ended up doing was this:

SELECT replace(replace(replace(CAST(CAST(replace([columnNameHere], '&', '&amp;') as xml).query('for $x in //. return concat((($x)//text())[1]," ")') as varchar(max)), '&amp;', '&'), '&nbsp;', ' '), '&#x20;', ' ')
FROM [tableName]

Without the character reference code it can be simplified to this:

SELECT CAST(CAST([columnNameHere] as xml).query('for $x in //. return concat((($x)//text())[1]," ")') as varchar(max))
FROM [tableName]
Irvinirvine answered 5/2, 2020 at 16:58 Comment(0)
P
0

Patrick Honorez code needs a slight change.

It returns incomplete results for html that contains &lt; or &gt;

This is because the code below the section

-- Remove anything between tags

will in fact replace the < > to nothing. The fix is to the apply the below two lines at the end:

set @HTMLText = replace(@htmlText, '&lt;' collate Latin1_General_CS_AS, '<'  collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&gt;' collate Latin1_General_CS_AS, '>'  collate Latin1_General_CS_AS)
Parham answered 27/8, 2018 at 5:44 Comment(0)
E
0

From the result of the accepted answer, the text is not in human-readable format. It combines two words and shows like meaning less. Hence, we should add space between two words. And remove multiple spaces between two words.

Please find the updated answer below,

CREATE FUNCTION [dbo].[udf_StripHTML] (@HTMLText NVARCHAR(MAX))
RETURNS NVARCHAR(MAX) AS
BEGIN
    DECLARE @Start INT
    DECLARE @End INT
    DECLARE @Length INT
    SET @Start = CHARINDEX('<',@HTMLText)
    SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText))
    SET @Length = (@End - @Start) + 1

    WHILE @Start > 0 AND @End > 0 AND @Length > 0
        BEGIN
            SET @HTMLText = STUFF(LTRIM(RTRIM(@HTMLText)),@Start,@Length,' ')
            SET @Start = CHARINDEX('<',@HTMLText)
            SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText))
            SET @Length = (@End - @Start) + 1
        END

    set @HTMLText = LTRIM(RTRIM(@HTMLText));

    -- Replace multiple spaces with a single space repeatedly
    WHILE CHARINDEX('  ', @HTMLText) > 0
        BEGIN
            SET @HTMLText = REPLACE(@HTMLText, '  ', ' ');
        END

    RETURN LTRIM(RTRIM(@HTMLText));
END
GO
Effete answered 5/10, 2023 at 7:16 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.