SQL Server - Remove all non-printable ASCII characters
Asked Answered
Z

4

14

We recently migrated from SQL Server 2012 to SQL Server 2014 and all our FOR XML code started throwing errors about non-printable ASCII characters. I wrote this horrible function to remove non-printable ASCII characters as a quick fix. I want to replace it with something cleaner. Is there a way to do this?

ALTER FUNCTION [dbo].[remove_non_printable_chars]
(@input_string nvarchar(max))
RETURNS nvarchar(max)
BEGIN
    RETURN
    REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
    REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
    REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
    REPLACE(REPLACE(@input_string,
        CHAR(1), ''),CHAR(2), ''),CHAR(3), ''),CHAR(4), ''),CHAR(5), ''),CHAR(6), ''),CHAR(7), ''),CHAR(8), ''),CHAR(9), ''),CHAR(10), ''),
        CHAR(11), ''),CHAR(12), ''),CHAR(13), ''),CHAR(14), ''),CHAR(15), ''),CHAR(16), ''),CHAR(17), ''),CHAR(18), ''),CHAR(19), ''),CHAR(20), ''),
        CHAR(21), ''),CHAR(22), ''),CHAR(23), ''),CHAR(24), ''),CHAR(25), ''),CHAR(26), ''),CHAR(27), ''),CHAR(28), ''),CHAR(29), ''),CHAR(30), ''),
        CHAR(31), ''), NCHAR(0) COLLATE Latin1_General_100_BIN2, '')
END

Here's the FOR XML code that broke. (I did not write this. It was already in the code base).

SELECT @HTMLTableData =
(
    SELECT  HTMLRows 
    FROM (
        SELECT N'<tr>' + HTMLRow + N'</tr>' AS HTMLRows 
        FROM @HTMLRowData
    ) mi            
    FOR XML PATH(''), TYPE
).value('/', 'NVARCHAR(MAX)')
Zimmermann answered 31/3, 2017 at 19:59 Comment(15)
This is somewhat cumbersome but nested replace is what is going to have to happen to remove these specific characters. Nested replace is also super fast. You might consider turning this into an inline table valued function instead of a scalar function. FWIW, I don't this function is horrible. It is a lot of ugly code that is needed....kind of the point of functions in the first place. :)Analects
I'd post this on codereview.stackexchange.comFordone
Web search yielded: sqlservercentral.com/Forums/Topic860321-338-1.aspxJudicature
@SeanLange Thanks for the kind words :)Zimmermann
This sentence makes me thinking: We recently migrated from SQL Server 2012 to SQL Server 2014 and all our FOR XML code started throwing errors about non-printable ASCII characters. Why and which errors? How are you creating your XMLs? This should not happen... Read about the XY-Problem It might help to show some example which worked in SS12 and does not work with SS14. I'm quite sure, that the problem - and its solution! - is somewhere else...Frontolysis
Btw: A while ago I answered a related question with a function to get namely foreign family names in plain latin. You might have a look into thisFrontolysis
@Shnugo The FOR XML is used to create emails after a job finishes. It uses data from tables and nothing else. The stored proc that sends out the emails just failed after we switched to 2014. We did not change anything else.Zimmermann
I still doubt this ... Please show some Code ...Frontolysis
@Shnugo Added the code that broke. There were multiple such blocks and all of them were giving errors. My suspicion is that encoding for nvarchar has changed between versions, but I have no way to validate that. I used the function to clean up the HTMLRow column in the innermost select.Zimmermann
@Zimmermann This code is obviously creating an HTML table. You might have a look at this generic solution, which will make things easier. But back to your problem: My magic crystal ball tells me, that the problem is not connected to this, but to the filling of @HtmlRowData. I'm pretty sure, that with your last server there was a different default collation. Each database has its own default collation, but the server's temp db is depending on the server's default collation. This can lead into hardly findeably troubles...Frontolysis
@Zimmermann ... continuing... You should never ever create XML (or HTML) on string level! This SELECT N'<tr>' + HTMLRow + N'</tr>' AS HTMLRows is a really bad approach!Frontolysis
@Shnugo It could be the collation issue you mentioned but I don't have access to the old server anymore to confirm. Also, this is not creating table, but just one row of the table. There's more such blocks later and then the final table is created as part of an email. I don't know why it has been done like this. There's many such issues in the entire code base. Like I mentioned, I am not the original developer, just the person who gets the fun task of making sure it runs as expected.Zimmermann
@Zimmermann Just imagine, there is a hi, Tom & Jerry in your @HtmlRowData. This will be concatenated as <tr>hi, Tom&Jerry</tr>, which is invalid (& must be escaped to &amp;)... Furthermore you should - for sure - not find the codes of non-printables like CHAR(1)... Your attempt with REPLACE might work, but is not more than an ugly repairment... The reason for all this is somewhere else...Frontolysis
@Shnugo I'm not disagreeing you with on any of this. I am sure the problem is somewhere else and I know what I have is an ugly hack (which was kind of the point of this question). Not sure what your point is here...Zimmermann
Let us continue this discussion in chat.Frontolysis
P
13

Another Option.

This function will replace control characters and correct any residual repeating spaces. For example Jane Smith{13}was here will not be returned as Jane Smithwas here, but rather Jane Smith was here

CREATE FUNCTION [dbo].[udf-Str-Strip-Control](@S varchar(max))
Returns varchar(max)
Begin
    ;with  cte1(N) As (Select 1 From (Values(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) N(N)),
           cte2(C) As (Select Top (32) Char(Row_Number() over (Order By (Select NULL))-1) From cte1 a,cte1 b)
    Select @S = Replace(@S,C,' ')
     From  cte2

    Return ltrim(rtrim(replace(replace(replace(@S,' ','†‡'),'‡†',''),'†‡',' ')))
End
--Select [dbo].[udf-Str-Strip-Control]('Michael        '+char(13)+char(10)+'LastName')  --Returns: Michael LastName
Pathological answered 31/3, 2017 at 20:25 Comment(11)
Thanks. I'll try this out tomorrow. I'm not sure I want to add spaces though since the control characters might be within a word.Zimmermann
@Zimmermann fair enough, then just change Select @S = Replace(@S,C,' ') to Select @S = Replace(@S,C,'') however, you run the risk of concatinating wordsPathological
@scsimon Can you blame me? When I first saw it, my first thought was "Meh..." then it it dawned on me... F'n brilliant !Pathological
Shoot no. I still remember that answer and was blown away, hence my remembering it now!Fordone
So, if I understood this correctly, cte1(N) creates a table of 10 1's, cte2 converts them into the char(0) to char(100) and picks the top 32, the select replaces all the control chars and the expression in the return statement converts all multiple spaces into a single space. Since I want to preserve the input as is, I returned @S without any changes. This is also very slightly faster than the nested replaces for me, so I'll use this. Thank you!Zimmermann
@Zimmermann SqlZim correctly pointed out that TVF are usually the best way to go, however, in this case there were just too many replace commands... diminishing returns ...Pathological
@ScottShaw-Smith While I respect the fact that you "own" the downvote, it does indeed work. OP accepted answer and and others upvoted. I don't believe it is worthy of a downvotePathological
@ScottShaw-Smith I see that you downvoted SQLZim's as well. Just curious... what did we miss?Pathological
This is great! I used this code to create a function in our DB :)Interlard
@JohnCappelletti why '†‡' and '‡†' are included?Newell
@Newell Just a unique / rare combination of characters to remove any number of repeating spacesPathological
S
9

In-line version:

create function [dbo].[remove_non_printable_chars] (@input_string nvarchar(max))
returns table with schemabinding as return (
  select 
    replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(
    replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(
    replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(
    replace(replace(@input_string collate latin1_general_100_bin2,
        char(1), ''),char(2), ''),char(3), ''),char(4), ''),char(5), ''),char(6), ''),char(7), ''),char(8), ''),char(9), ''),char(10), ''),
        char(11), ''),char(12), ''),char(13), ''),char(14), ''),char(15), ''),char(16), ''),char(17), ''),char(18), ''),char(19), ''),char(20), ''),
        char(21), ''),char(22), ''),char(23), ''),char(24), ''),char(25), ''),char(26), ''),char(27), ''),char(28), ''),char(29), ''),char(30), ''),
        char(31), ''), char(0) , '') 
     as clean_string
);
go

And use it like so:

select c.clean_string
from dbo.remove_non_printable_chars(@dirtystring) c

or

select ...
  , c.clean_string
from t
  cross apply dbo.remove_non_printable_chars(t.dirty_string) c

Reference:

Sub answered 31/3, 2017 at 20:8 Comment(5)
Thanks...I'll wait for some other answers. If not, I'll just switch to a table valued function.Zimmermann
Looks like you got downvoted by the same guy that downvoted me. I wonder what he knows that we missed. :)Pathological
@JohnCappelletti I guess we'll never knowSub
Can you elaborate on the use of collate latin1_general_100_bin2? At first, I tried to omit it because my database has a different collation, but I noticed that for a query with ~3000 rows, the function is much faster with collate latin1_general_100_bin2 (2 seconds vs 30 seconds)...and I don't understand why.Quadrate
@ChristianSpecht I'm concerned about the query taking 30 seconds to run over 3,000 rows even without the collation switch. How big are these strings you're using? I wouldn't have expected that big of a difference. Guidelines for Using BIN and BIN2 Collations and Paul White's answer about binary collation performanceSub
S
2

Just extending the previous answers a little bit

Below the white-list characters, all the others chars will be cleared

[ !`"#$%&'()\*+,\-\./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ\[\]^``\\_abcdefghijklmnopqrstuvwxyz{|}~µº°¡¢£¤¥¦§¨©ª«¬­®¯±²³´¶·¸¹»¼½¾¿×÷ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]

I know it's ugly but it works.

CREATE FUNCTION [dbo].REPLACE_UNPRINT_CHARS(@VarString nvarchar(256))  
RETURNS nvarchar(256) 
AS    
BEGIN  
    RETURN REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(@VarString, CHAR(0), ''), CHAR(1), ''), CHAR(2), ''), CHAR(3), ''), CHAR(4), ''), CHAR(5), ''), CHAR(6), ''), CHAR(7), ''), CHAR(8), ''), CHAR(9), ''), CHAR(10), ''), CHAR(11), ''), CHAR(12), ''), CHAR(13), ''), CHAR(14), ''), CHAR(15), ''), CHAR(16), ''), CHAR(17), ''), CHAR(18), ''), CHAR(19), ''), CHAR(20), ''), CHAR(21), ''), CHAR(22), ''), CHAR(23), ''), CHAR(24), ''), CHAR(25), ''), CHAR(26), ''), CHAR(27), ''), CHAR(28), ''), CHAR(29), ''), CHAR(30), ''), CHAR(31), ''), CHAR(127), ''), CHAR(128), ''), CHAR(129), ''), CHAR(130), ''), CHAR(131), ''), CHAR(132), ''), CHAR(133), ''), CHAR(134), ''), CHAR(135), ''), CHAR(136), ''), CHAR(137), ''), CHAR(138), ''), CHAR(139), ''), CHAR(140), ''), CHAR(141), ''), CHAR(142), ''), CHAR(143), ''), CHAR(144), ''), CHAR(145), ''), CHAR(146), ''), CHAR(147), ''), CHAR(148), ''), CHAR(149), ''), CHAR(150), ''), CHAR(151), ''), CHAR(152), ''), CHAR(153), ''), CHAR(154), ''), CHAR(155), ''), CHAR(156), ''), CHAR(157), ''), CHAR(158), ''), CHAR(159), ''), CHAR(160), ''); 
END;

Use for data clean up

UPDATE [dnName].[dbo].[tableName] 
SET FieldDirtyData= dbo.REPLACE_UNPRINT_CHARS(FieldDirtyData)
WHERE PATINDEX('%['+CHAR(1)+CHAR(2)+CHAR(3)+CHAR(4)+CHAR(5)+CHAR(6)+CHAR(7)+CHAR(8)+CHAR(9)+CHAR(10)+CHAR(11)+CHAR(12)+
CHAR(13)+CHAR(14)+CHAR(15)+CHAR(16)+CHAR(17)+CHAR(18)+CHAR(19)+CHAR(20)+
CHAR(21)+CHAR(22)+CHAR(23)+CHAR(24)+CHAR(25)+CHAR(26)+CHAR(27)+CHAR(28)+CHAR(29)+CHAR(30)+CHAR(31)+CHAR(127)+
CHAR(128)+CHAR(129)+CHAR(130)+CHAR(131)+CHAR(132)+CHAR(133)+CHAR(134)+CHAR(135)+CHAR(136)+CHAR(137)+CHAR(138)+
CHAR(139)+CHAR(140)+CHAR(141)+CHAR(142)+CHAR(143)+CHAR(144)+CHAR(145)+CHAR(146)+CHAR(147)+CHAR(148)+CHAR(149)+CHAR(150)+
CHAR(151)+CHAR(152)+CHAR(153)+CHAR(154)+CHAR(155)+CHAR(156)+CHAR(157)+CHAR(158)+CHAR(159)+CHAR(160)+']%', FieldDirtyData) <> 0

Adjust your datatype (nvarchar or varchar + max) as required

If you want to add more chars to clear use "select ASCII('char to remove here')" MSSQL command in order to get the ASCII code of the char and put it inside the replace instruction

i.g SELECT ASCII('¢') returns 162

so add one more "REPLACE(" after "RETURN" and "CHAR(162), '')" at the end of line but before the ";" sign.

Soubise answered 4/1, 2019 at 17:22 Comment(1)
Now let's just hope that only users of Latin alphabet-based languages ever use the application.Hibbitts
M
2

Here is a similar answer of the previous inline-table valued function answer (https://mcmap.net/q/804203/-sql-server-remove-all-non-printable-ascii-characters) for this question. The primary change is this uses a patindex guard first and is much faster when only a small percentage of the rows contain characters that need to be replaced.

Thus the two huge improvements from the original scalar function:

  • Use an inline table-valued function. This is much faster as it allows SQL Server to directly in-line the code in the query plan. I try to avoid scalar functions in queries designed to scale, as a normal scalar function can be a huge performance sap (even with schemabinding) and prevent optimizations such as parallelism.

  • Use patindex for an initial guard check. This changes the number of characters SQL must exmine when there are no control characters to replace from O(n * num_replace_calls) to ~O(n). Since most of the data data (in my case) contains no control characters, this can result in a huge performance increase.

-- Only accepts VARCHAR(8000) to avoid a conversion to VARCHAR(MAX);
-- use the suitable input type, which might even be NVARCHAR(MAX).
CREATE FUNCTION DropControlCharactersTv(@str VARCHAR(8000))
RETURNS TABLE
WITH SCHEMABINDING
AS
    RETURN SELECT CleanedString = CASE
    -- No-op.
    WHEN @str IS NULL or @str = '' THEN @str
    -- If any of the non-wanted characters are found then go through the string and replace each occurrence of every character.
    -- This guard significantly improves the performance when very few strings need to be corrected.
    WHEN PATINDEX (
        '%[' + CHAR(0) + CHAR(1) + CHAR(2) + CHAR(3) + CHAR(4) + CHAR(5) + CHAR(6) + CHAR(7) + CHAR(8) + CHAR(11) + CHAR(12) + CHAR(14) + CHAR(15) + CHAR(16) + CHAR(17) + CHAR(18) + CHAR(19) + CHAR(20) + CHAR(21) + CHAR(22) + CHAR(23) + CHAR(24) + CHAR(25) + CHAR(26) + CHAR(27) + CHAR(28) + CHAR(29) + CHAR(30) + CHAR(31) + CHAR(127) + ']%',
        @str COLLATE Latin1_General_BIN) <> 0 THEN
        -- Replace, nested.
        -- See https://www.sqlservercentral.com/forums/topic/how-to-remove-characters-char0-to-char31
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        REPLACE(
        @str
        ,CHAR(0),'') COLLATE Latin1_General_BIN
        ,CHAR(1),'') COLLATE Latin1_General_BIN
        ,CHAR(2),'') COLLATE Latin1_General_BIN
        ,CHAR(3),'') COLLATE Latin1_General_BIN
        ,CHAR(4),'') COLLATE Latin1_General_BIN
        ,CHAR(5),'') COLLATE Latin1_General_BIN
        ,CHAR(6),'') COLLATE Latin1_General_BIN
        ,CHAR(7),'') COLLATE Latin1_General_BIN
        ,CHAR(8),'') COLLATE Latin1_General_BIN
        ,CHAR(9),'') COLLATE Latin1_General_BIN
        ,CHAR(10),'') COLLATE Latin1_General_BIN
        ,CHAR(11),'') COLLATE Latin1_General_BIN
        ,CHAR(12),'') COLLATE Latin1_General_BIN
        ,CHAR(13),'') COLLATE Latin1_General_BIN
        ,CHAR(14),'') COLLATE Latin1_General_BIN
        ,CHAR(15),'') COLLATE Latin1_General_BIN
        ,CHAR(16),'') COLLATE Latin1_General_BIN
        ,CHAR(17),'') COLLATE Latin1_General_BIN
        ,CHAR(18),'') COLLATE Latin1_General_BIN
        ,CHAR(19),'') COLLATE Latin1_General_BIN
        ,CHAR(20),'') COLLATE Latin1_General_BIN
        ,CHAR(21),'') COLLATE Latin1_General_BIN
        ,CHAR(22),'') COLLATE Latin1_General_BIN
        ,CHAR(23),'') COLLATE Latin1_General_BIN
        ,CHAR(24),'') COLLATE Latin1_General_BIN
        ,CHAR(25),'') COLLATE Latin1_General_BIN
        ,CHAR(26),'') COLLATE Latin1_General_BIN
        ,CHAR(27),'') COLLATE Latin1_General_BIN
        ,CHAR(28),'') COLLATE Latin1_General_BIN
        ,CHAR(29),'') COLLATE Latin1_General_BIN
        ,CHAR(30),'') COLLATE Latin1_General_BIN
        ,CHAR(31),'') COLLATE Latin1_General_BIN
        ,CHAR(127),'') COLLATE Latin1_General_BIN
    -- Did not match pattern: inherently valid
    ELSE @str END

In a query:

select
    Plucker.CleanedString
from Goose d
cross apply DropControlCharactersTv(d.Turkey) as Plucker
Modillion answered 21/6, 2021 at 19:18 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.