How can I find Unicode/non-ASCII characters in an NTEXT field in a SQL Server 2005 table?
Asked Answered
L

9

44

I have a table with a couple thousand rows. The description and summary fields are NTEXT, and sometimes have non-ASCII chars in them. How can I locate all of the rows with non ASCII characters?

Lapointe answered 26/3, 2009 at 18:15 Comment(0)
G
67

I have sometimes been using this "cast" statement to find "strange" chars

select 
    *
from 
    <Table>
where 
    <Field> != cast(<Field> as varchar(1000))
Gramophone answered 21/4, 2010 at 13:54 Comment(5)
This worked for me and was much faster then RBarryYoung's solution.Crew
It doesn't work if the collation is set to an "non-ASCII" collation.Carycaryatid
Excellent & simple solution :-)Backpedal
@Gramophone I'm getting the error The data types ntext and nvarchar(max) are incompatible in the not equal to operator. Ideas? - Using SQL Server 2005Funky
Very helpful if you need Unicode but non-ASCII characters; I've also been dealing with some Latvian which isn't contained in ISO 8859-1Tow
M
18

First build a string with all the characters you're not interested in (the example uses the 0x20 - 0x7F range, or 7 bits without the control characters.) Each character is prefixed with |, for use in the escape clause later.

-- Start with tab, line feed, carriage return
declare @str varchar(1024)
set @str = '|' + char(9) + '|' + char(10) + '|' + char(13)

-- Add all normal ASCII characters (32 -> 127)
declare @i int
set @i = 32
while @i <= 127
    begin
    -- Uses | to escape, could be any character
    set @str = @str + '|' + char(@i)
    set @i = @i + 1
    end

The next snippet searches for any character that is not in the list. The % matches 0 or more characters. The [] matches one of the characters inside the [], for example [abc] would match either a, b or c. The ^ negates the list, for example [^abc] would match anything that's not a, b, or c.

select *
from yourtable
where yourfield like '%[^' + @str + ']%' escape '|'

The escape character is required because otherwise searching for characters like ], % or _ would mess up the LIKE expression.

Hope this is useful, and thanks to JohnFX's comment on the other answer.

Mungovan answered 26/3, 2009 at 18:36 Comment(3)
You may want to add a few(or all) of the characters below 32 as well, especially important would be Carriage Return(13), Line Feed (10), and Tab (9).Ormolu
I found your technique to be the most reliable way to do this in SQL-Server.Torytoryism
We combined this with https://mcmap.net/q/99404/-is-there-a-way-to-loop-through-a-table-variable-in-tsql-without-using-a-cursor and decisivedata.net/blog/… to quickly identify data with invalid characters with the row id, position and unicode value. Thank you!Predispose
P
11

Here ya go:

SELECT *
FROM Objects
WHERE 
    ObjectKey LIKE '%[^0-9a-zA-Z !"#$%&''()*+,\-./:;<=>?@\[\^_`{|}~\]\\]%' ESCAPE '\'
Prevocalic answered 5/11, 2015 at 23:33 Comment(1)
I think this is the best solution. Elegant and performant! Thank you!Glassine
O
4

It's probably not the best solution, but maybe a query like:

SELECT *
FROM yourTable
WHERE yourTable.yourColumn LIKE '%[^0-9a-zA-Z]%'

Replace the "0-9a-zA-Z" expression with something that captures the full ASCII set (or a subset that your data contains).

Ormolu answered 26/3, 2009 at 18:36 Comment(5)
Wouldn't this just match rows that contain any ASCII character, as opposed only ASCII characters?Mungovan
The ^ marker at the front of the expression means NOT, so no. It would get any row that had at least one character that wasn't in the ranges specified.Perfunctory
How can I put the full ascii set in that expression? it's HTML data that I'm looking at so "/><' etc... is in there.Lapointe
The answer I placed checks against the full ascii set, it should work with >/< because those get escaped.Mungovan
@Lapointe - you would need to add the characters to the list (eg, '%[^0-9a-zA-Z<>/"'']%'). Andomar's solution programmatically builds a complete list, you could use that.Ormolu
A
3

Technically, I believe that an NCHAR(1) is a valid ASCII character IF & Only IF UNICODE(@NChar) < 256 and ASCII(@NChar) = UNICODE(@NChar) though that may not be exactly what you intended. Therefore this would be a correct solution:

;With cteNumbers as
(
    Select ROW_NUMBER() Over(Order By c1.object_id) as N
    From sys.system_columns c1, sys.system_columns c2
)
Select Distinct RowID
From YourTable t
    Join cteNumbers n ON n <= Len(CAST(TXT As NVarchar(MAX)))
Where UNICODE(Substring(TXT, n.N, 1)) > 255
    OR UNICODE(Substring(TXT, n.N, 1)) <> ASCII(Substring(TXT, n.N, 1))

This should also be very fast.

Adjacency answered 2/6, 2009 at 20:55 Comment(8)
ASCII is only up to 127. Also your numbers cte is weird - the final solution should use a preexisting numbers table instead of it. Otherwise, this is how I would do it.Decal
FYI: "Not appearing weird" is not a criteria that I (or most of the uber-geeks on this site) really care about. And the advantage of not using a pre-existing numbers table, is that it works even if said table does not pre-exist (which it usually does not). Finally, if you test my solution, you will find that it performs comparable to and sometimes even better than a pre-existing numbers table.Adjacency
I don't really want to argue - I mostly agree with your solution. But "weird" is usually hard to read and unmaintainable. That's how I meant it. It's also not DRY if you ever have two different queries that need numbers - you have to write the unmaintainable cte twice. Reading row numbers from a join of system_columns on itself is not very straightforward, but reading numbers from a numbers table would be very easy to debug.Decal
I tried the above but found it to be much slower then CC1960's answer (5:34 compared to CC1960's 0:09). I did rewrite the query to move the "with" table inline: SELECT DISTINCT <RowID> FROM <Table> t JOIN (SELECT ROW_NUMBER() OVER(ORDER BY c1.object_id) as rNum From sys.system_columns c1, sys.system_columns c2) AS n ON n.rNum <= Len(<Field>) Where UNICODE(Substring(<Field>, n.rNum, 1)) > 255 OR UNICODE(Substring(<Field>, n.rNum, 1)) <> ASCII(Substring(<Field>, n.rNum, 1))Crew
I tried using the original query and found the same results (4:42). It should be noted that the original query does not work in SQL 2008 R2 since object_id is not resolved. It should also be noted that removing UNICODE(Substring(<Field>, n.rNum, 1)) > 255 OR from the where statement of my previous query resulted in a very small time reduction (5:12 vs 5:34).Crew
My answer is slower because it is correct. That's usually a pre-requisite for performance testing. So if you could, please correct or withdraw your false claims about my solution in this thread.Adjacency
What's interesting though is that you went out or your way to point out a minor typographical error in my query (which I usually appreciate), but did not bother to point out that the solution that you are praising will not compile or execute in any version of SQL Server. At least not for the Question stated in this thread. Makes me think that you are not exactly either qualified or objective in this testing.Adjacency
Perfect, just what I needed. Thank you very much!! A small typo: "Join cteNumbers n ON n <= Len(CAST(TXT As NVarchar(MAX))", missing a closing bracket. Needs to be "Join cteNumbers n ON n <= Len(CAST(TXT As NVarchar(MAX)))"Goop
B
2

I started with @CC1960's solution but found an interesting use case that caused it to fail. It seems that SQL Server will equate certain Unicode characters to their non-Unicode approximations. For example, SQL Server considers the Unicode character "fullwidth comma" (http://www.fileformat.info/info/unicode/char/ff0c/index.htm) the same as a standard ASCII comma when compared in a WHERE clause.

To get around this, have SQL Server compare the strings as binary. But remember, nvarchar and varchar binaries don't match up (16-bit vs 8-bit), so you need to convert your varchar back up to nvarchar again before doing the binary comparison:

select *
from my_table
where CONVERT(binary(5000),my_table.my_column) != CONVERT(binary(5000),CONVERT(nvarchar(1000),CONVERT(varchar(1000),my_table.my_column)))
Biolysis answered 17/4, 2013 at 2:15 Comment(0)
K
1

If you are looking for a specific unicode character, you might use something like below.

   select  Fieldname from 
     (
      select Fieldname,
             REPLACE(Fieldname COLLATE Latin1_General_BIN,
             NCHAR(65533) COLLATE Latin1_General_BIN,
             'CustomText123') replacedcol
      from table
     ) results where results.replacedcol like '%CustomText123%'
Knowle answered 22/9, 2015 at 15:6 Comment(0)
C
0

My previous answer was confusing UNICODE/non-UNICODE data. Here is a solution that should work for all situations, although I'm still running into some anomalies. It seems like certain non-ASCII unicode characters for superscript characters are being confused with the actual number character. You might be able to play around with collations to get around that.

Hopefully you already have a numbers table in your database (they can be very useful), but just in case I've included the code to partially fill that as well.

You also might need to play around with the numeric range, since unicode characters can go beyond 255.

CREATE TABLE dbo.Numbers
(
    number  INT NOT NULL,
    CONSTRAINT PK_Numbers PRIMARY KEY CLUSTERED (number)
)
GO
DECLARE @i INT

SET @i = 0

WHILE @i < 1000
BEGIN
    INSERT INTO dbo.Numbers (number) VALUES (@i)

    SET @i = @i + 1
END
GO

SELECT *,
    T.ID, N.number, N'%' + NCHAR(N.number) + N'%'
FROM
    dbo.Numbers N
INNER JOIN dbo.My_Table T ON
    T.description LIKE N'%' + NCHAR(N.number) + N'%' OR
    T.summary LIKE N'%' + NCHAR(N.number) + N'%'
and t.id = 1
WHERE
    N.number BETWEEN 127 AND 255
ORDER BY
    T.id, N.number
GO
Chicky answered 26/3, 2009 at 18:52 Comment(3)
The way I understand it, ASCII is 7 bit and varchar is 8 bit. So varchar can still store a lot of characters that aren't ascii, like ä or é.Mungovan
Extended ASCII is 8 bit, which is what some people are referring to when they say "ASCII". I'll edit the post to limit to normal ASCII as well.Chicky
THis won't work for the % or _ characters? And isn't an inner join slower than a LIKE statement (like in my answer)?Mungovan
F
-1

-- This is a very, very inefficient way of doing it but should be OK for -- small tables. It uses an auxiliary table of numbers as per Itzik Ben-Gan and simply -- looks for characters with bit 7 set.

SELECT  *
FROM    yourTable as t
WHERE   EXISTS ( SELECT *
                 FROM   msdb..Nums as NaturalNumbers
                 WHERE  NaturalNumbers.n < LEN(t.string_column)
                        AND ASCII(SUBSTRING(t.string_column, NaturalNumbers.n, 1)) > 127)  
Fatback answered 28/5, 2009 at 15:46 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.