What's the best way to identify hidden characters in the result of a query in SQL Server (Query Analyzer)?
Asked Answered
K

6

48

When trying to identify erroneous data (often needing manual review and removal), I'd like an easy way of seeing hidden characters, such as TAB, Space, Carriage return and Line feed. Is there a built-in way for this?

In a similar question here on stackoverflow, regarding Oracle, a DUMP(fieldname) function was suggested, but I don't know if that woud make things easier even if a corresponding function would exist in SQL Server, since I need to see the Characters in their context.

The best idea I could come up with was replacing the expected hidden characters with visible ones, like this:

SELECT REPLACE(REPLACE(REPLACE(REPLACE(myfield, ' ', '˙'), CHAR(13), '[CR]'), CHAR(10), '[LF]'), CHAR(9), '[TAB]') FROM mytable

Is there a better way? I don't like this way since there might be other less common hidden characters that are not taken into account by me such as vertical TAB etc... Turning on "show hidden characters", as you can do in almost any text editor, would be such a nice feature in SQL Server Query Analyzer, so I almost expect that it can be done somehow in SQL server as well... or at least that someone has an even better idea than mine, to show this kind of white space info.

I just noticed that there is a built-in way to see "white space", not in SQL Query Analyzer, but in the part of the interface that once was the SQL Enterprise manager. Right-click a table in SQL Management Studio Object Explorer tree, and select "Edit top 200 rows". In the result white space (at least CR LF) is visible as empty squares.

Knotting answered 28/12, 2011 at 12:26 Comment(2)
Do you have a finite set of "good" characters? Can you use CLR to use a REGEX substitute?Duckbill
The hard part is that the "bad" characters could sometimes be allowed. That's why I need to observe all hidden characters in their textual contexts. The REPLACE code above actually achieves what I need for the moment (for this customer I'm fairly sure that there are no other white space characters than the ones hard-coded in the Replace above), but I'm was looking for a simpler, more omnipotent "show hidden characters" option, that will work on all occasions.Knotting
G
62

Create a function that addresses all the whitespace possibilites and enable only those that seem appropriate:

SELECT dbo.ShowWhiteSpace(myfield) from mytable

Uncomment only those whitespace cases you want to test. For example:


CREATE FUNCTION dbo.ShowWhiteSpace (@str varchar(8000))
RETURNS varchar(8000)
AS
BEGIN
     DECLARE @ShowWhiteSpace varchar(8000);
     SET @ShowWhiteSpace = @str
     SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(32), '[?]')
     SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(13), '[CR]')
     SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(10), '[LF]')
     SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(9),  '[TAB]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(1),  '[SOH]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(2),  '[STX]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(3),  '[ETX]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(4),  '[EOT]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(5),  '[ENQ]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(6),  '[ACK]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(7),  '[BEL]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(8),  '[BS]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(11), '[VT]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(12), '[FF]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(14), '[SO]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(15), '[SI]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(16), '[DLE]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(17), '[DC1]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(18), '[DC2]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(19), '[DC3]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(20), '[DC4]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(21), '[NAK]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(22), '[SYN]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(23), '[ETB]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(24), '[CAN]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(25), '[EM]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(26), '[SUB]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(27), '[ESC]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(28), '[FS]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(29), '[GS]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(30), '[RS]')
--   SET @ShowWhiteSpace = REPLACE( @ShowWhiteSpace, CHAR(31), '[US]')
     RETURN(@ShowWhiteSpace)
END
Gentoo answered 28/12, 2011 at 15:3 Comment(2)
Uhm, does this support character 202C? I guess not.Tillage
You should include CHAR(0) [NUL] as well.Parsee
D
16

They way I did it was by selecting all of the data

select * from myTable and then right-clicking on the result set and chose "Save results as..." a csv file.

Opening the csv file in Notepad++ I saw the LF characters not visible in SQL Server result set.

Dressy answered 5/3, 2014 at 10:56 Comment(1)
best way ! just make sure you have newline CRLF visible enabledTarazi
D
11

To find them, you can use this

;WITH cte AS
(
   SELECT 0 AS CharCode
   UNION ALL
   SELECT CharCode + 1 FROM cte WHERE CharCode <31
)
SELECT
   *
FROM
   mytable T
     cross join cte
WHERE
   EXISTS (SELECT *
        FROM mytable Tx
        WHERE Tx.PKCol = T.PKCol
             AND
              Tx.MyField LIKE '%' + CHAR(cte.CharCode) + '%'
         )

Replacing the EXISTS with a JOIN will allow you to REPLACE them, but you'll get multiple rows... I can't think of a way around that...

Duckbill answered 28/12, 2011 at 12:44 Comment(6)
Ah, yes, this is indeed a nice way to find characters with low char numbers!Knotting
This doesn't appear to be valid SQLNope
@paulwhit: in what way?Duckbill
If you paste this into SQL Server Management Studio, it does not run. It balks on cte.Charcode. If I include Tx, cte in the FROM clause of the subquery, it will run but I'm not sure it's doing the same thing.Nope
I believe there needs to be a cross join added for this to work. Inside the final/innermost select statement, add this line after the FROM but before the WHERE: cross join cteRoshan
Edited to include the cross join cte. This is a really useful query.Goshorn
K
6

You can always use the DATALENGTH Function to determine if you have extra white space characters in text fields. This won't make the text visible but will show you where there are extra white space characters.

    SELECT DATALENGTH('MyTextData ') AS BinaryLength, LEN('MyTextData ') AS TextLength

This will produce 11 for BinaryLength and 10 for TextLength.

In a table your SQL would like this:

    SELECT * 
    FROM tblA
    WHERE DATALENGTH(MyTextField) > LEN(MyTextField)

This function is usable in all versions of SQL Server beginning with 2005.

Kronfeld answered 14/9, 2017 at 14:23 Comment(1)
FYI - this doesn't work for NVARCHAR fields. SELECT DATALENGTH(N'MyTextData ') AS BinaryLength, LEN(N'MyTextData ') AS TextLengthViehmann
W
2
select myfield, CAST(myfield as varbinary(max)) ...
Wharfinger answered 28/12, 2011 at 12:40 Comment(4)
Doesn't meet the "I need to see the Characters in their context" requirement.Bulletproof
I don't have any better ideas BTW. Unless there are any fonts that show glyphs for these.Bulletproof
@oleg: The varbinary thing is probably SQL Server's equivalent to the DUMP function of Oracle, but I find no good way to make use of this when ocularly reviewing records.Knotting
@Martin Smith: A font with glyph substitutes for all invisible characters would be splendid - then we would be able to view a result just as in a text editor - only that I'm not sure there is such a font.Knotting
P
0

I have faced the same problem with a character that I never managed to match with a where query - CHARINDEX, LIKE, REPLACE, etc. did not work. Then I have used a brute force solution which is awful, heavy but works:

Step 1: make a copy of the complete data set - keep track of the original names with an source_id referencing the pk of the source table (and keep this source id in all the subsequent tables). Step 2: LTRIM RTRIM the data, and replace all double spaces, tab, etc (basically all the CHAR(1) to CHAR(32) by one space. Lowercase the whole set as well. Step 3: replace all the special characters that you know (get the list of all the quotes, double quotes, etc.) by something from a-z (I suggest z). Basically replace everything that is not standard English characters by a z (using nested REPLACE of REPLACE in a loop). Step 4: split by word into a second copy, where each word is in a separate row - the split is a SUBSTRING based on the position of the space characters - at this point, we should miss the ones where there's a hidden space that we did not catche earlier. Step 5: split each word into a third copy, where each letter is in a separate row (I know it makes a very large table) - keep track of the charindex of each letter in a separate column. Step 6: Select everything in the above table which is not LIKE [a-z]. This is the list of the unidentified characters we want to exclude.

From the output of step 6 we have enough data to make a series of substring of the source to select everything but the unknown character we want to exclude.

Note 1: there are smart ways to optimize this, depending on the size of the original expression (steps 4, 5 and 6 can be made in one go).

Note 2: this is not very fast, but the fastest way to get this done for a large data set, because the split of lines into words and words into letters is made by substring, which slices all the table into one character slices. However, this is quite heavy to build. With a smaller set, it may be enough to parse each record one by one and search for character which is not in a list of all English characters plus all special characters.

Pavlish answered 15/1, 2018 at 22:54 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.