How to fix the embedded text qualifier issue while exporting data to CSV flat file?
Asked Answered
F

3

63

###RFC 4180:

RFC 4180 defines Common Format and MIME Type for Comma-Separated Values (CSV) Files. One of the requirements of the RFC 4180 is stated as below. This is the point #7 in the RFC link.

If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote.  For example:

"aaa","b""bb","ccc"

###SQL Server 2000:

DTS Export/Import Wizard in SQL Server 2000 seems to conform to the above mentioned standards even though the RFC 4180 itself seem to have been published only on October 2005. I am using the below stated SQL Server 2000 version.

Microsoft SQL Server  2000 - 8.00.2039 (Intel X86) 
May  3 2005 23:18:38 
Copyright (c) 1988-2003 Microsoft Corporation
Standard Edition on Windows NT 5.0 (Build 2195: Service Pack 4)

###SQL Server 2012:

SQL Server Import and Export Wizard in SQL Server 2012 does not export the data from table to CSV file according to the standard defined in RFC 4180. I am using the below stated SQL Server 2012 version.

Microsoft SQL Server 2012 - 11.0.2316.0 (X64) 
Apr  6 2012 03:20:55 
Copyright (c) Microsoft Corporation
Enterprise Edition (64-bit) on Windows NT 6.1 <X64> (Build 7601: Service Pack 1) (Hypervisor)

###Issue Simulation:

Here is a sample that I ran in both SQL Server 2000 and SQL Server 2012. I ran the below query to create a table and insert few records. The ItemDesc column has data with double-quotes in it. My intention is to export the data from both these SQL Server versions using their in-built export data wizard and compare the generated CSV files.

CREATE TABLE dbo.ItemInformation(
    ItemId nvarchar(20) NOT NULL,
    ItemDesc nvarchar(100) NOT NULL
) 
GO

INSERT INTO dbo.ItemInformation (ItemId, ItemDesc) VALUES ('100338754', 'Crown Bolt 3/8"-16 x 1" Stainless-Steel Hex Bolt');
INSERT INTO dbo.ItemInformation (ItemId, ItemDesc) VALUES ('202255836', 'Simpson Strong-Tie 5/8" SSTB Anchot Bolt');
INSERT INTO dbo.ItemInformation (ItemId, ItemDesc) VALUES ('100171631', 'Grip-Rite #11 x 1-1/2" Electro-Galvanized Steel Roofing Nails');
INSERT INTO dbo.ItemInformation (ItemId, ItemDesc) VALUES ('202210289', 'Crown Bolt 1/2" x 3" "Zinc-Plated" Universal Clevis Pin');
INSERT INTO dbo.ItemInformation (ItemId, ItemDesc) VALUES ('100136988', 'Tapcon 3/16" x 1-3/4" Climaseal Steel "Flat-Head" Phillips Concrete Anchors (75-Pack)');
INSERT INTO dbo.ItemInformation (ItemId, ItemDesc) VALUES ('203722101', 'KwikTap 3/16" x 2-1/4" "Flat-Head" Concrete Screws (100-Pack)');
GO

On the DTS Export/Import Wizard in SQL Server 2000, I used the below settings to export the data to CSV file. I saved the file under the name SQLServer2000_ItemInformation.csv.

DTS Export/Import Wizard

On the SQL Server Import and Export Wizard in SQL Server 2012, I used the below settings to export the data to CSV file. I saved the file under the name SQLServer2012_ItemInformation.csv.

SQL Server Import and Export Wizard - Choose a destination

SQL Server Import and Export Wizard - Configure Flat File Destination

Here is the comparison between the two files using Beyond Compare. The left side contains the file generated by SQL Server 2000 and the right side contains the file generated by SQL Server 2012. You can notice that the left side file from SQL Server 2000 contains additional double-quotes to compensate the embedded quotes in the data column. This conforms to the standard specified in RFC 4180 but it is clearly missing from the file generated by SQL Server 2012

File Comparison

###Searches on the web:

I searched for this bug on the web and found the following links. Following are the bug reports on Microsoft Connect. All these issues seem to be related to importing a file but nothing about exporting data. All these bugs have been closed as Fixed.

Below post on MSDN blog states that changes have been made in SQL Server 2012 with respect to Flat file source supports embedded qualifiers and a variable number of columns per row

Another post on MSDN blog states the same under the section Embedded Qualifiers.

###Workaround that I know of:

I know a workaround to fix the issue by writing a query that would replace all double-quotes (") in my column data with two double-quotes ("") so that the exported file will end up with correct embedded qualifier data. This would avoid pulling the data directly from the table as it is.

###My questions:

  • I don't know if this issue has been truly fixed in SQL Server 2012. Has this issue been fixed only for importing files that have embedded text qualifiers and not for exporting data to CSV?

  • Probably, I am clearly doing something wrong and missing the obvious. Could someone please explain to me what I am doing wrong here?

###Microsoft Connect:

I have submitted a bug report on Microsoft Connect website to get their feedback. Here is the link to the bug report. If you agree that this is a bug, please visit the below link to vote up on Microsoft Connect website.

Embedded text qualifier during export to CSV does not conform to RFC 4180

Future answered 11/2, 2013 at 15:45 Comment(5)
I spent some time researching and reproducing and it does appear to simply be a bug.Metamorphose
looks like a bug to me as well. I didn't realize a rfc had been release but it's a while since I parsed csv. I always took ", and "\n as the closing for a block and accepted any " or "" in the field. Not that helps you though :(Indue
Sadly, the BI tools for SQL Server don't seem to get a lot of development attention. I've had a lot of grief with embedded delimiters in the past, and switching to | (pipe) as a delimiter standard for our import/export data has alleviated a lot of that - you may be able to do something similar. My experience with connect has been that if you're not able to get a few thousand people to vote up your bug, it's not going to go anywhere. The other option is opening a support incident, but that's still iffy, and even in the best scenario you may still have to wait months for a resolution.Liv
An observation: While it is certainly good for players in the industry to move towards common accepted best practices as outlined in the RFC you refer to above. However, just because that exists, does not mean any given vendor follows that convention. Maybe they should and do, and thus they would likely denote that 'our product supports RFC 4180 conventions for CSV files' which would make many of us much happier when having to deal with CSV files. Too many vendors do it however they feel inclined still as your research shows.Modestamodeste
why dont you use some other software to export?Neper
M
8

I wouldn't offer this answer except that you worked so hard to document it and it's been upvoted with no answer after a month. So, here goes. Your only choices appear to be to change the data or change the tool.

Probably, I am clearly doing something wrong and missing the obvious. Could someone please explain to me what I am doing wrong here?

When the tool is broken and the vendor doesn't care, it's mistake to keep trying. It's time to switch. You put a lot of effort into researching exactly how it's broken and demonstrating it violates not only the RFC but the tool's own prior version. How much more evidence do you need?

CSV is a boat anchor too. If you have the option, you're better off using an ordinary delimited file format. For lots of applications, tab-delimited is good. The best delimiter IMO is '\' because that character has no place in English text. (On the other hand it won't work for data containing Windows pathnames.)

CSV has two problems as an exchange format. First, it's not all that standard; different applications recognize different versions, whatever the RFC may say. Second (and related) is that it doesn't constitute a regular language in CS terms, which is why it can't be parsed as a regular expression. Compare with ^([^\t]*\t)*[\t]*$ for a tab-delimited line. The practical implication of the complexity of CSV's definition is (see above) the relative dearth of tools to handle them and their tendency to be incompatible, particularly during the wee hours.

If you give CSV and DTS the boot, you have good options, one of which is bcp.exe. It's very fast, and safe because Microsoft hasn't been tempted to update it for years. I don't know much about DTS, but in case you have to use it for automation, IIRC there is a way to invoke external utilities. Beware though, that bcp.exe does not return error status to the shell dependably.

If you're determined to use DTS and to stick with CSV, then really your best remaining option is to write a view that prepares the data appropriately for it. I would, if backed into that corner, create a schema called, say, "DTS2012CSV", so that I could write select * from DTS2012CSV.tablename, giving anyone who cares a fighting chance to understand it (because you'll document it, won't you, in comments in the view text?). If need be, others can copy its technique for other broken extracts.

HTH.

Marentic answered 14/3, 2013 at 0:42 Comment(2)
Thanks for your response. I am aware that this particular functionality was broken in SSIS 2005 and 2008 R2. I read in all the above links that this has been fixed in SSIS 2012 but I found no evidence of it. It looks like that Microsoft fixed importing files, which was a known issue too, but still haven't fixed exporting. I posted it here to find if I had overlooked anything obvious. The issue is not with DTS but with SSIS. No, they are not same. SSIS is much different than DTS and a lot better in my opinion.Future
| (pipe) character is a better delimiter IMO. It's hardly ever used anywhere in data.. in that sense it's rarer to find a | in data than it is to find a `` :)Choragus
G
2

I know this is two years old, but I am also now having this issue, as we need to use SQL Server 2008 for a contract we have (don't ask). After reading through this question, I realized I needed to do the replace suggestion, but when I went to do it in the query, I ran into truncation issues, because using the replace() function in the query itself would convert the text to a varchar(8000) by default.

However, I discovered I could do the same thing using a Derived Column step in between the DB Source and Flat File objects. For example, I have a column named "short_description," that could have quotes in it, so I just used the following function as the expression, and selected "Replace short_description" in the Derived Column:

REPLACE(short_description,"\"","\"\"")

This seems to have solved the issue for me.

Galla answered 5/8, 2015 at 14:37 Comment(1)
Furthermore, if your source column is a textstream, you actually need to use a script component to transform the bytes of the textsteam into an interim string variable, do the replace, then transform the string back to bytes for a textstream output. This site helped: mscrmtech.com/…Galla
G
0

Often the first and last name is in the same field and formatted (Last, First). This needs to be text qualified if you're using Tasks->Export Data right off the database (not via SSIS where you have more options) and you need to export to CSV as comma-delimited file.

This will help in your non-null selected fields that need double quoting...

CASE WHEN NOT PersonName IS NULL AND LEN(PersonName) > 0 THEN QUOTENAME(PersonName, '"') ELSE NULL END as 'PersonName'

Result:

PersonName

"COLLINS, ZACKERY E"

Gish answered 6/6, 2018 at 20:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.