How to properly escape a double quote in CSV?
Asked Answered
E

6

314

I have a line like this in my CSV:

"Samsung U600 24"","10000003409","1","10000003427"

Quote next to 24 is used to express inches, while the quote just next to that quote closes the field. I'm reading the line with fgetcsv but the parser makes a mistake and reads the value as:

Samsung U600 24",10000003409"

I tried putting a backslash before the inches quote, but then I just get a backslash in the name:

Samsung U600 24\"

Is there a way to properly escape this in the CSV, so that the value would be Samsung U600 24" , or do I have to regex it in the processor?

Elielia answered 23/7, 2013 at 11:10 Comment(0)
B
534

Use 2 quotes:

"Samsung U600 24"""

RFC-4180 says "If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote."

Bichloride answered 23/7, 2013 at 11:19 Comment(8)
RFC-4180, paragraph "If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote."Jasminejason
As tommed says you only need to add a single double quote to escape a double quote. You can use a command-line tool called csvfix to detect any lines which don't conform: csvfix check -nl -v [filename]Aristophanes
@SamCritchley I only see a single double quote being used to escape here. By "Use 2 quotes", user4035 means that 1 quote should be replaced with 2 quotes. By escaping double quotes with double quotes, you're effectively creating pairs of double quotes (2 double quotes). The final quote you see on the end is to terminate the field.Exultant
Really nice approach. But this requires me to modify the CSV file before parsing.Shostakovich
if my csv string like: world,",helloHomebred
@Homebred Try this: "world,"",hello"Bichloride
@codemicky that basically because your original file was not a properly formatted CSV. If you need to make that file work using an RFC-4180-compliant parser, you could try setting the quote character to something other than " (a null character works well) so that the parser will give you the unprocessed string "Samsung U600 24"", and then you could do the work to unquote it yourself.Rode
If quoting an HTML string for instance: "<a href=""#somewhere"" class=""something"">my link has some ""quoted"" text</a>"Vena
P
7

Not only double quotes, you will be in need for single quote ('), double quote ("), backslash (\) and NUL (the NULL byte).

Use fputcsv() to write, and fgetcsv() to read, which will take care of all.

Pleuron answered 5/2, 2014 at 11:37 Comment(8)
This comment on the documentation page of fputcsv() shows how you could use fputcsv() when you want to output in csv format to the browser instead of an actual file.Nicoline
@Angelin Nadar, could you please add a source to your claim about the need to double single quote, backslash and NUL? I didn't find it in RFC-4180.Cosine
You don't need to actually escape single quotes etc. A proper CSV file doesn't even need to add double quotes around a field which contains only single quotes. If the CSV reader is properly implemented it should read the file correctly even with those symbols.Melly
Why was this answer ever voted up? The comment about escaping characters was never backed up and the original question doesn't ask about PHP. This only seems to be true for the string delimiter (and only for the chosen delimiter) when a program, such as Open Office, allows you to change it.Chestnut
If you actually read it RFC4180 notes that there are various [other] specifications and implementations for the CSV format and lists at least 4. The OP does not specify which particular format, so I feel it's unfair to downvote this answer based on presumptions about a specific document.Augean
@cz The answer here is plainly wrong. It neither answers the original question, nor does it provide reasonable other context. For example, NONE of the usual "other CSV formats" requires escaping single quotes inside double quotes, but SOME of the common CSV readers won't unescape escaped single quotes. Escaping backslash MAY be necessary with some readers, but again not with all.Charbonneau
I wrote an answer that I think this question deserves: goes into more detail about the RFC and escaping.Ordinance
@DaveF, the question mentions fgetcsv(), which seems to be specific to PHP.Commander
O
6

CSV is, in theory, a simple format (tabular data delimited by commas), but regrettably there is no formal specification, so there are many subtly different implementations out there. This requires some care when importing/exporting. I will quote RFC 4180 for common implementations:

2.  Definition of the CSV Format

   While there are various specifications and implementations for the
   CSV format (for ex. [4], [5], [6] and [7]), there is no formal
   specification in existence, which allows for a wide variety of
   interpretations of CSV files.  This section documents the format that
   seems to be followed by most implementations:

   1.  Each record is located on a separate line, delimited by a line
       break (CRLF).  For example:

       aaa,bbb,ccc CRLF
       zzz,yyy,xxx CRLF

   2.  The last record in the file may or may not have an ending line
       break.  For example:

       aaa,bbb,ccc CRLF
       zzz,yyy,xxx

   3.  There maybe an optional header line appearing as the first line
       of the file with the same format as normal record lines.  This
       header will contain names corresponding to the fields in the file
       and should contain the same number of fields as the records in
       the rest of the file (the presence or absence of the header line
       should be indicated via the optional "header" parameter of this
       MIME type).  For example:

       field_name,field_name,field_name CRLF
       aaa,bbb,ccc CRLF
       zzz,yyy,xxx CRLF


   4.  Within the header and each record, there may be one or more
       fields, separated by commas.  Each line should contain the same
       number of fields throughout the file.  Spaces are considered part
       of a field and should not be ignored.  The last field in the
       record must not be followed by a comma.  For example:

       aaa,bbb,ccc

   5.  Each field may or may not be enclosed in double quotes (however
       some programs, such as Microsoft Excel, do not use double quotes
       at all).  If fields are not enclosed with double quotes, then
       double quotes may not appear inside the fields.  For example:

       "aaa","bbb","ccc" CRLF
       zzz,yyy,xxx

   6.  Fields containing line breaks (CRLF), double quotes, and commas
       should be enclosed in double-quotes.  For example:

       "aaa","b CRLF
       bb","ccc" CRLF
       zzz,yyy,xxx

   7.  If double-quotes are used to enclose fields, then a double-quote
       appearing inside a field must be escaped by preceding it with
       another double quote.  For example:

       "aaa","b""bb","ccc"

So usually

  • A field may or may not be enclosed in double quotes. (The RFC from 2005 says Excel doesn't use double quotes, but I tested with Excel 2016 and it does.)
  • Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes. (In particular, a CSV file may have multiple lines as they appear in a text editor correspond to one row of data.)
  • If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote
    • So "" in the raw CSV field represents an empty string, and """" in a raw CSV represents a single quote, ".

(Usually not a problem: CRLF (Windows-style) or LF (Unix-style) line breaks; Whether or not the last line ends with a line break)

However you may encounter data that escapes quotes or other characters (delimiter, linebreaks, escape character itself) with an escape character like \. For example, in readr's read_csv(), this is controlled by escape_double and escape_backslash. Some unusual data use a comment character like # (default in R's read.table but not read.csv).

Ordinance answered 28/7, 2023 at 0:58 Comment(0)
B
2

I have written in Java.

public class CSVUtil {
    public static String addQuote(
            String pValue) {
        if (pValue == null) {
            return null;
        } else {
            if (pValue.contains("\"")) {
                pValue = pValue.replace("\"", "\"\"");
            }
            if (pValue.contains(",")
                    || pValue.contains("\n")
                    || pValue.contains("'")
                    || pValue.contains("\\")
                    || pValue.contains("\"")) {
                return "\"" + pValue + "\"";
            }
        }
        return pValue;
    }

    public static void main(String[] args) {
        System.out.println("ab\nc" + "|||" + CSVUtil.addQuote("ab\nc"));
        System.out.println("a,bc" + "|||" + CSVUtil.addQuote("a,bc"));
        System.out.println("a,\"bc" + "|||" + CSVUtil.addQuote("a,\"bc"));
        System.out.println("a,\"\"bc" + "|||" + CSVUtil.addQuote("a,\"\"bc"));
        System.out.println("\"a,\"\"bc\"" + "|||" + CSVUtil.addQuote("\"a,\"\"bc\""));
        System.out.println("\"a,\"\"bc" + "|||" + CSVUtil.addQuote("\"a,\"\"bc"));
        System.out.println("a,\"\"bc\"" + "|||" + CSVUtil.addQuote("a,\"\"bc\""));
    }
}
Blackamoor answered 12/10, 2021 at 5:32 Comment(2)
Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.Mercer
Doesn't all that .Contains() scan the full string every time? Why not just always wrap the text in double quote? Also, the first pValue.contains("\"")) seems useless (the replace will check the double quote exists itself).Intercalate
T
-2

Since no one has mentioned the way I usually do it, I'll just type this down. When there's a tricky string, I don't even bother escaping it.

What I do is just base64_encode and base64_decode, that is, encode the value to Base64 before writing the CSV line and when I want to read it, decode.

For your example assuming it's PHP:

$csvLine = [base64_encode('Samsung U600 24"'),"10000003409","1","10000003427"];

And when I want to take the value, I do the opposite.

$value = base64_decode($csvLine[0])

I just don't like to go through the pain.

Travail answered 25/11, 2022 at 16:55 Comment(0)
M
-3

I know this is an old post, but here's how I solved it (along with converting null values to empty string) in C# using an extension method.

Create a static class with something like the following:

    /// <summary>
    /// Wraps value in quotes if necessary and converts nulls to empty string
    /// </summary>
    /// <param name="value"></param>
    /// <returns>String ready for use in CSV output</returns>
    public static string Q(this string value)
    {
        if (value == null)
        {
            return string.Empty;
        }
        if (value.Contains(",") || (value.Contains("\"") || value.Contains("'") || value.Contains("\\"))
        {
            return "\"" + value + "\"";
        }
        return value;
    }

Then for each string you're writing to CSV, instead of:

stringBuilder.Append( WhateverVariable );

You just do:

stringBuilder.Append( WhateverVariable.Q() );
Merchandising answered 27/4, 2020 at 19:33 Comment(1)
doesn't this miss the doubling up of embedded quotes?Within

© 2022 - 2024 — McMap. All rights reserved.