fgetcsv/fputcsv $escape parameter fundamentally broken
Asked Answered
D

3

12

Overview

fgetcsv and fputcsv support an $escape argument, however, it's either broken, or I'm not understanding how it's supposed to work. Ignore the fact that you don't see the $escape parameter documented on fputcsv, it is supported in the PHP source, there's a small bug preventing it from coming through in the documentation.

The function also supports $delimiter and $enclosure parameters, defaulting to a comma and a double quote respectively. I would expect the $escape parameter should be passed in order to have a field containing any one of those metacharacters (backslash, comma or double quote), however this certainly isn't the case. (I now understand from reading Wikipedia, these are to be enclosed in double-quotes).

What I've tried

Take for example the pitfall that has affected numerous posters in the comments section from the fgetcsv documentation. The case where we'd like to write a single backslash to a field.

$r = fopen('/tmp/test.csv', 'w');
fwrite($r, '"\"');
fclose($r);

$r = fopen('/tmp/test.csv', 'r');
var_dump(fgetcsv($r));
fclose($r);

This returns false. I've also tried "\\", however that also returns false. Padding the backslash(es) with some nebulous text gives fgetcsv the boost it needs... "hi\\there" and "hi\there" both parse and have the same result, but the result has only 1 backslash, so what's the point of the $escape at all?

I've observed the same behavior when not enclosing the backslash in double quotes. Writing a 'CSV' file containing the string \, and \\, have the same result when parsed by fgetcsv, 1 backslash.

Let's ask PHP how it might encode a backslash as a field in a CSV using fputcsv

$r = fopen('/tmp/test.csv', 'w');
fputcsv($r, array('\\'));
fclose($r);
echo file_get_contents('/tmp/test.csv');

The result is a double-quote enclosed single backslash (and I've tried 3 versions of PHP > 5.5.4 when $enclose support was supposedly added to fputcsv). The hilarity of this is that fgetcsv can't even read it properly per my notes above, it returns false... I'd expect fputcsv not to enclose the backslash in double quotes or fgetcsv to be able to read "\" as fputcsv has written it..., or really in my apparently misconstrued mind, for fputcsv to write a double quote enclosed pair of backslashes and for fgetcsv to be able to properly parse it!

Reproducible Test

Try writing a single quote to a file using fputcsv, then reading it via fgetcsv.

$aBackslash = array('\\');

// Write a single backslash to a file using fputcsv
$r = fopen('/tmp/test.csv', 'w');
fputcsv($r, $aBackslash);
fclose($r);

// Read the file using fgetcsv
$r = fopen('/tmp/test.csv', 'r');
$aFgetcsv = fgetcsv($r);
fclose($r);

// Compare the read value from fgetcsv to our original value
if(count(array_diff($aBackslash, $aFgetcsv)))
  echo "PHP CSV support is broken\n";

Questions

Taking a step back I have some questions

  • What's the point of the $escape parameter?
  • Given the loose definition of CSV files, can it be said PHP is supporting them correctly?
  • What's the 'proper' way to encode a backslash in a CSV file?

Background

I initially discovered this when a co-worker provided me a CSV file produced from Python, which wrote out a single backslash enclosed by double quotes and after fgetcsv failed to read it. I had the gaul to ask him if he could use a standard Python function. Little did I know the PHP CSV toolkit is a tangled mess! (FWIW: the Python dev tells me he's using the CSV writing module).

Decor answered 10/11, 2014 at 9:14 Comment(4)
FWIW, the PHP string literal '"\\"' stands for the string "\". If you want two backslashes in your string you need to write '"\\\\"'. I think half of your complaints about a single backslash is based on this misunderstanding, no?Carraway
These are characters in the CSV file, not in PHP code. PHP should interpret them based on the CSV format, not its internal string representation. Again, what is the purpose of the $escape argument in fgetcsv and fputcsv then?Decor
I did just clean up some of the single quotes in those strings that I'm trying to indicate are coming from the file. I'll grant you those were probably misleading.Decor
"\" is invalid CSV. It means an opening enclosure, followed by a literal double quote character, without a terminating enclosure. Unfortunately, since reading your question and the answers, I've done some experimenting and I've discovered that, contrary to what you'd expect, the returned string is not unescaped. So, to encode a single backslash, the CSV needs to be "\\" (four characters, or "\\\\" in PHP code) which will return `\`. It's then up to you to unescape the escape characters. It's actually not broken, but you have to realise how unintuitive it is first.Aeroscope
F
6

From a quick look at Python's documentation on CSV Format Parameters, the escape character used within enclosed values (i.e. inside double quotes) is another double quote.

For PHP, the default escape character is a backslash (^); to match Python's behaviour you need to use this:

$data = fgetcsv($r, 0, ',', '"', '"');

(^) Actually fgetcsv() treats both $enclosure||$enclosure and $escape||$enclosure in the same way, so the $escape argument is used to avoid treating the backslash as a special character.

(^^) Setting the $length parameter to 0 instead of a fixed hard limit makes it less efficient.

Fretted answered 14/11, 2014 at 4:39 Comment(5)
I don't think that's entirely accurate, and this is where the purpose of the $escape parameter is confusing. You see without specifying the $escape parameter, when you write a single double quote out to a file, php will output 3 double quotes. It's one of those details of the vague CSV spec, double quotes are escaped with double quotes. I've seen this mentioned in comments on php.net and wikipedia. This again begs the question of the $escape parameter, though, if a double quote is essentially a built in escape parameter, why have another one?Decor
@Decor Actually, fgetcsv() detects double enclosures by default, so the $escape argument is really to prevent it from treating the backslash as a special character.Amenity
That doesn't make much sense either, since the escape character will obviously be treated as a special character. If it isn't treated as a special character, that could be the problem right. But clearly the escape character is being treated specially, since it's placed within delimiters automatically via fputcsv.Decor
Actually, your update has helped me to understand why the fgetcsv parser is breaking. It sees the opening double quote, but then doesn't see the closing one because it's escaped. This lends credence to my original thought fputcsv could be made smarter. Why write out something that knowingly won't be readable on the other side of the fence. It's an edge case, but I think it has merit.Decor
@Decor I didn't read the signature of fgetcsv() closely enough; the first argument is $length which I had missed earlier on, so I've updated that. It now decodes data properly, but I agree that the default $escape value of fgetcsv() should be changed to match the behaviour of fputcsv().Amenity
S
4

EDIT 2

So after sleep and a relook at the code, turns out fputcsv doesn't accept the escape parameter, and I was being stupid. I've updated the code below to proper working code. The same basic principle applies, the escape parameter is there to alter the escape parameter so you can load a CSV with backslashes without them being treated as escape characters. The trick is to use a character that isn't contained within the csv. You can do this by grepping the file for a specific character, until you find one that isn't returned.

EDIT

Ok, so the verdict is that it checks for the escape char, and then never stops checking. So, if it finds it, it's escaped. That simple.

That said, the purpose of the escape parameter is to allow for this exact situation, where you can alter the escape char to a character that isn't needed.

Here I've converted your example code to a working code:

$aBackslash = array('\\');

// Write a single backslash to a file using fputcsv
$r = fopen('/tmp/test.csv', 'w');
fputcsv($r, $aBackslash, ',', '"'); // EDIT 2: Removed escape param that causes PHP Notice.
fclose($r);

// Read the file using fgetcsv
$r = fopen('/tmp/test.csv', 'r');
$aFgetcsv = fgetcsv($r, ',', '"', '#');
fclose($r);

// Compare the read value from fgetcsv to our original value
if(count(array_diff($aBackslash, $aFgetcsv)))
  echo "PHP CSV support is broken\n";
else
  echo "PHP WORKS!\n";

One important caveat is that both fgetcsv and fputcsv must have the same parameters, otherwise the returned array will not match up to the original array.

ORIGINAL ANSWER

You are very correct. This is a failing with the language. I've tried every permutation of slashes that I can think of, and I've yet to actually achieve a successful response from the CSV. It always returns just as your example says.

I think what @deceze was mention is that in your example you use array('\\') which is actually the string literal "\" which PHP interprets as such, and passes "\" to the CSV, which is then returned that way. This returns the erroneous response \", which, as I stated above, is definitely wrong.

I did manage to find a work around, so that the result is actually appropriate:

First, for your example we'll either need to generate /tmp/test.csv in with "\" as the body, or alter the array slightly. Easiest method is just changing the array to:

array('"\\\\"');

After that, we should change up the fgetcsv request a bit.

$aFgetcsv = fgetcsv($r);
$aFgetcsv = array_map('stripslashes', $aFgetcsv);

By doing this, we're telling PHP to strip the first slash, thus making the string within $aFgetcsv "\"

Shiism answered 14/11, 2014 at 2:32 Comment(14)
Regarding the array('\\') from my example, I have done that on purpose. The point is to illustrate the fputcsv writes out the single backslash amidst surrounding double quotes. Trying to read that with fgetscsv fails, which is definitely a bug imo.Decor
The other question still stands too, which is if things are more or less 'escaped' in a CSV by surrounding double quotes (or custom $delimeter), then what is the point of the $escape in the first place? Seems like nonsense to me. My guess is the bug is really on the fgetcsv side, since fputcsv seems to 'correctly' surround the single backslash with double quotes. I've observed the same behavior from the Python CSV module (it writes a single backslash surrounded by double quotes).Decor
My comment in the code even says "// Write a single backslash to a file using fputcsv"...Decor
I agree that there's a bug in fgetcsv. I was simply explaining the confusion in the comments, and providing a workaround for this situation. I'd say submit a ticket to PHP. A friend of mine is already digging into the source, because you got him curious.Shiism
Cool, appreciate you taking a look at it. I've also taken a look a the source. There's definitely a bug in fputcsv preventing the $escape argument from showing in the documentation and on the CLI with php --rf fputcsv. Want to take a look at fgetcsv too, but so busy atm lol.Decor
It's a huge function. I was able to confirm that the same glitch happens in str_getcsv, with the hillarious effect of var_dump( str_getcsv('"\"') === str_getcsv('"\""') ); echoing bool( true )Shiism
Seems like it all boils down to the php_fgetcsv function, which both fgetcsv and str_getcsv callsShiism
On top of the fgetcsv bug though, I'm still wondering what the point of $escape is. Seems like it could be removed entirely lol.Decor
@Decor I've added an edit to explain why this happens, and why it's not a bug, after reading more up on the code.Shiism
So essentially what you're saying is that the $escape character is off limits for inclusion as a field value using fputcsv/fgetcsv? If we take your example and use a # as the escape character, now we can't write a # character with fputcsv that can be read by fgetcsv. Just moves the original problem to a new character.Decor
That's the basic issue with escape characters. There will always be breaking, if you have a file that contains all characters. The real solution is that you need to run my original suggestion: array_map('stripslashes', $aFgetcsv);Shiism
Also I've added another edit, as fputcsv doesn't actually accept escape as a parameter, which may be part of the confusion. Passing it with display_errors on shows a PHP notice.Shiism
There's a bug on fpetcsv like I was saying. The code is setup to support an $escape parameter, but there's a missing line in ext/standard/basic_functions.c. + ZEND_ARG_INFO(0, escape_char). I'll take a closer look at your answer tomorrow. Just too tired to give it proper attention today. Thanks for all your effort!Decor
LOL fpetcsv, who's tired? The bug is in fputcsv. Once you add that line and recompile you'll see php --rf fputcsv on the command line starts behaving correctly.Decor
T
1

Just had the same problem. The solution was to set $escape to false:

$row = ['a', '{"b":"single dquote=\""}', 'c'];
fputcsv($f, $row);                // invalid csv: a,"{""b"":""single dquote=\"""}",c
fputcsv($f, $row, ',', '"', false); // valid csv: a,"{""b"":""single dquote=\""""}",c
Tepefy answered 28/2, 2022 at 10:29 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.