Data gets garbled when writing to csv with fputcsv() / fgetcsv()
Asked Answered
M

5

1

There seems to be an encoding issue or bug in PHP with fputcsv() and fgetcsv().

The following PHP code:

$row_before = ['A', json_encode(['a', '\\', 'b']), 'B'];

print "\nBEFORE:\n";
var_export($row_before);
print "\n";

$fh = fopen($file = 'php://temp', 'rb+');

fputcsv($fh, $row_before);

rewind($fh);

$row_after = fgetcsv($fh);

print "\nAFTER:\n";
var_export($row_after);
print "\n\n";

fclose($fh);

Gives me this output:

BEFORE:
array (
  0 => 'A',
  1 => '["a","\\\\","b"]',
  2 => 'B',
)

AFTER:
array (
  0 => 'A',
  1 => '["a","\\\\',
  2 => 'b""]"',
  3 => 'B',
)

So clearly, the data is damaged on the way. Originally there were just 3 cells in the row, afterwards there are 4 cells in the row. The middle cell is split thanks to the backslash that is also used as an escape character.

See also https://3v4l.org/nc1oE Or here, with explicit values for delimiter, enclosure, escape_char: https://3v4l.org/Svt7m

Is there any way I can sanitize / escape my data before writing to CSV, to guarantee that the data read from the file will be exactly the same?

Is CSV a fully reversible format?

EDIT: The goal would be a mechanism to properly write and read ANY data as csv, so that after one round trip the data is still the same.

EDIT: I realize that I do not really understand the $escape_char parameter. See also fgetcsv/fputcsv $escape parameter fundamentally broken Maybe an answer to this would also bring us closer to a solution.

Michaeline answered 8/6, 2017 at 6:13 Comment(6)
CSV has loads of limitations, from embedded quotes to multiline data. This is why things like XML, JSON and various other formats are used.Friarbird
I have plenty of embedded quotes and line breaks in data that I export to CSV, and until now there was never a (visible) problem.Michaeline
CSV with line breaks in data isn't portable, if you just want data for your own application, why bother with CSV and just write any old format you want - and as your using all sorts of separators - you are effectively doing that.Friarbird
1. Depends on the application. E.g. LibreOffice Calc has no problem with it. It might be nonstandard, but the world does not care. 2. Because CSV (or PHP-flavoured CSV) is already supported natively, 3. I am using the standard separators mostly. The only reason I use different separators is to test if PHP will behave differently.Michaeline
Also, the application I am developing has all sorts of CSV from external sources already, so I prefer to stick to this one format instead of inventing new formats.Michaeline
@NigelRen I ended up writing my own userland CSV implementation that always replaces single enclosure characters in cells with double enclosure characters, and does not have a special "escape character". This seems to be bullet-proof. I processed a ton of data with it already, which all survives the round trip unaltered. I still use native fgetcsv() for csv files from external sources, since those often contain rogue single double quotes in cell contents.Michaeline
G
5

The culprit is that fputcsv() uses an escape character, which is a non-standard extension to CSV. (Well, as far as RFC 7111 can be regarded as standard.) Basically, this escape character would have to be disabled, but passing an empty string as $escape to fputcsv() doesn't work. Usually, passing a NUL character should give the desired results, however, see https://3v4l.org/MlluN.

Grist answered 21/9, 2017 at 11:21 Comment(1)
Hey! I think my answer is more up to date at this point. But if you are the cmb who followed up on the bug report, and if cmb means you authored this commit, then all praise to you! I happily keep the checkbox on your answer.Michaeline
M
2

UPDATE Jan 2020

Since PHP 7.4, passing an empty string as escape char fixes the problem! https://www.php.net/manual/en/function.fgetcsv.php

Demo https://3v4l.org/33Wja - see difference of PHP 7.4 vs older versions. (this is the same snippet as below, just with empty string as escape char)

Original answer

Contrary to what others are saying, I claim that this is a PHP bug. I am going to report it, and update this answer.

EDIT: Now reported here, https://bugs.php.net/bug.php?id=74713

Discussed in this answer:

  • Does changing the delimiter help? -> Not really.
  • Could fputcsv() be fixed? -> Yes.

Does changing the delimiter help?

It can be shown that this is reproducible with any combination of delimiter, enclosure and escape character.

https://3v4l.org/a29kR

$delimiter = 'X';
$enclosure = 'Y';
$escape_char = "Z";

$row_before = [
  'A',
  "[{$enclosure}a{$enclosure}{$delimiter}{$enclosure}{$escape_char}{$escape_char}{$enclosure}{$delimiter}{$enclosure}b{$enclosure}]",
  'B',
];

print "\nBEFORE:\n";
var_export($row_before);
print "\n";

$fh = fopen($file = 'php://temp', 'rb+');

fputcsv($fh,$row_before,$delimiter,$enclosure, $escape_char);

rewind($fh);

$row_plain = fread($fh, 1000);

print "\nPLAIN:\n";
var_export($row_plain);
print "\n";

rewind($fh);

$row_after = fgetcsv($fh, 500,$delimiter,$enclosure, $escape_char);

print "\nAFTER:\n";
var_export($row_after);
print "\n\n";

fclose($fh);

Output:

BEFORE:
array (
  0 => 'A',
  1 => '[YaYXYZZYXYbY]',
  2 => 'B',
)

PLAIN:
'AXY[YYaYYXYYZZYXYYbYY]YXB
'

AFTER:
array (
  0 => 'A',
  1 => '[YaYXYZZ',
  2 => 'bYY]Y',
  3 => 'B',
)

Could fputcsv() be fixed?

For this let's turn back to more common and readable delimiter, enclosure and escape character.

$delimiter = ',';
$enclosure = '"';
$escape_char = "@";

Here the result is:

BEFORE:
array (
  0 => 'A',
  1 => '["a","@@","b"]',
  2 => 'B',
)

PLAIN:
'A,"[""a"",""@@",""b""]",B
'

AFTER:
array (
  0 => 'A',
  1 => '["a","@@',
  2 => 'b""]"',
  3 => 'B',
)

We see that the '"@@"' part is exported as '""@@"', while it SHOULD have been exported as '""@@""'.

In fact, doing this manually with fwrite() instead of fputcsv() does fix the problem: https://3v4l.org/4U1CQ

Michaeline answered 8/6, 2017 at 16:47 Comment(0)
F
0

Using your code with specific delimiters but changing the following line will work...

$enclosure = "'";

I think it may be to do with thinking that the \ is escaping the following quote.

Friarbird answered 8/6, 2017 at 6:21 Comment(2)
This does work for the given example. But one could simply craft another string to break it, like so: 3v4l.org/LsTSTMichaeline
You can screw up any CSV export by putting a \n in the string, which is why my comment on the original post says that other formats are more flexible.Friarbird
M
0

As in php, \\ used to escape the backslash(link for PHP manual escape sequence),so for making it as string u need to use one more single quote(' ').

so your input array should be...

$row_before = ['A', json_encode(['a', "'\\'", 'b']), 'B'];
Mistrustful answered 8/6, 2017 at 7:9 Comment(1)
But the goal is to process arbitrary incoming data. If I have to rewrite the input, I need an algorithm that always works..Michaeline
C
0

This is not PHP bug. It seems that json_encode() use the same delimiter (,), enclosure (") and escape (\) which is the same as default delimiter, enclosure and escape for both fputcsv() and fgetcsv(). You may differentiate enclosure or escape, and delimiter if necessary.

As already answered, in this case it will work by specify enclosure with (') instead:

$row_before = ['A', json_encode(['a', '\\', 'b']), 'B'];

print "\nBEFORE:\n";
var_export($row_before);
print "\n";

$fh = fopen($file = 'php://temp', 'rb+');

fputcsv($fh, $row_before, ',', "'");

rewind($fh);

$row_after = fgetcsv($fh, 0, ',', "'");

print "\nAFTER:\n";
var_export($row_after);
print "\n\n";

fclose($fh);
Caty answered 8/6, 2017 at 7:32 Comment(3)
Same as for Nigel's answer: It does work for the given example, but what if your array contains "['a','\\\\','b']"? Or more universally, "[{$enclosure}a{$enclosure}{$delimiter}{$enclosure}{$escape_char}{$escape_char}{$enclosure}{$delimiter}{$enclosure}b{$enclosure}]" or ``Michaeline
You must not allow any (special) character that can cause bug.. You may try escape (") to Unicode Codepoint within json_encode(): json_encode($value, JSON_HEX_QUOT);. It will convert double quotes to \u0022.Caty
Or use strtr() to make temporary replacement.. dirty and it's commonCaty

© 2022 - 2024 — McMap. All rights reserved.