What is "ANSI as UTF-8" and how can I make fputcsv() generate UTF-8 w/BOM?
Asked Answered
F

4

19

I made a PHP script that generates CSV files that were previously generated by another process. And then, the CSV files have to be imported by yet another process.

The import of the old CSV files works fine, but but when importing the new CSV files there are issues with special characters.

When I open old CSVs with Notepad++, it says the encoding is UTF-8, and when I open the new CSVs with it, it says their encoding is 'ANSI as UTF-8'.

What's the difference of the two?

And how can I make fopen and fputcsv use the 'pure?' UTF-8 encoding?

Thanks!

Fresnel answered 4/9, 2009 at 17:57 Comment(6)
ANSI is the American National Standards Institute. I think you meant ASCII.Kaz
@Gumbo: No, Notepad++ uses "ANSI" the same way Microsoft does, to mean the default eight-bit encoding of the underlying OS. But "ANSI as UTF-8" is NPP's own, bizarre coinage.Gynecology
@Petruza: This question really has nothing to do with CSV, fopen(), or even PHP--it's all about Notepad++. I changed the title accordingly.Gynecology
@Alan Moore: I disagree with your edit. Petruzas main issue is the different handling of two almost identical CSV files by some process and his main Question is about what the difference might be. Notepad++ is just the tool he used to check for the difference, so I think your new title is a bit misleading.Undo
@Henrik Opel: You're right, I got carried away with the Notepad++ stuff.Gynecology
Yes, that's not at all what I asked. Anyway I fixed it by using utf_decode() so changed the utf-8 input to ANSI. This wouldn't work with cyrillic chars, for example, but the problem here were accented latin letters. Thanks all! ( @Gumbo: I mean what I say, and I know what ANSI is )Fresnel
G
42

There's nothing wrong with the file. "ANSI as UTF-8" means there's no BOM but Notepad++ has definitely identified the encoding as UTF-8 by analyzing byte patterns. I tested this by creating a file with Russian, Greek and Polish text in it and saving it as UTF-8 without a BOM. Here it is:

# Russian
Следующая

# Greek
Επόμενη

# Polish
Więcej

I did this in a different editor (EditPad Pro) and used hex mode to make sure the BOM wasn't there. When I opened it in NPP it showed the encoding as "ANSI as UTF-8" and all of the characters displayed correctly. Then, still in hex mode, I removed the first byte of the first Russian character. When I opened it in NPP again, it showed the encoding as "ANSI" and displayed the non-ASCII parts of the text as mojibake:

; Russian
¡Ð»ÐµÐ´ÑƒÑŽÑ‰Ð°Ñ

; Greek
Επόμενη

; Polish
Więcej

Back to EditPad, and this time I added a BOM but didn't repair the Cyrillic character. This time NPP reported the encoding as "UTF-8" and everything displayed correctly except that first Russian character, as shown below. "A1" is the hex representation of what should have been the second byte of that character in UTF-8. It was displayed in an inverted color scheme to indicate an error.

# Russian
A1ледующая

# Greek
Επόμενη

# Polish
Więcej

To summarize: In the absence of a BOM, Notepad++ looks for bytes that can't represent ASCII characters because their values are greater than 127 (or 7F hex). If it finds any, but they all conform to the patterns required by UTF-8, it decodes the file as UTF-8 and reports the encoding in the status bar as "ANSI as UTF-8".

But if it finds even one byte that doesn't toe the UTF-8 line, it decodes the file as "ANSI", meaning the default single-byte encoding for the underlying platform. If your file had been corrupted, that's what you would be seeing.

EDIT: Although your file is valid without it, you could add a BOM by manually writing the three bytes "EF BB BF" at the very beginning of the file--but there should be a better way. How are you generating the content now? Because it is UTF-8, with at least one non-ASCII character in there somewhere; otherwise, NPP would report it as "ANSI".

Another possibility to consider: if you have any influence over the process that consumes your CSV file, maybe you can configure it to expect UTF-8 without a BOM. Technically, any software that can decode UTF-8 with a BOM but not without one is broken. The Unicode Consortium actually discourages use of the UTF-8 BOM, not that anyone's listening.

Gynecology answered 5/9, 2009 at 3:56 Comment(5)
This is a good and well presented explanation of the 'ANSI as UTF-8' topic, but only a partial answer (at least until you edited the question title ;) We already covered that topic, albeit more briefly (see the comments to my answer below), so the main question left is why/if the absence/presence of the BOM would make a difference and if so, how to fix it.Undo
It sounded to me like Notepad++ with its "ANSI as UTF-8" silliness was the only problem--if not, it's doing a good job obscuring the problem. But you're right, I neglected to answer the second part of the question.Gynecology
+1 for Mojibake - there's actually a word for the weirdness I was experiencing.Homerhomere
"ANSI as UTF-8" is misleading. ANSI is not byte compatible with UTF-8. ASCII (0-127) is byte compatible with UTF-8. The ANSI range (128-255) which extends ASCII are control characters in UTF-8 indicating which page the next bytes should be looked up from. "ASCII as UTF-8" would be more correct - but still wrong. It's simply UTF-8 without BOM.Keratitis
I was about to report it, but this is yours, am I wrong @thomthom? xDExhilarate
U
6

According to the Notepad++ related threads here and here, 'ANSI as UTF-8' indicates UTF-8 without BOM, while a plain 'UTF-8' means UTF-8 with BOM. So maybe the process reading the CSV needs the Byte-order mark to correctly read the CSV as UTF-8.

But before going into that, make sure that your script actually writes UTF-8! When you open the new CSVs in Notepad++ (and it says 'ANSI as UTF-8'), are all 'special' characters displayed correctly? If not, you need to adapt your script to actually write UTF-8, if yes, check for the BOM difference.

Undo answered 4/9, 2009 at 18:11 Comment(3)
Thanks! yes, both CSV files show special characters correctly in Notepad++Fresnel
Ok, so it might be a missing BOM. You could add one 'by hand' in Notepad++ (Convert to UTF-8 with BOM) and check if that resolves the issue. If it does, see php.net/manual/en/function.utf8-encode.php#68211 for how to create a BOM in PHPUndo
Nowadays Notepad++ shows "UTF-8 without BOM" instead of the confusing "ANSI as UTF-8". Good change! :)Chivalrous
T
1

Try changing your PHP script to UTF-8 too. Sometimes it is necessary (despite it can be bypassed) to have the script in the same char encoding of the data.

Similar problem: PHP: Explode using special characters

Truck answered 4/9, 2009 at 18:4 Comment(0)
A
0

It is worth noting that ANSI as UTF-8, i.e. UTF-8 without the BOM is useful if you are formatting your PHP files as UTF-8. If your PHP file is outputting html to the browser then the BOM is included in the HTML output which the w3c validator explicitly warns against:

Byte-Order Mark found in UTF-8 File.

The Unicode Byte-Order Mark (BOM) in UTF-8 encoded files is known to cause problems for some text editors and older browsers. You may want to consider avoiding its use until it is better supported.

Further to this, I spotted that the BOM confuses Firefox's Firebug which now thinks that all your <head> content is actually in the <body> tag.

Alp answered 6/3, 2012 at 22:14 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.