C++ How to inspect file Byte Order Mark in order to get if it is UTF-8?

Asked 1/2, 2012 at 21:0 Answered 28/5, 2018 at 13:43

I wonder how to inspect file Byte Order Mark in order to get if it is UTF-8 in C++?

Bohlin answered 1/2, 2012 at 21:0 Comment(5)

What is the issue? All you need to do is compare it to 0xEF,0xBB,0xBF. I think you need to give more details on your problem. – Kathyrnkati 1/2, 2012 at 21:3

This might be of some relevance: en.wikipedia.org/wiki/Byte_order_mark#UTF-8 – Bloodyminded 1/2, 2012 at 21:4

The same way you do in any language. You get the first three bytes. If they look like the UTF-8 bytes for the Unicode Byte Order Mark, then it's UTF-8. If they don't, then it's not. Are you asking for someone to write the source code for you? – Donata 1/2, 2012 at 21:4

Nicol -- "If they don't then it's not" -- not true. If they don't look like a BOM, then it could very easily be UTF-8 still. There is no need for a Byte Order Mark with UTF-8 encoding. – Rend 1/2, 2012 at 21:17

@Nicol suggest you delete or edit that comment as it's wrong. A sequence of bytes which look like a Unicode BOM only tells you it might be Unicode data. It could mean "ï»¿". – Raphaelraphaela 1/2, 2012 at 21:22

In general, you can't.

The presence of a Byte Order Mark is a very strong indication that the file you are reading is Unicode. If you are expecting a text file, and the first four bytes you receive are:

0x00, 0x00, 0xfe, 0xff -- The file is almost certainly UTF-32BE
0xff, 0xfe, 0x00, 0x00 -- The file is almost certainly UTF-32LE
0xfe, 0xff,  XX,   XX     -- The file is almost certainly UTF-16BE
0xff, 0xfe,  XX,   XX (but not 00, 00) -- The file is almost certainly UTF-16LE
0xef, 0xbb, 0xbf,  XX   -- The file is almost certainly UTF-8 With a BOM

But what about anything else? If the bytes you get are anything other than one of these five patterns, then you can't say for certain that your file is or is not UTF-8.

In fact, any text document containing only ASCII characters from 0x00 to 0x7f is a valid UTF-8 document, as well as being a plain ASCII document.

There are heuristics that can try to infer, based on the particular characters that are seen, whether a document is encoded in, say, ISO-8859-1, or UTF-8, or CP1252, but in general, the first two, three, or four bytes of a file are not enough to say whether what you are looking at is definitely UTF-8.

Rend answered 1/2, 2012 at 21:36 Comment(2)

Does the "(but not 00, 00)" apply to both UTF-16BE and UTF-16LE, or just to the latter? – Balance 1/2, 2012 at 22:20

fe ff 00 00 would be UTF16-BE, not UTF32. In UTF-32, it would represent U+FFFE, which is a non-character, and shouldn't be present in any Unicode document. In UTF-16BE, it's a BOM followed by a null character – Rend 1/2, 2012 at 22:34

if (buffer[0] == '\xEF' && buffer[1] == '\xBB' && buffer[2] == '\xBF') {
    // UTF-8
}

It's better to use buffer[0] == '\xEF' instead of buffer[0] == 0xEF in order to avoid signed/unsigned char problems, see How do I represent negative char values in hexadecimal?

Fishplate answered 30/7, 2013 at 8:5 Comment(1)

I had to use a combination of ifstream.read(..) (not get()) and the char literal for the BOM bytes to be matched. Cheers! – Silverpoint 27/5, 2016 at 4:15

0xEF,0xBB,0xBF

ordering doesn't depend on endianness.

How you read the file with C++ is up to you. Personally I still use C-style File methods because they are provided by the library I am coding with and I can be sure to specify to binary mode and avoid unintended translations down the line.

adapted from cs.vt.edu

#include <fstream>
...
char buffer[100];
ifstream myFile ("data.bin", ios::in | ios::binary);
myFile.read (buffer, 3);
if (!myFile) {
    // An error occurred!
    // myFile.gcount() returns the number of bytes read.
    // calling myFile.clear() will reset the stream state
    // so it is usable again.
}
...
if (!myFile.read (buffer, 100)) {
    // Same effect as above
}
if (buffer[0] == 0XEF && buffer[1] == 0XBB && buffer[2] == 0XBF) {
    //Congrats, UTF-8
}

Alternatively, many format use UTF-8 by default if no other BOM (UTF-16, or UTF-32 for example) are specified.

wiki for BOM

unicode.org.faq

Various answered 1/2, 2012 at 21:4 Comment(0)

This is my version in C++:

#include <fstream>

/* Reads a leading BOM from file stream if it exists.
 * Returns true, iff the BOM has been there. */
bool ReadBOM(std::ifstream & is)
{
  /* Read the first byte. */
  char const c0 = is.get();
  if (c0 != '\xEF') {
    is.putback(c0);
    return false;
  }

  /* Read the second byte. */
  char const c1 = is.get();
  if (c1 != '\xBB') {
    is.putback(c1);
    is.putback(c0);
    return false;
  }

  /* Peek the third byte. */
  char const c2 = is.peek();
  if (c2 != '\xBF') {
    is.putback(c1);
    is.putback(c0);
    return false;
  }

  return true; // This file contains a BOM for UTF-8.
}

Axel answered 28/5, 2018 at 13:43 Comment(1)

Maybe better to get the 3rd character and put back. Otherwise you won’t move past the bom to the first valid character. As I understand it with this function the stream pointer would potentially be sat in tge middle of the bom. – Relativize 31/3 at 6:59

Recommended topics

Hot tags