Ignore byte-order marks in C++, reading from a stream
Asked Answered
R

4

9

I have a function to read the value of one variable (integer, double, or boolean) on a single line in an ifstream:

template <typename Type>
void readFromFile (ifstream &in, Type &val)
{
  string str;
  getline (in, str);
  stringstream ss(str);
  ss >> val;
}

However, it fails on text files created with editors inserting a BOM (byte order mark) at the beginning of the first line, which unfortunately includes {Note,Word}pad. How can I modify this function to ignore the byte-order mark if present at the beginning of str?

Rill answered 16/1, 2012 at 13:17 Comment(2)
You mean the UTF-8 BOM? That's very arcane...Jinnyjinrikisha
Ahem.. UTF8 BOM isn't FEFF EF BB BF it's supposed to be endian agnostic too. btw the UTF8 BOM is poo-pooed by the unicode consortium.Monique
D
16

(I'm assuming you're on Windows, since using U+FEFF as a signature in UTF-8 files is mostly a Windows thing and should simply be avoided elsewhere)

You could open the file as a UTF-8 file and then check to see if the first character is U+FEFF. You can do this by opening a normal char based fstream and then use wbuffer_convert to treat it as a series of code units in another encoding. VS2010 doesn't yet have great support for char32_t so the following uses UTF-16 in wchar_t.

std::fstream fs(filename);
std::wbuffer_convert<std::codecvt_utf8_utf16<wchar_t>,wchar_t> wb(fs.rdbuf());
std::wistream is(&wb);
// if you don't do this on the stack remember to destroy the objects in reverse order of creation. is, then wb, then fs.
std::wistream::int_type ch = is.get();
const std::wistream::int_type ZERO_WIDTH_NO_BREAK_SPACE = 0xFEFF
if(ZERO_WIDTH_NO_BREAK_SPACE != ch)
    is.putback(ch);

// now the stream can be passed around and used without worrying about the extra character in the stream.

int i;
readFromStream<int>(is,i);

Remember that this should be done on the file stream as a whole, not inside readFromFile on your stringstream, because ignoring U+FEFF should only be done if it's the very first character in the whole file, if at all. It shouldn't be done anywhere else.

On the other hand, if you're happy using a char based stream and just want to skip U+FEFF if present then James Kanze suggestion seems good so here's an implementation:

std::fstream fs(filename);
char a,b,c;
a = fs.get();
b = fs.get();
c = fs.get();
if (a != (char)0xEF || b != (char)0xBB || c != (char)0xBF) {
    fs.seekg(0);
} else {
    std::cerr << "Warning: file contains the so-called 'UTF-8 signature'\n";
}

Additionally if you want to use wchar_t internally the codecvt_utf8_utf16 and codecvt_utf8 facets have a mode that can consume 'BOMs' for you. The only problem is that wchar_t is widely recognized to be worthless these days* and so you probably shouldn't do this.

std::wifstream fin(filename);
fin.imbue(std::locale(fin.getloc(), new std::codecvt_utf8_utf16<wchar_t, 0x10FFFF, std::consume_header));

* wchar_t is worthless because it is specified to do just one thing; provide a fixed size data type that can represent any code point in a locale's character repertoire. It does not provide a common representation between locales (i.e., the same wchar_t value can be different characters in different locales so you cannot necessarily convert to wchar_t, switch to another locale, and then convert back to char in order to do iconv-like encoding conversions.)

The fixed sized representation itself is worthless for two reasons; first, many code points have semantic meanings and so understanding text means you have to process multiple code points anyway. Secondly, some platforms such as Windows use UTF-16 as the wchar_t encoding, which means a single wchar_t isn't even necessarily a code point value. (Whether using UTF-16 this way is even conformant to the standard is ambiguous. The standard requires that every character supported by a locale be representable as a single wchar_t value; If no locale supports any character outside the BMP then UTF-16 could be seen as conformant.)

Dextroglucose answered 16/1, 2012 at 15:20 Comment(0)
C
4

You have to start by reading the first byte or two of the stream, and deciding whether it is part of a BOM or not. It's a bit of a pain, since you can only putback a single byte, whereas you typically will want to read four. The simplest solution is to open the file, read the initial bytes, memorize how many you need to skip, then seek back to the beginning and skip them.

Chapter answered 16/1, 2012 at 13:32 Comment(4)
The UTF8 BOM is three bytes long. I'm assuming the stream is byte-sized, since it's a char-stream, so it can't really be UTF16 or UTF32.Jinnyjinrikisha
@KerrekSB You can read UTF-16 and UTF-32 as char streams, provided you have the appropriate locale. On the other hand, I don't know what they would do with a BOM. (IMHO, the BOM should really be the responsibility of the stream. Or rather of the the codecvt facet it uses.)Chapter
I had forgotten about the locales. Do you have to write your own, or is there a UTF-16 one in the standard?Jinnyjinrikisha
@KerrekSB The only local in the standard is "C". For the rest, it's all implementation dependent. For Linux, you can see what locales are available by listing /usr/lib/locale. I don't know of any equivalent for Windows, however.Chapter
E
0

With a not-so-clean solution, I solved by removing non printing chars:

bool isNotAlnum(unsigned char c)
{
    return (c < ' ' || c > '~');
}

...

str.erase(remove_if(str.begin(), str.end(), isNotAlnum), str.end());
Edsel answered 21/7, 2021 at 15:48 Comment(0)
A
-1

Here's a simple C++ function to skip the BOM on an input stream on Windows. This assumes byte-sized data, as in UTF-8:

// skip BOM for UTF-8 on Windows
void skip_bom(auto& fs) {
    const unsigned char boms[]{ 0xef, 0xbb, 0xbf };
    bool have_bom{ true };
    for(const auto& c : boms) {
        if((unsigned char)fs.get() != c) have_bom = false; 
    }
    if(!have_bom) fs.seekg(0);
    return;
}

It simply checks the first three bytes for the UTF-8 BOM signature, and skips them if they all match. There's no harm if there's no BOM.

Edit: This works with a file stream, but not with cin. I found it did work with cin on Linux with GCC-11, but that's clearly not portable. See @Dúthomhas comment below.

Auxesis answered 23/1, 2022 at 21:0 Comment(5)
This question is 10 years old, and your solution assumes a seekable stream delivering byte-sized data. For example, std::cin is not seekable, meaning that skip_bom() leaves the input state unsynchronized wrt what the program wants to get if the first characters are anything but a UTF-8 BOM. — The correct method is to open the stream as a byte stream to identify its type, close it, then return a new file object reading that particular stream type and returning code points, not byte characters.Wedlock
@Dúthomhas – that makes sense – thanks. It's interesting that this works on Linux with GCC 11, and rdstate is 0 after the seekg() call. But it doesn't seem to work with Clang on macOS. I'll take your suggestion and rethink my approach.Auxesis
Your approach works just fine if you open the file temporarily for the sole purpose of testing for a BOM. (This is actually a common approach for identifying a file’s type and format. Once the file’s type is identified, the caller can choose what to do with it.) So your function’s caller could determine that the file is a valid UTF-8 bytestream with or without a complete BOM. (A partial BOM means it is not a UTF-8.)Wedlock
AFAIK, seeking back on cin is an implementation capability tied to compiler + OS + stream source. If cin is a human at the keyboard, though, there is no OS I know of that supports any kind of backwards seek. It may not report failure, but the input pointer does not actually change.Wedlock
@Dúthomhas - My initial test was with a piped file. GCC/Linux must be handling that as a file stream. Anyway, I've updated the answer to explain that it's not going to work with cin. Thanks again.Auxesis

© 2022 - 2024 — McMap. All rights reserved.