Read Text File in D

Asked 17/1, 2011 at 21:14 Answered 19/1, 2011 at 15:48

Is there any one-size-fits-all (more or less) way to read a text file in D?

The requirement is that the function would auto-detect the encoding and give me the entire data of the file in a consistent format, like a string or a dstring. It should auto-detect BOMs and interpret them as appropriate.

I tried std.file.readText() but it doesn't handle different encodings well.

(Of course, this will have a non-zero failure rate, and that's acceptable for my application.)

Cantara answered 17/1, 2011 at 21:14 Comment(0)

I believe that the only real options for file I/O in Phobos at this point (aside from calling C functions) are std.file.readText and std.stdio.File. readText will read in a file as an array of chars, wchars, or dchars (defaulting to immutable(char)[] - i.e. string). I believe that the encoding must be UTF-8, UTF-16, and UTF-32 for chars, wchars, and dchars respectively, though I'd have to go digging in the source code to be sure. Any encodings which are compatible with those encodings (e.g. ASCII is compatible with UTF-8) should work just fine.

If you use File, then you have several options for functions to read the file with - including readln and rawRead. However, you essentially read the file in using a UTF-8, UTF-16, or UTF-32 compatible encoding just like with readText, or you read it in as binary data and manipulate it yourself.

Since, the character types in D are char, wchar, and dchar, which are UTF-8, UTF-16, and UTF-32 code units respectively, unless you want to read the data in binary format, the file is going to have to be encoded in an encoding compatible with one of those three types of unicode. Given a string in a particular encoding, you can convert it to another encoding using the functions in std.utf. However, I'm not aware of any way to query a file for its encoding type other than using readText to try and read the file in a given encoding and see if it succeeds.

So, unless you want to process a file yourself and determine on the fly what encoding it's in, your best bet is probably to just use readText with each consecutive string type, using the first one which succeeds. However, since text files are normally in UTF-8 or a UTF-8 compatible encoding, I would expect that readText used with a normal string would almost always work just fine.

Zeebrugge answered 18/1, 2011 at 0:4 Comment(7)

Hm... any idea what to do with the BOMs? – Cantara 18/1, 2011 at 0:23

@Lambert, I highly suggest using read() as it won't do any validation, but you can do it yourself and aren't reading the file in multiple times. For the BOM you can cast to ubyte and compare the first bytes, then do a cast for the rest of the slice... – Partridge 18/1, 2011 at 0:36

Hm... not the solution I was hoping for (I didn't want to manually check the BOM) but it's not too bad I guess; thanks. – Cantara 18/1, 2011 at 5:27

@Lambert: checking the BOM shouldn't be to bad. IIRC it's only 32 bits (or less) and there is only about half a dozen values (including big/little-ended). – Cretan 19/1, 2011 at 15:36

@BCS: o__O BOMs are 32 bits? What about UTF-8's 0xEF,0xBB,0xBF? – Cantara 19/1, 2011 at 15:38

@BCS: Haha sorry I missed that part. Thanks for the info!! :D – Cantara 19/1, 2011 at 15:40

Update: some time later std.encoding came into existence and D can now handle input in a few common non-unicode encodings, in addition to the unicode ones. – Osbourne 25/7, 2017 at 0:16

As for dealing with checking the BOM:

char[] ConvertViaBOM(ubyte[] data) {
  char[] UTF8()   { /*...*/ }
  char[] UTF16LE(){ /*...*/ }
  char[] UTF16BE(){ /*...*/ }
  char[] UTF32LE(){ /*...*/ }
  char[] UTF32BE(){ /*...*/ }

  switch (data.length) {
    default:
    case 4:
      if (data[0..4] == [cast(ubyte)0x00, 0x00, 0xFE, 0xFF]) return UTF32BE();
      if (data[0..4] == [cast(ubyte)0xFF, 0xFE, 0x00, 0x00]) return UTF32LE();
      goto case 3;

    case 3:
      if (data[0..3] == [cast(ubyte)0xEF, 0xBB, 0xBF]) return UTF8();
      goto case 2;

    case 2:
      if (data[0..2] == [cast(ubyte)0xFE, 0xFF]) return UTF16BE();
      if (data[0..2] == [cast(ubyte)0xFF, 0xFE]) return UTF16LE();
      goto case 1;

    case 1:
      return UTF8();
  }
}

Adding more obscure BOM's is left as an exercise for the reader.

Cretan answered 19/1, 2011 at 15:48 Comment(0)

Recommended topics

Hot tags