How do I capture utf-8 decode errors in node.js?

F

3

15

I just discovered that Node (tested: v0.8.23, current git: v0.11.3-pre) ignores any decoding errors in its Buffer handling, silently replacing any non-utf8 characters with '\ufffd' (the Unicode REPLACEMENT CHARACTER) instead of throwing an exception about the non-utf8 input. As a consequence, fs.readFile, process.stdin.setEncoding and friends mask a large class of bad input errors for you.

Example which doesn't fail but really ought to:

> notValidUTF8 = new Buffer([ 128 ], 'binary')
<Buffer 80>
> decodedAsUTF8 = notValidUTF8.toString('utf8') // no exception thrown here!
'�'
> decodedAsUTF8 === '\ufffd'
true

'\ufffd' is a perfectly valid character that can occur in legal utf8 (as the sequence ef bf bd), so it is non-trivial to monkey-patch in error handling based on this showing up in the result.

Digging a little deeper, it looks like this stems from node just deferring to v8's strings and that those in turn have the above behaviour, v8 not having any external world full of foreign-encoded data.

Are there node modules or otherwise that let me catch utf-8 decode errors, preferrably with context about where the error was discovered in the input string or buffer?

Fusee answered 9/6, 2013 at 5:39 Comment(3)

While I'm not sure, but have you looked at the encoding module. It may provide a way to circumvent your problem. npmjs.org/package/encoding – Chelseychelsie 3/9, 2013 at 19:1

Doing the decoding by hand (well, as in "not using node primitives") should be safe; iconv via encoding probably is the way to go where this need exists. – Fusee 3/9, 2013 at 20:43

@Fusee This is still a problem, after all these years! And iconv doesn't solve it either. It behaves the same way. – Deafanddumb 5/6, 2018 at 13:49

G

5

From node 8.3 on, you can use util.TextDecoder to solve this cleanly:

const util = require('util')
const td = new util.TextDecoder('utf8', {fatal:true})
td.decode(Buffer.from('foo')) // works!
td.decode(Buffer.from([ 128 ], 'binary')) // throws TypeError

This will also work in some browsers by using TextDecoder in the global namespace.

Greenfinch answered 30/1, 2019 at 23:34 Comment(5)

Tidy! Thanks. :-) – Fusee 31/1, 2019 at 2:14

Unfortunately TextDecoder is broken in Electron: github.com/electron/electron/issues/18733 – Heres 26/11, 2019 at 2:58

Also fatal:true option will cause Node to throw exception if the runtime was compiled without ICU support, see nodejs.org/api/util.html#util_new_textdecoder_encoding_options and this situation is not rare. – Heres 26/11, 2019 at 3:16

@Heres it looks like explicitly using "util.TextDecoder" may work ok in Electron, until that bug gets sorted, assuming your node has ICU support. – Greenfinch 27/11, 2019 at 17:29

github.com/hildjj/ctoaf-textdecoder does the best it can to find an implementation that works in your environment, then polyfills in a bad implementation if nothing is available. – Greenfinch 13/11, 2021 at 19:29

P

10

I hope you solved the problem in those years, I had a similar one and eventually solved with this ugly trick:

  function isValidUTF8(buf){
   return Buffer.compare(new Buffer(buf.toString(),'utf8') , buf) === 0;
  }

which converts the buffer back and forth and check it stays the same.

The 'utf8' encoding can be omitted.

Then we have:

> isValidUTF8(new Buffer('this is valid, 指事字 eè we hope','utf8'))
true
> isValidUTF8(new Buffer([128]))
false
> isValidUTF8(new Buffer('\ufffd'))
true

where the '\ufffd' character is correctly considered as valid utf8.

UPDATE: now this works in JXcore, too

Pollen answered 28/8, 2015 at 20:28 Comment(2)

Thanks – this solution at the very least covers all situations where different normalization forms don't confuse things. (As I recall it, I opted for doing the tooling I was working on at the time in Pike, a language with very solid Unicode pedigree. :-) – Fusee 29/8, 2015 at 1:10

@Deafanddumb I didn't do any measurement of the speed. I expect the time for this operation to be much lower than what's required to get the data from the disk/network; the memory usage, on the other hand, could be a problem if the buffer is huge, isValidUTF8 creates two additional copies (the string and the re-converted buffer) every time is called – Pollen 5/6, 2018 at 15:26

G

5

From node 8.3 on, you can use util.TextDecoder to solve this cleanly:

const util = require('util')
const td = new util.TextDecoder('utf8', {fatal:true})
td.decode(Buffer.from('foo')) // works!
td.decode(Buffer.from([ 128 ], 'binary')) // throws TypeError

This will also work in some browsers by using TextDecoder in the global namespace.

Greenfinch answered 30/1, 2019 at 23:34 Comment(5)

Tidy! Thanks. :-) – Fusee 31/1, 2019 at 2:14

Unfortunately TextDecoder is broken in Electron: github.com/electron/electron/issues/18733 – Heres 26/11, 2019 at 2:58

Also fatal:true option will cause Node to throw exception if the runtime was compiled without ICU support, see nodejs.org/api/util.html#util_new_textdecoder_encoding_options and this situation is not rare. – Heres 26/11, 2019 at 3:16

@Heres it looks like explicitly using "util.TextDecoder" may work ok in Electron, until that bug gets sorted, assuming your node has ICU support. – Greenfinch 27/11, 2019 at 17:29

github.com/hildjj/ctoaf-textdecoder does the best it can to find an implementation that works in your environment, then polyfills in a bad implementation if nothing is available. – Greenfinch 13/11, 2021 at 19:29

P

0

As Josh C. said above: "npmjs.org/package/encoding"

From the npm website: "encoding is a simple wrapper around node-iconv and iconv-lite to convert strings from one encoding to another."

Download: $ npm install encoding

Example Usage

var result = encoding.convert(new Buffer([ 128 ], 'binary'), "utf8");
console.log(result); //<Buffer 80>

Visit the site: npm - encoding

Pulvinus answered 10/11, 2013 at 2:27 Comment(0)

Recommended topics

Hot tags