How do I capture utf-8 decode errors in node.js?
Asked Answered
F

3

15

I just discovered that Node (tested: v0.8.23, current git: v0.11.3-pre) ignores any decoding errors in its Buffer handling, silently replacing any non-utf8 characters with '\ufffd' (the Unicode REPLACEMENT CHARACTER) instead of throwing an exception about the non-utf8 input. As a consequence, fs.readFile, process.stdin.setEncoding and friends mask a large class of bad input errors for you.

Example which doesn't fail but really ought to:

> notValidUTF8 = new Buffer([ 128 ], 'binary')
<Buffer 80>
> decodedAsUTF8 = notValidUTF8.toString('utf8') // no exception thrown here!
'�'
> decodedAsUTF8 === '\ufffd'
true

'\ufffd' is a perfectly valid character that can occur in legal utf8 (as the sequence ef bf bd), so it is non-trivial to monkey-patch in error handling based on this showing up in the result.

Digging a little deeper, it looks like this stems from node just deferring to v8's strings and that those in turn have the above behaviour, v8 not having any external world full of foreign-encoded data.

Are there node modules or otherwise that let me catch utf-8 decode errors, preferrably with context about where the error was discovered in the input string or buffer?

Fusee answered 9/6, 2013 at 5:39 Comment(3)
While I'm not sure, but have you looked at the encoding module. It may provide a way to circumvent your problem. npmjs.org/package/encodingChelseychelsie
Doing the decoding by hand (well, as in "not using node primitives") should be safe; iconv via encoding probably is the way to go where this need exists.Fusee
@Fusee This is still a problem, after all these years! And iconv doesn't solve it either. It behaves the same way.Deafanddumb
G
5

From node 8.3 on, you can use util.TextDecoder to solve this cleanly:

const util = require('util')
const td = new util.TextDecoder('utf8', {fatal:true})
td.decode(Buffer.from('foo')) // works!
td.decode(Buffer.from([ 128 ], 'binary')) // throws TypeError

This will also work in some browsers by using TextDecoder in the global namespace.

Greenfinch answered 30/1, 2019 at 23:34 Comment(5)
Tidy! Thanks. :-)Fusee
Unfortunately TextDecoder is broken in Electron: github.com/electron/electron/issues/18733Heres
Also fatal:true option will cause Node to throw exception if the runtime was compiled without ICU support, see nodejs.org/api/util.html#util_new_textdecoder_encoding_options and this situation is not rare.Heres
@Heres it looks like explicitly using "util.TextDecoder" may work ok in Electron, until that bug gets sorted, assuming your node has ICU support.Greenfinch
github.com/hildjj/ctoaf-textdecoder does the best it can to find an implementation that works in your environment, then polyfills in a bad implementation if nothing is available.Greenfinch
P
10

I hope you solved the problem in those years, I had a similar one and eventually solved with this ugly trick:

  function isValidUTF8(buf){
   return Buffer.compare(new Buffer(buf.toString(),'utf8') , buf) === 0;
  }

which converts the buffer back and forth and check it stays the same.

The 'utf8' encoding can be omitted.

Then we have:

> isValidUTF8(new Buffer('this is valid, 指事字 eè we hope','utf8'))
true
> isValidUTF8(new Buffer([128]))
false
> isValidUTF8(new Buffer('\ufffd'))
true

where the '\ufffd' character is correctly considered as valid utf8.

UPDATE: now this works in JXcore, too

Pollen answered 28/8, 2015 at 20:28 Comment(2)
Thanks – this solution at the very least covers all situations where different normalization forms don't confuse things. (As I recall it, I opted for doing the tooling I was working on at the time in Pike, a language with very solid Unicode pedigree. :-)Fusee
@Deafanddumb I didn't do any measurement of the speed. I expect the time for this operation to be much lower than what's required to get the data from the disk/network; the memory usage, on the other hand, could be a problem if the buffer is huge, isValidUTF8 creates two additional copies (the string and the re-converted buffer) every time is calledPollen
G
5

From node 8.3 on, you can use util.TextDecoder to solve this cleanly:

const util = require('util')
const td = new util.TextDecoder('utf8', {fatal:true})
td.decode(Buffer.from('foo')) // works!
td.decode(Buffer.from([ 128 ], 'binary')) // throws TypeError

This will also work in some browsers by using TextDecoder in the global namespace.

Greenfinch answered 30/1, 2019 at 23:34 Comment(5)
Tidy! Thanks. :-)Fusee
Unfortunately TextDecoder is broken in Electron: github.com/electron/electron/issues/18733Heres
Also fatal:true option will cause Node to throw exception if the runtime was compiled without ICU support, see nodejs.org/api/util.html#util_new_textdecoder_encoding_options and this situation is not rare.Heres
@Heres it looks like explicitly using "util.TextDecoder" may work ok in Electron, until that bug gets sorted, assuming your node has ICU support.Greenfinch
github.com/hildjj/ctoaf-textdecoder does the best it can to find an implementation that works in your environment, then polyfills in a bad implementation if nothing is available.Greenfinch
P
0

As Josh C. said above: "npmjs.org/package/encoding"

From the npm website: "encoding is a simple wrapper around node-iconv and iconv-lite to convert strings from one encoding to another."

Download: $ npm install encoding

Example Usage

var result = encoding.convert(new Buffer([ 128 ], 'binary'), "utf8");
console.log(result); //<Buffer 80>

Visit the site: npm - encoding

Pulvinus answered 10/11, 2013 at 2:27 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.