Conversion between UTF-8 ArrayBuffer and String
Asked Answered
P

8

97

I have an ArrayBuffer which contains a string encoded using UTF-8 and I can't find a standard way of converting such ArrayBuffer into a JS String (which I understand is encoded using UTF-16).

I've seen this code in numerous places, but I fail to see how it would work with any UTF-8 code points that are longer than 1 byte.

return String.fromCharCode.apply(null, new Uint8Array(data));

Similarly, I can't find a standard way of converting from a String to a UTF-8 encoded ArrayBuffer.

Photosynthesis answered 19/6, 2013 at 13:3 Comment(9)
@LightStyle Thanks, completely missed that spelling mistake! :PPhotosynthesis
var uintArray = new Uint8Array("string".split('').map(function(char) {return char.charCodeAt(0);}));Loanloanda
It that is what you need I can explain you in an answer, otherwise I can keep only the comment ;)Loanloanda
Will that definitely work on UTF code points that are longer than 1 byte?Photosynthesis
I don't know, but it should, can't you try?Loanloanda
I tried it with new Uint8Array("h€l".split('').map(function(char) {return char.charCodeAt(0);})); and it returned an array with 3 bytes, however I believe it should be 5 bytes because occording to fileformat.info/info/unicode/char/20ac/index.htm it says the UTF-8 encoding of it is 0xE2 0x82 0xAC.Photosynthesis
The one-liner you posted will decode bytes in the range 0x00–0xFF to their corresponding Unicode code points U+0000–U+00FF. In other words, it can’t represent anywhere near the whole Unicode range. However, it just so happens that Unicode code points U+0000–U+00FF correspond exactly to ISO 8859-1 (Latin 1), so what you have written is in effect an ISO 8859-1 decoder. LightStyle’s oneliner is the encoder that corresponds to the decoder in the question. In other words, it is an ISO 8859-1 encoder.Leeward
@TomLeese You fixed the spelling mistake and now I have no idea what it was :(Conjoined
Up-to-date answer here: stackoverflow.com/questions/6965107/…Cloakanddagger
W
46
function stringToUint(string) {
    var string = btoa(unescape(encodeURIComponent(string))),
        charList = string.split(''),
        uintArray = [];
    for (var i = 0; i < charList.length; i++) {
        uintArray.push(charList[i].charCodeAt(0));
    }
    return new Uint8Array(uintArray);
}

function uintToString(uintArray) {
    var encodedString = String.fromCharCode.apply(null, uintArray),
        decodedString = decodeURIComponent(escape(atob(encodedString)));
    return decodedString;
}

I have done, with some help from the internet, these little functions, they should solve your problems! Here is the working JSFiddle.

EDIT:

Since the source of the Uint8Array is external and you can't use atob you just need to remove it(working fiddle):

function uintToString(uintArray) {
    var encodedString = String.fromCharCode.apply(null, uintArray),
        decodedString = decodeURIComponent(escape(encodedString));
    return decodedString;
}

Warning: escape and unescape is removed from web standards. See this.

Welcome answered 19/6, 2013 at 13:42 Comment(11)
atob/btoa do base64 encoding/decoding, if you pass a honest utf8 byte array, it won't work: jsfiddle.net/Z9pQE/1Paletot
It is planned to work only with an UintArray of an encoded string, otherwise it is not going to work because of btoa and atob conversion.Loanloanda
I probably should've specified, but the UTF-8 string in the ArrayBuffer comes from a seperate program written in a different programming language which produces pure UTF-8 strings, so as Esailija said, I can't use this as it does base64 encoding.Photosynthesis
Wait. You can easily use this if the source is external, just don't use atob function. I'm going to update this with a new fiddle, just 1 minuteLoanloanda
Done. The same is true for the stringToUint function, just remove the btoa function and you're done :)Loanloanda
You're welcome! Anyway, @Paletot your solution is great, worth +1! :DLoanloanda
You saved my day! Just one addition, that if you use it with huge arrays, you can easily get: [Error] RangeError: Maximum call stack size exceeded. To fix that I use .slice() and apply it in chunksDineen
Glad to help! Feel free to edit the answer and add your solution :)Loanloanda
why the btoa() call in stringToUint()? To me that's completely wrong and reducing that line to var string = unescape(encodeURIComponent(string)); works better for me.Cajole
Just something that should be noted: If your array is sufficiently large, this solution will cause a stack overflow on the call to String.fromCharCode.apply. For some solutions, a loop may be better.Mcqueen
This answer is outdated, go here: stackoverflow.com/questions/6965107/…Cloakanddagger
S
109

Using TextEncoder and TextDecoder

var uint8array = new TextEncoder("utf-8").encode("Plain Text");
var string = new TextDecoder().decode(uint8array);
console.log(uint8array ,string )
Sayette answered 16/12, 2016 at 8:47 Comment(7)
Support for this feature is sorely lacking in IE and Edge.Matrix
And for some reason there is only a polyfill for TextEncoder, I'm assuming TextDecoding just simply wouldn't work in IE right now.Patronizing
Good answer but using "Plain Text" is misleading we aren't doing any cryptography here encode != encryptEelpout
If you need IE support, you can you use the FastestSmallestTextEncoderDecoder polyfill, recommended by the MDN website.Agan
Notice that TextEncoder c`tor doesn't accept any argument (it's always utf-8, no matter what you pass in). However the decoder does accept argument (both the documentation and how it works practically aligns with this).Vedi
@JosephGarrone "plain text" isn't a term that is restricted to cryptography...Corenecoreopsis
For anyone coming across this question in 2021, every major browser supports TextEncoder/Decoder now: caniuse.com/textencoderSpillway
W
46
function stringToUint(string) {
    var string = btoa(unescape(encodeURIComponent(string))),
        charList = string.split(''),
        uintArray = [];
    for (var i = 0; i < charList.length; i++) {
        uintArray.push(charList[i].charCodeAt(0));
    }
    return new Uint8Array(uintArray);
}

function uintToString(uintArray) {
    var encodedString = String.fromCharCode.apply(null, uintArray),
        decodedString = decodeURIComponent(escape(atob(encodedString)));
    return decodedString;
}

I have done, with some help from the internet, these little functions, they should solve your problems! Here is the working JSFiddle.

EDIT:

Since the source of the Uint8Array is external and you can't use atob you just need to remove it(working fiddle):

function uintToString(uintArray) {
    var encodedString = String.fromCharCode.apply(null, uintArray),
        decodedString = decodeURIComponent(escape(encodedString));
    return decodedString;
}

Warning: escape and unescape is removed from web standards. See this.

Welcome answered 19/6, 2013 at 13:42 Comment(11)
atob/btoa do base64 encoding/decoding, if you pass a honest utf8 byte array, it won't work: jsfiddle.net/Z9pQE/1Paletot
It is planned to work only with an UintArray of an encoded string, otherwise it is not going to work because of btoa and atob conversion.Loanloanda
I probably should've specified, but the UTF-8 string in the ArrayBuffer comes from a seperate program written in a different programming language which produces pure UTF-8 strings, so as Esailija said, I can't use this as it does base64 encoding.Photosynthesis
Wait. You can easily use this if the source is external, just don't use atob function. I'm going to update this with a new fiddle, just 1 minuteLoanloanda
Done. The same is true for the stringToUint function, just remove the btoa function and you're done :)Loanloanda
You're welcome! Anyway, @Paletot your solution is great, worth +1! :DLoanloanda
You saved my day! Just one addition, that if you use it with huge arrays, you can easily get: [Error] RangeError: Maximum call stack size exceeded. To fix that I use .slice() and apply it in chunksDineen
Glad to help! Feel free to edit the answer and add your solution :)Loanloanda
why the btoa() call in stringToUint()? To me that's completely wrong and reducing that line to var string = unescape(encodeURIComponent(string)); works better for me.Cajole
Just something that should be noted: If your array is sufficiently large, this solution will cause a stack overflow on the call to String.fromCharCode.apply. For some solutions, a loop may be better.Mcqueen
This answer is outdated, go here: stackoverflow.com/questions/6965107/…Cloakanddagger
P
29

This should work:

// http://www.onicos.com/staff/iz/amuse/javascript/expert/utf.txt

/* utf.js - UTF-8 <=> UTF-16 convertion
 *
 * Copyright (C) 1999 Masanao Izumo <[email protected]>
 * Version: 1.0
 * LastModified: Dec 25 1999
 * This library is free.  You can redistribute it and/or modify it.
 */

function Utf8ArrayToStr(array) {
  var out, i, len, c;
  var char2, char3;

  out = "";
  len = array.length;
  i = 0;
  while (i < len) {
    c = array[i++];
    switch (c >> 4)
    { 
      case 0: case 1: case 2: case 3: case 4: case 5: case 6: case 7:
        // 0xxxxxxx
        out += String.fromCharCode(c);
        break;
      case 12: case 13:
        // 110x xxxx   10xx xxxx
        char2 = array[i++];
        out += String.fromCharCode(((c & 0x1F) << 6) | (char2 & 0x3F));
        break;
      case 14:
        // 1110 xxxx  10xx xxxx  10xx xxxx
        char2 = array[i++];
        char3 = array[i++];
        out += String.fromCharCode(((c & 0x0F) << 12) |
                                   ((char2 & 0x3F) << 6) |
                                   ((char3 & 0x3F) << 0));
        break;
    }
  }    
  return out;
}

It's somewhat cleaner as the other solutions because it doesn't use any hacks nor depends on Browser JS functions, e.g. works also in other JS environments.

Check out the JSFiddle demo.

Also see the related questions: here, here

Pommel answered 13/3, 2014 at 8:38 Comment(2)
What about when going from string to utf-8 buffer?Grummet
This is the least readable code I've ever seen to implement char-code to string conversion. I appreciate and admire the effort put into it, but there's 100s of more maintainable ways to achieve that.Stefa
K
23

There's a polyfill for Encoding over on Github: text-encoding. It's easy for Node or the browser, and the Readme advises the following:

var uint8array = TextEncoder(encoding).encode(string);
var string = TextDecoder(encoding).decode(uint8array);

If I recall, 'utf-8' is the encoding you need, and of course you'll need to wrap your buffer:

var uint8array = new Uint8Array(utf8buffer);

Hope it works as well for you as it has for me.

Kc answered 13/5, 2014 at 22:5 Comment(3)
For anyone lazy like me, npm install text-encoding, var textEncoding = require('text-encoding'); var TextDecoder = textEncoding.TextDecoder;. No thanks.Goldwin
@KarthikHande That's what the polyfill is for. Its not supported by all browsers so you also supply a pure js implementation as an alternative.Matrix
Beware the library is HUGEChor
P
13

If you are doing this in browser there are no character encoding libraries built-in, but you can get by with:

function pad(n) {
    return n.length < 2 ? "0" + n : n;
}

var array = new Uint8Array(data);
var str = "";
for( var i = 0, len = array.length; i < len; ++i ) {
    str += ( "%" + pad(array[i].toString(16)))
}

str = decodeURIComponent(str);

Here's a demo that decodes a 3-byte UTF-8 unit: http://jsfiddle.net/Z9pQE/

Paletot answered 19/6, 2013 at 13:39 Comment(1)
You're the best person in the world.Halsy
G
3

The methods readAsArrayBuffer and readAsText from a FileReader object converts a Blob object to an ArrayBuffer or to a DOMString asynchronous.

A Blob object type can be created from a raw text or byte array, for example.

let blob = new Blob([text], { type: "text/plain" });

let reader = new FileReader();
reader.onload = event =>
{
    let buffer = event.target.result;
};
reader.readAsArrayBuffer(blob);

I think it's better to pack up this in a promise:

function textToByteArray(text)
{
    let blob = new Blob([text], { type: "text/plain" });
    let reader = new FileReader();
    let done = function() { };

    reader.onload = event =>
    {
        done(new Uint8Array(event.target.result));
    };
    reader.readAsArrayBuffer(blob);

    return { done: function(callback) { done = callback; } }
}

function byteArrayToText(bytes, encoding)
{
    let blob = new Blob([bytes], { type: "application/octet-stream" });
    let reader = new FileReader();
    let done = function() { };

    reader.onload = event =>
    {
        done(event.target.result);
    };

    if(encoding) { reader.readAsText(blob, encoding); } else { reader.readAsText(blob); }

    return { done: function(callback) { done = callback; } }
}

let text = "\uD83D\uDCA9 = \u2661";
textToByteArray(text).done(bytes =>
{
    console.log(bytes);
    byteArrayToText(bytes, 'UTF-8').done(text => 
    {
        console.log(text); // 💩 = ♡
    });
});
Gerontology answered 28/10, 2017 at 13:12 Comment(0)
A
3

If you don't want to use any external polyfill library, you can use this function provided by the Mozilla Developer Network website:

function utf8ArrayToString(aBytes) {
    var sView = "";
    
    for (var nPart, nLen = aBytes.length, nIdx = 0; nIdx < nLen; nIdx++) {
        nPart = aBytes[nIdx];
        
        sView += String.fromCharCode(
            nPart > 251 && nPart < 254 && nIdx + 5 < nLen ? /* six bytes */
                /* (nPart - 252 << 30) may be not so safe in ECMAScript! So...: */
                (nPart - 252) * 1073741824 + (aBytes[++nIdx] - 128 << 24) + (aBytes[++nIdx] - 128 << 18) + (aBytes[++nIdx] - 128 << 12) + (aBytes[++nIdx] - 128 << 6) + aBytes[++nIdx] - 128
            : nPart > 247 && nPart < 252 && nIdx + 4 < nLen ? /* five bytes */
                (nPart - 248 << 24) + (aBytes[++nIdx] - 128 << 18) + (aBytes[++nIdx] - 128 << 12) + (aBytes[++nIdx] - 128 << 6) + aBytes[++nIdx] - 128
            : nPart > 239 && nPart < 248 && nIdx + 3 < nLen ? /* four bytes */
                (nPart - 240 << 18) + (aBytes[++nIdx] - 128 << 12) + (aBytes[++nIdx] - 128 << 6) + aBytes[++nIdx] - 128
            : nPart > 223 && nPart < 240 && nIdx + 2 < nLen ? /* three bytes */
                (nPart - 224 << 12) + (aBytes[++nIdx] - 128 << 6) + aBytes[++nIdx] - 128
            : nPart > 191 && nPart < 224 && nIdx + 1 < nLen ? /* two bytes */
                (nPart - 192 << 6) + aBytes[++nIdx] - 128
            : /* nPart < 127 ? */ /* one byte */
                nPart
        );
    }
    
    return sView;
}

let str = utf8ArrayToString([50,72,226,130,130,32,43,32,79,226,130,130,32,226,135,140,32,50,72,226,130,130,79]);

// Must show 2H₂ + O₂ ⇌ 2H₂O
console.log(str);
Agan answered 5/12, 2019 at 3:57 Comment(1)
see up-to-date answer: stackoverflow.com/questions/6965107/…Cloakanddagger
P
1

The main problem of programmers looking for conversion from byte array into a string is UTF-8 encoding (compression) of unicode characters. This code will help you:

var getString = function (strBytes) {

    var MAX_SIZE = 0x4000;
    var codeUnits = [];
    var highSurrogate;
    var lowSurrogate;
    var index = -1;

    var result = '';

    while (++index < strBytes.length) {
        var codePoint = Number(strBytes[index]);

        if (codePoint === (codePoint & 0x7F)) {

        } else if (0xF0 === (codePoint & 0xF0)) {
            codePoint ^= 0xF0;
            codePoint = (codePoint << 6) | (strBytes[++index] ^ 0x80);
            codePoint = (codePoint << 6) | (strBytes[++index] ^ 0x80);
            codePoint = (codePoint << 6) | (strBytes[++index] ^ 0x80);
        } else if (0xE0 === (codePoint & 0xE0)) {
            codePoint ^= 0xE0;
            codePoint = (codePoint << 6) | (strBytes[++index] ^ 0x80);
            codePoint = (codePoint << 6) | (strBytes[++index] ^ 0x80);
        } else if (0xC0 === (codePoint & 0xC0)) {
            codePoint ^= 0xC0;
            codePoint = (codePoint << 6) | (strBytes[++index] ^ 0x80);
        }

        if (!isFinite(codePoint) || codePoint < 0 || codePoint > 0x10FFFF || Math.floor(codePoint) != codePoint)
            throw RangeError('Invalid code point: ' + codePoint);

        if (codePoint <= 0xFFFF)
            codeUnits.push(codePoint);
        else {
            codePoint -= 0x10000;
            highSurrogate = (codePoint >> 10) | 0xD800;
            lowSurrogate = (codePoint % 0x400) | 0xDC00;
            codeUnits.push(highSurrogate, lowSurrogate);
        }
        if (index + 1 == strBytes.length || codeUnits.length > MAX_SIZE) {
            result += String.fromCharCode.apply(null, codeUnits);
            codeUnits.length = 0;
        }
    }

    return result;
}

All the best !

Pockmark answered 18/6, 2017 at 12:23 Comment(5)
Thats not complete. For samplle, german umlauts are missing!Eba
By the way ... I have noticed that there was invalid ordering in if statements. May be that was a problem your string was not processed. I have corrected in my codes, but forget to correct it in this post.Pockmark
ö = RangeError: Invalid code point: 1581184, ü = RangeError: Invalid code point: 3678336Eba
I have changed code above. please try it one more time. There was a problem with "else if" statements ordering .. Now it must work for your case too. That code was tested for more than 30 languages including Japan, korean, Arabic etc. languages.Pockmark
For example here are words I have transferred using bytes and restored to string in Javascript: Hälfte, Über,Pockmark

© 2022 - 2024 — McMap. All rights reserved.