How to convert UTF8 string to byte array?
Asked Answered
E

10

67

The .charCodeAt function returns with the unicode code of the caracter. But I would like to get the byte array instead. I know, if the charcode is over 127, then the character is stored in two or more bytes.

var arr=[];
for(var i=0; i<str.length; i++) {
    arr.push(str.charCodeAt(i))
}
Epigram answered 10/9, 2013 at 21:56 Comment(0)
L
81

The logic of encoding Unicode in UTF-8 is basically:

  • Up to 4 bytes per character can be used. The fewest number of bytes possible is used.
  • Characters up to U+007F are encoded with a single byte.
  • For multibyte sequences, the number of leading 1 bits in the first byte gives the number of bytes for the character. The rest of the bits of the first byte can be used to encode bits of the character.
  • The continuation bytes begin with 10, and the other 6 bits encode bits of the character.

Here's a function I wrote a while back for encoding a JavaScript UTF-16 string in UTF-8:

function toUTF8Array(str) {
    var utf8 = [];
    for (var i=0; i < str.length; i++) {
        var charcode = str.charCodeAt(i);
        if (charcode < 0x80) utf8.push(charcode);
        else if (charcode < 0x800) {
            utf8.push(0xc0 | (charcode >> 6), 
                      0x80 | (charcode & 0x3f));
        }
        else if (charcode < 0xd800 || charcode >= 0xe000) {
            utf8.push(0xe0 | (charcode >> 12), 
                      0x80 | ((charcode>>6) & 0x3f), 
                      0x80 | (charcode & 0x3f));
        }
        // surrogate pair
        else {
            i++;
            // UTF-16 encodes 0x10000-0x10FFFF by
            // subtracting 0x10000 and splitting the
            // 20 bits of 0x0-0xFFFFF into two halves
            charcode = 0x10000 + (((charcode & 0x3ff)<<10)
                      | (str.charCodeAt(i) & 0x3ff));
            utf8.push(0xf0 | (charcode >>18), 
                      0x80 | ((charcode>>12) & 0x3f), 
                      0x80 | ((charcode>>6) & 0x3f), 
                      0x80 | (charcode & 0x3f));
        }
    }
    return utf8;
}
Lanilaniard answered 10/9, 2013 at 22:43 Comment(5)
The result is not the same, as unescape(encodeURIComponent()). https://mcmap.net/q/21088/-how-to-convert-utf8-string-to-byte-arrayEpigram
@donkaka It should match comparing to arr after the for loop, though. jsfiddle.net/3Uz8nCavil
Look similar to onicos.com/staff/iz/amuse/javascript/expert/utf.txt which worked for me on a string containing obscure, 4-byte, characters in the CJK Unified Extension B.Durarte
This is about 89% faster than the leading answer. Nice work.Putdown
A similar function inside google closure library: stringToUtf8ByteArray(). The fact that strings are UTF16 in the JavaScript's memory has been an opening to me 🤔😳Philosophize
C
47

JavaScript Strings are stored in UTF-16. To get UTF-8, you'll have to convert the String yourself.

One way is to mix encodeURIComponent(), which will output UTF-8 bytes URL-encoded, with unescape, as mentioned on ecmanaut.

var utf8 = unescape(encodeURIComponent(str));

var arr = [];
for (var i = 0; i < utf8.length; i++) {
    arr.push(utf8.charCodeAt(i));
}
Cavil answered 10/9, 2013 at 22:6 Comment(5)
Thanks, it works. But I would like to understand it, how to code this unicode to utf8 bytecode conversion. Could you please link me an article about it? I haven't found anyEpigram
@donkaka I linked to one in my post. ecmanaut.blogspot.com/2006/07/…. Are you wanting to convert it manually, code-by-code?Cavil
yes. encodeURIComponent works well, but I would like to understand, how utf8 bytecode is generated.Epigram
Wikipedia actually has a good summary of UTF-8 conversion. en.wikipedia.org/wiki/UTF-8#Description The examples demonstrate how the bits of the original code point are spread and what prefixes are applied for aiding decoding later. To code it gets complicated by UTF-16 surrogate pairs, but is based in bitwise shifting and masking with AND or OR.Cavil
Here are some more examples, if you want to convert between UTF-8 text and hex, binary, or base64: jsfiddle.net/47zwb41oAceto
A
46

The Encoding API lets you both encode and decode UTF-8 easily (using typed arrays):

var encoded = new TextEncoder().encode("Γεια σου κόσμε");
var decoded = new TextDecoder("utf-8").decode(encoded);
    
console.log(encoded, decoded);

Browser support isn't too bad, and there's a polyfill that should work in IE11 and older versions of Edge.

While TextEncoder can only encode to UTF-8, TextDecoder supports other encodings. I used it to decode Japanese text (Shift-JIS) in this way:

// Shift-JIS encoded text; must be a byte array due to values 129 and 130.
var arr = [130, 108, 130, 102, 130, 80, 129,  64, 130, 102, 130,  96, 130, 108, 130, 100,
           129,  64, 130,  99, 130, 96, 130, 115, 130,  96, 129, 124, 130,  79, 130, 80];
// Convert to byte array
var data = new Uint8Array(arr);
// Decode with TextDecoder
var decoded = new TextDecoder("shift-jis").decode(data.buffer);
console.log(decoded);
Ancon answered 15/9, 2017 at 13:42 Comment(5)
.decode( ) doesn't work on a string though, so no use if you're trying to decode a string of bytes that happens to be in utf8 format (which can happen in some environments)Whiffen
If you have a string of hex bytes like "DEADBEEF", you cannot use it directly. You need to translate it to a TypedArray for it to be decoded. Can be done in 4 lines of code: paste2.org/5KHPxdVOAncon
In my case I actually had a Javascript (UTF-16) string that had UTF-8 character codes. Actually it was worse than that, as 0x80 was represented as something else again (the unicode for Euro symbol), etc. Still trying to work out a better solution, I should be able to read the data into an Array instead. But unfortunately TextDecoder is an issue for IE/Edge.Whiffen
@DylanNicholson 2022: what is IE ?Entourage
Internet Explorer? That comment is almost 4 years old now so most likely whatever issues I had then are no longer relevant.Whiffen
G
11

The Google Closure library has functions to convert to/from UTF-8 and byte arrays. If you don't want to use the whole library, you can copy the functions from here. For completeness, the code to convert to a string to a UTF-8 byte array is:

goog.crypt.stringToUtf8ByteArray = function(str) {
  // TODO(user): Use native implementations if/when available
  var out = [], p = 0;
  for (var i = 0; i < str.length; i++) {
    var c = str.charCodeAt(i);
    if (c < 128) {
      out[p++] = c;
    } else if (c < 2048) {
      out[p++] = (c >> 6) | 192;
      out[p++] = (c & 63) | 128;
    } else if (
        ((c & 0xFC00) == 0xD800) && (i + 1) < str.length &&
        ((str.charCodeAt(i + 1) & 0xFC00) == 0xDC00)) {
      // Surrogate Pair
      c = 0x10000 + ((c & 0x03FF) << 10) + (str.charCodeAt(++i) & 0x03FF);
      out[p++] = (c >> 18) | 240;
      out[p++] = ((c >> 12) & 63) | 128;
      out[p++] = ((c >> 6) & 63) | 128;
      out[p++] = (c & 63) | 128;
    } else {
      out[p++] = (c >> 12) | 224;
      out[p++] = ((c >> 6) & 63) | 128;
      out[p++] = (c & 63) | 128;
    }
  }
  return out;
};
Gerstein answered 30/1, 2015 at 1:5 Comment(2)
Google moved closure to github. Updated the link (and also updated the code snippet as the function implementation had changed too).Gerstein
Here is the updated link: stringToUtf8ByteArray()Philosophize
P
7

Assuming the question is about a DOMString as input and the goal is to get an Array, that when interpreted as string (e.g. written to a file on disk), would be UTF-8 encoded:

Now that nearly all modern browsers support Typed Arrays, it'd be ashamed if this approach is not listed:

  • According to the W3C, software supporting the File API should accept DOMStrings in their Blob constructor (see also: String encoding when constructing a Blob)
  • Blobs can be converted to an ArrayBuffer using the .readAsArrayBuffer() function of a File Reader
  • Using a DataView or constructing a Typed Array with the buffer read by the File Reader, one can access every single byte of the ArrayBuffer

Example:

// Create a Blob with an Euro-char (U+20AC)
var b = new Blob(['€']);
var fr = new FileReader();

fr.onload = function() {
    ua = new Uint8Array(fr.result);
    // This will log "3|226|130|172"
    //                  E2  82  AC
    // In UTF-16, it would be only 2 bytes long
    console.log(
        fr.result.byteLength + '|' + 
        ua[0]  + '|' + 
        ua[1] + '|' + 
        ua[2] + ''
    );
};
fr.readAsArrayBuffer(b);

Play with that on JSFiddle. I haven't benchmarked this yet but I can imagine this being efficient for large DOMStrings as input.

Portentous answered 23/12, 2014 at 20:15 Comment(1)
This is great. No need for insane bit-twiddling in JS, just pass it straight into the Blob constructor. Thanks!Chrysoprase
G
2

You can save a string raw as is by using FileReader.

Save the string in a blob and call readAsArrayBuffer(). Then the onload-event results an arraybuffer, which can converted in a Uint8Array. Unfortunately this call is asynchronous.

This little function will help you:

function stringToBytes(str)
{
    let reader = new FileReader();
    let done = () => {};

    reader.onload = event =>
    {
        done(new Uint8Array(event.target.result), str);
    };
    reader.readAsArrayBuffer(new Blob([str], { type: "application/octet-stream" }));

    return { done: callback => { done = callback; } };
}

Call it like this:

stringToBytes("\u{1f4a9}").done(bytes =>
{
    console.log(bytes);
});

output: [240, 159, 146, 169]

explanation:

JavaScript use UTF-16 and surrogate-pairs to store unicode characters in memory. To save unicode character in raw binary byte streams an encoding is necessary. Usually and in the most case, UTF-8 is used for this. If you not use an enconding you can't save unicode character, just ASCII up to 0x7f.

FileReader.readAsArrayBuffer() uses UTF-8.

Gazo answered 26/1, 2018 at 13:52 Comment(0)
W
1

As there is no pure byte type in JavaScript we can represent a byte array as an array of numbers, where each number represents a byte and thus will have an integer value between 0 and 255 inclusive.

Here is a simple function that does convert a JavaScript string into an Array of numbers that contain the UTF-8 encoding of the string:

function toUtf8(str) {
    var value = [];
    var destIndex = 0;
    for (var index = 0; index < str.length; index++) {
        var code = str.charCodeAt(index);
        if (code <= 0x7F) {
            value[destIndex++] = code;
        } else if (code <= 0x7FF) {
            value[destIndex++] = ((code >> 6 ) & 0x1F) | 0xC0;
            value[destIndex++] = ((code >> 0 ) & 0x3F) | 0x80;
        } else if (code <= 0xFFFF) {
            value[destIndex++] = ((code >> 12) & 0x0F) | 0xE0;
            value[destIndex++] = ((code >> 6 ) & 0x3F) | 0x80;
            value[destIndex++] = ((code >> 0 ) & 0x3F) | 0x80;
        } else if (code <= 0x1FFFFF) {
            value[destIndex++] = ((code >> 18) & 0x07) | 0xF0;
            value[destIndex++] = ((code >> 12) & 0x3F) | 0x80;
            value[destIndex++] = ((code >> 6 ) & 0x3F) | 0x80;
            value[destIndex++] = ((code >> 0 ) & 0x3F) | 0x80;
        } else if (code <= 0x03FFFFFF) {
            value[destIndex++] = ((code >> 24) & 0x03) | 0xF0;
            value[destIndex++] = ((code >> 18) & 0x3F) | 0x80;
            value[destIndex++] = ((code >> 12) & 0x3F) | 0x80;
            value[destIndex++] = ((code >> 6 ) & 0x3F) | 0x80;
            value[destIndex++] = ((code >> 0 ) & 0x3F) | 0x80;
        } else if (code <= 0x7FFFFFFF) {
            value[destIndex++] = ((code >> 30) & 0x01) | 0xFC;
            value[destIndex++] = ((code >> 24) & 0x3F) | 0x80;
            value[destIndex++] = ((code >> 18) & 0x3F) | 0x80;
            value[destIndex++] = ((code >> 12) & 0x3F) | 0x80;
            value[destIndex++] = ((code >> 6 ) & 0x3F) | 0x80;
            value[destIndex++] = ((code >> 0 ) & 0x3F) | 0x80;
        } else {
            throw new Error("Unsupported Unicode character \"" 
                + str.charAt(index) + "\" with code " + code + " (binary: " 
                + toBinary(code) + ") at index " + index
                + ". Cannot represent it as UTF-8 byte sequence.");
        }
    }
    return value;
}

function toBinary(byteValue) {
    if (byteValue < 0) {
        byteValue = byteValue & 0x00FF;
    }
    var str = byteValue.toString(2);
    var len = str.length;
    var prefix = "";
    for (var i = len; i < 8; i++) {
        prefix += "0";
    }
    return prefix + str;
}
Weide answered 27/3, 2020 at 19:40 Comment(0)
L
0

I was using Joni's solution and it worked fine, but this one is much shorter.

This was inspired by the atobUTF16() function of Solution #3 of Mozilla's Base64 Unicode discussion

function convertStringToUTF8ByteArray(str) {
    let binaryArray = new Uint8Array(str.length)
    Array.prototype.forEach.call(binaryArray, function (el, idx, arr) { arr[idx] = str.charCodeAt(idx) })
    return binaryArray
}
Lucent answered 31/5, 2019 at 22:59 Comment(1)
This will not work for non ascii characters. This is because JavaScript strings are UTF-16. charCodeAt will return a number between 0 and 65535, and a given Uint8Array index can only store 0 to 255.Blond
H
0

In my tests (and to the best of my understanding) this gives the same results as the unescape(encodeURIComponent(instr)) method, but without using escape / unescape

    function utf8_toBinary(instr) {
        //this is the same as unescape(encodeURIComponent(instr))
        const binAry = (new TextEncoder().encode(instr));
        let safeStr = String.fromCharCode(...binAry);
        return btoa(safeStr);
    }

    function binary_toUtf8(binstr) {
        let safeStr = atob(binstr);
        let arr = new Uint8Array(safeStr.length);
        for (let i = 0; i < safeStr.length; i++) {
            arr[i] = safeStr.charCodeAt(i);
        }
        return new TextDecoder().decode(arr);
    }
Helmsman answered 28/4, 2023 at 20:31 Comment(0)
D
-1
function convertByte()
{
    var c=document.getElementById("str").value;
    var arr = [];
    var i=0;
    for(var ind=0;ind<c.length;ind++)
    {
        arr[ind]=c.charCodeAt(i);
        i++;
    }    
    document.getElementById("result").innerHTML="The converted value is "+arr.join("");    
}
Damnable answered 27/6, 2020 at 18:35 Comment(1)
Welcome to Stack Overflow. Code only answers can generally be improved by explaining how and why they work, and in the case of adding an answer to an older question with existing answers and an accepted answer, to point out what new aspect of the question this answer addresses.Khajeh

© 2022 - 2024 — McMap. All rights reserved.