How to convert large UTF-8 strings into ASCII?
Asked Answered
S

11

6

I need to convert large UTF-8 strings into ASCII. It should be reversible, and ideally a quick/lightweight algorithm.

How can I do this? I need the source code (using loops) or the JavaScript code. (should not be dependent on any platform/framework/library)

Edit: I understand that the ASCII representation will not look correct and would be larger (in terms of bytes) than its UTF-8 counterpart, since its an encoded form of the UTF-8 original.

Superphosphate answered 7/5, 2009 at 12:17 Comment(7)
im getting confused by your edits. it's starting to sound like what you actually to do is url encoding. is that right?Lexicography
I'm going to guess you downvoted me because of my spoonfeeding comment ... but it's obvious that you don't know what you're asking for, so do yourself a favor and read this: joelonsoftware.com/articles/Unicode.htmlChantel
I didn't downvote you. And I don't care about the binary format of UTF-8.Superphosphate
If I didn't know what I was asking for, I wouldn't even have gotten a few correct answers. (such as Escaping/Base64)Superphosphate
You should consider going with David's answer - endoceURI()/decodeURI() are better suited to solve your problem than quote()/eval()Campfire
Jeremy, take a look at what people are commenting and update your question, currently the title and description are very wrong. Otherwise you will continue to get downvotes from others.Denna
Why the downvotes? Yes, there is some confusion in the terms here, but this site is supposed to be newbie-friendly.Faradize
K
12

You could use an ASCII-only version of Douglas Crockford's json2.js quote function. Which would look like this:

    var escapable = /[\\\"\x00-\x1f\x7f-\uffff]/g,
        meta = {    // table of character substitutions
            '\b': '\\b',
            '\t': '\\t',
            '\n': '\\n',
            '\f': '\\f',
            '\r': '\\r',
            '"' : '\\"',
            '\\': '\\\\'
        };

    function quote(string) {

// If the string contains no control characters, no quote characters, and no
// backslash characters, then we can safely slap some quotes around it.
// Otherwise we must also replace the offending characters with safe escape
// sequences.

        escapable.lastIndex = 0;
        return escapable.test(string) ?
            '"' + string.replace(escapable, function (a) {
                var c = meta[a];
                return typeof c === 'string' ? c :
                    '\\u' + ('0000' + a.charCodeAt(0).toString(16)).slice(-4);
            }) + '"' :
            '"' + string + '"';
    }

This will produce a valid ASCII-only, javascript-quoted of the input string

e.g. quote("Doppelgänger!") will be "Doppelg\u00e4nger!"

To revert the encoding you can just eval the result

var encoded = quote("Doppelgänger!");
var back = JSON.parse(encoded); // eval(encoded);
Krone answered 7/5, 2009 at 13:2 Comment(5)
Why not use something other than eval() ? Like say, html entities?Bumboat
mostly because you don't need to implement anything for reversion and it will be pretty fast. You could just as well use an regex-based unquote method very much like the quote function.Krone
.. or you could secure the eval based unquote with regex validation like json2.js does for complete JSON.Krone
Note that strictly speaking this is not "conversion to ASCII". You're actually implementing your own encoding scheme on top of ASCII. This may be perfectly ok for the requirements (and it seems to be for you), but it's not just a simple "conversion to ASCII".Gulf
instead of eval(encoded) you can use JSON.parse(encoded) (which is similar under the covers, but safer)Fidelafidelas
A
6

Any UTF-8 string that is reversibly convertible to ASCII is already ASCII.

UTF-8 can represent any unicode character - ASCII cannot.

Antebellum answered 7/5, 2009 at 12:20 Comment(3)
"ASCII cannot" - Of course it can! look at the accepted answer above.Superphosphate
@Jeremy: Then state your question less sneakly! "UTF-8 to ASCII conversion" sounds like a character encoding conversion problem, while what you really want is a way to represent Unicode (that's not the same as UTF-8) characters using the ASCII charset and a known character escaping syntax.Valladares
@Pat That's one of the most common misconceptions about UTF-8. UTF-8 and UTF-16 actually have variable bit lengths and either one can represent any unicode character. en.wikipedia.org/wiki/UTF-8Antebellum
P
6

As others have said, you can't convert UTF-8 text/plain into ASCII text/plain without dropping data.

You could convert UTF-8 text/plain into ASCII someother/format. For instance, HTML lets any character in UTF-8 be representing in an ASCII data file using character references.

If we continue with that example, in JavaScript, charCodeAt could help with converting a string to a representation of it using HTML character references.

Another approach is taken by URLs, and implemented in JS as encodeURIComponent.

Phyle answered 7/5, 2009 at 12:31 Comment(0)
N
4

It is impossible to convert an UTF-8 string into ASCII but it is possible to encode Unicode as an ASCII compatible string.

Probably you want to use Punycode - this is already a standard Unicode encoding that encodes all Unicode characters into ASCII. For JavaScript code check this question

Please edit you question title and description in order to prevent others from down-voting it - do not use term conversion, use encoding.

Nomarchy answered 23/12, 2009 at 13:38 Comment(0)
P
2

If the string is encoded as UTF-8, it's not a string any more. It's binary data, and if you want to represent the binary data as ASCII, you have to format it into a string that can be represented using the limited ASCII character set.

One way is to use base-64 encoding (example in C#):

string original = "asdf";
// encode the string into UTF-8 data:
byte[] encodedUtf8 = Encoding.UTF8.GetBytes(original);
// format the data into base-64:
string base64 = Convert.ToBase64String(encodedUtf8);

If you want the string encoded as ASCII data:

// encode the base-64 string into ASCII data:
byte[] encodedAscii = Encoding.ASCII.GetBytes(base64);
Peccavi answered 7/5, 2009 at 12:43 Comment(1)
Great idea, though I wanted JS. Thanks.Superphosphate
G
1

Your requirement is pretty strange.

Converting UTF-8 into ASCII would loose all information about Unicode codepoints > 127 (i.e. everything that's not in ASCII).

You could, however try to encode your Unicode data (no matter what source encoding) in an ASCII-compatible encoding, such as UTF-7. This would mean that the data that is produced could legally be interpreted as ASCII, but it is really UTF-7.

Gulf answered 7/5, 2009 at 13:11 Comment(3)
"loose all information" - It can be lossless! look at the accepted answer above.Superphosphate
Good idea about the UTF-7 though.Superphosphate
@Jeremy: it can be lossless, but then you're no longer just "converting to ASCII", you're then converting to some encoding scheme implemented on top of the ASCII character set ...Gulf
B
1

Do you want to strip all non ascii chars (slash replace them with '?', etc) or to store Unicode code points in a non unicode system?

First can be done in a loop checking for values > 128 and replacing them.

If you don't want to use "any platform/framework/library" then you will need to write your own encoder. Otherwise I'd just use JQuery's .html();

Bumboat answered 7/5, 2009 at 13:14 Comment(0)
D
1
function utf8ToAscii(str) {
    /**
     * ASCII contains 127 characters.
     * 
     * In JavaScript, strings is encoded by UTF-16, it means that
     * js cannot present strings which charCode greater than 2^16. Eg:
     * `String.fromCharCode(0) === String.fromCharCode(2**16)`
     *
     * @see https://developer.mozilla.org/en-US/docs/Web/API/DOMString/Binary
     */
    const reg = /[\x7f-\uffff]/g; // charCode: [127, 65535]
    const replacer = (s) => {
        const charCode = s.charCodeAt(0);
        const unicode = charCode.toString(16).padStart(4, '0');
        return `\\u${unicode}`;
    };

    return str.replace(reg, replacer);
}

Better way

See Uint8Array to string in Javascript also. You can use TextEncoder and Uint8Array:

function utf8ToAscii(str) {
    const enc = new TextEncoder('utf-8');
    const u8s = enc.encode(str);

    return Array.from(u8s).map(v => String.fromCharCode(v)).join('');
}
// For ascii to string
// new TextDecoder().decode(new Uint8Array(str.split('').map(v=>v.charCodeAt(0))))
Dose answered 17/1, 2022 at 21:55 Comment(0)
D
0

Here is a function to convert UTF8 accents to ASCII Accents (àéèî etc) If there is an accent in the string it's converted to %239 for exemple Then on the other side, I parse the string and I know when there is an accent and what is the ASCII char.

I used it in a javascript software to send data to a microcontroller that works in ASCII.

convertUtf8ToAscii = function (str) {
    var asciiStr = "";
    var refTable = { // Reference table Unicode vs ASCII
        199: 128, 252: 129, 233: 130, 226: 131, 228: 132, 224: 133, 231: 135, 234: 136, 235: 137, 232: 138,
        239: 139, 238: 140, 236: 141, 196: 142, 201: 144, 244: 147, 246: 148, 242: 149, 251: 150, 249: 151
    };
    for(var i = 0; i < str.length; i++){
        var ascii = refTable[str.charCodeAt(i)];
        if (ascii != undefined)
            asciiStr += "%" +ascii;
        else
            asciiStr += str[i];
    }
    return asciiStr;
}
Diwan answered 23/4, 2015 at 13:35 Comment(0)
D
0

If you are using node.js you can use the TextDecoder class.

const decoder = new TextDecoder('ascii');
let text = decoder.decode(buffer);
Delozier answered 13/8, 2024 at 18:59 Comment(0)
C
-1

An implementation of the quote() function might do what you want. My version can be found here

You can use eval() to reverse the encoding:

var foo = 'Hägar';
var quotedFoo = quote(foo);
var unquotedFoo = eval(quotedFoo);
alert(foo === unquotedFoo);
Campfire answered 7/5, 2009 at 13:10 Comment(2)
@Jeremy: not really - same thing, different implementation; if I'd seen fforw's answer before posting my own, I wouldn't have bothered; my version has a few more options (choice between single or double quotes, optionally doesn't escape non-ascii characters), but most likely it will be slowerCampfire
Dead link -----Dinner

© 2022 - 2025 — McMap. All rights reserved.