How do I translate 8bit characters into 7bit characters? (i.e. Ü to U)
Asked Answered
Y

15

26

I'm looking for pseudocode, or sample code, to convert higher bit ascii characters (like, Ü which is extended ascii 154) into U (which is ascii 85).

My initial guess is that since there are only about 25 ascii characters that are similar to 7bit ascii characters, a translation array would have to be used.

Let me know if you can think of anything else.

Yuyuan answered 26/9, 2008 at 16:5 Comment(1)
See sinelaw's answer below for a really great solution in .NET.Melbamelborn
W
5

Indeed as proposed by unexist : "iconv" function exists to handle all weird conversion for you, is available in almost all programming language and has a special option which tries to convert characters missing in the target set with approximations.

Use iconv to simply convert your input UTF-8 string to 7bit ASCII.

Otherwise, you'll always end hitting corner case : a 8bit input using a different codepage with a different set of characters (thus not working at all with your conversion table), forgot to map one last stupid accented caracter (you mapped all grave/acute accent, but forgot to map Czech caron or the nordic '°'), etc.

Of course if you want to apply the solution to a small specific problem (making file-system friendly filenames for your music collection) the the look-up arrays are the way to go (either an array which for each code number above 128 maps an approximation under 128 as proposed by JeeBee, or the source/target pairs proposed by vIceBerg depending on which substitution functions are already available in your language of choice), because it's quickly hacked together and quickly check for missing elements.

Whoopee answered 26/9, 2008 at 16:41 Comment(0)
M
41

For .NET users the article in CodeProject (thanks to GvS's tip) does indeed answer the question more correctly than any other I've seen so far.

However the code in that article (in solution #1) is cumbersome. Here's a compact version:

// Based on http://www.codeproject.com/Articles/13503/Stripping-Accents-from-Latin-Characters-A-Foray-in
private static string LatinToAscii(string inString)
{
    var newStringBuilder = new StringBuilder();
    newStringBuilder.Append(inString.Normalize(NormalizationForm.FormKD)
                                    .Where(x => x < 128)
                                    .ToArray());
    return newStringBuilder.ToString();
}

To expand a bit on the answer, this method uses String.Normalize which:

Returns a new string whose textual value is the same as this string, but whose binary representation is in the specified Unicode normalization form.

Specifically in this case we use the NormalizationForm FormKD, described in those same MSDN docs as such:

FormKD - Indicates that a Unicode string is normalized using full compatibility decomposition.

For more information about unicode normalization forms, see Unicode Annex #15.

Melisamelisande answered 5/4, 2012 at 22:29 Comment(0)
H
17

Most languages have a standard way to replace accented characters with standard ASCII, but it depends on the language, and it often involves replacing a single accented character with two ASCII ones. e.g. in German ü becomes ue. So if you want to handle natural languages properly it's a lot more complicated than you think it is.

Hoopoe answered 26/9, 2008 at 16:33 Comment(0)
D
11

Is converting Ü to U really what you would like to do? I don't know about other languages but in German Ü would become Ue, ö would become oe, etc.

Deliberative answered 26/9, 2008 at 16:43 Comment(2)
not even that simple, Ü would become UE if used in an all-uppercase wordSnafu
There are also certain scenarios where a 7-bit character set must be used, such as SMTP Content-Transfer-Encoding - en.wikipedia.org/wiki/MIME#Content-Transfer-Encoding. As a side note, if you're viewing this post because of SMTP issues, look into the UUEncoding features of your SMTP client/library.Notability
B
6

I think you just can't.

I usually do something like that:

AccentString = 'ÀÂÄÉÈÊ[and all the other]'
ConvertString = 'AAAEEE[and all the other]'

Looking for the char in AccentString and replacing it for the same index in ConvertString

HTH

Boxberry answered 26/9, 2008 at 16:8 Comment(0)
B
6

In code page 1251, chars are coded with 2 bytes : one for the basic char and one for the variation. Then, when you encode back in ASCII, only basic chars are kept.

public string RemoveDiacritics(string text)
{

  return System.Text.Encoding.ASCII.GetString(System.Text.Encoding.GetEncoding(1251).GetBytes(text));

}

From : http://www.clt-services.com/blog/post/Enlever-les-accents-dans-une-chaine-(proprement).aspx

Banneret answered 29/9, 2008 at 9:51 Comment(0)
W
5

Indeed as proposed by unexist : "iconv" function exists to handle all weird conversion for you, is available in almost all programming language and has a special option which tries to convert characters missing in the target set with approximations.

Use iconv to simply convert your input UTF-8 string to 7bit ASCII.

Otherwise, you'll always end hitting corner case : a 8bit input using a different codepage with a different set of characters (thus not working at all with your conversion table), forgot to map one last stupid accented caracter (you mapped all grave/acute accent, but forgot to map Czech caron or the nordic '°'), etc.

Of course if you want to apply the solution to a small specific problem (making file-system friendly filenames for your music collection) the the look-up arrays are the way to go (either an array which for each code number above 128 maps an approximation under 128 as proposed by JeeBee, or the source/target pairs proposed by vIceBerg depending on which substitution functions are already available in your language of choice), because it's quickly hacked together and quickly check for missing elements.

Whoopee answered 26/9, 2008 at 16:41 Comment(0)
B
1

You seem to have nailed it I think. A 128 byte long array of bytes, indexed by char&127, containing the matching 7-bit character for the 8-bit bit character.

Baking answered 26/9, 2008 at 16:8 Comment(0)
E
1

Hm, why not just change the encoding of the string with iconv?

Ephedrine answered 26/9, 2008 at 16:15 Comment(0)
B
1

It really depends on the nature of your source strings. If you know the string's encoding, and you know that it's an 8-bit encoding — for example, ISO Latin 1 or similar — then a simple static array is sufficient:

static const char xlate[256] = { ..., ['é'] = 'e', ..., ['Ü'] = 'U', ... }
...
new_c = xlate[old_c];

On the other hand, if you have a different encoding, or if you're using UTF-8 encoded strings, you will probably find the functions in the ICU library very helpful.

Bonbon answered 26/9, 2008 at 16:24 Comment(0)
M
1

The upper 128 characters do not have standard meanings. They can take different interpretations (code pages) depending on the user's language.

For example, see Portuguese versus French Canadian

Unless you know the code page, your "translation" will be wrong sometimes.

If you are going to assume a certain code page (e.g. the original IBM code page) then a translation array will work, but for true international users, it will be wrong a lot.

This is one reason why unicode is favored over the older system of code pages.

Strictly speaking, ASCII is only 7 bits.

Midi answered 26/9, 2008 at 16:36 Comment(0)
G
1

There is an article on CodeProject that looks good.

Also the conversion using codepage 1251 take my interest (see other answer).

I don't like the conversion tables, since the number of characters in Unicode are that large you easily miss one.

Groundage answered 8/10, 2008 at 16:3 Comment(0)
R
0

I think you already nailed it on the head. Given your limited domain, a conversion array or hash is your best bet. No sense creating anything complex to try to automagically do it.

Rush answered 26/9, 2008 at 16:7 Comment(0)
W
0

A lookup array is probably the simplest and fastest way to accomplish this. This is one way that you can convert say, ASCII to EBCDIC.

White answered 26/9, 2008 at 16:13 Comment(0)
R
0

I use this function to fix a variable with accents to pass to a soap function from VB6:

Function FixAccents(ByVal Valor As String) As String

    Dim x As Long
    Valor = Replace(Valor, Chr$(38), "&#" & 38 & ";")

    For x = 127 To 255
        Valor = Replace(Valor, Chr$(x), "&#" & x & ";")
    Next

    FixAccents = Valor

End Function

And inside the soap function I do this (for the variable Filename):

FileName = HttpContext.Current.Server.HtmlDecode(FileName)
Repugn answered 7/6, 2009 at 17:7 Comment(0)
M
-1

Try the uni2ascii program.

Mn answered 9/3, 2010 at 5:32 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.