Conversion from iso-8859-15 (Latin9) to UTF-8?
Asked Answered
P

2

2

I need to convert some strings formated with Latin9 charset to UTF-8. I cannot use iconv as it is not included in my embedded system. Do you know if there is some available code for it?

Partheniaparthenocarpy answered 29/6, 2012 at 7:49 Comment(2)
Well the best way to know if it'll work is to read the requirements of iconv and the features of your embedded system and cross check. You even fail to mention what embedded system and what compiler so we can't really tell you much.Acrolein
Btw. since this question is one of the first search results: if you are working in an JavaScript environment, I would highly recommend the iso-8859-15 package (npmjs.com/package/iso-8859-15) written by Mathias Bynens.Maineetloire
D
4

Code points 1 to 127 are the same in both Latin-9 (ISO-8859-15) and UTF-8.

Code point 164 in Latin-9 is U+20AC, \xe2\x82\xac = 226 130 172 in UTF-8.
Code point 166 in Latin-9 is U+0160, \xc5\xa0 = 197 160 in UTF-8.
Code point 168 in Latin-9 is U+0161, \xc5\xa1 = 197 161 in UTF-8.
Code point 180 in Latin-9 is U+017D, \xc5\xbd = 197 189 in UTF-8.
Code point 184 in Latin-9 is U+017E, \xc5\xbe = 197 190 in UTF-8.
Code point 188 in Latin-9 is U+0152, \xc5\x92 = 197 146 in UTF-8.
Code point 189 in Latin-9 is U+0153, \xc5\x93 = 197 147 in UTF-8.
Code point 190 in Latin-9 is U+0178, \xc5\xb8 = 197 184 in UTF-8.

Code points 128 .. 191 (except for those listed above) in Latin-9 all map to \xc2\x80 .. \xc2\xbf = 194 128 .. 194 191 in UTF-8.

Code points 192 .. 255 in Latin-9 all map to \xc3\x80 .. \xc3\xbf = 195 128 .. 195 191 in UTF-8.

This means that Latin-9 code points 1..127 are one byte long in UTF-8, code point 164 is three bytes long, and the rest (128..163 and 165..255) are two bytes long.

If you first scan the Latin-9 input string, you can determine the length of the resulting UTF-8 string. If you want or need to -- you're working on an embedded system, after all -- you can then do the conversion in-place, by working backwards from the end towards the start.

Edit:

Here are two functions you can use for the conversion either way. These return a dynamically allocated copy you need to free() after use. They only return NULL when an error occurs (out of memory, errno == ENOMEM). If given a NULL or empty string to convert from, the functions return an empty dynamically allocated string.

In other words, you should always call free() on the pointer returned by these functions when you are done with them. (free(NULL) is allowed and does nothing.)

The latin9_to_utf8() has been verified to produce the exact same output as iconv if the input contains no zero bytes. The function uses standard C strings, i.e. zero byte indicates end of string.

The utf8_to_latin9() has been verified to produce the exact same output as iconv if the input contains only Unicode code points also in ISO-8859-15, and no zero bytes. When given random UTF-8 strings, the function maps the eight code points in Latin-1 to Latin-9 equivalents, i.e. currency sign to euro; iconv either ignores them or considers those errors.

The utf8_to_latin9() behaviour means that the functions are suitable for both Latin 1->UTF-8->Latin 1 and Latin 9->UTF-8->Latin9 round-trips.

#include <stdlib.h>     /* for realloc() and free() */
#include <string.h>     /* for memset() */
#include <errno.h>      /* for errno */

/* Create a dynamically allocated copy of string,
 * changing the encoding from ISO-8859-15 to UTF-8.
*/
char *latin9_to_utf8(const char *const string)
{
    char   *result;
    size_t  n = 0;

    if (string) {
        const unsigned char  *s = (const unsigned char *)string;

        while (*s)
            if (*s < 128) {
                s++;
                n += 1;
            } else
            if (*s == 164) {
                s++;
                n += 3;
            } else {
                s++;
                n += 2;
            }
    }

    /* Allocate n+1 (to n+7) bytes for the converted string. */
    result = malloc((n | 7) + 1);
    if (!result) {
        errno = ENOMEM;
        return NULL;
    }

    /* Clear the tail of the string, setting the trailing NUL. */
    memset(result + (n | 7) - 7, 0, 8);

    if (n) {
        const unsigned char  *s = (const unsigned char *)string;
        unsigned char        *d = (unsigned char *)result;

        while (*s)
            if (*s < 128) {
                *(d++) = *(s++);
            } else
            if (*s < 192) switch (*s) {
                case 164: *(d++) = 226; *(d++) = 130; *(d++) = 172; s++; break;
                case 166: *(d++) = 197; *(d++) = 160; s++; break;
                case 168: *(d++) = 197; *(d++) = 161; s++; break;
                case 180: *(d++) = 197; *(d++) = 189; s++; break;
                case 184: *(d++) = 197; *(d++) = 190; s++; break;
                case 188: *(d++) = 197; *(d++) = 146; s++; break;
                case 189: *(d++) = 197; *(d++) = 147; s++; break;
                case 190: *(d++) = 197; *(d++) = 184; s++; break;
                default:  *(d++) = 194; *(d++) = *(s++); break;
            } else {
                *(d++) = 195;
                *(d++) = *(s++) - 64;
            }
    }

    /* Done. Remember to free() the resulting string when no longer needed. */
    return result;
}

/* Create a dynamically allocated copy of string,
 * changing the encoding from UTF-8 to ISO-8859-15.
 * Unsupported code points are ignored.
*/
char *utf8_to_latin9(const char *const string)
{
    size_t         size = 0;
    size_t         used = 0;
    unsigned char *result = NULL;

    if (string) {
        const unsigned char  *s = (const unsigned char *)string;

        while (*s) {

            if (used >= size) {
                void *const old = result;

                size = (used | 255) + 257;
                result = realloc(result, size);
                if (!result) {
                    if (old)
                        free(old);
                    errno = ENOMEM;
                    return NULL;
                }
            }

            if (*s < 128) {
                result[used++] = *(s++);
                continue;

            } else
            if (s[0] == 226 && s[1] == 130 && s[2] == 172) {
                result[used++] = 164;
                s += 3;
                continue;

            } else
            if (s[0] == 194 && s[1] >= 128 && s[1] <= 191) {
                result[used++] = s[1];
                s += 2;
                continue;

            } else
            if (s[0] == 195 && s[1] >= 128 && s[1] <= 191) {
                result[used++] = s[1] + 64;
                s += 2;
                continue;

            } else
            if (s[0] == 197 && s[1] == 160) {
                result[used++] = 166;
                s += 2;
                continue;

            } else
            if (s[0] == 197 && s[1] == 161) {
                result[used++] = 168;
                s += 2;
                continue;

            } else
            if (s[0] == 197 && s[1] == 189) {
                result[used++] = 180;
                s += 2;
                continue;

            } else
            if (s[0] == 197 && s[1] == 190) {
                result[used++] = 184;
                s += 2;
                continue;

            } else
            if (s[0] == 197 && s[1] == 146) {
                result[used++] = 188;
                s += 2;
                continue;

            } else
            if (s[0] == 197 && s[1] == 147) {
                result[used++] = 189;
                s += 2;
                continue;

            } else
            if (s[0] == 197 && s[1] == 184) {
                result[used++] = 190;
                s += 2;
                continue;

            }

            if (s[0] >= 192 && s[0] < 224 &&
                s[1] >= 128 && s[1] < 192) {
                s += 2;
                continue;
            } else
            if (s[0] >= 224 && s[0] < 240 &&
                s[1] >= 128 && s[1] < 192 &&
                s[2] >= 128 && s[2] < 192) {
                s += 3;
                continue;
            } else
            if (s[0] >= 240 && s[0] < 248 &&
                s[1] >= 128 && s[1] < 192 &&
                s[2] >= 128 && s[2] < 192 &&
                s[3] >= 128 && s[3] < 192) {
                s += 4;
                continue;
            } else
            if (s[0] >= 248 && s[0] < 252 &&
                s[1] >= 128 && s[1] < 192 &&
                s[2] >= 128 && s[2] < 192 &&
                s[3] >= 128 && s[3] < 192 &&
                s[4] >= 128 && s[4] < 192) {
                s += 5;
                continue;
            } else
            if (s[0] >= 252 && s[0] < 254 &&
                s[1] >= 128 && s[1] < 192 &&
                s[2] >= 128 && s[2] < 192 &&
                s[3] >= 128 && s[3] < 192 &&
                s[4] >= 128 && s[4] < 192 &&
                s[5] >= 128 && s[5] < 192) {
                s += 6;
                continue;
            }

            s++;
        }
    }

    {
        void *const old = result;

        size = (used | 7) + 1;

        result = realloc(result, size);
        if (!result) {
            if (old)
                free(old);
            errno = ENOMEM;
            return NULL;
        }

        memset(result + used, 0, size - used);
    }

    return (char *)result;
}

While iconv() is the correct solution for character set conversions in general, the two functions above are certainly useful in an embedded or otherwise constricted environment.

Dispassionate answered 29/6, 2012 at 11:7 Comment(7)
Thanks a lot for these two really useful functions!! BTW, do you really mean they can be used for ISO-8859-1 too? For example, what will happen if you try to convert from 8859-1 to UTF8 characters that are different in 8859-1 and 8859-15, like "|" (0xA6 in 8859-1) or "1/4" (0xBC in 8859-1) ?Kegler
@cesss: No, I did not. I wrote that the functions work for both Latin1-UTF8-Latin1 and Latin9-UTF8-Latin9 round-trips. It means that using these functions to convert a Latin1/Latin9 string to UTF8 and back, always gives the original Latin1/Latin9 string. The conversion itself is only correct for Latin9. If you want to change that, you need to edit the code of both functions for Latin1 code points. (I recommend you rename the edited copies to latin1_to_utf8() and utf8_to_latin1(), to avoid confusion.)Dispassionate
Thanks a lot! I'll look at the tables from unicode.org and use them in my version of your functions.Kegler
@cesss: Latin 1 (ISO 8859-1) and Unicode share code points 0-127 and 160-255; code points 128-159 being undefined in Latin 1. In other words, to get the Latin1 versions of the above functions, you only need to remove code.Dispassionate
True, just realised about this! However, there's one thing that I don't understand: why do you malloc (n | 7) + 1 bytes instead of n + 1 bytes? Is it only because of memset alignment requirements, or maybe UTF8 strings have some standard requirement of having at least 8 bytes allocated?Kegler
@cesss: No, it's only a habit of mine. (If you know your string buffers are padded with nuls to an eight-byte boundary, you can do some nifty optimizations, that's all.) You can use n+1 in the malloc, if you replace the memset() with a result[n] = '\0'; in the first function. In the second function, replace size = (used | 7) + 1; with size = used + 1; (and optionally the memset() with result[used] = '\0';).Dispassionate
Thanks for the clarification. At one point I thought the reason could be that UTF-8 parsing can require to access several bytes beyond the current position, but then I considered that if ( expr1 && expr2 && expr3 && ... && exprN ) stops evaluating expressions as soon as one of them fails, so it's safe for exprN to access a char past the string end if a previous expression detected the string terminator. So, thanks for clarifying it's a matter of personal habit :)Kegler
D
2

It should be relatively easy to create a conversion table from the 128-255 latin9 codes to UTF-8 sequences of bytes. You can even use iconv to do this. Or you can create a file with the 128-255 latin9 codes and convert it to UTF-8 using an appropriate text editor. Then you can use this data to build the conversion table.

Demirelief answered 29/6, 2012 at 8:18 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.