C++ ShiftJIS to UTF8 conversion
Asked Answered
V

2

7

I need to convert Doublebyte characters. In my special case Shift-Jis into something better to handle, preferably with standard C++.

the following Question ended up without a workaround: Doublebyte encodings on MSVC (std::codecvt): Lead bytes not recognized

So is there anyone with a suggestion or a reference on how to handle this conversion with C++ standard?

Volotta answered 16/10, 2015 at 7:49 Comment(5)
"Better to handle" for what exactly? Only one direction? (ShitJIS => somethingelse, but not somethingelse => ShiftJIS)Harness
Sry, for displaying in UTF-8 for example. In only one direction. That would be nice to know.Volotta
@gabriel - for what platform/OS? It will be difficult w/o ICU on any platform.Groovy
Basically I just wanted a mini function like the one in the accepted answer that works, but if I can't get that until the bounty ends I'll prolly just use ICUCapon
Sorry all, I don't really monitor such old questions and didn't notice at all that people still want to use this. Couldn't find the original generator anymore, but will edit a new one in now...Harness
H
9

Normally I would recommend using the ICU library, but for this alone, using it is way too much overhead.

First a conversion function which takes an std::string with Shiftjis data, and returns an std::string with UTF8 (note 2019: no idea anymore if it works :))

It uses a uint8_t array of 25088 elements (25088 byte), which is used as convTable in the code. The function does not fill this variable, you have to load it from eg. a file first. The second code part below is a program that can generate the file.

The conversion function doesn't check if the input is valid ShiftJIS data.

std::string sj2utf8(const std::string &input)
{
    std::string output(3 * input.length(), ' '); //ShiftJis won't give 4byte UTF8, so max. 3 byte per input char are needed
    size_t indexInput = 0, indexOutput = 0;

    while(indexInput < input.length())
    {
        char arraySection = ((uint8_t)input[indexInput]) >> 4;

        size_t arrayOffset;
        if(arraySection == 0x8) arrayOffset = 0x100; //these are two-byte shiftjis
        else if(arraySection == 0x9) arrayOffset = 0x1100;
        else if(arraySection == 0xE) arrayOffset = 0x2100;
        else arrayOffset = 0; //this is one byte shiftjis

        //determining real array offset
        if(arrayOffset)
        {
            arrayOffset += (((uint8_t)input[indexInput]) & 0xf) << 8;
            indexInput++;
            if(indexInput >= input.length()) break;
        }
        arrayOffset += (uint8_t)input[indexInput++];
        arrayOffset <<= 1;

        //unicode number is...
        uint16_t unicodeValue = (convTable[arrayOffset] << 8) | convTable[arrayOffset + 1];

        //converting to UTF8
        if(unicodeValue < 0x80)
        {
            output[indexOutput++] = unicodeValue;
        }
        else if(unicodeValue < 0x800)
        {
            output[indexOutput++] = 0xC0 | (unicodeValue >> 6);
            output[indexOutput++] = 0x80 | (unicodeValue & 0x3f);
        }
        else
        {
            output[indexOutput++] = 0xE0 | (unicodeValue >> 12);
            output[indexOutput++] = 0x80 | ((unicodeValue & 0xfff) >> 6);
            output[indexOutput++] = 0x80 | (unicodeValue & 0x3f);
        }
    }

    output.resize(indexOutput); //remove the unnecessary bytes
    return output;
}

About the helper file: I used to have a download here, but nowadays I only know unreliable file hosters. So... either http://s000.tinyupload.com/index.php?file_id=95737652978017682303 works for you, or:

First download the "original" data from ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/SHIFTJIS.TXT . I can't paste this here because of the length, so we have to hope at least unicode.org stays online.

Then use this program while piping/redirecting above text file in, and redirecting the binary output to a new file. (Needs a binary-safe shell, no idea if it works on Windows).

#include <iostream>
#include <string>
#include <cstdint>
#include <cstdio>

using namespace std;

// pipe SHIFTJIS.txt in and pipe to (binary) file out
int main()
{
    string s;
    uint8_t *mapping; //same bigendian array as in converting function
    mapping = new uint8_t[2*(256 + 3*256*16)];

    //initializing with space for invalid value, and then ASCII control chars
    for(size_t i = 32; i < 256 + 3*256*16; i++)
    {
        mapping[2 * i] = 0;
        mapping[2 * i + 1] = 0x20;
    }
    for(size_t i = 0; i < 32; i++)
    {
        mapping[2 * i] = 0;
        mapping[2 * i + 1] = i;
    }

    while(getline(cin, s)) //pipe the file SHIFTJIS to stdin
    {
        if(s.substr(0, 2) != "0x") continue; //comment lines

        uint16_t shiftJisValue, unicodeValue;
        if(2 != sscanf(s.c_str(), "%hx %hx", &shiftJisValue, &unicodeValue)) //getting hex values
        {
            puts("Error hex reading");
            continue;
        }

        size_t offset; //array offset
        if((shiftJisValue >> 8) == 0) offset = 0;
        else if((shiftJisValue >> 12) == 0x8) offset = 256;
        else if((shiftJisValue >> 12) == 0x9) offset = 256 + 16*256;
        else if((shiftJisValue >> 12) == 0xE) offset = 256 + 2*16*256;
        else
        {
            puts("Error input values");
            continue;
        }

        offset = 2 * (offset + (shiftJisValue & 0xfff));
        if(mapping[offset] != 0 || mapping[offset + 1] != 0x20)
        {
            puts("Error mapping not 1:1");
            continue;
        }

        mapping[offset] = unicodeValue >> 8;
        mapping[offset + 1] = unicodeValue & 0xff;
    }

    fwrite(mapping, 1, 2*(256 + 3*256*16), stdout);
    delete[] mapping;
    return 0;
}

Notes:
Two-byte big endian raw unicode values (more than two byte not necessary here)
First 256 chars (512 byte) for the single byte ShiftJIS chars, value 0x20 for invalid ones.
Then 3 * 256*16 chars for the groups 0x8???, 0x9??? and 0xE???
= 25088 byte

Harness answered 16/10, 2015 at 12:48 Comment(24)
Thank you Sir for sharing your code! I didn't manage to get it to work yet. I passed in a std::string which should be a decent japanese text and the output is always 0. Any suggestions?Volotta
@Sascha Had an error, but it should've affected only ASCII text, not japanese symbols. Updated the code in the answer, it should work as it is. Are your sure you loaded the array properly? Where does the input come form, how do you check/print the output?Harness
I read the entire file with ifstream so I have a vector with 25088 values looking similar to what I saw in the binary editor. I simply pass a string with values like "\x83\x41" or sth similar and the output sj2utf8 returns is always 0.Volotta
btw the value from convTable is right. It doesn't get added to the output. I try to understand what is going wrong but I don't get all steps in your method by 100% yet.Volotta
@Sascha Currently I have no idea what the problem could be; for me it works without any problems. ... What part of the code you don't understand? ... Does it work for some pure ASCII string like "hello" for you?Harness
Yes, "hello" works. In my case, the decimal value you get from convTable is something like 12450, so the following conditions don't have an effect (hightest value possible is 2048). That's the area I don't get the purpose of. But this might be my inexperience faultVolotta
@Sascha :o I don't know what happened here, but the last of the three parts should only be else, not else if... Sorry ... correcting the post now...Harness
Yay I also thought it's supposed to be else. But isn't it necessary to return a wide string to get a decent displayed unicode character? I'm sorry but I still don't get the way you pass the characters in because 'value' would already represent the right charVolotta
@Sascha Did you read "wide string" on a Microsoft site? Keep in mind MS writing about Unicode is very inaccurate.To put it very very short: Within Unicode, there are several encodings (=how test is represented in byte values): UTF32BE, UTF32LE, UTF16 BE/LE, UTF8 and some unimportant others. Bytes in each of those a convertable in all others, but theencodings differ in complexity (=processing speed) and memory consumption. Eg. UTF32LE is the most easy/fast one, but can use up to 4 times of the memory of UTF8 for the same text. ... When Microsoft talks about wide strings, they mean UTF16LE.Harness
Part 2/2: Currently my function converts ShiftJIS to UTF8 data in a std::string (while certain variable types are often used for certain encodings, they are independent of each other.). If necessary, it's pretty easy to change it to eg. UTF16LE or anything. ... What you need for displaying the text depends on your GUI framework. Some use UTF8, others UTF16LE (BE and UTF32 are rarely used in practice)... what GUI framework are you using (and what type of variable it expects in the code)?Harness
i haven't decided for a specific GUI framework yet to stay independent. Therefor I'm working with C++ standard. Probably I'm going to use Qt which can also convert character sets. I mentioned wide string because that was the only option until now which was displayed in visual studio while debugging. I'm still irritated because of the returned output when I pass in a string like "\x83\x41" and get "xe3\x82\xa2" as output while I expect a hex value with x30A2.. is this a decent output to work with later?Volotta
a) Why you're irritated / why do you expect 0x30A2? EIther you're too fixated on UTF16 or you think ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/… has the real values? ... In reality, the file has UTF32BE values, and because they're all smaller than 0x10000, they are valid UTF16BE values too (except one uses 5 byte and one 2). For >=0x10000, UTF32 and UTF16 differ in values too. ... For UTF8, only values equal to UTF32 are 0-0xff (just 1 byte instead of 4), everything else is different.Harness
b) Don't trust QT with character conversions. ShiftJIS->Unicode is pretty easy, but in general, charset/encoding stuff is a major pain. QT doesn't have some important features ... for real stuff, see the mentioned ICU lib (only charset stuff, and the download has ~25 MB zipped, for a good reason)Harness
c) As long as you don't have a GUI, you're using the console, so a note to the Windows console: It can't do UTF8. There are many sites in the internet, each describing one of 4-5 workarounds, but all of them have serious hidden problems (possible data loss included). Don't try it, else your program isn't bug-free anymore. (There are bug reports and discussions on MS sites, but the reaction is "the problem affects the whole OS; too much work to fix"...)Harness
Yeah I already knew about the console thing. I was just focusing on debug output but I think I got it now. Thank you, I appreciate your help. Do you have any reference to enlarge on this topic?Volotta
@deviantfan, the file link "filedropper.com/shiftjis" is not working. Could you please provide a valid link. Thanks. Appreciate your help.Montymonument
I'll echo what Rak said - could you please fix the dead link? I'd love to be able to use this.Washery
I echo the exact same feeling - gonna put a 50 rep bounty on thisCapon
Sorry all, I don't really monitor such old questions and didn't notice at all that people still want to use this. Couldn't find the original generator anymore, but will edit a new one in now...Harness
Thanks for fixing the answer ! I'm checking if this works, will confirm the bounty if it does (upvoted anyway so if I forget you'll get it in 2 days)Capon
It segfaults on ASCII strings (indexInput is not incremented after encountering a single-byte character)Capon
@GabrielRavier So my small improvements were too much for me too handle without testing :/ SorryHarness
Now it works ! Thanks for coming back to this super old question.Capon
This is perfect! Thank you so much! :DWashery
C
1

For those looking for the Shift-JIS conversion table data, you can get the uint8_t array here: https://github.com/bucanero/apollo-ps3/blob/master/include/shiftjis.h

Also, here's a very simple function to convert basic Shift-JIS chars to ASCII:

const char SJIS_REPLACEMENT_TABLE[] = 
    " ,.,..:;?!\"*'`*^"
    "-_????????*---/\\"
    "~||--''\"\"()()[]{"
    "}<><>[][][]+-+X?"
    "-==<><>????*'\"CY"
    "$c&%#&*@S*******"
    "*******T><^_'='";

//Convert Shift-JIS characters to ASCII equivalent
void sjis2ascii(char* bData)
{
    uint16_t ch;
    int i, j = 0;
    int len = strlen(bData);
    
    for (i = 0; i < len; i += 2)
    {
        ch = (bData[i]<<8) | bData[i+1];

        // 'A' .. 'Z'
        // '0' .. '9'
        if ((ch >= 0x8260 && ch <= 0x8279) || (ch >= 0x824F && ch <= 0x8258))
        {
            bData[j++] = (ch & 0xFF) - 0x1F;
            continue;
        }

        // 'a' .. 'z'
        if (ch >= 0x8281 && ch <= 0x829A)
        {
            bData[j++] = (ch & 0xFF) - 0x20;
            continue;
        }

        if (ch >= 0x8140 && ch <= 0x81AC)
        {
            bData[j++] = SJIS_REPLACEMENT_TABLE[(ch & 0xFF) - 0x40];
            continue;
        }

        if (ch == 0x0000)
        {
            //End of the string
            bData[j] = 0;
            return;
        }

        // Character not found
        bData[j++] = bData[i];
        bData[j++] = bData[i+1];
    }

    bData[j] = 0;
    return;
}
Champlin answered 15/9, 2020 at 15:2 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.