Python encoded utf-8 string \xc4\x91 in Java
Asked Answered
G

2

6

How to get proper Java string from Python created string 'Oslobo\xc4\x91enja'? How to decode it? I've tryed I think everything, looked everywhere, I've been stuck for 2 days with this problem. Please help!

Here is the Python's web service method that returns JSON from which Java client with Google Gson parses it.

def list_of_suggestions(entry):
   input = entry.encode('utf-8')
   """Returns list of suggestions from auto-complete search"""
   json_result = { 'suggestions': [] }
   resp = urllib2.urlopen('https://maps.googleapis.com/maps/api/place/autocomplete/json?input=' + urllib2.quote(input) + '&location=45.268605,19.852924&radius=3000&components=country:rs&sensor=false&key=blahblahblahblah')
   # make json object from response
   json_resp = json.loads(resp.read())

   if json_resp['status'] == u'OK':
     for pred in json_resp['predictions']:
        if pred['description'].find('Novi Sad') != -1 or pred['description'].find(u'Нови Сад') != -1:
           obj = {}
           obj['name'] = pred['description'].encode('utf-8').encode('string-escape')
           obj['reference'] = pred['reference'].encode('utf-8').encode('string-escape')
           json_result['suggestions'].append(obj)

   return str(json_result)

Here is solution on Java client

private String python2JavaStr(String pythonStr) throws UnsupportedEncodingException {
    int charValue;
    byte[] bytes = pythonStr.getBytes();
    ByteBuffer decodedBytes = ByteBuffer.allocate(pythonStr.length());
    for (int i = 0; i < bytes.length; i++) {
        if (bytes[i] == '\\' && bytes[i + 1] == 'x') {
            // \xc4 => c4 => 196
            charValue = Integer.parseInt(pythonStr.substring(i + 2, i + 4), 16);
            decodedBytes.put((byte) charValue);
            i += 3;
        } else
            decodedBytes.put(bytes[i]);
    }
    return new String(decodedBytes.array(), "UTF-8");
}
Gruel answered 3/9, 2013 at 13:56 Comment(22)
You have UTF-8 data displayed as a Python string literal, decoding that to Unicode gives Oslobođenja. Presumably Java can handle UTF-8 data?Venosity
maybe have a look to this question : https://mcmap.net/q/525655/-string-decode-utf-8Bertelli
@MartijnPieters I get Python string u'Oslobo\u0111enja' which Java can't handle. Just can't. Java can handle 'Oslobo\xc4\x91enja'. Now I just need to decode it in, as you guesed: Oslobođenja.Cantara
@Bertelli I've looked, and that is just like x + 2 - 2. Nothing changed... Thanks anyway.Cantara
@Ognjen: Who said anything about decoding the value in Python? You have the Unicode value, but it's the UTF8 value is what Java can handle.Venosity
@MartijnPieters Yes, sorry I forgot to mention. I've trieed to read u'Oslobo\u0111enja' in JSON from Java and Java can't do that. So I've done str.encode('utf-8') in Python and got: 'Oslobo\xc4\x91enja' which Java can read from JSON. But I don't know how to get from that: OslobođenjaCantara
@Ognjen: Stick to the json module to produce valid JSON. u'Oslobo\u0111enja' is not JSON, that's a Python string literal. "Oslobo\u0111enja" is.Venosity
@MartijnPieters Thanks man, but have you any idea how to do that in Python? In Python, when parsing JSON text I get json object and every key and every (string) value is unicode. How to get "Oslobo\u0111enja" from u'Oslobo\u0111enja'?Cantara
@Ognjen: What are you trying to do? If you are loading JSON in python, then u'Oslobo\u0111enja' is exactly what you want. That's a valid Unicode value right there. I assumed you were producing JSON for some Java code to read and were struggling with the Java side.Venosity
@MartijnPieters Ok, brief expl. In Python web service, first I download some JSON from one Google's service, than I made json object from it (json.loads(resp)), and than I make shortened JSON from it for my app purpose. Java client is than reading shortened JSON with Google Gson. Everything works, but Java don't decode in right way that "Oslobo\xc4\x91enja" string. Just that.Cantara
@MartijnPieters Or, let me be very clear: how to get letter 'đ' from \xc4\x91? How can I get unicode character from \xc4\x91 in Java?Cantara
@Ognjen: Can you update your question to show the code for that? Either pass Unicode values to json.dumps() to produce valid JSON for Java to handle, or tell json.dumps() with the encoding parameter how to decode byte strings.Venosity
@Ognjen: 'Oslobo\xc4\x91enja' is a Python literal string notation for UTF8 bytes. I had Python interpret that string back to a byte string value, then decoded from UTF8.Venosity
@MartijnPieters Ok in a minute I will update question....Cantara
@MartijnPieters Thanks, I've found solution folowing your's and Keith's proposition.Cantara
@Ognjen: I only just now saw your code. You are returning a Python object string representation, not a JSON response.Venosity
@MartijnPieters No, I'm returning a hash, a dictionary as a string.Cantara
@Ognjen: Exactly. How is Java supposed to interpret Python dictionary objects? Use actual JSON instead.Venosity
@MartijnPieters Hmm I tried print json_resp (which is json object) and I've got every key/value pair in unicode. In that way surely Google Gson wouldn't parse it. Check question I've updated it with solution. Thank for your effort and time Martin I really appreciate it! Thanks, you helped me!Cantara
Don't put 'Solved' in the question title; instead, you can mark any one answer as accepted.Venosity
@Ognjen: JSON uses \u.... escape codes as well, and those any JSON parser can handle. It's the u'..' or u"..." strings that don't work for such parsers.Venosity
@MartijnPieters Now everything is clear. Thanks, you are the boss!Cantara
V
2

You are returning the string version of the python data structure.

Return an actual JSON response instead; leave the values as Unicode:

if json_resp['status'] == u'OK':
    for pred in json_resp['predictions']:
        desc = pred['description'] 
        if u'Novi Sad' in desc or u'Нови Сад' in desc:
            obj = {
                'name': pred['description'],
                'reference': pred['reference']
            }
            json_result['suggestions'].append(obj)

return json.dumps(json_result)

Now Java does not have to interpret Python escape codes, and can parse valid JSON instead.

Venosity answered 3/9, 2013 at 15:12 Comment(1)
As would you English speaking people said: works like a charm! :) Thanks, this is far more elegant solution. I'm still learning Python.Cantara
O
1

Python escapes unicode characters by converting their UTF-8 bytes into a series of \xVV values, where VV is the hex value of the byte. This is very different from the java unicode escapes, which are just a single \uVVVV per character, where VVVV is hex UTF-16 encoding.

Consider:

\xc4\x91

In decimal, those hex values are:

196 145

then (in Java):

byte[] bytes = { (byte) 196, (byte) 145 };
System.out.println("result: " + new String(bytes, "UTF-8"));

prints:

result: đ
Optional answered 3/9, 2013 at 14:47 Comment(1)
Thank you 10000 times! Buy beer on me, send me a bill :) Thanks once more!Cantara

© 2022 - 2024 — McMap. All rights reserved.