How to log Protobuf string in nested objects in a human-readable way?

Asked 18/7, 2020 at 5:59 Answered 18/7, 2020 at 22:46

Solved serialization protocol-buffers protobuf-java

Given a proto file:

syntax = "proto3";
package hello;

message TopGreeting {
    NestedGreeting greeting = 1;
}

message NestedGreeting {
    Greeting greeting = 1;
}

message Greeting {
    string message = 1;
}

and the code:

public class Main {
    public static void main(String[] args) {
        System.out.printf("From top: %s%n", newGreeting("오늘은 무슨 요일입니까?"));
        System.out.printf("Directly: %s%n", "오늘은 무슨 요일입니까?");
        System.out.printf("ByteString: %s", newGreeting("오늘은 무슨 요일입니까?").toByteString().toStringUtf8());
    }

    private static Hello.TopGreeting newGreeting(String message) {
        Hello.Greeting greeting = Hello.Greeting.newBuilder()
                .setMessage(message)
                .build();
        Hello.NestedGreeting nestedGreeting = Hello.NestedGreeting.newBuilder()
                .setGreeting(greeting)
                .build();
        return Hello.TopGreeting.newBuilder()
                .setGreeting(nestedGreeting)
                .build();
    }
}

Output

From top: greeting {
  greeting {
    message: "\354\230\244\353\212\230\354\235\200 \353\254\264\354\212\250 \354\232\224\354\235\274\354\236\205\353\213\210\352\271\214?"
  }
}

Directly: 오늘은 무슨 요일입니까?

ByteString: 
%
#
!오늘은 무슨 요일입니까?

How do I print the message in a human-readable way? As you can see, converting to ByteString prints the UTF-8 characters alright, but also prints some other garbage % and #.

Phocine answered 18/7, 2020 at 5:59 Comment(3)

Is it possible that the source code or those string literals are in UTF16 or something other than UTF8? The thing that's got my attention is that it has output things like "\354\230\244", but then the spaces are intact. Some of those numbers are >255, hence my wondering if it's trying to output 16 bit values. If it were dumping UTF8 as byte values, I'd expect them to be <255. – Springfield 18/7, 2020 at 9:37

Hello agan, I found in this answer https://mcmap.net/q/597437/-from-compilation-to-runtime-how-does-java-string-encoding-really-work that Java strings are UTF16, which may have something to do with how the strings are appearing in the debug output. If the GPB class were expecting its buffer to contain UTF8 encoded text, but actually it contained UTF16 encoded text, then it would print out strangely; the two encodings are not compatible. I'm wondering if you can use something like this answer https://mcmap.net/q/24465/-encode-string-to-utf-8 to convert your string literal to UTF8 before initialising a newgreeting? – Springfield 18/7, 2020 at 13:4

@Springfield see my answer. Almost always, the truth is in the source code. – Phocine 18/7, 2020 at 22:47

Answering my own question, I solved this issue by digging through Protobuf source code.

System.out.println(TextFormat.printer().escapingNonAscii(false).printToString(greeting))

Output:

greeting {
  greeting {
    message: "오늘은 무슨 요일입니까?"
  }
}

toString uses the same mechanism but with escapingNonAscii(true) (default when omitted).

Also see this answer for how to convert Octal sequences to UTF-8 characters in case you don't have access to the source code, only logs.

Phocine answered 18/7, 2020 at 22:46 Comment(1)

Good find :-) I should have spotted the octal... I've found ref here developers.google.com/protocol-buffers/docs/reference/java/com/… The default tostring() certainly seems to be pretty rubbish. The string encoding in the object is supposed to be UTF8, so one would think that it would at least try to not print it as 7 bit ascii. I presume then it's relying on the stdout understanding UTF8 - which clearly yours does - but it's not guaranteed universally. I'm wondering if there's different behaviour on Windows and Linux. – Springfield 19/7, 2020 at 8:32

-1

The protobuf binary format isn't human readable and you shouldn't attempt to make it so. There is a JSON variant if you need, but frankly it would be better to log the interpreted data, not the payloads.

Komi answered 18/7, 2020 at 6:54 Comment(6)

I disagree. Almost always, one part of a response doesn’t stand on it’s own, and it’s interpretation depends on the other parts. Seeing the whole message is crucial for debugging, and works as expected with ASCII charset. What boggles my mind is that Google went out of their way to obscure what’s printed, – Phocine 18/7, 2020 at 7:32

@AbhijitSarkar, you have misunderstood the purpose of GPB. Google designed it as a binary serialiser specifically to save storage space. Text serialisations, which can be clumsily read as plain text, take up a lot more room and take longer to send via a network connection. – Springfield 18/7, 2020 at 9:14

@Springfield I think you misunderstood my point. No one is stopping Google to do what's best on the wire; I'm talking about printing messages for debugging, not transmitting them anywhere. Debugging, still, is done by programmers, who are usually human. – Phocine 18/7, 2020 at 9:16

@AbhijitSarkar ah I see, I'm sorry. Hang on a mo and I'll do some digging. My first instinct is that GBP has specific ways of representing strings, and that it can get lost if a different character encoding gets put into it. – Springfield 18/7, 2020 at 9:19

@Abhijit so long the contents, not the serialization payload, which is what you seem to be doing right now. The serialization payload is not intended to be readable. – Komi 19/7, 2020 at 0:45

@MarcGravell There is no ambiguity in my question or sample code. – Phocine 19/7, 2020 at 2:8

Recommended topics

Hot tags