How to cryptographically hash a JSON object?

Asked 12/1, 2011 at 15:25 Answered 9/5, 2023 at 11:16

Solved json cryptography canonicalization

The following question is more complex than it may first seem.

Assume that I've got an arbitrary JSON object, one that may contain any amount of data including other nested JSON objects. What I want is a cryptographic hash/digest of the JSON data, without regard to the actual JSON formatting itself (eg: ignoring newlines and spacing differences between the JSON tokens).

The last part is a requirement, as the JSON will be generated/read by a variety of (de)serializers on a number of different platforms. I know of at least one JSON library for Java that completely removes formatting when reading data during deserialization. As such it will break the hash.

The arbitrary data clause above also complicates things, as it prevents me from taking known fields in a given order and concatenating them prior to hasing (think roughly how Java's non-cryptographic hashCode() method works).

Lastly, hashing the entire JSON String as a chunk of bytes (prior to deserialization) is not desirable either, since there are fields in the JSON that should be ignored when computing the hash.

I'm not sure there is a good solution to this problem, but I welcome any approaches or thoughts =)

Torrent answered 12/1, 2011 at 15:25 Comment(6)

Did you have a look at the XML DSig? They have the same problem and have a quite complex "canonicalization" spec. – Beatriz 12/1, 2011 at 15:37

I can't help but notice how apt your name is to the question. – Beale 20/2, 2013 at 17:20

This is being standardized. See the JSON Web Signature (JWS) draft RFC. tools.ietf.org/html/draft-ietf-jose-json-web-signature-17 – Tirol 29/10, 2013 at 20:47

that RFC only specifies a JSON format to store payload+signature+some headers, no JSON canonicalization is mentioned – Traitorous 6/2, 2014 at 1:41

@RomanPlášil there are existing implementation in Go / Node.js / Python that you can use and does the canonicalization for you. – Berga 27/8, 2015 at 12:41

You can also have a look at this RFC draft: dpaste-bkero.paas.allizom.org/MtkA/raw – Berga 27/8, 2015 at 13:6

The problem is a common one when computing hashes for any data format where flexibility is allowed. To solve this, you need to canonicalize the representation.

For example, the OAuth1.0a protocol, which is used by Twitter and other services for authentication, requires a secure hash of the request message. To compute the hash, OAuth1.0a says you need to first alphabetize the fields, separate them by newlines, remove the field names (which are well known), and use blank lines for empty values. The signature or hash is computed on the result of that canonicalization.

XML DSIG works the same way - you need to canonicalize the XML before signing it. There is a proposed W3 standard covering this, because it's such a fundamental requirement for signing. Some people call it c14n.

I don't know of a canonicalization standard for json. It's worth researching.

If there isn't one, you can certainly establish a convention for your particular application usage. A reasonable start might be:

lexicographically sort the properties by name
double quotes used on all names
double quotes used on all string values
no space, or one-space, between names and the colon, and between the colon and the value
no spaces between values and the following comma
all other white space collapsed to either a single space or nothing - choose one
exclude any properties you don't want to sign (one example is, the property that holds the signature itself)
sign the result, with your chosen algorithm

You may also want to think about how to pass that signature in the JSON object - possibly establish a well-known property name, like "nichols-hmac" or something, that gets the base64 encoded version of the hash. This property would have to be explicitly excluded by the hashing algorithm. Then, any receiver of the JSON would be able to check the hash.

The canonicalized representation does not need to be the representation you pass around in the application. It only needs to be easily produced given an arbitrary JSON object.

Suckle answered 12/1, 2011 at 15:38 Comment(2)

Canonicalisation must also take into account the representation of characters: "A" vs "\u0041", "é" vs "\u00e9" vs "\u00E9". Same issue for numbers: 1 vs 0.1e1. – Hurtful 19/8, 2013 at 14:6

Canonicalization must also take numbers into account. ECMAscript defines JSON.stringify, which says to format a number without an exponent if it is in the range [1e-6, 1e21); otherwise format it with 1 digit before the decimal point. – Moynihan 15/5, 2016 at 3:35

Instead of inventing your own JSON normalization/canonicalization you may want to use bencode. Semantically it's the same as JSON (composition of numbers, strings, lists and dicts), but with the property of unambiguous encoding that is necessary for cryptographic hashing.

bencode is used as a torrent file format, every bittorrent client contains an implementation.

Exuviate answered 12/1, 2011 at 15:54 Comment(5)

JSON is greatly preferred because nearly every language has libraries available to do object (de)serialization. – Torrent 12/1, 2011 at 16:39

I meant using bencode only as a normalization step before the hashing. Outside of your hashing routine everything stays JSON. – Exuviate 12/1, 2011 at 16:51

bencode is great and super easy to implement. Canonical JSON won't parse with a standard JSON parser either. Neither needs to be parsed for this application which only requires a hash function input. – Shiri 8/2, 2013 at 19:2

+1 for this answer - I work on an OSS project called Learning Registry which is a distributed JSON database. We have to sign every JSON document before it goes into the database. To accomplish this we (among other things) convert JSON to Bencode before signing b/c Bencode is a reliable semantic representation, whereas JSON isn't (in our experience). – Estuarine 5/7, 2013 at 0:27

bencoding encodes only byte strings while JSON encodes Unicode strings. So you have to design a JSON-string canonicalization on top of bencode. And bencode doesn't encode float values that JSON has. – Hurtful 19/8, 2013 at 15:34

This is the same issue as causes problems with S/MIME signatures and XML signatures. That is, there are multiple equivalent representations of the data to be signed.

For example in JSON:

{  "Name1": "Value1", "Name2": "Value2" }

vs.

{
    "Name1": "Value\u0031",
    "Name2": "Value\u0032"
}

Or depending on your application, this may even be equivalent:

{
    "Name1": "Value\u0031",
    "Name2": "Value\u0032",
    "Optional": null
}

Canonicalization could solve that problem, but it's a problem you don't need at all.

The easy solution if you have control over the specification is to wrap the object in some sort of container to protect it from being transformed into an "equivalent" but different representation.

I.e. avoid the problem by not signing the "logical" object but signing a particular serialized representation of it instead.

For example, JSON Objects -> UTF-8 Text -> Bytes. Sign the bytes as bytes, then transmit them as bytes e.g. by base64 encoding. Since you are signing the bytes, differences like whitespace are part of what is signed.

Instead of trying to do this:

{  
   "JSONContent": {  "Name1": "Value1", "Name2": "Value2" },
   "Signature": "asdflkajsdrliuejadceaageaetge="
}

Just do this:

{
   "Base64JSONContent": "eyAgIk5hbWUxIjogIlZhbHVlMSIsICJOYW1lMiI6ICJWYWx1ZTIiIH0s",
   "Signature": "asdflkajsdrliuejadceaageaetge="

}

I.e. don't sign the JSON, sign the bytes of the encoded JSON.

Yes, it means the signature is no longer transparent.

Sublingual answered 6/12, 2016 at 14:17 Comment(2)

Pro: This loosens the coupling for the properties, as indicated by your "Optional" object. Minor con: Standard API tools don't understand this packaging. Then again, producing hashes for those isn't trivial. – Southworth 20/7, 2017 at 23:38

It's been years since I've focused on this problem, but if I had to implement hashing today this is the approach I would take. – Torrent 31/8, 2017 at 14:3

JSON-LD can do normalitzation.

You will have to define your context.

Comptom answered 31/1, 2015 at 8:28 Comment(0)

RFC 7638: JSON Web Key (JWK) Thumbprint includes a type of canonicalization. Although RFC7638 expects a limited set of members, we would be able to apply the same calculation for any member.

https://www.rfc-editor.org/rfc/rfc7638#section-3

Posh answered 22/12, 2018 at 2:24 Comment(0)

What would be ideal is if JavaScript itself defined a formal hashing process for JavaScript Objects.

Yet we do have RFC-8785 JSON Canonicalization Scheme (JCS) which hopefully can be implemented in most libs for JSON and in particular added to popular JavaScript JSON object. With this canonicalization done it is just a matter of applying your preferred hashing algorithm.

If JCS is available in browsers and other tools and libs it becomes reasonable to expect most JSON on-the-wire to be in this common canonicalized form. Common consistent application and verification of standards like this can go some way to pushing back against trivial security threats.

Subarid answered 4/11, 2022 at 5:31 Comment(0)

I would do all fields in a given order (alphabetically for example). Why does arbitrary data make a difference? You can just iterate over the properties (ala reflection).

Alternatively, I would look into converting the raw json string into some well defined canonical form (remove all superflous formatting) - and hashing that.

Chastain answered 12/1, 2011 at 15:37 Comment(0)

We encountered a simple issue with hashing JSON-encoded payloads. In our case we use the following methodology:

Convert data into JSON object;
Encode JSON payload in base64
Message digest (HMAC) the generated base64 payload .
Transmit base64 payload .

Advantages of using this solution:

Base64 will produce the same output for a given payload.
Since the resulting signature will be derived directly from the base64-encoded payload and since base64-payload will be exchanged between the endpoints, we will be certain that the signature and payload will be maintained.
This solution solve problems that arise due to difference in encoding of special characters.

Disadvantages

The encoding/decoding of the payload may add overhead
Base64-encoded data is usually 30+% larger than the original payload.

Aa answered 5/4, 2018 at 2:8 Comment(8)

Base64-encoded data is usually about 30% larger than the original payload. – Aa 5/4, 2018 at 2:30

Wouldn't you need to sort object keys first (e.g. alphabetically) because JSON doesn't guarantee the order of object keys so identical data could have different key orders, giving a different hash? – Branching 1/8, 2018 at 5:15

Sorting of original string is not needed since you will be hashing the b64 payload and not the original JSON string itself. – Aa 8/8, 2018 at 15:6

@DeezzleLuBimkii What do you mean by "payload"? Specifically, if I have the object {"a":1,"b":2}, then when you say payload do you mean just the values 1 and 2, or do you mean the whole serialized string '{"a":1,"b":2}'? If the latter, then it absolutely does matter for the keys to be ordered. A change in ordering will change the base64 encoding. And if you mean the former, then this isn't quite solving the whole problem. – Aplanatic 18/8, 2019 at 23:31

the payload is could be the serialized data itself. Converting your serialized data to base64 would preserve the structure . Then you can hash/sign the b64 encoded string to get a signature . At the receiving end, you can verify the signature of the b64 payload by applying the same hash/signing algo used earlier. If the signature matches, then the b64 payload is valid and you can decode the payload. If the sig does not match then you can assume that the b64 has been tampered with. – Aa 10/2, 2020 at 5:17

@DeezzleLuBimkii you're being very hand-wavy on "covert the serialized data to base64" -- the point is that you then have to serialize the data to a json string before you can check it, and so if the order of the fields changed -- even though it's actually the same data -- the check would fail. Thus order matters if you're checking that the data is the same after potentially deserializing and reserializing it – Balkh 9/7, 2020 at 1:11

correct @Balkh . if the order of the variables in serialized object changes , then the calculated hash on the object would be different too . – Aa 13/7, 2020 at 8:4

Similar to @Balkh the step 'convert data into JSON object' is fraught with problems as the layout of the JSON could change; that is the part that would need to be standardized and steps taken to verify it hasn't undergone some change for some JSON producer. – Subarid 4/11, 2022 at 5:7

So I am not sure why there is no library mentioned here but you could just use something like https://www.npmjs.com/package/@tufjs/canonical-json as first step and afterwards any hash algorithm of your choice.

Flynt answered 9/5, 2023 at 11:16 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags