What valid JSON files are not valid YAML 1.1 files?
Asked Answered
Y

2

14

YAML 1.2 is (with one minor caveat regarding duplicate keys) a superset of JSON, so any valid JSON file is also a valid YAML file. However, the YAML 1.1 specification (which has the most library support) doesn't mention JSON. Most valid JSON files are valid YAML 1.1 files, but I found at least one exception by experimenting with PyYaml and Python's standard JSON library:

  • a double-precision floating-point overflow such as 12345e999 is interpreted as a string by PyYAML and IEEE infinity by Python's JSON library.

Does anyone have a complete list of differences, determined more robustly than by testing edge cases in a particular implementation? (That is, from a comparison of the specifications?) For example, I want to generate JSON strings that will be interpreted the same way by a JSON parser and a YAML 1.1 parser: what constraints must I place on my strings?

Yarn answered 5/2, 2014 at 18:4 Comment(2)
I said, "Does anyone have a list...". I'm not asking people to do new work, I'm asking if anyone else has encountered this problem before so that we can share results.Yarn
I don't think the 12345e999 example shows that the file wasn't valid JSON or YAML. 1) It was after all interpreted without error by both implementations (which, of course, might be buggy); and 2) AFAIK neither YAML nor JSON spec strictly define the range of floating point values that have to be supported by an implementation, so implementation-specific behaviour is fair game.Ravishing
A
15

See here (specifically footnote 25). It says:

The incompatibilities were as follows: JSON allows extended character sets like UTF-32 and had incompatible unicode character escape syntax relative to YAML; YAML required a space after separators like comma, equals, and colon while JSON does not. Some non-standard implementations of JSON extend the grammar to include Javascript's /*...*/ comments. Handling such edge cases may require light pre-processing of the JSON before parsing as in-line YAML

See also https://metacpan.org/pod/JSON::XS#JSON-and-YAML

Related
What is the difference between YAML and JSON? When to prefer one over the other

Arillode answered 5/2, 2014 at 18:16 Comment(1)
As of YAML 1.2, YAML is a strict superset of JSON. yaml.org/spec/1.2.2/#12-yaml-historyGodred
G
13

As you noticed, one thing is what the specifications say the other what commonly available parsers (both YAML and JSON) process. You should therefore take several aspects into account and use the least common denominator to not be able to load your JSON with a YAML parser.

On the JSON side there are multiple standards and best practises. Originally a JSON text would have to have an object or array at the topmost level. This is still so according to the fail1.json files available on the json.org site:

"A JSON payload should be an object or array, not a string."

According to RFC7159 any value can be at the top level (apart from using a string, this leads to rather boring JSON files):

A JSON text is a serialized value. Note that certain previous specifications of JSON constrained a JSON text to be an object or an array. Implementations that generate only objects or arrays where a JSON text is called for will be interoperable in the sense that all implementations will accept these as conforming JSON texts.

Because of the problems with JSON hijacking *by redefining the array handing in older browsers) there have been implementations that only accept an object at the top level (i.e. the first character of the file has to be {.

On the YAML side there are fewer competing standards than with JSON, but things get muddled by the persistent usage of YAML 1.1, and is not helped by the fact that if you google for "yaml current spec" the first hit is yaml.org/spec/current.html and that is actually an old working-draft for YAML 1.1

Apart from the UTF-32 support the other answer mentioned, which is largely a non-issue in a world using UTF-8 almost exclusively, there are a few things to take into account, especially if you want PyYAML to to be able to parse your JSON (PyYAML still implements most of YAML 1.1 only, close to eight years after the YAML 1.2 spec release):

  • numbers in JSON don't need a dot in the mantissa, even if such a number has an exponent:

    enter image description here

    but the Floating-Point Language-Independent Type for YAML™ Version 1.1 does require that dot:

    |[-]?0\.([0-9]*[1-9])?e[-+](0|[1-9][0-9]+) (scientific)
           ^--- no ? or * associated with this dot
    

    (in the YAML 1.2 spec this regex has changed to:

    -? [1-9] ( \. [0-9]* [1-9] )? ( e [-+] [1-9] [0-9]* )?.
    

    allowing the dot to disappear even if there is an e (and no E) and exponent.

    This is the cause for your 12345e999 being handled differently by JSON (overflow) and PyYAML (string). In YAML 1.1 this can only be interpreted as a string and hence doesn't need quotes and can be plain scalar.

  • In YAML 1.1 there are escape sequences, but this is not a superset from what JSON supports. The forward slash (/) can be escaped in JSON, but not in YAML 1.1 (it can in YAML 1.2, rule 53)

  • In JSON as well as in YAML 1.1 you can use \uNNNN to indicate a 16 bit unicode code point. Although the YAML 1.1 spec (and YAML 1.2) mentions surrogate pairs in conjunction with using UTF-16, nothing is mentioned about such pairs as escaped sequences ("\uD834\uDD1E"). This string sequence is explicitly mentioned in RFC 7159 as representing the G clef character (U+1D11E). I don't know of any YAML parser that support this, PyYAML throws a:

    yaml.reader.ReaderError: unacceptable character #xd834: special characters are not allowed

So as long as you write your JSON

  • as UTF-8
  • with the top-level being an object
  • scientific numbers always with a dot
  • no \/ escape sequence
  • no \uNNNN characters between \uD7FF and \uE000 (exclusive), nor \uFFFE, nor \uFFFF

you should be fine for both JSON and YAML (1.1) parsers.


¹ In ruamel.yaml a YAML 1.2 parser of which I am the author, the \/ and scientific numbers without dot are handled correctly: your 12345e999 loads as type float and prints as inf.

Glyph answered 18/6, 2017 at 16:59 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.