Why does PyYAML use generators to construct objects?
Asked Answered
N

1

12

I've been reading the PyYAML source code to try to understand how to define a proper constructor function that I can add with add_constructor. I have a pretty good understanding of how that code works now, but I still don't understand why the default YAML constructors in the SafeConstructor are generators. For example, the method construct_yaml_map of SafeConstructor:

def construct_yaml_map(self, node):
    data = {}
    yield data
    value = self.construct_mapping(node)
    data.update(value)

I understand how the generator is used in BaseConstructor.construct_object as follows to stub out an object and only populate it with data from the node if deep=False is passed to construct_mapping:

    if isinstance(data, types.GeneratorType):
        generator = data
        data = generator.next()
        if self.deep_construct:
            for dummy in generator:
                pass
        else:
            self.state_generators.append(generator)

And I understand how the data is generated in BaseConstructor.construct_document in the case where deep=False for construct_mapping.

def construct_document(self, node):
    data = self.construct_object(node)
    while self.state_generators:
        state_generators = self.state_generators
        self.state_generators = []
        for generator in state_generators:
            for dummy in generator:
                pass

What I don't understand is the benefit of stubbing out the data objects and working down through the objects by iterating over the generators in construct_document. Does this have to be done to support something in the YAML spec, or does it provide a performance benefit?

This answer on another question was somewhat helpful, but I don't understand why that answer does this:

def foo_constructor(loader, node):
    instance = Foo.__new__(Foo)
    yield instance
    state = loader.construct_mapping(node, deep=True)
    instance.__init__(**state)

instead of this:

def foo_constructor(loader, node):
    state = loader.construct_mapping(node, deep=True)
    return Foo(**state)

I've tested that the latter form works for the examples posted on that other answer, but perhaps I am missing some edge case.

I am using version 3.10 of PyYAML, but it looks like the code in question is the same in the latest version (3.12) of PyYAML.

Nicolas answered 27/1, 2017 at 18:35 Comment(0)
A
10

In YAML you can have anchors and aliases. With that you can make self-referential structures, directly or indirectly.

If YAML would not have this possibility of self-reference, you could just first construct all the children and then create the parent structure in one go. But because of the self-references you might not have the child yet to "fill-out" the structure that you are creating. By using the two-step process of the generator (I call this two step, because it has only one yield before you come to the end of the method), you can create an object partially and the fill it out with a self-reference, because the object exist (i.e. its place in memory is defined).

The benefit is not in speed, but purely because of making the self-reference possible.

If you simplify the example from the answer you refer to a bit, the following loads:

import sys
import ruamel.yaml as yaml


class Foo(object):
    def __init__(self, s, l=None, d=None):
        self.s = s
        self.l1, self.l2 = l
        self.d = d


def foo_constructor(loader, node):
    instance = Foo.__new__(Foo)
    yield instance
    state = loader.construct_mapping(node, deep=True)
    instance.__init__(**state)

yaml.add_constructor(u'!Foo', foo_constructor)

x = yaml.load('''
&fooref
!Foo
s: *fooref
l: [1, 2]
d: {try: this}
''', Loader=yaml.Loader)

yaml.dump(x, sys.stdout)

but if you change foo_constructor() to:

def foo_constructor(loader, node):
    instance = Foo.__new__(Foo)
    state = loader.construct_mapping(node, deep=True)
    instance.__init__(**state)
    return instance

(yield removed, added a final return), you get a ConstructorError: with as message

found unconstructable recursive node 
  in "<unicode string>", line 2, column 1:
    &fooref

PyYAML should give a similar message. Inspect the traceback on that error and you can see where ruamel.yaml/PyYAML tries to resolve the alias in the source code.

Album answered 27/1, 2017 at 18:48 Comment(9)
Thank you, I thought it might have something to do with aliases and anchors. Why is it that when I modify foo_constructor from your answer as described in my question, I seem to see the correct output? That answer has self-references in its examples. Can you include in your answer an example YAML document that would have problems if I edited foo_constructor to not be a generator as shown in my question?Nicolas
@Nicolas I updated my answer, with the code for ruamel.yaml. PyYAML should behave the same in this respect. Because of its lack of keeping track of comments, its code for BaseConstructor.construct_mapping() might actually be easier to follow than that of ruamel.yaml.Album
BTW, welcome to Stack Overflow, and please post more such excellent questions.Album
Thank you. Your example is very helpful. I now see the difference between the example you gave in which an object refers to itself and the example I was trying. I was trying to understand self-reference using the third example you give in your other answer, but that was is not truly a self-reference like this one. Walking through those two examples in the debugger helped me understand it. And thanks for welcoming me! FYI, I did verify your example works the same with PyYAML as well.Nicolas
As a little more background, the reason I started this investigation was a desire to preserve order in YAML mappings. This answer on another post does not correctly handle self-reference in my testing. I was confused by the differences between that answer and this much more involved answer, and I wanted to find a functional difference, which I now have, so thanks again.Nicolas
@ryan did you try that with my answer? ruamel.yaml when using the round-trip-loader, preserves the key order in the mapping (by automatically storing things in an ordereddict)Album
In addition the second answer mentioned in my previous comment does not seem to handle tag:yaml.org,2002:omap correctly. I think it should just omit the line that adds the constructor for tag:yaml.org,2002:omap, but I don't have sufficient reputation to leave a comment to that effect (which spurred me to create an account).Nicolas
yes, I have tried out ruamel.yaml. Unfortunately, we have been using the solution from Eric's answer in our code base which relies heavily on YAML for quite some time, and I'm not sure we can incur the risk of switching to ruamel.yaml at this time. I also am not sure we want to preserve comments. Is there a way to preserve only ordering with ruamel.yaml (and not comments)? Another impediment to using ruamel.yaml is that we try to use the CSafeLoader (for performance benefits), and that doesn't seem compatible with round-trip loading.Nicolas
@Nicolas You should be able to combine the CSafeLoader with the RoundTripRepresenter, maybe with a little tweaking. The CSafeLoader drops the comments anyway. But if the risk needs to stay low, I would just overload the mapping constructor for PyYAML with one that preserves order.Album

© 2022 - 2024 — McMap. All rights reserved.