Getting duplicate keys in YAML using Python
Asked Answered
C

4

18

We are in need of parsing YAML files which contain duplicate keys and all of these need to be parsed. It is not enough to skip duplicates. I know this is against the YAML spec and I would like to not have to do it, but a third-party tool used by us enables this usage and we need to deal with it.

File example:

build:
  step: 'step1'

build:
  step: 'step2'

After parsing we should have a similar data structure to this:

yaml.load('file.yml')
# [('build', [('step', 'step1')]), ('build', [('step', 'step2')])]

dict can no longer be used to represent the parsed contents.

I am looking for a solution in Python and I didn't find a library supporting this, have I missed anything?

Alternatively, I am happy to write my own thing but would like to make it as simple as possible. ruamel.yaml looks like the most advanced YAML parser in Python and it looks moderately extensible, can it be extended to support duplicate fields?

Chanticleer answered 4/7, 2017 at 11:4 Comment(3)
I need to have the yaml with duplicate keys parsed, not just recognise that there are duplicate keys. Unless I am missing something, the links you provided won't do that?Chanticleer
Can you tell us what 3rd party tool generates such YAML? (YUNK?)Hypocoristic
@Hypocoristic the tool we use is Drone CI and it doesn't generate it, but merely accepts it as a valid input. It basically ignores the key names and only cares about a content and order. We are building some analysis tooling over files we feed to Drone CI and thus we need to be able to parse the files.Chanticleer
H
14

PyYAML will just silently overwrite the first entry, ruamel.yaml¹ will give a DuplicateKeyFutureWarning if used with the legacy API, and raise a DuplicateKeyError with the new API.

If you don't want to create a full Constructor for all types, overwriting the mapping constructor in SafeConstructor should do the job:

import sys
from ruamel.yaml import YAML
from ruamel.yaml.constructor import SafeConstructor

yaml_str = """\
build:
  step: 'step1'

build:
  step: 'step2'
"""


def construct_yaml_map(self, node):
    # test if there are duplicate node keys
    data = []
    yield data
    for key_node, value_node in node.value:
        key = self.construct_object(key_node, deep=True)
        val = self.construct_object(value_node, deep=True)
        data.append((key, val))


SafeConstructor.add_constructor(u'tag:yaml.org,2002:map', construct_yaml_map)
yaml = YAML(typ='safe')
data = yaml.load(yaml_str)
print(data)

which gives:

[('build', [('step', 'step1')]), ('build', [('step', 'step2')])]

However it doesn't seem necessary to make step: 'step1' into a list. The following will only create the list if there are duplicate items (could be optimised if necessary, by caching the result of the self.construct_object(key_node, deep=True)):

def construct_yaml_map(self, node):
    # test if there are duplicate node keys
    keys = set()
    for key_node, value_node in node.value:
        key = self.construct_object(key_node, deep=True)
        if key in keys:
            break
        keys.add(key)
    else:
        data = {}  # type: Dict[Any, Any]
        yield data
        value = self.construct_mapping(node)
        data.update(value)
        return
    data = []
    yield data
    for key_node, value_node in node.value:
        key = self.construct_object(key_node, deep=True)
        val = self.construct_object(value_node, deep=True)
        data.append((key, val))

which gives:

[('build', {'step': 'step1'}), ('build', {'step': 'step2'})]

Some points:

  • Probably needless to say, this will not work with YAML merge keys (<<: *xyz)
  • If you need ruamel.yaml's round-trip capabilities (yaml = YAML()) , that will require a more complex construct_yaml_map.
  • If you want to dump the output, you should instantiate a new YAML() instance for that, instead of re-using the "patched" one used for loading (it might work, this is just to be sure):

    yaml_out = YAML(typ='safe')
    yaml_out.dump(data, sys.stdout)
    

    which gives (with the first construct_yaml_map):

    - - build
      - - [step, step1]
    - - build
      - - [step, step2]
    
  • What doesn't work in PyYAML nor ruamel.yaml is yaml.load('file.yml'). If you don't want to open() the file yourself you can do:

    from pathlib import Path  # or: from ruamel.std.pathlib import Path
    yaml = YAML(typ='safe')
    yaml.load(Path('file.yml')
    

¹ Disclaimer: I am the author of that package.

Hypocoristic answered 4/7, 2017 at 12:58 Comment(9)
Mind blown! That's much more elegant than I thought it'd be. Thanks a lot for the code and good explanation! Thankfully the limitations are fine to me.Chanticleer
One question: in construct_yaml_map is there an advantage of yeilding the data array instead of just returning it when it's populated?Chanticleer
@Chanticleer Yes the yield is essential part of two-step generation necessary for self-referential structures (i.e. those using anchors and aliases)Hypocoristic
Makes sense, cheers. FYI I ended up using multidict to represent the file.Chanticleer
I am trying to convert this to a round-trip function that can handle both duplicate and non-duplicate keys (i.e, make all second level nodes lists), but I cannot make it work. could you help me there? or just tell me how to get "node", I then can reverse-engineer. thanksChristychristye
@Christychristye please post a new question with the code (and input file) that you have, even though that is not working tag it ruamel.yaml and I'll get notified that there is a new questionHypocoristic
@Hypocoristic Do you have any idea how to limit this constructor to specific keys, e.g. "build"? I'd like to use this approach in my application, but don't want to affect the remaining data structure. Within construct_yaml_map() only the value seems to be available, not its key.Legitimacy
@Legitimacy you can subclass the Constructor with just the method for representing mappings changed, and include the code for checking on keys. But if it is depending on context that is going to be difficult and you are better of recursively traversing the datastructure before dumping, as you have more control over keeping track of the context that you need to decide on which keys are allowed or not.Hypocoristic
@Hypocoristic Ah ok. I think I'll stick with a more general approach. This is rather easy to implement and only affects subtrees of the data structure with duplicate keys.Legitimacy
P
6

You can override how pyyaml loads keys. For example, you could use a defaultdict with lists of values for each keys:

from collections import defaultdict
import yaml


def parse_preserving_duplicates(src):
    # We deliberately define a fresh class inside the function,
    # because add_constructor is a class method and we don't want to
    # mutate pyyaml classes.
    class PreserveDuplicatesLoader(yaml.loader.Loader):
        pass

    def map_constructor(loader, node, deep=False):
        """Walk the mapping, recording any duplicate keys.

        """
        mapping = defaultdict(list)
        for key_node, value_node in node.value:
            key = loader.construct_object(key_node, deep=deep)
            value = loader.construct_object(value_node, deep=deep)

            mapping[key].append(value)

        return mapping

    PreserveDuplicatesLoader.add_constructor(yaml.resolver.BaseResolver.DEFAULT_MAPPING_TAG, map_constructor)
    return yaml.load(src, PreserveDuplicatesLoader)
Pipistrelle answered 26/1, 2018 at 11:41 Comment(1)
for some reason just returns src stringSerrell
M
2

If you can modify the input data very slightly, you should be able to do this by converting the single yaml-like file into multiple yaml documents. yaml documents can be in the same file if they're separated by --- on a line by itself, and you handily appear to have entries separated by two newlines next to each other:

with open('file.yml', 'r') as f:
    data = f.read()
    data = data.replace('\n\n', '\n---\n')

    for document in yaml.load_all(data):
        print(document)

Output:

{'build': {'step': 'step1'}}
{'build': {'step': 'step2'}}
Mata answered 4/7, 2017 at 12:20 Comment(4)
This approach will only work if the duplicate keys are all in a mapping that is in the top-level. Why the comment # should really use os.path.sep, you are not doing anything with filenames?Hypocoristic
Fair point, I was basing it on the example given. And os.path.sep I blame on lack of caffeine ;)Mata
As a quick fix it seems ok, just have to be aware of the limitations. I used to go for coffee to a place a few doors down from Heffers bookshop when I was around (back in the 80's), can't remember its name though.Hypocoristic
Good tip, but won't work in my case unfortunately as I have duplicates in the subsections. I should have made it clear in the example, sorry!Chanticleer
L
1

Here is an alternative implementation based on Anthon's answer and ruamel.yaml. It is rather generic and uses lists for duplicates, while other entries are left unchanged.

from collections import Counter
from ruamel.yaml import YAML
from ruamel.yaml.constructor import SafeConstructor

yaml_str = '''
a: 1
b: 2
b: 2
'''

def construct_yaml_map(self, node):
    data = {}
    yield data
    keys = [self.construct_object(node, deep=True) for node, _ in node.value]
    vals = [self.construct_object(node, deep=True) for _, node in node.value]
    key_count = Counter(keys)
    for key, val in zip(keys, vals):
        if key_count[key] > 1:
            if key not in data:
                data[key] = []
            data[key].append(val)
        else:
            data[key] = val

SafeConstructor.add_constructor(u'tag:yaml.org,2002:map', construct_yaml_map)
yaml = YAML(typ='safe')
data = yaml.load(yaml_str)
print(data)

Output:

{'a': 1, 'b': [2, 2]}

The same is possible with the pyyaml package (inspired by Wilfred Hughes' answer):

from collections import Counter
import yaml

yaml_str = '''
a: 1
b: 2
b: 2
'''

def parse_preserving_duplicates(src):
    class PreserveDuplicatesLoader(yaml.loader.Loader):
        pass

    def map_constructor(loader, node, deep=False):
        keys = [loader.construct_object(node, deep=deep) for node, _ in node.value]
        vals = [loader.construct_object(node, deep=deep) for _, node in node.value]
        key_count = Counter(keys)
        data = {}
        for key, val in zip(keys, vals):
            if key_count[key] > 1:
                if key not in data:
                    data[key] = []
                data[key].append(val)
            else:
                data[key] = val
        return data

    PreserveDuplicatesLoader.add_constructor(yaml.resolver.BaseResolver.DEFAULT_MAPPING_TAG, map_constructor)
    return yaml.load(src, PreserveDuplicatesLoader)

print(parse_preserving_duplicates(yaml_str))

Output:

{'a': 1, 'b': [2, 2]}
Legitimacy answered 5/4, 2022 at 11:35 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.