Parsing YAML, return with line number
Asked Answered
J

4

33

I'm making a document generator from YAML data, which would specify which line of the YAML file each item is generated from. What is the best way to do this? So if the YAML file is like this:

- key1: item 1
  key2: item 2
- key1: another item 1
  key2: another item 2

I want something like this:

[
     {'__line__': 1, 'key1': 'item 1', 'key2': 'item 2'},
     {'__line__': 3, 'key1': 'another item 1', 'key2': 'another item 2'},
]

I'm currently using PyYAML, but any other library is OK if I can use it from Python.

Jorge answered 10/11, 2012 at 3:51 Comment(1)
For further inspiration, here's my code for this. It contains more information than requested above as it reports the location information using start_mark, end_mark on each dict/list/unicode (using dict_node, list_node, unicode_node subclasses, respectively). gist.github.com/dagss/5008118Diagenesis
F
23

Here's an improved version of puzzlet's answer:

import yaml
from yaml.loader import SafeLoader

class SafeLineLoader(SafeLoader):
    def construct_mapping(self, node, deep=False):
        mapping = super(SafeLineLoader, self).construct_mapping(node, deep=deep)
        # Add 1 so line numbering starts at 1
        mapping['__line__'] = node.start_mark.line + 1
        return mapping

You can use it like this:

data = yaml.load(whatever, Loader=SafeLineLoader)
Fokine answered 6/12, 2018 at 8:11 Comment(8)
You fail to mention that, on uncontrolled YAML input, this can get your disk wiped (or worse). With the OPs example input there is no need to play it unsafe and you could just subclass SafeLoader instead of Loader. I also fail to see how this would address getting the line number of the sequence in the OP's (or any other) YAML document.Semidiurnal
@Semidiurnal In the current version of PyYaml, Loader is the same as SafeLoader.Fokine
@Semidiurnal This is an enhancement of puzzlet's answer with the same behavior. It adds a key __line__ to each mapping in the YAML structure with a value of the starting line of that mapping node.Fokine
Which version of PyYAML are you using? PyYAML 4.0 finally had that security hole fixed, but that version has been retracted half a year ago. In the latest version of PyYAML on PyPI ( 3.13) Loader uses the unsafe Constructor and SafeLoader uses SafeConstructor. (loader.py lines 38 resp. 28)Semidiurnal
@Semidiurnal Hm, I was looking at the latest code on their develop branch. I'll change to SafeLoader just to be clear.Fokine
I get the line index starting from 1 in pyyaml 5.3.1.Freedwoman
Does anyone know a good, robust way to modify this to only attach __line__ to the top-most level of the yaml? Best I could come up with is if node.start_mark.column == 2: mapping["__line__"] = node.start_mark.line + 1, but it feels hacky to do it based on the column number, and could possibly fail if the yaml is formatted differently?Plasticizer
@V.Rubinetti Since the constructor performs a depth-first traversal, you could add an attribute to track current depth and override construct_object() to increment/decrement it appropriately. You'd need some extra logic to handle anchors correctly, if needed for your use case.Fokine
J
14

I've made it by adding hooks to Composer.compose_node and Constructor.construct_mapping:

import yaml
from yaml.composer import Composer
from yaml.constructor import Constructor

def main():
    loader = yaml.Loader(open('data.yml').read())
    def compose_node(parent, index):
        # the line number where the previous token has ended (plus empty lines)
        line = loader.line
        node = Composer.compose_node(loader, parent, index)
        node.__line__ = line + 1
        return node
    def construct_mapping(node, deep=False):
        mapping = Constructor.construct_mapping(loader, node, deep=deep)
        mapping['__line__'] = node.__line__
        return mapping
    loader.compose_node = compose_node
    loader.construct_mapping = construct_mapping
    data = loader.get_single_data()
    print(data)
Jorge answered 10/11, 2012 at 3:51 Comment(2)
Thanks - this worked perfectly and is very useful when it comes to error reporting.Mcdougal
Since Sep 2006, the recommended extension for YAML files has been .yaml.Semidiurnal
S
8

If you are using ruamel.yaml >= 0.9 (of which I am the author), and use the RoundTripLoader, you can access the property lc on collection items to get line and column where they started in the source YAML:

def test_item_04(self):
    data = load("""
     # testing line and column based on SO
     # http://stackoverflow.com/questions/13319067/
     - key1: item 1
       key2: item 2
     - key3: another item 1
       key4: another item 2
        """)
    assert data[0].lc.line == 2
    assert data[0].lc.col == 2
    assert data[1].lc.line == 4
    assert data[1].lc.col == 2

(line and column start counting at 0).

This answer show how to add the lc attribute to string types during loading.

Semidiurnal answered 18/4, 2015 at 17:52 Comment(3)
Couldn'd find a way to let this work if the list is inside an ordered map, like in key1: !!omap\n - key4: item2\n - key3: item3 it's not possible to access to key4 and key3 line numbers.Descender
@Descender an orderedmap doesn't by default get loaded into a CommentedMap structure and doesn't therefore have the lc attribute. You would have to register the !omap loading as subclass of CommentedMap. That is doable, but more than I can answer in a comment. You should post a new question if you cannot figure out how to do that.Semidiurnal
Indeed I cannot figure this out. I've only found a "dirty" workaround to get the lines numbers. Question asked here.Descender
T
7

The following codes are based on previous good answers, if anyone also needs to locate leaf attributes' line numbers, the following codes may help:

from yaml.composer import Composer
from yaml.constructor import Constructor
from yaml.nodes import ScalarNode
from yaml.resolver import BaseResolver
from yaml.loader import Loader


class LineLoader(Loader):
    def __init__(self, stream):
        super(LineLoader, self).__init__(stream)

    def compose_node(self, parent, index):
        # the line number where the previous token has ended (plus empty lines)
        line = self.line
        node = Composer.compose_node(self, parent, index)
        node.__line__ = line + 1
        return node

    def construct_mapping(self, node, deep=False):
        node_pair_lst = node.value
        node_pair_lst_for_appending = []

        for key_node, value_node in node_pair_lst:
            shadow_key_node = ScalarNode(tag=BaseResolver.DEFAULT_SCALAR_TAG, value='__line__' + key_node.value)
            shadow_value_node = ScalarNode(tag=BaseResolver.DEFAULT_SCALAR_TAG, value=key_node.__line__)
            node_pair_lst_for_appending.append((shadow_key_node, shadow_value_node))

        node.value = node_pair_lst + node_pair_lst_for_appending
        mapping = Constructor.construct_mapping(self, node, deep=deep)
        return mapping


if __name__ == '__main__':
    stream = """             # The first line
    key1:                    # This is the second line
      key1_1: item1
      key1_2: item1_2
      key1_3:
        - item1_3_1
        - item1_3_2
    key2: item 2
    key3: another item 1
    """
    loader = LineLoader(stream)
    data = loader.get_single_data()

    from pprint import pprint

    pprint(data)

And the output are as follows, with another key with prefix "__line__", like "__line__key" at the same level.

PS: For the list items, I cannot show the line yet.

{'__line__key1': 2,
 '__line__key2': 8,
 '__line__key3': 9,
 'key1': {'__line__key1_1': 3,
          '__line__key1_2': 4,
          '__line__key1_3': 5,
          'key1_1': 'item1',
          'key1_2': 'item1_2',
          'key1_3': ['item1_3_1', 'item1_3_2']},
 'key2': 'item 2',
 'key3': 'another item 1'}
Tenia answered 12/1, 2022 at 9:28 Comment(2)
Thank you for this. Does the self.compose_node = self.compose_node in __init__ actually do anything here? When I'm testing snippet, it appears to work the same even when not including the __init__ function.Museum
thx,already removed it.Tenia

© 2022 - 2024 — McMap. All rights reserved.