How can I update a .yml file, ignoring preexisting Jinja syntax, using Python?
Asked Answered
C

3

11

I have some preprocessing to do with some existing .yml files - however, some of them have Jinja template syntax embedded in them:

A:
 B:
 - ip: 1.2.3.4
 - myArray:
   - {{ jinja.variable }}
   - val1
   - val2

I'd want to read in this file, and add val3 under myArray as such:

A:
 B:
 - ip: 1.2.3.4
 - myArray:
   - {{ jinja.variable }}
   - val1
   - val2
   - val 3

I tried manually writing out the jinja templates, but they got written with single quotes around them: '{{ jinja.variable }}'

What's the recommended way for me to read such .yml files and modify them, albeit with preexisting Jinja syntax? I'd like to add information to these files keeping all else the same.

I tried the above using PyYAML on Python 2.7+

Cheremkhovo answered 7/6, 2017 at 20:36 Comment(2)
Can you provide your code, please?Autonomic
@IlyaV.Schurov- from here: #1774305Cheremkhovo
G
10

The solution in this answer has been incorporated into ruamel.yaml using a plugin mechanism. At the bottom of this post there are quick-and-dirty instructions on how to use that.

There are three aspects in updating a YAML file that contains jinja2 "code":

  • making the jinja2 code acceptable to the YAML parser
  • making sure the acceptable can reversed (i.e. the changes should be unique, so only they get reversed)
  • preserving the layout of the YAML file so that the updated file processed by jinja2 still produces a valid YAML file, that again can be loaded.

Let's start by making your example somewhat more realistic by adding a jinja2 variable definition and for-loop and adding some comments (input.yaml):

# trying to update
{% set xyz = "123" }

A:
  B:
  - ip: 1.2.3.4
  - myArray:
    - {{ jinja.variable }}
    - val1
    - val2         # add a value after this one
    {% for d in data %}
    - phone: {{ d.phone }}
      name: {{ d.name }}
    {% endfor %}
    - {{ xyz }}
# #% or ##% should not be in the file and neither <{ or <<{

The lines starting with {% contain no YAML, so we'll make those into comments (assuming that comments are preserved on round-trip, see below). Since YAML scalars cannot start with { without being quoted we'll change the {{ to <{. This is done in the following code by calling sanitize() (which also stores the patterns used, and the reverse is done in sanitize.reverse (using the stored patterns).

The preservation of your YAML code (block-style etc) is best done using ruamel.yaml (disclaimer: I am the author of that package), that way you don't have to worry about flow-style elements in the input getting mangled into as block style as with the rather crude default_flow_style=False that the other answers use. ruamel.yaml also preserves comments, both the ones that were originally in the file, as well as those temporarily inserted to "comment out" jinja2 constructs starting with %{.

The resulting code:

import sys
from ruamel.yaml import YAML

yaml = YAML()

class Sanitize:
    """analyse, change and revert YAML/jinja2 mixture to/from valid YAML"""
    def __init__(self):
        self.accacc = None
        self.accper = None

    def __call__(self, s):
        len = 1
        for len in range(1, 10):
            pat = '<' * len + '{'
            if pat not in s:
                self.accacc = pat
                break
        else:
            raise NotImplementedError('could not find substitute pattern '+pat)
        len = 1
        for len in range(1, 10):
            pat = '#' * len + '%'
            if pat not in s:
                self.accper = pat
                break
        else:
            raise NotImplementedError('could not find substitute pattern '+pat)
        return s.replace('{{', self.accacc).replace('{%', self.accper)

    def revert(self, s):
        return s.replace(self.accacc, '{{').replace(self.accper, '{%')


def update_one(file_name, out_file_name=None):

    sanitize = Sanitize()

    with open(file_name) as fp:
        data = yaml.load(sanitize(fp.read()))
    myArray = data['A']['B'][1]['myArray']
    pos = myArray.index('val2')
    myArray.insert(pos+1, 'val 3')
    if out_file_name is None:
        yaml.dump(data, sys.stdout, transform=sanitize.revert)
    else:
        with open(out_file_name, 'w') as fp:
            yaml.dump(data, out, transform=sanitize.revert)

update_one('input.yaml')

which prints (specify a second parameter to update_one() to write to a file) using Python 2.7:

# trying to update
{% set xyz = "123" }

A:
  B:
  - ip: 1.2.3.4
  - myArray:
    - {{ jinja.variable }}
    - val1
    - val2         # add a value after this one
    - val 3
    {% for d in data %}
    - phone: {{ d.phone }}
      name: {{ d.name }}
    {% endfor %}
    - {{ xyz }}
# #% or ##% should not be in the file and neither <{ or <<{

If neither #{ nor <{ are in any of the original inputs then sanitizing and reverting can be done with simple one-line functions (see this versions of this post), and then you don't need the class Sanitize

Your example is indented with one position (key B) as well as two positions (the sequence elements), ruamel.yaml doesn't have that fine control over output indentation (and I don't know of any YAML parser that does). The indent (defaulting to 2) is applied to both YAML mappings as to sequence elements (measured to the beginning of the element, not to the dash). This has no influence on re-reading the YAML and happened to the output of the other two answerers as well (without them pointing out this change).

Also note that YAML().load() is safe (i.e. doesn't load arbitrary potentially malicious objects), whereas the yaml.load() as used in the other answers is definitely unsafe, it says so in the documentation and is even mentioned in the WikiPedia article on YAML. If you use yaml.load(), you would have to check each and every input file to make sure there are no tagged objects that could cause your disc to be wiped (or worse).

If you need to update your files repeatedly, and have control over the jinja2 templating, it might be better to change the patterns for jinja2 once and not revert them, and then specifying appropriate block_start_string, variable_start_string (and possible block_end_string and variable_end_string) to the jinja2.FileSystemLoader added as loader to the jinja2.Environment.


If the above seems to complicated then in a a virtualenv do:

pip install ruamel.yaml ruamel.yaml.jinja2

assuming you have the input.yaml from before you can run:

import os
from ruamel.yaml import YAML


yaml = YAML(typ='jinja2')

with open('input.yaml') as fp:
    data = yaml.load(fp)

myArray = data['A']['B'][1]['myArray']
pos = myArray.index('val2')
myArray.insert(pos+1, 'val 3')

with open('output.yaml', 'w') as fp:
    yaml.dump(data, fp)

os.system('diff -u input.yaml output.yaml')

to get the diff output:

--- input.yaml  2017-06-14 23:10:46.144710495 +0200
+++ output.yaml 2017-06-14 23:11:21.627742055 +0200
@@ -8,6 +8,7 @@
     - {{ jinja.variable }}
     - val1
     - val2         # add a value after this one
+    - val 3
     {% for d in data %}
     - phone: {{ d.phone }}
       name: {{ d.name }}

ruamel.yaml 0.15.7 implements a new plug-in mechanism and ruamel.yaml.jinja2 is a plug-in that rewraps the code in this answer transparently for the user. Currently the information for reversion is attached to the YAML() instance, so make sure you do yaml = YAML(typ='jinja2') for each file you process (that information could be attached to the top-level data instance, just like the YAML comments are).

Geniculate answered 13/6, 2017 at 8:8 Comment(7)
The above requires ruamel.yaml>=0.15.1. You can also use older versions of ruamel.yaml to do the above, then you would need to use round_trip_load()/round_trip_dump(), at the cost of adding some lines of code.Geniculate
+1 for recommending changing block_start_string etc - I was thinking about adding that as a separate answer, as even though it's outside the scope of the question, it's a much better long term solution if these updates are frequent.Annecy
@Annecy I have given a complete example of that elsewhere. But it requires having tight control over the jinja2 rendering. If, for the OP, that is hidden in the bowels of some web framework, that might not be so easy.Geniculate
@Geniculate - this is really good - however, I don't see val3 in the output. Did the code work or was accidentally left out in a copy/paste? Also, your code is dependent on searching for val2 however, that may not be the case. I can never know what values are in myArray...Cheremkhovo
@Cheremkhovo My bad, as you can see from the edit history I originally had a different comment in the input than what I was using later on. When I realised that, I copied the new input into the answer... but at the wrong place (they do look very similar ;-) ) and a few edits later I updated the input for real now, never realising the output had the val3 deleted. I reran the program: the val3 really gets added at the right place.Geniculate
@Cheremkhovo I realised that this kind of pre- and post-processing would be an ideal candidate for a plug-in mechanism for ruamel.yaml. So I implemented that and the plug-in. I updated the answer with how to install and use it.Geniculate
@Geniculate I have used your example but is producing this outoput: --- !!python/object/apply:ruamel.yaml.comments.CommentedMap dictitems: about: !!python/object/apply:ruamel.yaml.comments.CommentedMap dictitems: {home: 'https://github.com/soedinglab/xxmotif', license: GPLv3, license_file: LICENSE, summary: 'eXhaustive, weight matriX-based motif discovery in nucleotide sequences'}Duque
A
4

In their current format, your .yml files are jinja templates which will not be valid yaml until they have been rendered. This is because the jinja placeholder syntax conflicts with yaml syntax, as braces ({ and }) can be used to represent mappings in yaml.

>>> yaml.load('foo: {{ bar }}')
Traceback (most recent call last):
...
yaml.constructor.ConstructorError: while constructing a mapping
  in "<string>", line 1, column 6:
    foo: {{ bar }}
     ^
found unacceptable key (unhashable type: 'dict')
  in "<string>", line 1, column 7:
    foo: {{ bar }}

One way to workaround this is to replace the jinja placeholders with something else, process the file as yaml, then reinstate the placeholders.

$ cat test.yml
A:
  B:
  - ip: 1.2.3.4
  - myArray:
    - {{ jinja_variable }}
    - val1
    - val2

Open the file as a text file

>>> with open('test.yml') as f:
...     text = f.read()
... 
>>> print text
A:
  B:
  - ip: 1.2.3.4
  - myArray:
    - {{ jinja_variable }}
    - val1
    - val2

The regular expression r'{{\s*(?P<jinja>[a-zA-Z_][a-zA-Z0-9_]*)\s*}}' will match any jinja placeholders in the text; the named group jinja in the expression captures the variable name. The regular expression the same as that used by Jinja2 to match variable names.

The re.sub function can reference named groups in its replacement string using the \g syntax. We can use this feature to replace the jinja syntax with something that does not conflict with yaml syntax, and does not already appear in the files that you are processing. For example replace {{ ... }} with << ... >>.

>>> import re
>>> yml_text = re.sub(r'{{\s*(?P<jinja>[a-zA-Z_][a-zA-Z0-9_]*)\s*}}', '<<\g<jinja>>>', text)
>>> print yml_text
A:
  B:
  - ip: 1.2.3.4
  - myArray:
    - <<jinja_variable>>
    - val1
    - val2

Now load the text as yaml:

>>> yml = yaml.load(yml_text)
>>> yml
{'A': {'B': [{'ip': '1.2.3.4'}, {'myArray': ['<<jinja_variable>>', 'val1', 'val2']}]}}

Add the new value:

>>> yml['A']['B'][1]['myArray'].append('val3')
>>> yml
{'A': {'B': [{'ip': '1.2.3.4'}, {'myArray': ['<<jinja_variable>>', 'val1', 'val2', 'val3']}]}}

Serialise back to a yaml string:

>>> new_text = yaml.dump(yml, default_flow_style=False)
>>> print new_text
A:
  B:
  - ip: 1.2.3.4
  - myArray:
    - <<jinja_variable>>
    - val1
    - val2
    - val3

Now reinstate the jinja syntax.

>>> new_yml = re.sub(r'<<(?P<placeholder>[a-zA-Z_][a-zA-Z0-9_]*)>>', '{{ \g<placeholder> }}', new_text)
>>> print new_yml
A:
  B:
  - ip: 1.2.3.4
  - myArray:
    - {{ jinja_variable }}
    - val1
    - val2
    - val3

And write the yaml to disk.

>>> with open('test.yml', 'w') as f:
...     f.write(new_yml)
... 

$cat test.yml
A:
  B:
  - ip: 1.2.3.4
  - myArray:
    - {{ jinja_variable }}
    - val1
    - val2
    - val3
Annecy answered 10/6, 2017 at 7:33 Comment(0)
E
4

One way to do this is to use the jinja2 parser itself to parse the template and output an alternate format.

Jinja2 Code:

This code inherits from the Jinja2 Parser, Lexer and Environment classes to parse inside variable blocks (usually {{ }}). Instead of evaluating the variables, this code changes the text to something that yaml can understand. The exact same code can be used to reverse the process with an exchange of the delimiters. By default it translates to the delimiters suggested by snakecharmerb.

import jinja2
import yaml

class MyParser(jinja2.parser.Parser):

    def parse_tuple(self, *args, **kwargs):

        super(MyParser, self).parse_tuple(*args, **kwargs)

        if not isinstance(self.environment._jinja_vars, list):
            node_text = self.environment._jinja_vars
            self.environment._jinja_vars = None
            return jinja2.nodes.Const(
                self.environment.new_variable_start_string +
                node_text +
                self.environment.new_variable_end_string)

class MyLexer(jinja2.lexer.Lexer):

    def __init__(self, *args, **kwargs):
        super(MyLexer, self).__init__(*args, **kwargs)
        self.environment = None

    def tokenize(self, source, name=None, filename=None, state=None):
        stream = self.tokeniter(source, name, filename, state)

        def my_stream(environment):
            for t in stream:
                if environment._jinja_vars is None:
                    if t[1] == 'variable_begin':
                        self.environment._jinja_vars = []
                elif t[1] == 'variable_end':
                    node_text = ''.join(
                        [x[2] for x in self.environment._jinja_vars])
                    self.environment._jinja_vars = node_text
                else:
                    environment._jinja_vars.append(t)
                yield t

        return jinja2.lexer.TokenStream(self.wrap(
            my_stream(self.environment), name, filename), name, filename)

jinja2.lexer.Lexer = MyLexer


class MyEnvironment(jinja2.Environment):

    def __init__(self,
                 new_variable_start_string='<<',
                 new_variable_end_string='>>',
                 reverse=False,
                 *args,
                 **kwargs):
        if kwargs.get('loader') is None:
            kwargs['loader'] = jinja2.BaseLoader()

        super(MyEnvironment, self).__init__(*args, **kwargs)
        self._jinja_vars = None
        if reverse:
            self.new_variable_start_string = self.variable_start_string
            self.new_variable_end_string = self.variable_end_string
            self.variable_start_string = new_variable_start_string
            self.variable_end_string = new_variable_end_string
        else:
            self.new_variable_start_string = new_variable_start_string
            self.new_variable_end_string = new_variable_end_string
        self.lexer.environment = self

    def _parse(self, source, name, filename):
        return MyParser(self, source, name,
                        jinja2._compat.encode_filename(filename)).parse()

How/Why?

The jinja2 parser scans the template file looking for delimiters. When finding delimiters, it then switches to parse the appropriate material between the delimiters. The changes in the code here insert themselves into the lexer and parser to capture the text captured during the template compilation, and then when finding the termination delimiter, concats the parsed tokens into a string and inserts it as a jinja2.nodes.Const parse node, in place of the compiled jinja code, so that when the template is rendered the string is inserted instead of a variable expansion.

The MyEnvironment() code is used to hook in the custom parser and lexer extensions. And while at it, added some parameters processing.

The primary advantage of this approach is that it should be fairly robust to parsing whatever jinja will parse.

User Code:

def dict_from_yaml_template(template_string):
    env = MyEnvironment()
    template = env.from_string(template_string)
    return yaml.load(template.render())

def yaml_template_from_dict(template_yaml, **kwargs):
    env = MyEnvironment(reverse=True)
    template = env.from_string(yaml.dump(template_yaml, **kwargs))
    return template.render()

Test Code:

with open('data.yml') as f:
    data = dict_from_yaml_template(f.read())
data['A']['B'][1]['myArray'].append('val 3')
data['A']['B'][1]['myArray'].append('<< jinja.variable2 >>')
new_yaml = yaml_template_from_dict(data, default_flow_style=False)
print(new_yaml)

data.yml

A:
 B:
 - ip: 1.2.3.4
 - myArray:
   - {{ x['}}'] }}
   - {{ [(1, 2, (3, 4))] }}
   - {{ jinja.variable }}
   - val1
   - val2

Results:

A:
  B:
  - ip: 1.2.3.4
  - myArray:
    - {{ x['}}'] }}
    - {{ [(1, 2, (3, 4))] }}
    - {{ jinja.variable }}
    - val1
    - val2
    - val 3
    - {{ jinja.variable2 }}
Explore answered 12/6, 2017 at 7:9 Comment(2)
Could add some explanation around the Parser/Lexer/Environment classes? i.e., what/why?Cheremkhovo
@PhD, tried to some add how/why. Let me know if you have any questions. Cheers.Explore

© 2022 - 2024 — McMap. All rights reserved.