Best way to use ruamel.yaml to dump YAML to string (NOT to stream)
Asked Answered
B

3

16

In the past, I did something like some_fancy_printing_loggin_func(yaml.dump(...), ...), using the backward-compatible part of ruamel.yaml, but I want to convert my code to use the latest API so that I can take advantage of some of the new formatting settings.

However, I hate that I have to specify a stream to ruamel.yaml.YAML.dump() ... I don't want it to write directly to a stream; I just want it to return the output to the caller. What am I missing?

PS: I know I can do something like the following, though of course I'm trying to avoid it.

f = io.StringIO()
yml.dump(myobj, f)
f.seek(0)
my_logging_func(f.read())
Burin answered 3/12, 2017 at 3:5 Comment(0)
R
9

I am not sure if you really are missing something, if at all it might be that if you're working with streams you should—preferably—continue to work with streams. That is however something many users of ruamel.yaml and PyYAML seem to miss and therefore they do:

print(dump(data))

instead of

dump(data, sys.stdout)

The former might be fine for non-realistic data used in the (PyYAML) documentation, but it leads to bad habits for real data.

The best solution is to make your my_logging_func() stream oriented. This can e.g. be done as follows:

import sys
import ruamel.yaml

data = dict(user='rsaw', question=47614862)

class MyLogger:
    def write(self, s):
        sys.stdout.write(s.decode('utf-8'))

my_logging_func = MyLogger()
yml = ruamel.yaml.YAML()
yml.dump(data, my_logging_func)

which gives:

user: rsaw
question: 47614862

but note that MyLogger.write() gets called multiple times (in this case eight times), and if you need to work on a line at a time, you have to do line buffering.

If you really need to process your YAML as bytes or str, you can install the appropriate plugin (ruamel.yaml.bytes resp. ruamel.yaml.string ) and do:

yaml = ruamel.yaml.YAML(typ=['rt', 'string'])
data  = dict(abc=42, help=['on', 'its', 'way'])
print('retval', yaml.dump_to_string(data))

Or process the result of yaml.dump_to_string(data), its equivalent yaml.dumps(data) as you see necessary. Replacing string with bytes in the above doesn't decode the UTF-8 stream back to str but keeps it as bytes.

Recognizance answered 3/12, 2017 at 10:9 Comment(2)
You're my hero Anthon. Thank you for that detailed explanation and thank you for all your seriously conscientious dedicated work on ruamel.yaml.Burin
One problem with ruamel.yaml.string is that static type checkers like Mypy or Pyright will complain about YAML not having a method named dump_to_string or dumps. I think methods added by plugins are difficult/impossible to represent in static types.Marella
N
13

This answer (a small wrapper around ruamel.yaml), was put into a pip module here by me after needing this functionality so frequently

TLDR

pip install ez_yaml

import ez_yaml

ez_yaml.to_string(obj=your_object    , options={})

ez_yaml.to_object(file_path=your_path, options={})
ez_yaml.to_object(string=your_string , options={})

ez_yaml.to_file(your_object, file_path=your_path)

Hacky / Copy-Paste Solution to Original Question

def object_to_yaml_str(obj, options=None):
    # 
    # setup yaml part (customize this, probably move it outside this def)
    # 
    import ruamel.yaml
    yaml = ruamel.yaml.YAML()
    yaml.version = (1, 2)
    yaml.indent(mapping=3, sequence=2, offset=0)
    yaml.allow_duplicate_keys = True
    # show null
    def my_represent_none(self, data):
        return self.represent_scalar(u'tag:yaml.org,2002:null', u'null')
    yaml.representer.add_representer(type(None), my_represent_none)
    
    # 
    # the to-string part
    # 
    if options == None: options = {}
    from io import StringIO
    string_stream = StringIO()
    yaml.dump(obj, string_stream, **options)
    output_str = string_stream.getvalue()
    string_stream.close()
    return output_str

Original Answer (if you want to customize the config/options more)

import ruamel.yaml
from io import StringIO
from pathlib import Path

# setup loader (basically options)
yaml = ruamel.yaml.YAML()
yaml.version = (1, 2)
yaml.indent(mapping=3, sequence=2, offset=0)
yaml.allow_duplicate_keys = True
yaml.explicit_start = False
# show null
def my_represent_none(self, data):
    return self.represent_scalar(u'tag:yaml.org,2002:null', u'null')
yaml.representer.add_representer(type(None), my_represent_none)

# o->s
def object_to_yaml_str(obj, options=None):
    if options == None: options = {}
    string_stream = StringIO()
    yaml.dump(obj, string_stream, **options)
    output_str = string_stream.getvalue()
    string_stream.close()
    return output_str

# s->o
def yaml_string_to_object(string, options=None):
    if options == None: options = {}
    return yaml.load(string, **options)

# f->o
def yaml_file_to_object(file_path, options=None):
    if options == None: options = {}
    as_path_object = Path(file_path)
    return yaml.load(as_path_object, **options)

# o->f
def object_to_yaml_file(obj, file_path, options=None):
    if options == None: options = {}
    as_path_object = Path(Path(file_path))
    with as_path_object.open('w') as output_file:
        return yaml.dump(obj, output_file, **options)

# 
# string examples
# 
yaml_string = object_to_yaml_str({ (1,2): "hi" })
print("yaml string:", yaml_string)
obj = yaml_string_to_object(yaml_string)
print("obj from string:", obj)

# 
# file examples
# 
obj = yaml_file_to_object("./thingy.yaml")
print("obj from file:", obj)
object_to_yaml_file(obj, file_path="./thingy2.yaml")
print("saved that to a file")

Rant

I appreciate Mike Night solving the original "I just want it to return the output to the caller", and calling out that Anthon's post fails to answer the question. Which I will do further: Anthon your module is great; round trip is impressive and one of the few ones ever made. But, (this happens often on Stack Overflow) it is not the job of the author to make other people's code runtime-efficient. Explicit tradeoffs are great, an author should help people understand the consequences of their choices. Adding a warning, including "slow" in the name, etc can be very helpful. However, the methods in the ruamel.yaml documentation; creating an entire inherited class, are not "explicit". They are encumbering and obfuscating, making it difficult to perform and time consuming for others to understand what and why that additional code exists.

As for performance, the runtime of my program, without YAML, is 2 weeks. A 500,000 line yaml file is read in seconds. Both the 2 weeks and the few seconds are irrelevant to the project because they are CPU time and the project is billed purely by dev-hours. Many users rightfully care about dev time more than runtime, we are using python after all.

Even assuming runtime is critical, the YAML code was already a string object because of other other operations being performed on it. Forcing it into a stream is is actually causing more overhead. Removing the need for the string form of the YAML would involve rewriting several major libraries and potentially months of effort; making streams a highly impractical choice in this situation.

Even assuming stream input is possible, and billing by CPU time; optimizing the one time read of a 500,000-line-yaml-file would be a ≤0.001% runtime improvement. The extra hour I spent figuring out the answer to this question, and the time spent by others trying to understand the point of my boilerplate code, could have instead been spent on one of the c-functions that is being called 100 times a second for two weeks. Even when we do care about CPU time, the optimized method still can fail to be the best choice.

A stack overflow post that ignores the question while also suggesting users sink potentially large amounts of time rewriting their applications is not an answer. Respect others by assuming they generally know what they are doing and are aware of the alternatives. Then offers of potentially more-efficient methods will be met with appreciation rather than rejection.

[end rant]

Natatorial answered 30/7, 2020 at 19:18 Comment(6)
This is a useful example but note that your response suffers a common Gotcha, mutable defaults. You should update that.Incorporation
Oh you're right @JustinWinokur 👍 thanks. Example has been updated. I should've known better too, as I've run into mutable defaults before. (However, I did not know about Late Binding Closures, so thank you for the link)Natatorial
@Greg I think you have good intentions, but editing my answer to say things like "I created and published to PyPI" feels like impersonation; putting words and phrases in my mouth that I never said, and would have worded and ordered differently. Essential information (like the module only being a ruamel.yaml wrapper, not an alternative) was removed. Next time please just ask for the desired changes, I'm happy to update it. Making it look like I said something I didn't just makes me want to leave the platform entirely.Natatorial
(@Greg and sorry this is a public comment, I don't know and don't clearly see another way to get in contact)Natatorial
@JeffHykin: I am sorry, you are right that I did cross a line between editing and rewriting. I hope that it will make you feel better if I say that I did it because I admire your work on this problem very much and I thought that it's worth my small effort to make it even better.Choong
Upvoted for the rant.Flyspeck
R
9

I am not sure if you really are missing something, if at all it might be that if you're working with streams you should—preferably—continue to work with streams. That is however something many users of ruamel.yaml and PyYAML seem to miss and therefore they do:

print(dump(data))

instead of

dump(data, sys.stdout)

The former might be fine for non-realistic data used in the (PyYAML) documentation, but it leads to bad habits for real data.

The best solution is to make your my_logging_func() stream oriented. This can e.g. be done as follows:

import sys
import ruamel.yaml

data = dict(user='rsaw', question=47614862)

class MyLogger:
    def write(self, s):
        sys.stdout.write(s.decode('utf-8'))

my_logging_func = MyLogger()
yml = ruamel.yaml.YAML()
yml.dump(data, my_logging_func)

which gives:

user: rsaw
question: 47614862

but note that MyLogger.write() gets called multiple times (in this case eight times), and if you need to work on a line at a time, you have to do line buffering.

If you really need to process your YAML as bytes or str, you can install the appropriate plugin (ruamel.yaml.bytes resp. ruamel.yaml.string ) and do:

yaml = ruamel.yaml.YAML(typ=['rt', 'string'])
data  = dict(abc=42, help=['on', 'its', 'way'])
print('retval', yaml.dump_to_string(data))

Or process the result of yaml.dump_to_string(data), its equivalent yaml.dumps(data) as you see necessary. Replacing string with bytes in the above doesn't decode the UTF-8 stream back to str but keeps it as bytes.

Recognizance answered 3/12, 2017 at 10:9 Comment(2)
You're my hero Anthon. Thank you for that detailed explanation and thank you for all your seriously conscientious dedicated work on ruamel.yaml.Burin
One problem with ruamel.yaml.string is that static type checkers like Mypy or Pyright will complain about YAML not having a method named dump_to_string or dumps. I think methods added by plugins are difficult/impossible to represent in static types.Marella
H
6

There is always a case where something unexpected is required (even if that contradicts with best practices under usual circumstances). Here is an example:

In this case, I need yaml as a string. No, using files instead of string does not cut it here because I will create this input_yaml multiple times as I need to do this pypandoc conversion multiple times. Creating individual files would have been much more messy!

output = pypandoc.convert_text(input_yaml, to='markdown_strict', format='md', filters=filters)

input_yaml = """
---
bibliography: testing.bib
citation-style: ieee-with-url.csl
nocite: |

 @*
...
"""

Just because of this, I had to go back to PyYAML. It allows me to

yaml_args = {'bibliography':'testing.bib', 'citation-style':'ieee-with-url.csl'}

test = yaml.dump(yaml_args, default_flow_style=False)
test = "---\n"+ test + "nocite: | \n\n @* \n...\n"
output = pypandoc.convert_text(test, to='markdown_strict', format='md', filters=filters)

Clumsy but best I could find under the circumstances.

Hostile answered 8/7, 2019 at 8:59 Comment(3)
I needed it to format/parse some command line arguments. Switched to json as a result.Acro
Also when writing tests, it's just much easier to compare string outputs than write to files ....Commissariat
Yes, this. In my case I want to dump it to yaml to go into jinja2. There may be a way to do this with streams, but that's more advanced than the level I'm currently working at.Reachmedown

© 2022 - 2024 — McMap. All rights reserved.