How do I achieve sprintf-style formatting for bytes objects in python 3?
Asked Answered
B

2

11

I want to do sprintf on python3 but with raw bytes objects, without having to do any manual conversions for the %s to work. So, take a bytes object as a 'template', plus any number of objects of any type and return a rendered bytes object. This is how python 2's sprintf % operator has always worked.

b'test %s %s %s' % (5, b'blah','strblah') # python3 ==> error
Traceback (most recent call last):
  File "<input>", line 1, in <module>
TypeError: %b requires bytes, or an object that implements __bytes__, not 'int'

def to_bytes(arg):
    if hasattr(arg,'encode'): return arg.encode()
    if hasattr(arg,'decode'): return arg
    return repr(arg).encode()

def render_bytes_template(btemplate : bytes, *args):
    return btemplate % tuple(map(to_bytes,args))

render_bytes_template(b'this is how we have to write raw strings with unknown-typed arguments? %s %s %s',5,b'blah','strblah')

# output: b'this is how we have to render raw string templates with unknown-typed arguments? 5 blah strblah'

But in python 2, it's just built in:

'example that just works %s %s %s' % (5,b'blah',u'strblah')
# output: 'example that just works 5 blah strblah'

Is there a way to do this in python 3 but still achieve the same performance of python 2? Please tell me I'm missing something. The fallback here is to implement in cython (or are there libraries out there for python 3 that help in this?) but still not seeing why it was removed from the standard library other than the implicit encoding of the string object. Can't we just add a bytes method like format_any()?

By the way, it's not as simple as this cop-out:

def render_bytes_template(btemplate : bytes, *args):
    return (btemplate.decode() % args).encode()

Not only do I not want to do any unnecessary encode/decoding, but the bytes args are repr'd instead of being injected raw.

Binucleate answered 29/7, 2017 at 4:0 Comment(2)
Note that Python 3 now protects you from bugs and that were hidden under the waterline in Python 2. Try 'unicode: %s' % (u'Ünîcódæ',) on for size for example.Yoga
@Martijn And for every time that Python 3 saves us from that, we have to fix 10 bugs like 'unicode: %s' % u'Ünîcódæ'.encode(). Good work PSF.Proceeds
D
2

I want to do sprintf on python3 but with raw bytes objects, without having to do any manual conversions for the %s to work.

For this to work, all the formatting arguments also need to already be bytes.

This has changed since Py2 which allowed even unicode strings to be formatted in a byte string because the Py2 implementation is prone to errors as soon as a unicode string with unicode characters is introduced.

Eg, on Python 2:

In [1]: '%s' % (u'é',)
Out[1]: u'\xe9'

Technically that is correct, but not what the developer intended. It also takes no account of any encoding used.

In Python 3 OTOH:

In [2]: '%s' % ('é',)
Out[2]: 'é'

For formatting byte strings, use byte string arguments (Py3.5+ only)

b'%s %s' % (b'blah', 'strblah'.encode('utf-8'))

Other types like integers need to be converted to byte strings as well.

Drucie answered 7/8, 2017 at 13:35 Comment(2)
Thanks for re-enforcing my observations in the question. There are some discrepancies however. First off print() can take a bytes object, an int object, as well as a unicode. So one could argue that it's not explicit at all. In addition, the regular unicode strings allow %s to work on anything with a repr, which is also not explicit. So they only went half way here. It does nothing but add confusion and reduce features, but that is just my opinion and obviously things won't change. I will get started on a workaround that tries not to reduce performance of python2 or simply remote-call python2.Binucleate
To state the obvious, print is for printing. Printing an encoded unicode string and the unicode string itself results in different output. That is explicit. Technically in both cases the object's __repr__ or __str__ is used for printing purposes. The 'regular unicode strings' work with any other unicode string, which in Py3 is the default. So repr strings are unicode, so are __str__ and anything not explicitly set as a byte string. This was a decision by the Python core dev team and will have to get used to it.Drucie
Z
1

Would something like this work for you? You just need to make sure that when you begin some bytes object you wrap it in the new B bytes-like object which overloads the % and %= operators:

class B(bytes):
    def __init__(self, template):
        self._template = template

    @staticmethod
    def to_bytes(arg):
        if hasattr(arg,'encode'): return arg.encode()
        if hasattr(arg,'decode'): return arg
        return repr(arg).encode()

    def __mod__(self, other):
        if hasattr(other, '__iter__') and not isinstance(other, str):
            ret = self._template % tuple(map(self.to_bytes, other))
        else: 
            ret = self._template % self.to_bytes(other)
        return ret

    def __imod__(self, other):
        return self.__mod__(other)

a = B(b'this %s good')
b = B(b'this %s %s good string')
print(a % 'is')
print(b % ('is', 'a'))

a = B(b'this %s good')
a %= 'is'
b = B(b'this %s %s good string')
b %= ('is', 'a')
print(a)
print(b)

This outputs:

b'this is good'
b'this is a good string'
b'this is good'
b'this is a good string'
Zambrano answered 2/8, 2017 at 1:24 Comment(4)
Honestly I don't know if my question is more of a gripe or an honest question about design getting in the way of performance. Thanks for your contribution. If no one answers in a week I'll give you the reward.Binucleate
I think it's a fair question, I'm not sure what the performance costs are though compared with .format or f-strings.Zambrano
.format and f-strings require a decode() so it's going to be worse. I've read in other posts online that working with unicode is roughly half the speed of working with bytes in general. So not terrible, but for a lot of workloads it hurts when all you want to do is compose bytes from other bytes, and yes the answer is to wrangle all the inputs before the composition which is a major overhaul. And using six or some other helper is not going to solve any performance degradation. I get that the desire is to be explicit, but note the print() command accepts both bytes and unicode (so not quite)Binucleate
This breaks on unicode strings. It works for unicode strings that do not actually contain unicode characters, per the above examples, but not generally.Drucie

© 2022 - 2024 — McMap. All rights reserved.