Possible to make custom string literal prefixes in Python?

Asked 13/5, 2016 at 7:17 Answered 8/7, 2020 at 4:42

Solved python string prefix string-literals

Let's say I have a custom class derived from str that implements/overrides some methods:

class mystr(str):
    # just an example for a custom method:
    def something(self):
        return "anything"

Now currently I have to manually create instances of mystr by passing it a string in the constructor:

ms1 = mystr("my string")

s = "another string"
ms2 = mystr(s)

This is not too bad, but it lead to the idea that it would be cool to use a custom string prefix similar to b'bytes string' or r'raw string' or u'unicode string'.

Is it somehow possible in Python to create/register such a custom string literal prefix like m so that a literal m'my string' results in a new instance of mystr?
Or are those prefixes hard-coded into the Python interpreter?

Locker answered 13/5, 2016 at 7:17 Comment(2)

Unfortunately, no, not without modifying the interpreter and compiling your own. Interestingly, this is something ECMAScript can do in the future. – Layette 13/5, 2016 at 7:34

(related, custom suffix for numbers python - How does the j suffix for complex numbers work? Can I make my own suffix? - Stack Overflow) – Phenol 4/10, 2021 at 8:53

Those prefixes are hardcoded in the interpreter, you can't register more prefixes.

What you could do however, is preprocess your Python files, by using a custom source codec. This is a rather neat hack, one that requires you to register a custom codec, and to understand and apply source code transformations.

Python allows you to specify the encoding of source code with a special comment at the top:

# coding: utf-8

would tell Python that the source code encoded with UTF-8, and will decode the file accordingly before parsing. Python looks up the codec for this in the codecs module registry. And you can register your own codecs.

The pyxl project uses this trick to parse out HTML syntax from Python files to replace them with actual Python syntax to build that HTML, all in a 'decoding' step. See the codec package in that project, where the register module registers a custom codec search function that transforms source code before Python actually parses and compiles it. A custom .pth file is installed into your site-packages directory to load this registration step at Python startup time. Another project that does the same thing to parse out Ruby-style string formatting is interpy.

All you have to do then, is build such a codec too that'll parse a Python source file (tokenizes it, perhaps with the tokenize module) and replaces string literals with your custom prefix with mystr(<string literal>) calls. Any file you want parsed you mark with # coding: yourcustomcodec.

I'll leave that part as an exercise for the reader. Good luck!

Note that the result of this transformation is then compiled into bytecode, which is cached; your transformation only has to run once per source code revision, all other imports of a module using your codec will load the cached bytecode.

Absentee answered 13/5, 2016 at 7:41 Comment(4)

Wow, I'm not sure that is a good diea - but the hacking factor is great – Theism 13/5, 2016 at 7:43

Hey, that's a cool approach. Definitely nothing one would want in production code, but interesting for sure. I just fear that simply writing mystr("my string") is a lot easier... ;) Will stick to that. Thanks! – Locker 13/5, 2016 at 7:58

@ByteCommander: well, I'm pretty sure Dropbox uses the pyxl codec in production. Registering a codec is pretty light-weight, and the decoding-tokenizing-transform step only has to take place at compilation time. The result is compiled to bytecode and cached. This step is not repeated until the source code changes, invalidating the bytecode cache. – Absentee 13/5, 2016 at 8:9

Yes, sure. I mean it's just a bit complicated in my simple case where I would only like a custom literal that could simply be replaced with the constructor call. It's one of those features one might find really useful, but for others it's just obfuscating the code. That would depend on how much the "codec" changes. Too much makes it confusing and too less makes it unnecessary. – Locker 13/5, 2016 at 8:24

One could use operator overloading to implicitly convert str into a custom class

class MyString(str):
    def __or__( self, a ):
        return MyString(self + a)

m = MyString('')
print( m, type(m) )
#('', <class 'MyString'>)
print m|'a', type(m|'a')
#('a', <class 'MyString'>)

This avoids the use of parenthesis effectively emulating a string prefix with one extra character ─ which I chose to be | but it could also be & or other binary comparison operator.

Basic answered 8/7, 2020 at 4:42 Comment(0)

While the workarounds mentioned above are great, they could be dangerous. Hacking your python is really not a good idea. while you can't really make a prefix otherwise, you could do the following:

class MyString(str):
    def something(self):
        return MyString("anything")

m = MyString

# The you can do:
m("hi")
# Rather than:
# m"hi"

That's probably the best safe solution you can find. Two parentheses aren't really that much to type, and it can be less confusing to readers of your code.

Krug answered 28/2, 2020 at 13:50 Comment(0)

Recommended topics

Hot tags