How do I get the raw representation of a string in Python?
Asked Answered
T

1

12

I am making a class that relies heavily on regular expressions.

Let's say my class looks like this:

class Example:
    def __init__(self, regex):
        self.regex = regex

    def __repr__(self):
        return 'Example({})'.format(repr(self.regex.pattern))

And let's say I use it like this:

import re

example = Example(re.compile(r'\d+'))

If I do repr(example), I get 'Example('\\\\d+')', but I want 'Example(r'\\d+')'. Take into account the extra backslash where that upon printing, it appears correctly. I suppose I could implement it to return "r'{}'".format(regex.pattern), but that doesn't sit well with me. In the unlikely event that the Python Software Foundation someday changes the manner for specifying raw string literals, my code won't reflect that. That's hypothetical, though. My main concern is whether or not this always works. I can't think of an edge case off the top of my head, though. Is there a more formal way of doing this?

EDIT: Nothing seems to appear in the Format Specification Mini-Language, the printf-style String Formatting guide, or the string module.

Theseus answered 8/12, 2012 at 14:48 Comment(0)
P
11

The problem with rawstring representation is, that you cannot represent everything in a portable (i.e. without using control characters) manner. For example, if you had a linebreak in your string, you had to literally break the string to the next line, because it cannot be represented as rawstring.

That said, the actual way to get rawstring representation is what you already gave:

"r'{}'".format(regex.pattern)

The definition of rawstrings is that there are no rules applied except that they end at the quotation character they start with and that you can escape said quotation character using a backslash. Thus, for example, you cannot store the equivalent of a string like "\" in raw string representation (r"\" yields SyntaxError and r"\\" yields "\\\\").

If you really want to do this, you should use a wrapper like:

def rawstr(s):
    """
    Return the raw string representation (using r'') literals of the string
    *s* if it is available. If any invalid characters are encountered (or a
    string which cannot be represented as a rawstr), the default repr() result
    is returned.
    """
    if any(0 <= ord(ch) < 32 for ch in s):
        return repr(s)

    if (len(s) - len(s.rstrip("\\"))) % 2 == 1:
        return repr(s)

    pattern = "r'{0}'"
    if '"' in s:
        if "'" in s:
            return repr(s)
    elif "'" in s:
        pattern = 'r"{0}"'

    return pattern.format(s)

Tests:

>>> test1 = "\\"
>>> test2 = "foobar \n"
>>> test3 = r"a \valid rawstring"
>>> test4 = "foo \\\\\\"
>>> test5 = r"foo \\"
>>> test6 = r"'"
>>> test7 = r'"'
>>> print(rawstr(test1))
'\\'
>>> print(rawstr(test2))
'foobar \n'
>>> print(rawstr(test3))
r'a \valid rawstring'
>>> print(rawstr(test4))
'foo \\\\\\'
>>> print(rawstr(test5))
r'foo \\'
>>> print(rawstr(test6))
r"'"
>>> print(rawstr(test7))
r'"'
Prattle answered 8/12, 2012 at 15:2 Comment(15)
+1 Though the implementation is flawed (assumes ASCII, does not catch all instances of an odd number of backslashes at the end of the string) and the rest is ugly (how about if any(<condition involving c> for c in s)?).Glover
good point, didn't think about the general problem of an odd number of backslashes, I'll try to extend that.Leodora
Just got done playing around with your code. This is impressive! I didn't even think about the control characters. I see that your function falls back to the normal string representation in the event of a control character. By the way, filter returns an iterator, so there's no need to call iter. :) Thank you.Theseus
@TylerCrompton Thanks for thanking! filter: That's dependent on the python version. In Python2, it'll be a list.Leodora
@delnan Oh, didn't even think about any. Thanks for the suggestion. Cannot fix the other condition without using itertools though. With itertools, i'd do a sum(map(lambda x: 1, takewhile(lambda x: x == "\\", reversed(s)))) off the top of my head.Leodora
@JonasWielicki, that's probably the best way. A similar, more readable way: len(tuple(takewhile(lambda x: x == '\\', reversed(s)))).Theseus
I thought about using a list and taking the length too, but I preferred to go without construction of a list, at least in Py3. OT: tuples are actually more expensive to construct (did some benchmarking in an often-called function inside some GUI framework once)Leodora
Interesting. I had assumed it was the other way around since they are immutable. Anyway, I don't think either are proper Python and should be broken up across lines into an equivalent suite.Theseus
Exactly the same surprise which I found. Going through the relevant commit logs, it might've been neglectable though, even in that routine (just like 2% speedup). Maybe because they have to setup the hashing infrastructure?Leodora
One issue with this: it won't work if the string contains the ' character.Timothee
Why do you exclude characters in the 0-32 interval? I think all of those are valid in a raw string, and I know tabs and line feeds are definitely okay in a raw string.Lalo
Aside from that, this function also has problems with raw strings that contain both apostrophes and single quotes, which can happen when backslashes are used.Lalo
This complex answer indirectly taught me a much simpler lesson. When I'm using Python interactively or a debugger and I want to look at a string variable, I don't just enter its name any more. Instead I: print(string_var1)Bygone
@Bygone That may conceal things, try printing string_var1 = "foo\rbar" for example. Will often not matter, but it may in some cases (which is why stuff like repr() exists)Leodora
Thanks @JonasSchäfer you're right: for tricky strings you want to use both string_var1 and print(string_var1) in a debugger. For merely counting backslashes though, print(string_var1) is enough :-)Bygone

© 2022 - 2024 — McMap. All rights reserved.