What is the rationale for parenthesis in C++11's raw string literals R"(...)"?

O

3

87

There is a very convenient feature introduced in C++11 called raw string literals, which are strings with no escape characters. And instead of writing this:

  regex mask("\\t[0-9]+\\.[0-9]+\\t\\\\SUB");

You can simply write this:

  regex mask(R"(\t[0-9]+\.[0-9]+\t\\SUB)");

Quite more readable. However, note extra parenthesis around the string one have to place to define a raw string literal.

My question is, why do we even need these? For me it looks quite ugly and illogical. Here are the cons what I see:

Extra verbosity, while the whole feature is used to make literals more compact
Hard to distinguish between the body of the literal and the defining symbols

That's what I mean by the hard distinguishing:

"good old usual string literal"
 ^-    body inside quotes   -^

R"(new strange raw string literal)"
   ^- body inside parenthesis  -^

And here is the pro:

More flexibility, more characters available in raw strings, especially when used with the delimiter: "delim( can use "()" here )delim"

But hey, if you need more flexibility, you have old good escapeable string literals. Why the standard committee decided to pollute the content of every raw string literal with these absolutely unnecessary parenthesis? What was the rationale behind that? What are the pros I didn't mention?

UPD The answer by Kerrek is great, but it is not an answer, unfortunately. Since I already described that I understand how it works and what benefits does it give. Five years passed since I've asked this question, and still there is no answer. And I am still frustrated by this decision. One could say that this is a matter of taste, but I would disagree. How many spaces do you use, how do you name your variables, is this SomeFunction() or some_function() - this is the matter of taste. And I can really easily switch from one style to another.

But this?.. Still feels awkward and clumsy after so many years. No, this is not about the taste. This is about how we want to cover all possible cases no matter what. We doomed to write these ugly parens every time we need to write a Windows-specific path, or a regular expression, or a multi-line string literal. And for what?.. For those rare cases when we actually need to put " in a string? I wish I was on that committee meeting where they decided to do it this way. And I would be strongly against this really bad decision. I wish. Now we are doomed.

Thank you for reading this far. Now I feel a little better.

UPD2 Here are my alternative proposals, which I think both would be MUCH better than existing.

Proposal 1. Inspired by python. Cannot support string literals with triple quotes: R"""Here is a string literal with any content, except for triple quotes, which you don't actually use that often."""

Proposal 2. Inspired by common sense. Supports all possible string literals, just like the current one: R"delim"content of string"delim". With empty delimiter: R""Looks better, doesn't it?"". Empty raw string: R"""". Raw string with double quotes: R"#"Here are double quotes: "", thanks"#".

Any problems with these proposals?

Outfielder answered 29/9, 2013 at 8:34 Comment(7)

R";-](R"(this is a basic raw string literal as text inside a more complex one)");-]" – Northampton 3/10, 2013 at 20:3

The syntax is indeed quite ugly imo, but I can't really think of an alternative that can also remain backwards compatible and keep all the features. – Conciseness 12/11, 2018 at 22:18

@ChilliDoughnuts, see the updated question. – Outfielder 13/11, 2018 at 10:39

I do like the first proposal, maybe it would work as an alternative. I guess an advantage of R"(...))"'s ugliness serves as a warning sign for me that "this is a string literal, be careful", since it stands out so much. – Conciseness 13/11, 2018 at 18:36

@Mikhail: "For those rare cases when we actually need to put " in a string?" The fact that you believe that cases where you need " in a raw string are "rare" is probably part of the problem. It's not that there is "no answer". There is an answer; you just don't agree with it. If your definition of what constitutes an "answer" is "something that convinces me to change my mind on this", then your question is too opinionated. The justification has been provided; your agreement with it is not required. – Pregnant 16/12, 2018 at 20:10

You should not update a historical highly upvoted question to include a new question ... instead post a new question. (Which will probably be closed as opinion-based anyway, since your only objection seems to be "I find this unaesthetic") – Discomfort 16/12, 2018 at 20:11

@Discomfort This question did not have an accepted answer anyway. And I updated it due to a logical claim in the comments. – Outfielder 17/12, 2018 at 17:51

I

11

As the other answer explains, there must be something additional to the quotation mark to avoid the parsing ambiguity in cases where " or )", or actually any closing sequence that may appear in the string itself.

As for the syntax choice, well, I agree the syntax choice is suboptimal, but it is OK in general (you could think of it: "things could be worse", lol). I think it is a good compromise between usage simplicity and parsing simplicity.

Proposal 1. Inspired by python. Cannot support string literals with triple quotes:
R"""any content, except for triple quotes, which you don't actually use that often."""

There is indeed a problem with this - "quotes, which you don't actually use that often". Firstly, the very idea of raw strings is to represent raw strings, i.e. exactly as they would appear in a text file, without any modifications to the string, regardless of the string contents. Secondly, the syntax should be general, i.e. without adding variations like "almost raw string", etc.

How would you write one quote with this syntax? Two quotes? Note - those are very common cases, especially when your code is dealing with strings and parsing.

Proposal 2.
R"delim"content of string"delim".
R""Looks better, doesnt it?"".
R"#"Here are double quotes: "", thanks"#".

Well, this one might be a better candidate. One thing though - a common case (and I believe it was a motivating case for accepted syntax), is that the double-quote character itself is very common and raw strings should come in handy for these cases.

So, lets see, normal string syntax:

s1 = "\"";
s2 = "\"quoted string\"";

Your syntax e.g. with "x" as delim:

s1 = R"x"""x";
s2 = R"x""quoted string""x";

Accepted syntax:

s1 = R"(")";
s2 = R"("quoted string")";

Yes, I agree that the brackets introduce some annoying visual effect. So I suspect the authors of the syntax were after the idea that the additional "delim" in this case will be rarely needed, since )" appears not very often inside a string. But OTOH, trailing/leading/isolated quotes are quite often, so e.g. your proposed syntax (#2) would require some delim more often, which in turn would require more often changing it from R"".."" to R"delim"..."delim". Hope you get the idea.

Could the syntax be better? I personally would prefer an even simpler variant of syntax:

Rdelim"string contents"delim;

With the above examples:

s1 = Rx"""x; 
s2 = Rx""quoted string""x;

However to work correctly (if its possible at all in current grammar), this variant would require limiting the character set for the delim part, say to letters/digits only (because of existing operators), and maybe some further restrictions for the initial character to avoid clashes with possible future grammar.
So I believe a better choice could have been made, although nothing significantly better can be done in this case.

Illeetvilaine answered 16/12, 2018 at 15:37 Comment(4)

Thanks for the elaborated answer! This is actually much closer to what I would like to see. "OTOH, trailing/leading/isolated quotes are quite often" - well, I don't have such a feeling. But this is just my feeling. Maybe if you analyze a huge public set of code bases, you'll find out this is actually the case. But again, to me it feels differently. – Outfielder 17/12, 2018 at 17:57

Good example with a "quoted string". But hey, are you trying to say raw string literals should look as good as possible in all cases? I'd want to optimize them only for cases where non-raw string literals are not good enough. And for both of your examples I would actually prefer to have a non-raw string literal. That's why I don't care that much how it would look for a raw string literal. But I see your point. Thanks. – Outfielder 17/12, 2018 at 18:0

@Outfielder "for cases where non-raw string literals are not good enough". Any literals where I may need some kind of escaping are not good for many tasks (e.g. placing strings with DSL contents, e.g. JSON, Regex, etc.) So I just say that this kind of literals IMO must be true raw strings, and not something half-baked thus the existing syntax fits my expectation of correct technical solution. – Illeetvilaine 17/12, 2018 at 23:35

Yes, one must watch out for the delimiter, but that is at least more visible than escape sequences. If a string terminates at wrong place at parsing stage - most probably you see some error, but in case of incorrectly escaped sequences, there are more cases for hard-to-spot errors and it is more typing annoyance. – Illeetvilaine 17/12, 2018 at 23:39

A

112

The purpose of the parentheses is to allow you to specify a custom delimiter:

R"foo(Hello World)foo"   // the string "Hello World"

In your example, and in typical use, the delimiter is simply empty, so the raw string is enclosed by the sequences R"( and )".

Allowing for arbitrary delimiters is a design decision that reflects the desire to provide a complete solution without weird limitations or edge cases. You can pick any sequence of characters that does not occur in your string as the delimiter.

Without this, you would be in trouble if the string itself contained something like " (if you had just wanted R"..." as your raw string syntax) or )" (if the delimiter is empty). Both of those are perfectly common and frequent character sequences, especially in regular expressions, so it would be incredibly annoying if the decision whether or not you use a raw string depended on the specific content of your string.

Remember that inside the raw string there's no other escape mechanism, so the best you could do otherwise was to concatenate pieces of string literal, which would be very impractical. By allowing a custom delimiter, all you need to do is pick an unusual character sequence once, and maybe modify it in very rare cases when you make a future edit.

But to stress once again, even the empty delimiter is already useful, since the R"(...)" syntax allows you to place naked quotation marks in your string. That by itself is quite a gain.

Attendance answered 29/9, 2013 at 10:25 Comment(12)

And naked newlines and tabs and whitespace! – Educate 20/6, 2015 at 10:59

The point is that a " would end the string if the format was just R"...". That is why the format is R"(...)". E.g. R"(he said, "hello")" – Albertoalberts 17/2, 2016 at 10:21

@SuperflyJon: Yes, sure, that's just a special case of using an empty custom delimiter, right? – Attendance 17/2, 2016 at 10:23

Sure, just highlighting that the () are not there to allow backslashes and white space. The delimiter is only needed if you have a string with )" in it. E.g. R"("(eg)")" would have to use a delimiter, R"delim("(eg)"))delim". I kind of agree that the syntax is a bit unwieldy, in this example, "\"(eg)\"", is more readable to me. – Albertoalberts 18/2, 2016 at 11:1

"You can pick any sequence of characters that does not occur in your string as the delimiter." Can you elaborate? My interpretation of your statement is that using the delimiter within the sequence is ill-formed, but the standard does not indicate this, and compilers seem happy to accept R"foo(Hello World foo)foo" – Fearsome 25/8, 2016 at 12:20

@AndyG: I meant it in the sense that )foo does not appear in your string, including the parenthesis. The d-char-sequence itself may indeed appear arbitrarily. – Attendance 25/8, 2016 at 13:2

@KerrekSB: Ok that makes more sense. Thanks for the follow up. – Fearsome 25/8, 2016 at 13:5

I think that @SuperflyJon s comment about the )" inside the string is important for understanding the possibility for additional delimter characters and should be added to the answer – Sofer 1/8, 2017 at 9:30

Many years have passed, but I still cannot get used to this syntax. This is just ugly. – Outfielder 2/2, 2018 at 12:58

@Mikhail: You're not required to use raw string literals for every string. It's a judgement call; use it when it improves matters. The typical use case would have a either a long or complex string so that you concentrate on the body and basically ignore the delimiters when reading. – Attendance 2/2, 2018 at 20:14

@KerrekSB more precisely, )foo can also appear inside the string, but )foo" cannot. R"foo(Hello World )foo)foo" is equivalent to "Hello World )foo". – Khiva 11/8, 2018 at 10:55

And why are unicode characters not allowed in deliminators ? – Embosser 25/9, 2021 at 14:17

I

11

As the other answer explains, there must be something additional to the quotation mark to avoid the parsing ambiguity in cases where " or )", or actually any closing sequence that may appear in the string itself.

As for the syntax choice, well, I agree the syntax choice is suboptimal, but it is OK in general (you could think of it: "things could be worse", lol). I think it is a good compromise between usage simplicity and parsing simplicity.

Proposal 1. Inspired by python. Cannot support string literals with triple quotes:
R"""any content, except for triple quotes, which you don't actually use that often."""

There is indeed a problem with this - "quotes, which you don't actually use that often". Firstly, the very idea of raw strings is to represent raw strings, i.e. exactly as they would appear in a text file, without any modifications to the string, regardless of the string contents. Secondly, the syntax should be general, i.e. without adding variations like "almost raw string", etc.

How would you write one quote with this syntax? Two quotes? Note - those are very common cases, especially when your code is dealing with strings and parsing.

Proposal 2.
R"delim"content of string"delim".
R""Looks better, doesnt it?"".
R"#"Here are double quotes: "", thanks"#".

Well, this one might be a better candidate. One thing though - a common case (and I believe it was a motivating case for accepted syntax), is that the double-quote character itself is very common and raw strings should come in handy for these cases.

So, lets see, normal string syntax:

s1 = "\"";
s2 = "\"quoted string\"";

Your syntax e.g. with "x" as delim:

s1 = R"x"""x";
s2 = R"x""quoted string""x";

Accepted syntax:

s1 = R"(")";
s2 = R"("quoted string")";

Yes, I agree that the brackets introduce some annoying visual effect. So I suspect the authors of the syntax were after the idea that the additional "delim" in this case will be rarely needed, since )" appears not very often inside a string. But OTOH, trailing/leading/isolated quotes are quite often, so e.g. your proposed syntax (#2) would require some delim more often, which in turn would require more often changing it from R"".."" to R"delim"..."delim". Hope you get the idea.

Could the syntax be better? I personally would prefer an even simpler variant of syntax:

Rdelim"string contents"delim;

With the above examples:

s1 = Rx"""x; 
s2 = Rx""quoted string""x;

However to work correctly (if its possible at all in current grammar), this variant would require limiting the character set for the delim part, say to letters/digits only (because of existing operators), and maybe some further restrictions for the initial character to avoid clashes with possible future grammar.
So I believe a better choice could have been made, although nothing significantly better can be done in this case.

Illeetvilaine answered 16/12, 2018 at 15:37 Comment(4)

Thanks for the elaborated answer! This is actually much closer to what I would like to see. "OTOH, trailing/leading/isolated quotes are quite often" - well, I don't have such a feeling. But this is just my feeling. Maybe if you analyze a huge public set of code bases, you'll find out this is actually the case. But again, to me it feels differently. – Outfielder 17/12, 2018 at 17:57

Good example with a "quoted string". But hey, are you trying to say raw string literals should look as good as possible in all cases? I'd want to optimize them only for cases where non-raw string literals are not good enough. And for both of your examples I would actually prefer to have a non-raw string literal. That's why I don't care that much how it would look for a raw string literal. But I see your point. Thanks. – Outfielder 17/12, 2018 at 18:0

@Outfielder "for cases where non-raw string literals are not good enough". Any literals where I may need some kind of escaping are not good for many tasks (e.g. placing strings with DSL contents, e.g. JSON, Regex, etc.) So I just say that this kind of literals IMO must be true raw strings, and not something half-baked thus the existing syntax fits my expectation of correct technical solution. – Illeetvilaine 17/12, 2018 at 23:35

Yes, one must watch out for the delimiter, but that is at least more visible than escape sequences. If a string terminates at wrong place at parsing stage - most probably you see some error, but in case of incorrectly escaped sequences, there are more cases for hard-to-spot errors and it is more typing annoyance. – Illeetvilaine 17/12, 2018 at 23:39

S

4

The question asks about the rationale for a language decision, so it's useful to review the documents that were published by committee members working on the feature prior to its standardization. The information below is from reviewing the list of proposals on the Compiler support for C++11 page at cppreference.com and following the history of N2442 backward.

N2053 (2006-09-06)

The first proposal that eventually became C++11 raw string literals was N2053 by Beman Dawes in 2006. This proposal offers two motivating examples, one a monster regular expression and another a short HTML fragment. Both examples contain literal " characters, so clearly the designers felt that support for double quotes in string literals was important (whereas the question describes them as "rare").

N2053 proposed that raw strings would typically look like:

R""Hello, world!""

Note that this is similar to "Proposal 2" in the question, evidencing that the committee considered but ultimately rejected it.

N2053 allowed the inner " to be any character for which std::ispunct is true, so it would also allow, for example:

R"$Hello, world!  Embedded double-quotes like this "" are ok here.$"

N2146 (2007-01-09)

The next iteration was N2146, also by Beman Dawes. In N2146, raw strings typically look like:

R"[Hello, world!]"

It also allows custom delimiter strings between the quote and bracket:

R"DELIM[Hello, world!]DELIM"

The rationale given for changing from quotes to square brackets is that "common use cases will employ the easily recognizable R"[...]"."

The question contends that it is "hard to distinguish" the delimiter from the string. Beman Dawes apparently felt the opposite, at least when the syntax used brackets rather than parentheses.

The rationale given for allowing custom delimiters is the obvious one, namely to "reduce the risk of the raw literal string containing the same sequence as the delimiter".

N2295 (2007-06-23), N2384 (2007-08-03), N2442 (2007-10-05)

The next three iterations were N2295, N2384, and N2442, each by Lawrence Crowl and Beman Dawes.

These iterations made no changes to the delimiter syntax; in all three, it remained:

R"DELIM[Hello, world!]DELIM"

N2295 did, however, drop the motivating examples and design rationale (brief though it was), even while stating, "The motivation, discussion, and other details from the original proposals remains unchanged." Hmph.

Standardization in C++11

There doesn't appear to be any more publicly-available discussion of the feature until it appears in the C++11 standard, section 2.14.5, with the now-familiar syntax consisting of double quotes, round parenthses, and optional custom delimiter strings:

R"DELIM(Hello, world!)DELIM"

I speculate that square brackets were changed to (round) parentheses because only the latter are invariant code points in ISO 646 (the international standard corresponding to ASCII). Consequently, with brackets, some users utilizing non-US character encodings would have had to resort to using trigraphs in order to use raw strings.

Comparison to Python triple quotes

The question suggests two alternatives, the first being Python-like triple quotes:

R"""Hello, world!"""

I'll first note that N2053 takes explicit inspiration from Python, so its author obviously considered it but chose to go a different route.

In N2053, raw string literals were always delimited by a two-character sequence, with two double-quotes evidently being considered sufficient in most cases. Based on the subsequent evolution of the feature, I speculate that the committe members ultimately favored R"(...)" over R"""...""" on the basis of being less verbose in typical usage.

Comparison to custom delimiters between a double quote pair

The question's second proposed alternative is delimiters in quotes:

R"DELIM"Hello, world!"DELIM"

This is pretty close to what N2146 had:

R"DELIM[Hello, world!]DELIM"

As already noted, the authors evidently felt that brackets were easier to visually recognize than double quotes in this role. I presume that they felt the same way about round parentheses.

Summary

In short, the alternatives offered, or close variations, were considered but ultimately rejected. The history provides some explicit indications of why, with some gaps that unfortunately can only filled by conjecture based on what is publicly available.

Stylographic answered 10/8, 2023 at 2:10 Comment(0)