Is string::c_str() no longer null terminated in C++11? [duplicate]
Asked Answered
U

4

72

In C++11 basic_string::c_str is defined to be exactly the same as basic_string::data, which is in turn defined to be exactly the same as *(begin() + n) and *(&*begin() + n) (when 0 <= n < size()).

I cannot find anything that requires the string to always have a null character at its end.

Does this mean that c_str() is no longer guaranteed to produce a null-terminated string?

Unifoliolate answered 26/9, 2011 at 10:56 Comment(4)
surely such a drastic change would break lots of old code...Russom
@Nim: I agree completely, but I was wondering where in the standard this requirement is stated.Unifoliolate
If c_str didn't return a NULL terminated string, it would be the most misnamed function ever.Lutist
You missed an = in 0 <= n <= size() ... everything is fine when you include it, as the Standard doesEngraft
C
81

Strings are now required to use null-terminated buffers internally. Look at the definition of operator[] (21.4.5):

Requires: pos <= size().

Returns: *(begin() + pos) if pos < size(), otherwise a reference to an object of type T with value charT(); the referenced value shall not be modified.

Looking back at c_str (21.4.7.1/1), we see that it is defined in terms of operator[]:

Returns: A pointer p such that p + i == &operator[](i) for each i in [0,size()].

And both c_str and data are required to be O(1), so the implementation is effectively forced to use null-terminated buffers.

Additionally, as David Rodríguez - dribeas points out in the comments, the return value requirement also means that you can use &operator[](0) as a synonym for c_str(), so the terminating null character must lie in the same buffer (since *(p + size()) must be equal to charT()); this also means that even if the terminator is initialised lazily, it's not possible to observe the buffer in the intermediate state.

Centrosphere answered 26/9, 2011 at 11:9 Comment(22)
That doesn't say anything about the string being null-terminated.Beverly
While that does not say that the string must be null terminated, it can be inferred from the string requirements. Both c_str and data must be a O(1) operation, which means that they cannot create a copy on the fly. Additionally, the requirement of matching operator[] output means that either it is already nul terminated, or the call to data/c_str must add the nul terminator prior to returning the pointer. Additionally, the string must have space for that terminator before the call to maintain the O(1) requirement. Technically the string need not be nul terminated, but data() doesNoctiluca
@R.MartinhoFernandes: there is no requirement if pos > size, because that would violate the precondition.Unifoliolate
This is the correct answer. I was unclear in my question. c_str() actually returns something slightly different from what I stated.Unifoliolate
Since c_str and data are both required to be constant time, IMO this pretty much forces the implementation to use null-terminated buffers.Centrosphere
Also, the last quote: Returns: A pointer p such that p + i == &operator[](i) for each i in [0,size()]. Means that &operator[](size()) == &operator[](size()-1) + 1 --i.e. if operator[](size()) returned a reference to a \0 outside of the string, this requirement could never be met.Noctiluca
@jalf: That doesn't say anything about the string being null-terminated. Yes, it does. 21.4.7.1 says that the pointer returned by c_str() must point to a buffer of length size()+1. 21.4.5 says that the last element of this buffer must have a value of charT() -- in other words, the null character.Barbarian
@David and others: The (first) snippet Mikhail posted says nothing about nulls, and nothing about the buffer itself being null-terminated. My point is simply that he said that strings are required to use null-termianted buffers internally, and then post a quote from the standard talking about something completely different. Even with the second snippet, it doesn't say anything about teh buffer itself being null-terminated.Beverly
Given that the OP's question is basically "where is the requirement for strings to be null-terminated", I would expect an answer to point to the part of the standard which at least mentions a null. Where in this answer can I see that the result of operator[] (whose output you've noted that c_str is required to match) must return a null at the end of the string? This answer only gives us half of the inference chain. It tells us that c_str is required to return the same thing as something else, which isn't defined in the answer.Beverly
@jalf: I don't know what the post looked like when you made that first comment. The post as it stands certainly does answer the question. The standard most certainly does say, in standardese, that c_str() must return a pointer to a null-terminated buffer. A non-binding explanatory note that this is the case would have been helpful. Then again, lots of other non-binding explanatory notes elsewhere would also be helpful to those of us who don't speak standardese as a primary language.Barbarian
@DavidHammen again, where, in this post, can I see that the buffer is required to be null-terminated? That's pretty essential information when the answer given is "because it returns the same thing as the buffer". That's not just an explanatory note, it's the entire premise for the answer being correct.Beverly
@jalf: "This answer only gives us half of the inference chain." It gives two thirds of the full chain. The one thing that is missing is that the value assigned by default initialization charT() is the null character. This is clearly the case when charT is char. The standard is a bit vague (more than a bit vague) on the meaning of wchar_t.Barbarian
See also open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2647.html : "This change effectively requires null-terminated buffers."Centrosphere
Even the requirement for O(1) does not rule out the possibility of the charT() terminator being lazily initialised when c_str() is called. The string knows its length, and can make sure that it always has some spare space in which to place the terminator. This means that the buffer does not necessarily always have to be null terminated.Unifoliolate
@Unifoliolate Yes, theoretically, but you can't observe the string in the intermediate state.Centrosphere
@MikhailGlushenkov - It could be observed by reading off the end of the buffer using *(&front() + size). I'm pretty sure that would invoke undefined behaviour though.Unifoliolate
@Unifoliolate 21.4.5 says that front() is equivalent to operator[](0), so your example still returns null (since &operator[](0) is equivalent to c_str()). If you use begin() instead of front(), you'll be effectively dereferencing end(), which is undefined.Centrosphere
This argument hinges on O(1) time, which doesn't mean the c_str() and data can't do actual processing. Indeed, one could envision a buffering technique where a fixed number of characters at the end are stored in a seperate buffer and copied when c_str() is called (since a fixed amount, is technically still O(1)). Also perhaps the string is using some kind of relocatable OS memory, and calling c_str() needs to fixate the memory (to prevent moving). SO while it must somehow be null terminated internally, I don't agree that the address of front be a synonym for c_str.Suzerainty
@edA-qa mort-ora-y See 21.4.1/5 - "The char-like objects in a basic_string object shall be stored contiguously."Centrosphere
@ edA-qa mort-ora-y On the question of &front() being equivalent to c_str() - 21.4.5/9 defines front() as being equivalent to operator[](0), and 21.4.7.1/1 says that c_str() is the same as &operator[](0). See my reply to Mankarse.Centrosphere
@MikhailGlushenkov, 21.4.1.7 does indeed say the pointer values are equivalent, not just the contents. Thank you.Suzerainty
if the null terminator is guaranteed to be there, it would be nice if it were also defined behaviour to read it. Thus you could dereference *s.end() defined to give a null character.Cache
B
23

Well, in fact it is true that the new standard stipulates that .data() and .c_str() are now synonyms. However, it doesn't say that .c_str() is no longer zero-terminated :)

It just means that you can now rely on .data() being zero-terminated as well.

Paper N2668 defines c_str() and data() members of std::basic_string as follows:

 const charT* c_str() const; 
 const charT* data() const; 

Returns: A pointer to the initial element of an array of length size() + 1 whose first size() elements equal the corresponding elements of the string controlled by *this and whose last element is a null character specified by charT().

Requires: The program shall not alter any of the values stored in the character array.

Note that this does NOT mean that any valid std::string can be treated as a C-string because std::string can contain embedded nulls, which will prematurely end the C-string when used directly as a const char*.

Addendum:

I don't have access to the actual published final spec of C++11 but it appears that indeed the wording was dropped somewhere in the revision history of the spec: e.g. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2011/n3242.pdf

§ 21.4.7 basic_string string operations [string.ops]

§ 21.4.7.1 basic_string accessors [string.accessors]

     const charT* c_str() const noexcept;
     const charT* data() const noexcept;
  1. Returns: A pointer p such that p + i == &operator[](i) for each i in [0,size()].
  2. Complexity: constant time.
  3. Requires: The program shall not alter any of the values stored in the character array.
Bengal answered 26/9, 2011 at 11:5 Comment(5)
@R.MartinhoFernandes: my edit and your comment must have crossed posts?Bengal
Yeah, sorry about that. Regarding your edit I'd like to note that the FDIS wording is very different from this and the requirement for null-termination is not this obvious, but it's ninja'ed in :)Divergent
dug up some more revisions. Now, who buys me that copy of the spec ;)Bengal
Please escape the Square brackets that appear as part of Operator[](i) in your post, since they are currently interpreted as a link, which makes the text impossible to understand.Campy
@Kevin: sry about that, fixedBengal
C
10

The "history" was that a long time ago when everyone worked in single threads, or at least the threads were workers with their own data, they designed a string class for C++ which made string handling easier than it had been before, and they overloaded operator+ to concatenate strings.

The issue was that users would do something like:

s = s1 + s2 + s3 + s4;

and each concatenation would create a temporary which had to implement a string.

Therefore someone had the brainwave of "lazy evaluation" such that internally you could store some kind of "rope" with all the strings until someone wanted to read it as a C-string at which point you would change the internal representation to a contiguous buffer.

This solved the problem above but caused a load of other headaches, in particular in the multi-threaded world where one expected a .c_str() operation to be read-only / doesn't change anything and therefore no need to lock anything. Premature internal-locking in the class implementation just in case someone was doing it multi-threaded (when there wasn't even a threading standard) was also not a good idea. In fact it was more costly to do anything of this than simply copy the buffer each time. Same reason "copy on write" implementation was abandoned for string implementations.

Thus making .c_str() a truly immutable operation turned out to be the most sensible thing to do, however could one "rely" on it in a standard that now is thread-aware? Therefore the new standard decided to clearly state that you can, and thus the internal representation needs to hold the null terminator.

Cache answered 24/10, 2012 at 11:15 Comment(1)
The old string also had the strange property that the first non const begin() would invalidate iterators!Kilo
I
2

Well spotted. This is certainly a defect in the recently adopted standard; I'm sure that there was no intent to break all of the code currently using c_str. I would suggest a defect report, or at least asking the question in comp.std.c++ (which will usually end up before the committee if it concerns a defect).

Incubate answered 26/9, 2011 at 11:5 Comment(2)
no need... groups.google.com/group/comp.std.c++/browse_thread/thread/…Bengal
Well, there are bits in the FDIS that are arguably shaky. 21.4.2/2 says that .data() for an empty string isn't actually null-terminated (.data()+1 is not valid, but should be a pointer one beyond the \0)Bangweulu

© 2022 - 2024 — McMap. All rights reserved.