Are end+1 iterators for std::string allowed?
Asked Answered
B

4

25

Is it valid to create an iterator to end(str)+1 for std::string?
And if it isn't, why isn't it?

This question is restricted to C++11 and later, because while pre-C++11 the data was already stored in a continuous block in any but rare POC toy-implementations, the data didn't have to be stored that way.
And I think that might make all the difference.

The significant difference between std::string and any other standard container I speculate on is that it always contains one element more than its size, the zero-terminator, to fulfill the requirements of .c_str().

21.4.7.1 basic_string accessors [string.accessors]

const charT* c_str() const noexcept;
const charT* data() const noexcept;

1 Returns: A pointer p such that p + i == &operator[](i) for each i in [0,size()].
2 Complexity: Constant time.
3 Requires: The program shall not alter any of the values stored in the character array.

Still, even though it should imho guarantee that said expression is valid, for consistency and interoperability with zero-terminated strings if nothing else, the only paragraph I found casts doubt on that:

21.4.1 basic_string general requirements [string.require]

4 The char-like objects in a basic_string object shall be stored contiguously. That is, for any basic_string object s, the identity &*(s.begin() + n) == &*s.begin() + n shall hold for all values of n such that 0 <= n < s.size().

(All quotes are from C++14 final draft (n3936).)

Related: Legal to overwrite std::string's null terminator?

Boomkin answered 11/11, 2015 at 18:10 Comment(29)
I'm pretty sure that If you increment the end iterator, the result is undefined behavior.Cornwell
I don't get it. The iterator obtained by end does point to the \0 character. end + 1 would be an attempt to go beyond the terminating \0, wouldn't it?Methodology
@ChristianHackl in compare for example with std::vector, where end iterator may point to not existing memory, std::string::end point to existing, so as I understand the question is always std::string::end + 1 return valid iterator that may be use for range from begin to zero byteSheldon
Don't try to break the abstraction. Don't assume there is a 0 on the end unless you call c_str().Snip
Perhaps the gist of the question is whether std::string effectively allows *end(str). In practice, assert(*end(str) == '\0') should never fail, but is it correct C++11?Methodology
Probably unimportant but isn't the final draft N4140?Erigeron
@user1034749: Why should end + 1 point to '\0'?Methodology
I don't think that an implementation would be prohibited – at least not from the quotes in your post – to implement operator++ of std::basic_string::iterator to assert on the current position being less than s.end(). But I'm pretty sure it is well defined to say s.c_str() + s.length() + 1 (or &(*s.begin())) because those are guaranteed to be raw pointers.Erigeron
And the reason to not break the abstraction is that things that are almost like strings but aren't, like string_ref, will break or get very inefficient. If you have a string_ref it can be a substring without needing to make a copy. Unless you go around demanding it stick a 0 on the end.Snip
@ChristianHackl No, end + 1 not point to zero, end point to zero, so to iterate thourhg all chars you can use begin and end and to iterate for all chars + zero byte you can or may be can not use begin end+1Sheldon
@ZanLynx: Well, a string_ref isn't a string. And for good reasons it doesn't have the additional guarantees the latter provides.Boomkin
@Boomkin point is that you or someone will want to optimize your code later and abusing a container makes replacing it without bugs much harder.Snip
@user1034749: If you agree that end points to zero, then end + 1 must be undefined behaviour, because what else do expect in the dark hole beyond \0?Methodology
All things considered, I don't see the big difference to "normal" container classes. Just like in std::vector or std::list, end does not point to an element but to "one past the last element". It just happens to be that std::string leaks a bit of its abstraction such that you can peek at "one past the last element" and conclude that it is a \0 character.Methodology
@ChristianHackl all other containers have the same end point to the dark whole as end+1 of std::string that is the point of question. yes end+1 deference is bad thing, but what about usage it as normal end in algorithmsSheldon
Perhaps the misunderstanding is about what constitutes an element. Is the \0 part of the elements or not? c_str() (or data()) seem to suggest it, but all member functions that deal with the abstraction that std::string represents, like size() or end(), clearly don't.Methodology
@ZanLynx: "Don't assume there is a 0 on the end unless you call c_str()." - c_str() is declared as const, calling it doesn't make a difference.Adnate
@ChristianHackl don't forget about length(). I find keeping the differences difficult and usually just do a test run to make sure I do or don't need the +1. Even testing it as many times as I have I still can't tell you which includes the \0 and which don't.Trapezius
@Trapezius length and size are the same: en.cppreference.com/w/cpp/string/basic_string/sizeWenzel
@KarolyHorvath: C++ introduced the mutable keyword just to confuse people who think const means const.Snip
@ZanLynx Well, it must also be constant-time, and thus the element must exist.Boomkin
@Deduplicator: I am not arguing that std::string doesn't have a zero at the end. I am saying that trying to break the container abstraction will result in bugs when the code changes to use a similar but not identical container. Because the programmer doing the change is going to assume that the code is not doing anything STUPID.Snip
@ZanLynx: A std::string isn't just a container, it's a string, with string-specific additions (members, and use of traits for customization). Yes, if there's no advantage to writing code customized to use a string, it's a good idea to stay more generic. Doesn't mean if there is some reason it must be a string, taking advantage of that is stupid.Boomkin
@Deduplicator: It IS stupid because there are dozens of containers that do a string look-alike interface. These things implement the string contract and if you break it by dereferencing end() the code will be super surprised when it gets 'e' instead of '\0'.Snip
@ZanLynx: No, there are dozens of containers which implement a part of the string-interface. Beginning with std::initializer_list, going over std::vector and not ending with std::string_view. As I said, one should always code to the most common interface one can, but if one knows the interface is bigger, and using more has any advantage, there's nothing wrong with that. Sometimes one even specializes templates to take advantage of such opportunities, and that's as it should be.Boomkin
Who would have known such a simple question would spark such a heated discussion? I guess it shows that nobody really knows (for now).Haemophiliac
@Boomkin It must exist, but it need not be a '\0'. It could even be on a different memory guarded page that traps access.Buckthorn
@Yakk: the other memory page that traps access thing could be said about any array/vector though, no?Haemophiliac
@MatthieuM. I'm saying that, prior to calling .data() or .c_str(), the std::basic_string is free to leave the trailing null uninitialized or trap and cause your program to format your hard drive it it is written (undefined behavior). It cannot reallocate to make room for it, as .data() and .c_str() are constant-time, but setting the value of the null terminator and/or changing the trap information on a page is a constant-time operation. In theory, the trap could extend to even forming the pointer to the null terminator (on some architecture with pointer trap values) AFAICT.Buckthorn
H
13

TL;DR: s.end() + 1 is undefined behavior.


std::string is a strange beast, mainly for historical reasons:

  1. It attempts to bring C compatibility, where it is known that an additional \0 character exists beyond the length reported by strlen.
  2. It was designed with an index-based interface.
  3. As an after thought, when merged in the Standard library with the rest of the STL code, an iterator-based interface was added.

This led std::string, in C++03, to number 103 member functions, and since then a few were added.

Therefore, discrepancies between the different methods should be expected.


Already in the index-based interface discrepancies appear:

§21.4.5 [string.access]

const_reference operator[](size_type pos) const;
reference operator[](size_type pos);

1/ Requires: pos <= size()

const_reference at(size_type pos) const; reference at(size_type pos);

5/ Throws: out_of_range if pos >= size()

Yes, you read this right, s[s.size()] returns a reference to a NUL character while s.at(s.size()) throws an out_of_range exception. If anyone tells you to replace all uses of operator[] by at because they are safer, beware the string trap...


So, what about iterators?

§21.4.3 [string.iterators]

iterator end() noexcept;
const_iterator end() const noexcept;
const_iterator cend() const noexcept;

2/ Returns: An iterator which is the past-the-end value.

Wonderfully bland.

So we have to refer to other paragraphs. A pointer is offered by

§21.4 [basic.string]

3/ The iterators supported by basic_string are random access iterators (24.2.7).

while §17.6 [requirements] seems devoid of anything related. Thus, strings iterators are just plain old iterators (you can probably sense where this is going... but since we came this far let's go all the way).

This leads us to:

24.2.1 [iterator.requirements.general]

5/ Just as a regular pointer to an array guarantees that there is a pointer value pointing past the last element of the array, so for any iterator type there is an iterator value that points past the last element of a corresponding sequence. These values are called past-the-end values. Values of an iterator i for which the expression *i is defined are called dereferenceable. The library never assumes that past-the-end values are dereferenceable. [...]

So, *s.end() is ill-formed.

24.2.3 [input.iterators]

2/ Table 107 -- Input iterator requirements (in addition to Iterator)

List for pre-condition to ++r and r++ that r be dereferencable.

Neither the Forward iterators, Bidirectional iterators nor Random iterator lift this restriction (and all indicate they inherit the restrictions of their predecessor).

Also, for completeness, in 24.2.7 [random.access.iterators], Table 111 -- Random access iterator requirements (in addition to bidirectional iterator) lists the following operational semantics:

  • r += n is equivalent to [inc|dec]rememting r n times
  • a + n and n + a are equivalent to copying a and then applying += n to the copy

and similarly for -= n and - n.

Thus s.end() + 1 is undefined behavior.

Haemophiliac answered 12/11, 2015 at 15:36 Comment(5)
*s.end() isn't ill-formed: It is valid syntactically, does not violate any 'diagnosable semantic rules' and doesn't violate the ODR. Similarly, *s.end() + 1 is not ill-formed. Both of these expression are in fact well-formed.Drown
So where does it say that past the end iterators are not dereferenceable? Just because the library never assumes they are dereferenceable does not make it so.Wenzel
@NathanOliver: I think I see what you mean; I had originally read it as "the library never guarantees that you can dereference a past-the-end value" whereas I see now it can be read as "whenever you submit a [begin, end) range to a library-supplied method/function, it will never dereference end" which is completely different.Haemophiliac
So then this answer is incorrect. You assume *s.end() is ill formed but nothing you presented in this answer actually says it is or isn't. You have to prove that s.end() is not dereferenceable. I had this same answer and deleted it as I could not find that.Wenzel
@NathanOliver: It may or may not be, depending on the reading of 24.2.1/5. I do not have the time right now to go and re-check the Standard for further indication of how it should be read though.Haemophiliac
D
5

Returns: A pointer p such that p + i == &operator[](i) for each i in [0,size()].

std::string::operator[](size_type i) is specified to return "a reference to an object of type charT with value charT() when i == size(), so we know that that pointer points to an object.

5.7 states that "For the purposes of [operators + and -], a pointer to a nonarray object behaves the same as a pointer to the first element of an array of length one with the type of the object as its element type."

So we have a non-array object and the spec guarantees that a pointer one past it will be representable. So we know std::addressof(*end(str)) + 1 has to be representable.

However that's not a guarantee on std::string::iterator, and there is no such guarantee anywhere in the spec, which makes it undefined behavior.

(Note that this is not the same as 'ill-formed'. *end(str) + 1 is in fact well-formed.)

Iterators can and do implement checking logic that produce various errors when you do things like increment the end() iterator. This is in fact what Visual Studios debug iterators do with end(str) + 1.

#define _ITERATOR_DEBUG_LEVEL 2
#include <string>
#include <iterator>

int main() {
  std::string s = "ssssssss";
  auto x = std::end(s) + 1; // produces debug dialog, aborts program if skipped
}

And if it isn't, why isn't it?

for consistency and interoperability with zero-terminated strings if nothing else

C++ specifies some specific things for compatibility with C, but such backwards compatibility is limited to supporting things that can actually be written in C. C++ doesn't necessarily try to take C's semantics and make new constructs behave in some analogous way. Should std::vector decay to an iterator just to be consistent with C's array decay behavior?

I'd say end(std) + 1 is left as undefined behavior because there's no value in trying to constrain std::string iterators this way. There's no legacy C code that does this that C++ needs to be compatible with and new code should be prevented from doing it.

New code should be prevented from relying on it... why? [...] What does not allowing it buy you in theory, and how does that look in practice?

Not allowing it means implementations don't have to support the added complexity, complexity which provides zero demonstrated value.

In fact it seems to me that supporting end(str) + 1 has negative value since code that tries to use it will essentially be creating the same problem as C code which can't figure out when to account for the null terminator or not. C has enough off by one buffer size errors for both languages.

Drown answered 12/11, 2015 at 5:45 Comment(3)
The Visual Studio example is striking, however it is also NOT authoritative. Is this behavior condoned by the Standard or did Dirkumware or STL err?Haemophiliac
There's no existing code relying on it... likely. At least I don't know of any. New code should be prevented from relying on it... why? That part of the conclusion needs some elaboration. What does not allowing it buy you in theory, and how does that look in practice?Boomkin
@MatthieuM. My transition from analysis of the spec to demonstrating implementation behavior was a bit incomplete. The intent was to first show that it's undefined behavior in the spec (simply due to the fact that nowhere in the spec is its behavior defined or required), and then show that implementations do in fact take advantage of that. I've added the missing phrase which I had taken as implied.Drown
B
3

A std::basic_string<???> is a container over its elements. Its elements do not include the trailing null that is implicitly added (it can include embedded nulls).

This makes lots of sense -- "for each character in this string" probably shouldn't return the trailing '\0', as that is really an implementation detail for compatibility with C style APIs.

The iterator rules for containers were based off of containers that don't shove an extra element at the end. Modifying them for std::basic_string<???> without motivation is questionable; one should only break a working pattern if there is a payoff.

There is every reason to think that pointers to .data() and .data() + .size() + 1 are allowed (I could imagine a twisted interpretation of the standard that would make it not allowed). So if you really need read-only iterators into the contents of a std::string, you can use pointer-to-const-elements (which are, after all, a kind of iterator).

If you want editable ones, then no, there is no way to get a valid iterator to one-past-the-end. Neither can you get a non-const reference to the trailing null legally. In fact, such access is clearly a bad idea; if you change the value of that element, you break the std::basic_string's invariant null-termination.

For there to be an iterator to one-past-the-end, the const and non-const iterators to the container would have to have a different valid range, or a non-const iterator to the last element that can be dereferenced but not written to must exist.

I shudder at making such standard wording watertight.

std::basic_string is already a mess. Making it even stranger would lead to standard bugs and would have a non-trivial cost. The benefit is really low; in the few cases where you want access to said trailing null in an iterator range, you can use .data() and use the resulting pointers as iterators.

Buckthorn answered 12/11, 2015 at 15:19 Comment(10)
How could the DS9K (imported from comp.lang.c) disallow creating a pointer past the NUL-terminator? Doesn't the rule that you must be able to treat an element as an array of one, and form a pointer just past, guarantee that? Now the problem of allowing a non-const pointer to a const element just for std::string's terminator, that looks like a deal-breaker for modifiable iterators, and because consistency is really a good idea, also for constant ones, yes.Boomkin
@Boomkin It is really easy to show that the NUL-terminator is contiguous with the rest of the data. And yes, there is no way to implement an element in C++ such that a pointer one past its end is not valid; but the insane convoluted argument I'm imagining (but not making) is based on the fact that the C++ standard does not require that the C++ standard library be implemented in C++. It specifies certain behavior. We have the guarantee that int x; (&x)+1; is ok, but all we have for the last element is a pointer such that *p == '\0' and a reference to same.Buckthorn
@Boomkin I don't think implementing based off such an argument (if it is valid) is reasonable, and I'd consider any use of such an argument in a compiler evidence that the standard is defective, but I'm saying that such an argument exists, and might even be a valid one. I could be wrong; I'm not that interested in an argument that, even if correct, points out a standard defect with no real-world impact (because nobody is going to exploit it, no, really).Buckthorn
Thanks, that one made me laugh, because the argument is so ludicrous but strictly might actually work according to the letter of the standard. Yes, I concurr that any implementer actually going that way is beyond reason, or just trying to screw with users minds...Boomkin
I believe it would be legal for a string implementation to internally hold strings using non-consecutive allocations until such time as someone requests a pointer to the backing store, would it not? Such a thing could improve the performance of repeated concatenations to a string, since concatenation of each new string would just require allocating enough space for a copy of the new content and a pointer to the earlier string [consolidating the chains when they get too big, or when code asks for a pointer to the string content]. In such a system, the trailing zero might not be stored at all...Viaticum
...unless or until code asks for the address of the string data, and there would be no particular reason to expect an iterator to be able to point past something that doesn't even exist.Viaticum
@Viaticum There are constant-time requirements on the .c_str() and .data() members, which means they cannot transcribe the string from some other store to single buffer. Amusingly, the string could have a pre-allocated buffer of virtual pages of the right size (but "no content") that, when accessed, trap the access and do the transcription from another buffer, at least with POD character types. I think. Basically there are serious restrictions on what you can do with string implementations due to O-notation restrictions.Buckthorn
@Yakk: Could one overcome the O-notation restrictions by saying that any strings less than 500 quadrillion characters will use "copy-on-fright" while those larger than that will pre-allocate? After all, O(500000000000000000) and O(1) are equivalent. Or is there more to it than that? While I could see some usefulness to a requirement that the total amount of time required to perform some combination of operations must be O(something), I'm less clear on the value of forbidding the deferral of certain operations in cases where the total time of all operations performed on an object to date...Viaticum
...would meet requirements. If the time required to construct a string of length n and perform m access to it is O(n)+O(m), what practical harm would come from allowing the first access to take time O(n) if the total time constraints were met?Viaticum
@Yakk: Personally, I'm something of a fan of a form of "copy on demand" semantics where all data is either sharable or mutable, but never both. For it to really work well one must predict whether it will be necessary to make more than one logical "copy" of something before it's modified [guessing wrong will result in an otherwise-unnecessary copy operation] but it avoids a lot of the threading issues and other complexity of trying to use "copy on write" with mutable data.Viaticum
A
1

I can't find a definitive answer, but indirect evidence points at end()+1 being undefined.

[string.insert]/15

constexpr iterator insert(const_iterator p, charT c);
Preconditions: p is a valid iterator on *this.

It would be unreasonable to expect this to work with end()+1 as the iterator, and it indeed causes a crash on both libstdc++ and libc++.

This means end()+1 is not a valid iterator, meaning end() is not incrementable.

Allopathy answered 11/3, 2021 at 21:39 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.