Legality of COW std::string implementation in C++11
Asked Answered
N

7

131

It had been my understanding that copy-on-write is not a viable way to implement a conforming std::string in C++11, but when it came up in discussion recently I found myself unable to directly support that statement.

Am I correct that C++11 does not admit COW based implementations of std::string?

If so, is this restriction explicitly stated somewhere in the new standard (where)?

Or is this restriction implied, in the sense that it is the combined effect of the new requirements on std::string that precludes a COW based implementation of std::string. In this case, I'd be interested in a chapter and verse style derivation of 'C++11 effectively prohibits COW based std::string implementations'.

Nightmare answered 30/8, 2012 at 14:54 Comment(1)
The GCC bug for their COW string is gcc.gnu.org/bugzilla/show_bug.cgi?id=21334#c45 . One of the bugs tracking a new C++11 compilant implementation of std::string in libstdc++ is gcc.gnu.org/bugzilla/show_bug.cgi?id=53221Ford
T
130

It's not allowed, because as per the standard 21.4.1 p6, invalidation of iterators/references is only allowed for

— as an argument to any standard library function taking a reference to non-const basic_string as an argument.

— Calling non-const member functions, except operator[], at, front, back, begin, rbegin, end, and rend.

For a COW string, calling non-const operator[] would require making a copy (and invalidating references), which is disallowed by the paragraph above. Hence, it's no longer legal to have a COW string in C++11.

Technicality answered 30/8, 2012 at 15:6 Comment(13)
Some rationale: N2534Sheepwalk
@Cheersandhth.-Alf: The logic can be seen in the following if COW were allowed: std::string a("something"); char& c1 = a[0]; std::string b(a); char& c2 = a[1]; c1 is a reference to a. You then "copy" a. Then, when you attempt to take the reference the second time, it has to make a copy to get a non-const reference since there are two strings that point to the same buffer. This would have to invalidate the first reference taken, and is against the section quoted above.Technicality
@DaveS: In the case you sketch there is no buffer sharing, since there are (possible) external references to the original string's buffer. Thus the conlusion about making a copy later is wrong. Just to top it off the assumption that that would invalidate references to the original string's buffer is also wrong, but very irrelevant since the late copying doesn't happen here. More generally it's logically impossible to construct an example that does what you want, because avoiding that case is what COW is designed to do. That's why every attempt ends up doing umpteen fallacies.Nicholas
@Cheersandhth.-Alf, according to this, at least GCC's COW implementation does do exactly what DaveS is saying. So at least that style of COW is prohibited by the standard.Nuthatch
@Alf: This answer argues that non-const operator[] (1) must make a copy and that (2) it is illegal for it to do so. Which of those two points do you disagree with? Looking at your first comment, it seems that an implementation could share the string, at least under this requirement, up until the time it is accessed, but that both read and write accesses would need to unshare it. Is that your reasoning?Overcasting
@BenVoigt: quoting myself, "At the time of a COW copying there are no references or iterators that can be invalidated". The assertion that copying necessarily invalidates references is just wrong. No example can be constructed, because it is not true. And I am pretty sure you understood that already.Nicholas
For clarity, by "COW copying" are you talking about incrementing the sharing count, or unsharing?Overcasting
@BenVoigt: I added a proper answer. Contrary to what I expected (but as I held open in my first comment here in this list) it is indeed practically impossible to provide a conforming ref-counted implementation for C++11 and later, due to new non-throw requirements.Nicholas
@Cheersandhth.-Alf, what is wrong with my reasoning at gcc.gnu.org/bugzilla/show_bug.cgi?id=21334#c47 ?Wabble
@JonathanWakely: I see nothing wrong with it. It's a fine bug report. Your reasoning (or rather observations) there appear to 100% valid. On the other hand, the view that that bug means something about the formal, that someone else has claimed, is nonsense.Nicholas
Let us continue this discussion in chat.Wabble
@Cheersandhth.-Alf, I have added my own answer, which simply expands on the correct answer by Dave S above. Maybe if you now re-read my comments above you will see that I was asking you to provide some justification for your own seemingly extraordinary claims such as "On the other hand, the view that that bug means something about the formal, that someone else has claimed, is nonsense." (that was my own view that you labelled as nonsense, yet I'm the one accused of ad hominem attacks and getting personal when I use the same word! How offensive!)Wabble
@DaveS Is it really invalidated? Because c1 is still alive through b. It is true that &c1 == &a[0] is no longer true, but c1 is a reference to an object (the storage region of the character itself), not to a name (a[0]), and the object is still alive so the reference is well-defined.Abscission
W
58

The answers by Dave S and gbjbaanb are correct. (And Luc Danton's is correct too, although it's more a side-effect of forbidding COW strings rather than the original rule that forbids it.)

But to clear up some confusion, I'm going to add some further exposition. Various comments link to a comment of mine on the GCC bugzilla which gives the following example:

std::string s("str");
const char* p = s.data();
{
    std::string s2(s);
    (void) s[0];
}
std::cout << *p << '\n';  // p is dangling

The point of that example is to demonstrate why GCC's reference counted (COW) string is not valid in C++11. The C++11 standard requires this code to work correctly. Nothing in the code permits the p to be invalidated in C++11.

Using GCC's old reference-counted std::string implementation, that code has undefined behaviour, because p is invalidated, becoming a dangling pointer. (What happens is that when s2 is constructed it shares the data with s, but obtaining a non-const reference via s[0] requires the data to be unshared, so s does a "copy on write" because the reference s[0] could potentially be used to write into s, then s2 goes out of scope, destroying the array pointed to by p).

The C++03 standard explicitly permits that behaviour in 21.3 [lib.basic.string] p5 where it says that subsequent to a call to data() the first call to operator[]() may invalidate pointers, references and iterators. So GCC's COW string was a valid C++03 implementation.

The C++11 standard no longer permits that behaviour, because no call to operator[]() may invalidate pointers, references or iterators, irrespective of whether they follow a call to data().

So the example above must work in C++11, but does not work with libstdc++'s kind of COW string, therefore that kind of COW string is not permitted in C++11.

Wabble answered 22/3, 2015 at 20:55 Comment(5)
Let us continue this discussion in chat.Wabble
Good observation, those are new C++11 requirements, and they can't be met by a COW implementation. In addition to noexcept can't be met. Re “OK to be wrong”, will you delete this answer? When I get the time (sorry, I've used far too much already) I'll update my answer with both your observations: that (1) C++03 made a special exemption for first time calls, where e.g. s[i] can invalidate a pointer obtained from s.data(), even though in general client code can't know whether a call is first time or not, and (2) new O(1) complexity requirements on string operations, where C++03 had none.Nicholas
@JonathanWakely: I can imagine some cases where a string which shares backing stores until they're accessed could be useful. For example, concatenating two strings with sharable backing stores could yield a new string object that holds references to both backing stores; concatenating that to something else could build up a tree. If intermediate strings are not used for any purpose except as fodder for future concatenations, such an approach could seem useful. It wouldn't quite be "copy-on-write" in the usual sense, but the idea of sharing data as long as practical would remain.Noach
@supercat, that would add a lot of overhead but would pessimize all uses that don't involve concatenating strings. Also see the point I made at stackoverflow.com/questions/12199710/…Wabble
−1 "The answers by Dave S and gbjbaanb are correct." is a falsehood. In particular, Dave's assertion is irrational, with language that has the look & feel of logic, but isn't. The conclusion at the end in this answer, that the buggy COW in g++'s libstdc++ isn't permitted, is OK.Nicholas
V
21

It is, CoW is an acceptable mechanism for making faster strings... but...

it makes multithreading code slower (all that locking to check if you're the only one writing kills performance when using a lot of strings). This was the main reason CoW was killed off years ago.

The other reasons are that the [] operator will return you the string data, without any protection for you to overwrite a string someone else expects to be unchanging. The same applies to c_str() and data().

Quick google says that the multithreading is basically the reason it was effectively disallowed (not explicitly).

The proposal says :

Proposal

We propose to make all iterator and element access operations safely concurrently executable.

We are increasing the stability of operations even in sequential code.

This change effectively disallows copy-on-write implementations.

followed by

The largest potential loss in performance due to a switch away from copy-on-write implementations is the increased consumption of memory for applications with very large read-mostly strings. However, we believe that for those applications ropes are a better technical solution, and recommend a rope proposal be considered for inclusion in Library TR2.

Ropes are part of STLPort and SGIs STL.

Violet answered 30/8, 2012 at 15:1 Comment(2)
The operator[] issue isn't really a problem. The const variant does offer protection, and the non-const variant always has the option of doing the CoW at that time (or being really crazy and setting up a page fault to trigger it).Literatim
it's just silly that a std::cow_string class wasn't included, with lock_buffer(), etc. there are lots of times i know threading isn't an issue. more often than not, actually.Alatea
N
7

Is COW basic_string prohibited in C++11 and later?

Regarding

Am I correct that C++11 does not admit COW based implementations of std::string?

Yes.

Regarding

If so, is this restriction explicitly stated somewhere in the new standard (where)?

Almost directly, by requirements of constant complexity for a number of operations that would require O(n) physical copying of the string data in a COW implementation.

For example, for the member functions

auto operator[](size_type pos) const -> const_reference;
auto operator[](size_type pos) -> reference;

… which in a COW implementation would ¹both trigger string data copying to un-share the string value, the C++11 standard requires

C++11 §21.4.5/4:

Complexity: constant time.

… which rules out such data copying, and hence, COW.

C++03 supported COW implementations by not having these constant complexity requirements, and by, under certain restrictive conditions, allowing calls to operator[](), at(), begin(), rbegin(), end(), or rend() to invalidate references, pointers and iterators referring to the string items, i.e. to possibly incur a COW data copying. This support was removed in C++11.


Is COW also prohibited via the C++11 invalidation rules?

In another answer which at the time of writing is selected as solution, and which is heavily upvoted and therefore apparently believed, it's asserted that

For a COW string, calling non-const operator[] would require making a copy (and invalidating references), which is disallowed by the [quoted] paragraph above [C++11 §21.4.1/6]. Hence, it's no longer legal to have a COW string in C++11.

That assertion is incorrect and misleading in two main ways:

  • It incorrectly indicates that only the non-const item accessors need to trigger a COW data copying.
    But also the const item accessors need to trigger data copying, because they allow client code to form references or pointers that (in C++11) it's not permitted to invalidate later via the operations that can trigger COW data copying.
  • It incorrectly assumes that COW data copying can cause reference invalidation.
    But in a correct implementation COW data copying, un-sharing the string value, is done at a point before there are any references that can be invalidated.

To see how a correct C++11 COW implementation of basic_string would work, when the O(1) requirements that make this invalid are ignored, think of an implementation where a string can switch between ownership policies. A string instance starts out with policy Sharable. With this policy active there can be no external item references. The instance can transition to Unique policy, and it must do so when an item reference is potentially created such as with a call to .c_str() (at least if that produces a pointer to the internal buffer). In the general case of multiple instances sharing ownership of the value, this entails copying the string data. After that transition to Unique policy the instance can only transition back to Sharable by an operation that invalidates all references, such as assignment.

So, while that answer's conclusion, that COW strings are ruled out, is correct, the reasoning offered is incorrect and strongly misleading.

I suspect the cause of this misunderstanding is a non-normative note in C++11's annex C:

C++11 §C.2.11 [diff.cpp03.strings], about §21.3:

Change: basic_string requirements no longer allow reference-counted strings
Rationale: Invalidation is subtly different with reference-counted strings. This change regularizes behavor (sic) for this International Standard.
Effect on original feature: Valid C ++ 2003 code may execute differently in this International Standard

Here the rationale explains the primary why one decided to remove the C++03 special COW support. This rationale, the why, is not how the standard effectively disallows COW implementation. The standard disallows COW via the O(1) requirements.

In short, the C++11 invalidation rules don't rule out a COW implementation of std::basic_string. But they do rule out a reasonably efficient unrestricted C++03-style COW implementation like the one in at least one of g++'s standard library implementations. The special C++03 COW support allowed practical efficiency, in particular using const item accessors, at the cost of subtle, complex rules for invalidation:

C++03 §21.3/5 which includes “first call” COW support:

References, pointers, and iterators referring to the elements of a basic_string sequence may be invalidated by the following uses of that basic_string object:
— As an argument to non-member functions swap() (21.3.7.8), operator>>() (21.3.7.9), and getline() (21.3.7.9).
— As an argument to basic_string::swap().
— Calling data() and c_str() member functions.
— Calling non-const member functions, except operator[](), at(), begin(), rbegin(), end(), and rend().
— Subsequent to any of the above uses except the forms of insert() and erase() which return iterators, the first call to non-const member functions operator[](), at(), begin(), rbegin(), end(), or rend().

These rules are so complex and subtle that I doubt many programmers, if any, could give a precise summary. I could not.


What if O(1) requirements are disregarded?

If the C++11 constant time requirements on e.g. operator[] are disregarded, then COW for basic_string could be technically feasible, but difficult to implement.

Operations which could access the contents of a string without incurring COW data copying include:

  • Concatenation via +.
  • Output via <<.
  • Using a basic_string as argument to standard library functions.

The latter because the standard library is permitted to rely on implementation specific knowledge and constructs.

Additionally an implementation could offer various non-standard functions for accessing string contents without triggering COW data copying.

A main complicating factor is that in C++11 basic_string item access must trigger data copying (un-sharing the string data) but is required to not throw, e.g. C++11 §21.4.5/3 “Throws: Nothing.”. And so it can't use ordinary dynamic allocation to create a new buffer for COW data copying. One way around this is to use a special heap where memory can be reserved without being actually allocated, and then reserve the requisite amount for each logical reference to a string value. Reserving and un-reserving in such a heap can be constant time, O(1), and allocating the amount that one has already reserved, can be noexcept. In order to comply with the standard's requirements, with this approach it seems there would need to be one such special reservation-based heap per distinct allocator.


Notes:
¹ The const item accessor triggers a COW data copying because it allows the client code to obtain a reference or pointer to the data, which it's not permitted to invalidate by a later data copying triggered by e.g. the non-const item accessor.

Nicholas answered 29/7, 2016 at 5:53 Comment(16)
"But in a correct implementation COW data copying, un-sharing the string value, is done at a point before there are any references that can be invalidated." The example in my answer contradicts this, showing how taking a reference before any sharing takes place gives you a reference that can later become invalid. You can't unshare a string that isn't shared yet. If you're going to downvote other answers because you disagree with the details you need to get your specifics right.Wabble
"Your example is a good example of an incorrect-for- C++11 implementation. Possibly it was correct for C++03." Yes that's the point of the example. It shows a COW string that was legal in C++03 because it doesn't break the old iterator invalidation rules and is not legal in C++11 because it does break the new iterator invalidation rules. And it also contradicts the statement I quoted in the comment above.Wabble
@JonathanWakely: No, your example of an incorrect implementation doesn't contradict anything. This is the same as with an incorrect attempt to open a door, e.g. sniffing on the hinges. The fact that that incorrect approach doesn't open door, has no bearing at all on whether the door can be opened: it's just stupid.Nicholas
@Jonathan: In case you fail to see how a correct implementation (modulo ignoring the O(1) requirements, i.e. correct apart from that) would work, when s.c_str() is called s becomes unshared, if it was shared/sharable.Nicholas
What if s isn't shared yet when you call s.c_str()? How do you deal with later invalidation of the pointer returned by c_str()? The string needs to remember if any reference was ever taken, even while it was uniquely owned, and then never allow sharing. That's impractical in the real world.Wabble
@Jonathan: A COW string is initially shared. This means that it acts as if it has a separate value with a reference count, like a shared_ptr.Nicholas
No, it's initially uniquely owned. That's not shared. It becomes shared when the reference count goes above one. If you're going to redefine what COW means, distinct from what the question is about, then do so somewhere else.Wabble
Assuming an incorrect implementation like you do with "No, it's initially uniquely owned", leads nowhere: it's circular logic. Or worse. My sniffing doesn't open the door, it can't be opened. What, use the handle? No, proper door opening, as I define it, is done exclusively with hinge-sniffing.Nicholas
@JonathanWakely: Again assuming that you mean what you write, and just don't entirely grok things, think of an implementation where a string can switch between ownership policies. A COW string starts out with policy Sharable. It can transition to Unique policy, and it must do so when an item reference is potentially created, such as with a call to .c_str() (at least if that produces a pointer to the internal buffer). This generally entails copying the string data. After that it can only transition Sharable by an operation that invalidates all references, e.g. assignment. Is that clear?Nicholas
If you'd said sharable not initially shared I would not have argued. Saying something is initially shared is just confusing. Shared with itself? That's not what the word means. But I repeat: your attempt to argue that the C++11 iterator invalidation rules don't forbid some hypothetical COW string that was never used in practice (and would have unacceptable performance), when they most certainly do forbid the kind of COW string that was used in practice, is somewhat academic and pointless.Wabble
Your proposed COW string is interesting, but I'm not sure how useful it would be. The point of a COW string is to only copy the string data in the event that the two strings are written to. Your suggested implementation requires copying when any user-defined read operation happens. Even if the compiler knows its only a read, it must still copy. Furthermore, copying a Unique string will result in a copy of its string data (presumably to a Sharable state), which again makes COW rather pointless. So without the complexity guarantees, you could write... a really crappy COW string.Sherris
So while you are technically correct that the complexity guarantees are what prevents you from writing any form of COW, it really is [basic.string]/5 that prevents you from writing any genuinely useful form of COW string.Sherris
"requires copying when any user-defined read operation happens", no, that is incorrect.Nicholas
"it really is [basic.string]/5 that prevents you from writing any genuinely useful form of COW string", happily the question wasn't about subjective opinion of usefulness :-), but about what's absolutely permitted or not.Nicholas
The OP said "It had been my understanding that copy-on-write is not a viable way to implement a conforming std::string in C++11". Unacceptable performance would not be a viable way to implement a conforming std::string. My users would not accept it, and they wouldn't accept it if I told them their subjective opinion on its usefulness was not relevant.Wabble
@JonathanWakely: (1) Your quote is not the question. Here is the question: “Am I correct that C++11 does not admit COW based implementations of std::string? If so, is this restriction explicitly stated somewhere in the new standard (where)?” (2) Your opinion that a COW std::string, when disregarding the O(1) requirements, would be inefficient, is your opinion. I don't know what the performance could be, but I think that that assertion is put forward more for the feel of it, for the vibes that it conveys, than for any relevance to this answer.Nicholas
M
5

From 21.4.2 basic_string constructors and assignment operators [string.cons]

basic_string(const basic_string<charT,traits,Allocator>& str);

[...]

2 Effects: Constructs an object of class basic_string as indicated in Table 64. [...]

Table 64 helpfully documents that after construction of an object via this (copy) constructor, this->data() has as value:

points at the first element of an allocated copy of the array whose first element is pointed at by str.data()

There are similar requirements for other similar constructors.

Munger answered 30/8, 2012 at 15:2 Comment(0)
M
2

Since it is now guaranteed that strings are stored contiguously and you are now allowed to take a pointer to the internal storage of a string, (i.e. &str[0] works like it would for an array), it's not possible to make a useful COW implementation. You would have to make a copy for way too many things. Even just using operator[] or begin() on a non-const string would require a copy.

Mickiemickle answered 30/8, 2012 at 15:0 Comment(6)
I think strings in C++11 are guaranteed to be stored contiguously.Pompey
In the past you had to do the copies in all those situations and it was not a problem...Petey
@Pompey yes, but they weren't previouslyMickiemickle
Although C++11 does guarantee strings are contiguous, that is orthogonal to banning COW strings. GCC's COW string is contiguous, so clearly your claim that "it's not possible to make a useful COW implementation" is bogus.Wabble
@JonathanWakely: If one concatenates two strings and ask the resulting string for its backing store, it must report a contiguous sequence of characters. Are there any requirements about the storage of strings whose backing store has never been exposed?Noach
@supercat, asking for the backing store (e.g. by calling c_str()) must be O(1) and cannot throw, and must not introduce data races, so it's very difficult to meet those requirements if you lazily concatenate. In practice the only reasonable option is to always store contiguous data.Wabble
A
-1

I was always wondering about immutable cows: once cow is created I could be changed only through assignment from another cow, hence it will be compliant with the standard.

I had time to try it today for a simple comparison test: a map of size N keyed by string/cow with every node holding a set of all strings in the map (we have NxN number of objects).

With strings sized ~300 bytes and N=2000 cows are slightly faster and use almost order of magnitude less memory. See below, sizes are in kbs, run b is with cows.

~/icow$ ./tst 2000
preparation a
run
done a: time-delta=6 mem-delta=1563276
preparation b
run
done a: time-delta=3 mem-delta=186384
Aurist answered 16/10, 2019 at 1:6 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.