Is COW basic_string
prohibited in C++11 and later?
Regarding
” Am I correct that C++11 does not admit COW based implementations of std::string
?
Yes.
Regarding
” If so, is this restriction explicitly stated somewhere in the new standard (where)?
Almost directly, by requirements of constant complexity for a number of operations that would require O(n) physical copying of the string data in a COW implementation.
For example, for the member functions
auto operator[](size_type pos) const -> const_reference;
auto operator[](size_type pos) -> reference;
… which in a COW implementation would ¹both trigger string data copying to un-share the string value, the C++11 standard requires
C++11 §21.4.5/4:
” Complexity: constant time.
… which rules out such data copying, and hence, COW.
C++03 supported COW implementations by not having these constant complexity requirements, and by, under certain restrictive conditions, allowing calls to operator[]()
, at()
, begin()
, rbegin()
, end()
, or rend()
to invalidate references, pointers and iterators referring to the string items, i.e. to possibly incur a COW data copying. This support was removed in C++11.
Is COW also prohibited via the C++11 invalidation rules?
In another answer which at the time of writing is selected as solution, and which is heavily upvoted and therefore apparently believed, it's asserted that
” For a COW string, calling non-const
operator[]
would require making a copy (and invalidating references), which is disallowed by the [quoted] paragraph above [C++11 §21.4.1/6]. Hence, it's no longer legal to have a COW string in C++11.
That assertion is incorrect and misleading in two main ways:
- It incorrectly indicates that only the non-
const
item accessors need to trigger a COW data copying.
But also the const
item accessors need to trigger data copying, because they allow client code to form references or pointers that (in C++11) it's not permitted to invalidate later via the operations that can trigger COW data copying.
- It incorrectly assumes that COW data copying can cause reference invalidation.
But in a correct implementation COW data copying, un-sharing the string value, is done at a point before there are any references that can be invalidated.
To see how a correct C++11 COW implementation of basic_string
would work, when the O(1) requirements that make this invalid are ignored, think of an implementation where a string can switch between ownership policies. A string instance starts out with policy Sharable. With this policy active there can be no external item references. The instance can transition to Unique policy, and it must do so when an item reference is potentially created such as with a call to .c_str()
(at least if that produces a pointer to the internal buffer). In the general case of multiple instances sharing ownership of the value, this entails copying the string data. After that transition to Unique policy the instance can only transition back to Sharable by an operation that invalidates all references, such as assignment.
So, while that answer's conclusion, that COW strings are ruled out, is correct, the reasoning offered is incorrect and strongly misleading.
I suspect the cause of this misunderstanding is a non-normative note in C++11's annex C:
C++11 §C.2.11 [diff.cpp03.strings], about §21.3:
Change: basic_string
requirements no longer allow reference-counted strings
Rationale: Invalidation is subtly different with reference-counted strings. This change regularizes behavor (sic) for this International Standard.
Effect on original feature: Valid C ++ 2003 code may execute differently in this International Standard
Here the rationale explains the primary why one decided to remove the C++03 special COW support. This rationale, the why, is not how the standard effectively disallows COW implementation. The standard disallows COW via the O(1) requirements.
In short, the C++11 invalidation rules don't rule out a COW implementation of std::basic_string
. But they do rule out a reasonably efficient unrestricted C++03-style COW implementation like the one in at least one of g++'s standard library implementations. The special C++03 COW support allowed practical efficiency, in particular using const
item accessors, at the cost of subtle, complex rules for invalidation:
C++03 §21.3/5 which includes “first call” COW support:
” References, pointers, and iterators referring to the elements of a basic_string
sequence may be invalidated by the following uses of that basic_string
object:
— As an argument to non-member functions swap()
(21.3.7.8), operator>>()
(21.3.7.9), and getline()
(21.3.7.9).
— As an argument to basic_string::swap()
.
— Calling data()
and c_str()
member functions.
— Calling non-const
member functions, except operator[]()
, at()
, begin()
, rbegin()
, end()
, and rend()
.
— Subsequent to any of the above uses except the forms of insert()
and erase()
which return iterators, the first call to non-const
member functions operator[]()
, at()
, begin()
, rbegin()
, end()
, or rend()
.
These rules are so complex and subtle that I doubt many programmers, if any, could give a precise summary. I could not.
What if O(1) requirements are disregarded?
If the C++11 constant time requirements on e.g. operator[]
are disregarded, then COW for basic_string
could be technically feasible, but difficult to implement.
Operations which could access the contents of a string without incurring COW data copying include:
- Concatenation via
+
.
- Output via
<<
.
- Using a
basic_string
as argument to standard library functions.
The latter because the standard library is permitted to rely on implementation specific knowledge and constructs.
Additionally an implementation could offer various non-standard functions for accessing string contents without triggering COW data copying.
A main complicating factor is that in C++11 basic_string
item access must trigger data copying (un-sharing the string data) but is required to not throw, e.g. C++11 §21.4.5/3 “Throws: Nothing.”. And so it can't use ordinary dynamic allocation to create a new buffer for COW data copying. One way around this is to use a special heap where memory can be reserved without being actually allocated, and then reserve the requisite amount for each logical reference to a string value. Reserving and un-reserving in such a heap can be constant time, O(1), and allocating the amount that one has already reserved, can be noexcept
. In order to comply with the standard's requirements, with this approach it seems there would need to be one such special reservation-based heap per distinct allocator.
Notes:
¹ The const
item accessor triggers a COW data copying because it allows the client code to obtain a reference or pointer to the data, which it's not permitted to invalidate by a later data copying triggered by e.g. the non-const
item accessor.