char *stpcpy(char *dest, const char *src);
returns a pointer to the end of the string, and is part of POSIX.1-2008. Before that, it was a GNU libc extension since 1992. It first appeared in Lattice C AmigaDOS in 1986.
gcc -O3
will in some cases optimize strcpy
+ strcat
to use stpcpy
or strlen
+ inline copying, see below.
C's standard library was designed very early, and it's very easy to argue that the str*
functions are not optimally designed. The I/O functions were definitely designed very early, in 1972 before C even had a preprocessor, which is why fopen(3)
takes a mode string instead of a flag bitmap like Unix open(2)
.
I haven't been able to find a list of functions included in Mike Lesk's "portable I/O package", so I don't know whether strcpy
in its current form dates all the way back to there or if those functions were added later. (The only real source I've found is Dennis Ritchie's widely-known C History article, which is excellent but not that in depth. I didn't find any documentation or source code for the actual I/O package itself.)
They do appear in their current form in K&R first edition, 1978.
Functions should return the result of computation they do, if it's potentially useful to the caller, instead of throwing it away. Either as a pointer to the end of the string, or an integer length. (A pointer would be natural.)
As @R says:
We all wish these functions returned a pointer to the terminating null byte (which would reduce a lot of O(n)
operations to O(1)
)
e.g. calling strcat(bigstr, newstr[i])
in a loop to build up a long string from many short (O(1) length) strings has approximately O(n^2)
complexity, but strlen
/memcpy
will only look at each character twice (once in strlen, once in memcpy).
Using only the ANSI C standard library, there's no way to efficiently only look at every character once. You could manually write a byte-at-a-time loop, but for strings longer than a few bytes, that's worse than looking at each character twice with current compilers (which won't auto-vectorize a search loop) on modern HW, given efficient libc-provided SIMD strlen and memcpy. You could use length = sprintf(bigstr, "%s", newstr[i]); bigstr+=length;
, but sprintf()
has to parse its format string and is not fast.
There isn't even a version of strcmp
or memcmp
that returns the position of the difference. If that's what you want, you have the same problem as Why is string comparison so fast in python?: an optimized library function that runs faster than anything you can do with a compiled loop (unless you have hand-optimized asm for every target platform you care about), which you can use to get close to the differing byte before falling back to a regular loop once you get close.
It seems that C's string library was designed without regard to the O(n) cost of any operation, not just finding the end of implicit-length strings, and strcpy
's behaviour is definitely not the only example.
They basically treat implicit-length strings as whole opaque objects, always returning pointers to the start, never to the end or to a position inside one after searching or appending.
History guesswork
In early C on a PDP-11, I suspect that strcpy
was no more efficient than while(*dst++ = *src++) {}
(and was probably implemented that way).
In fact, K&R first edition (page 101) shows that implementation of strcpy
and says:
Although this may seem cryptic at first sight, the notational convenience is considerable, and the idiom should be mastered, if for no other reason than that you will see it frequently in C programs.
This implies they fully expected programmers to write their own loops in cases where you wanted the final value of dst
or src
. And thus maybe they didn't see a need to redesign the standard library API until it was too late to expose more useful APIs for hand-optimized asm library functions.
But does returning the original value of dst
make any sense?
strcpy(dst, src)
returning dst
is analogous to x=y
evaluating to the x
. So it makes strcpy work like a string assignment operator.
As other answers point out, this allows nesting, like foo( strcpy(buf,input) );
. Early computers were very memory-constrained. Keeping your source code compact was common practice. Punch cards and slow terminals were probably a factor in this. I don't know historical coding standards or style guides or what was considered too much to put on one line.
Crusty old compilers were also maybe a factor. With modern optimizing compilers, char *tmp = foo();
/ bar(tmp);
is no slower than bar(foo());
, but it is with gcc -O0
. I don't know if very early compilers could optimize variables away completely (not reserving stack space for them), but hopefully they could at least keep them in registers in simple cases (unlike modern gcc -O0
which on purpose spills/reloads everything for consistent debugging). i.e. gcc -O0
isn't a good model for ancient compilers, because it's anti-optimizing on purpose for consistent debugging.
Possible compiler-generated-asm motivation
Given the lack of care about efficiency in the general API design of the C string library, this might be unlikely. But perhaps there was a code-size benefit. (On early computers, code-size was more of a hard limit than CPU time).
I don't know much about the quality of early C compilers, but it's a safe bet that they were not awesome at optimizing, even for a nice simple / orthogonal architecture like PDP-11.
It's common to want the string pointer after the function call. At an asm level, you (the compiler) probably has it in a register before the call. Depending on calling convention, you either push it on the stack or you copy it to the right register where the calling convention says the first arg goes. (i.e. where strcpy
is expecting it). Or if you're planning ahead, you already had the pointer in the right register for the calling convention.
But function calls clobber some registers, including all the arg-passing registers. (So when a function gets an arg in a register, it can increment it there instead of copying to a scratch register.)
So as the caller, your code-gen option for keeping something across a function call include:
- store/reload it to local stack memory. (Or just reload it if an up-to-date copy is still in memory).
- save/restore a call-preserved register at the start/end of your whole function, and copy the pointer to one of those registers before the function call.
- the function returns the value in a register for you. (Of course, this only works if the C source is written to use the return value instead of the input variable. e.g.
dst = strcpy(dst, src);
if you aren't nesting it).
All calling conventions on all architectures I'm aware of return pointer-sized return values in a register, so having maybe one extra instruction in the library function can save code-size in all callers that want to use that return value.
You probably got better asm from primitive early C compilers by using the return value of strcpy
(already in a register) than by making the compiler save the pointer around the call in a call-preserved register or spill it to the stack. This may still be the case.
BTW, on many ISAs, the return-value register is not the first arg-passing register. And unless you use base+index addressing modes, it does cost an extra instruction (and tie up another reg) for strcpy to copy the register for a pointer-increment loop.
PDP-11 toolchains normally used some kind of stack-args calling convention, always pushing args on the stack. I'm not sure how many call-preserved vs. call-clobbered registers were normal, but only 5 or 6 GP regs were available (R7 being the program counter, R6 being the stack pointer, R5 often used as a frame pointer). So it's similar to but even more cramped than 32-bit x86.
char *bar(char *dst, const char *str1, const char *str2)
{
//return strcat(strcat(strcpy(dst, str1), "separator"), str2);
// more readable to modern eyes:
dst = strcpy(dst, str1);
dst = strcat(dst, "separator");
// dst = strcat(dst, str2);
return dst; // simulates further use of dst
}
# x86 32-bit gcc output, optimized for size (not speed)
# gcc8.1 -Os -fverbose-asm -m32
# input args are on the stack, above the return address
push ebp #
mov ebp, esp #, Create a stack frame.
sub esp, 16 #, This looks like a missed optimization, wasted insn
push DWORD PTR [ebp+12] # str1
push DWORD PTR [ebp+8] # dst
call strcpy #
add esp, 16 #,
mov DWORD PTR [ebp+12], OFFSET FLAT:.LC0 # store new args over our incoming args
mov DWORD PTR [ebp+8], eax # EAX = dst.
leave
jmp strcat # optimized tailcall of the last strcat
This is significantly more compact than a version which doesn't use dst =
, and instead reuses the input arg for the strcat
. (See both on the Godbolt compiler explorer.)
The -O3
output is very different: gcc for the version that doesn't use the return value uses stpcpy
(returns a pointer to the tail) and then mov
-immediate to store the literal string data directly to the right place.
But unfortunately, the dst = strcpy(dst, src)
-O3 version still uses regular strcpy
, then inlines strcat
as strlen
+ mov
-immediate.
To C-string or not to C-string
C implicit-length strings aren't always inherently bad, and have interesting advantages (e.g. a suffix is also a valid string, without having to copy it).
But the C string library is not designed in a way that makes efficient code possible, because char
-at-a-time loops typically don't auto-vectorize and the library functions throw away results of work they have to do.
gcc and clang never auto-vectorize loops unless the iteration count is known before the first iteration, e.g. for(int i=0; i<n ;i++)
. ICC can vectorize search loops, but it's still unlikely to do as well as hand-written asm.
strncpy
and so on are basically a disaster. e.g. strncpy
doesn't copy the terminating '\0'
if it reaches the buffer size limit, so you need to manually arr[n] = 0;
before or after. But if the source string is shorter, it pads with 0
bytes out to the specified length, potentially touching a page of memory that never needed to be touched. (Also making it very inefficient for copying short strings into a large buffer that still has lots of space left.)
It appears to have been designed for writing into the middle of larger strings, not for avoiding buffer overflows.
A few functions like snprintf
are usable and do always nul-terminate. Remembering which does which is hard, and a huge risk if you remember wrong, so you have to check every time in cases where it matters for correctness.
As Bruce Dawson says: Stop using strncpy already!. Apparently some MSVC extensions like _snprintf
are even worse.
strncat
also exists in POSIX.2001 and is unrelated to strcpy
; it does what you'd hope, a bounds-checked strcpy
which always 0-terminates. But like strcat
it still returns the original pointer so is not useful for efficiently appending strings into a buffer; it has to re-scan the leading part every time to find the current end if you simply call it repeatedly on the same buffer. The man page mentions "Shlemiel the painter".
O(n)
operations toO(1)
). – Presentlystpcpy(3)
. It it the same asstrcpy(3)
, but returns a pointer to the NUL terminating byte. – Member