What is "namespace cleanliness", and how does glibc achieve it?
Asked Answered
L

3

8

I came across this paragraph from this answer by @zwol recently:

The __libc_ prefix on read is because there are actually three different names for read in the C library: read, __read, and __libc_read. This is a hack to achieve "namespace cleanliness", which you only need to worry about if you ever set out to implement a full-fledged and fully standards compliant C library. The short version is that there are many functions in the C library that need to call read, but some of them cannot use the name read to call it, because a C program is technically allowed to define a function named read itself.

As some of you may know, I am setting out to implement my own full-fledged and fully standards-compliant C library, so I'd like more details on this.

What is "namespace cleanliness", and how does glibc achieve it?

Leolaleoline answered 30/8, 2019 at 20:58 Comment(1)
I would like to write an answer to this question, but it's 5pm on a Friday and I don't know if I'm going to have time to give it the attention it deserves for, um, several weeks. For right now I'm going to point you at section 7.1.3 of C2011 and suggest you think about the implications of "No other identifiers are reserved (for the implementation)" in particular.Infundibulum
A
7

First, note that the identifier read is not reserved by ISO C at all. A strictly conforming ISO C program can have an external variable or function called read. Yet, POSIX has a function called read. So how can we have a POSIX platform with read that at the same time allows the C program? After all fread and fgets probably use read; won't they break?

One way would be to split all the POSIX stuff into separate libraries: the user has to link -lio or whatever to get read and write and other functions (and then have fread and getc use some alternative read function, so they work even without -lio).

The approach in glibc is not to use symbols like read, but instead stay out of the way by using alternative names like __libc_read in a reserved namespace. The availability of read to POSIX programs is achieved by making read a weak alias for __libc_read. Programs which make an external reference to read, but do not define it, will reach the weak symbol read which aliases to __libc_read. Programs which define read will override the weak symbol, and their references to read will all go to that override.

The important part is that this has no effect on __libc_read. Moreover, the library itself, where it needs to use the read function, calls its internal __libc_read name that is unaffected by the program.

So all of this adds up to a kind of cleanliness. It's not a general form of namespace cleanliness feasible in a situation with many components, but it works in a two-party situation where our only requirement is to separate "the system library" and "the user application".

Astatic answered 30/8, 2019 at 21:54 Comment(2)
@JL2210 That may be the current situation, but I seem to recall that it wasn't always in the history of glibc; in any case, it's just a detail. The weak alias read and a single internal function like __read or __libc_read are enough to bring about the namespace separation. Any further split into __read and __libc_read must be for some additional requirements. (Like allowing programs to actually override library functions when they need to or whatever.)Astatic
__libc_read is not just a stub, it does the same thing as __read and read. (JL2210 may have been confused because, in the linked question, someone quoted code from glibc's stub implementation of all three.) I'm going to write a complementary answer explaining why glibc needs both __libc_read and __read.Infundibulum
J
3

OK, first some basics about the C language as specified by the standard. In order that you can write C applications without concern that some of the identifiers you use might clash with external identifiers used in the implementation of the standard library or with macros, declarations, etc. used internally in the standard headers, the language standard splits up possible identifiers into namespaces reserved for the implementation and namespaces reserved for the application. The relevant text is:

7.1.3 Reserved identifiers

Each header declares or defines all identifiers listed in its associated subclause, and optionally declares or defines identifiers listed in its associated future library directions subclause and identifiers which are always reserved either for any use or for use as file scope identifiers.

  • All identifiers that begin with an underscore and either an uppercase letter or another underscore are always reserved for any use.
  • All identifiers that begin with an underscore are always reserved for use as identifiers with file scope in both the ordinary and tag name spaces.
  • Each macro name in any of the following subclauses (including the future library directions) is reserved for use as specified if any of its associated headers is included; unless explicitly stated otherwise (see 7.1.4).
  • All identifiers with external linkage in any of the following subclauses (including the future library directions) and errno are always reserved for use as identifiers with external linkage.184)
  • Each identifier with file scope listed in any of the following subclauses (including the future library directions) is reserved for use as a macro name and as an identifier with file scope in the same name space if any of its associated headers is included.

No other identifiers are reserved. If the program declares or defines an identifier in a context in which it is reserved (other than as allowed by 7.1.4), or defines a reserved identifier as a macro name, the behavior is undefined.

Emphasis here is mine. As examples, the identifier read is reserved for the application in all contexts ("no other..."), but the identifier __read is reserved for the implementation in all contexts (bullet point 1).

Now, POSIX defines a lot of interfaces that are not part of the standard C language, and libc implementations might have a good deal more not covered by any standards. That's okay so far, assuming the tooling (linker) handles it correctly. If the application doesn't include <unistd.h> (outside the scope of the language standard), it can safely use the identifier read for any purpose it wants, and nothing breaks even though libc contains an identifier named read.

The problem is that a libc for a unix-like system is also going to want to use the function read to implement parts of the base C language's standard library, like fgetc (and all the other stdio functions built on top of it). This is a problem, because now you can have a strictly conforming C program such as:

#include <stdio.h>
#include <stdlib.h>
void read()
{
    abort();
}
int main()
{
    getchar();
    return 0;
}

and, if libc's stdio implementation is calling read as its backend, it will end up calling the application's function (not to mention, with the wrong signature, which could break/crash for other reasons), producing the wrong behavior for a simple, strictly conforming program.

The solution here is for libc to have an internal function named __read (or whatever other name in the reserved namespace you like) that can be called to implement stdio, and have the public read function call that (or, be a weak alias for it, which is a more efficient and more flexible mechanism to achieve the same thing with traditional unix linker semantics; note that there are some namespace issues more complex than read that can't be solved without weak aliases).

Jollification answered 30/8, 2019 at 21:43 Comment(3)
Ah, shucks. I was hoping to avoid weak aliases. This has my upvote, but I'm going to wait "several weeks" to see if @Infundibulum can answer.Leolaleoline
@JL2210: Well you could avoid them by having completely separate versions of the standard library you link for "nothing but plain C" etc., but that's a pretty extreme measure. Refusing to use them (or other extensions that could act in their place, like global ctors that setup function pointers if they're linked) also precludes any sort of efficient static linking.Jollification
Well, I suppose they're needed then. Guess I'll start working on that.Leolaleoline
I
2

Kaz and R.. have explained why a C library will, in general, need to have two names for functions such as read, that are called by both applications and other functions within the C library. One of those names will be the official, documented name (e.g. read) and one of them will have a prefix that makes it a name reserved for the implementation (e.g. __read).

The GNU C Library has three names for some of its functions: the official name (read) plus two different reserved names (e.g. both __read and __libc_read). This is not because of any requirements made by the C standard; it's a hack to squeeze a little extra performance out of some heavily-used internal code paths.

The compiled code of GNU libc, on disk, is split into several shared objects: libc.so.6, ld.so.1, libpthread.so.0, libm.so.6, libdl.so.2, etc. (exact names may vary depending on the underlying CPU and OS). The functions in each shared object often need to call other functions defined within the same shared object; less often, they need to call functions defined within a different shared object.

Function calls within a single shared object are more efficient if the callee's name is hidden—only usable by callers within that same shared object. This is because globally visible names can be interposed. Suppose that both the main executable and a shared object define the name __read. Which one will be used? The ELF specification says that the definition in the main executable wins, and all calls to that name from anywhere must resolve to that definition. (The ELF specification is language-agnostic and does not make any use of the C standard's distinction between reserved and non-reserved identifiers.)

Interposition is implemented by sending all calls to globally visible symbols through the procedure linkage table, which involves an extra layer of indirection and a runtime-variable final destination. Calls to hidden symbols, on the other hand, can be made directly.

read is defined in libc.so.6. It is called by other functions within libc.so.6; it's also called by functions within other shared objects that are also part of GNU libc; and finally it's called by applications. So, it is given three names:

  • __libc_read, a hidden name used by callers from within libc.so.6. (nm --dynamic /lib/libc.so.6 | grep read will not show this name.)
  • __read, a visible reserved name, used by callers from within libpthread.so.0 and other components of glibc.
  • read, a visible normal name, used by callers from applications.

Sometimes the hidden name has a __libc prefix and the visible implementation name has just two underscores; sometimes it's the other way around. This doesn't mean anything. It's because GNU libc has been under continuous development since the 1990s and its developers have changed their minds about internal conventions several times, but haven't always bothered to fix up all the old-style code to match the new convention (sometimes compatibility requirements mean we can't fix up the old code, even).

Infundibulum answered 10/9, 2019 at 21:25 Comment(4)
This is kind of nuts. Why bother having two internal functions when you can just have one?Leolaleoline
The entire point of this answer was to explain why glibc bothers to have two internal names (they aren't separate functions). Can you explain in some more detail why it's still not clear to you?Infundibulum
I'm a bit confused by "[f]unction calls within a single shared object are more efficient if the callee's name is hidden". Why is this?Leolaleoline
@JL2210 I've added some text about that now.Infundibulum

© 2022 - 2024 — McMap. All rights reserved.