C pointer to array declaration with bitwise and operator
Asked Answered
G

6

9

I want to understand the following code:

//...
#define _C 0x20
extern const char *_ctype_;
//...
__only_inline int iscntrl(int _c)
{
    return (_c == -1 ? 0 : ((_ctype_ + 1)[(unsigned char)_c] & _C));
}

It originates from the file ctype.h from the obenbsd operating system source code. This function checks if a char is a control character or a printable letter inside the ascii range. This is my current chain of thought:

  1. iscntrl('a') is called and 'a' is converted to it's integer value
  2. first check if _c is -1 then return 0 else...
  3. increment the adress the undefined pointer points to by 1
  4. declare this adress as a pointer to an array of length (unsigned char)((int)'a')
  5. apply the bitwise and operator to _C (0x20) and the array (???)

Somehow, strangely, it works and everytime when 0 is returned the given char _c is not a printable character. Otherwise when it's printable the function just returns an integer value that's not of any special interest. My problem of understanding is in step 3, 4 (a bit) and 5.

Thank you for any help.

Glochidiate answered 15/11, 2019 at 15:2 Comment(2)
_ctype_ is essentially an array of bitmasks. It's being indexed by the character of interest. So _ctype_['A'] would contain bits corresponding to "alpha" and "uppercase", _ctype_['a'] would contain bits corresponding to "alpha" and "lowercase", _ctype_['1'] would contain a bit corresponding to "digit", etc. It looks like 0x20 is the bit corresponding to "control". But for some reason the _ctype_ array is offset by 1, so the bits for 'a' are really in _ctype_['a'+1]. (That was probably to let it work for EOF even without the extra test.)Trouveur
The cast to (unsigned char) is to take care of the possibility that characters are signed and negative.Trouveur
G
3

_ctype_ appears to be a restricted internal version of the symbol table and I'm guessing the + 1 is that they didn't bother saving index 0 of it since that one isn't printable. Or possibly they are using a 1-indexed table instead of 0-indexed as is custom in C.

The C standard dictates this for all ctype.h functions:

In all cases the argument is an int, the value of which shall be representable as an unsigned char or shall equal the value of the macro EOF

Going through the code step by step:

  • int iscntrl(int _c) The int types are really characters, but all ctype.h functions are required to handle EOF, so they must be int.
  • The check against -1 is a check against EOF, since it has the value -1.
  • _ctype+1 is pointer arithmetic to get an address of an array item.
  • [(unsigned char)_c] is simply an array access of that array, where the cast is there to enforce the standard requirement of the parameter being representable as unsigned char. Note that char can actually hold a negative value, so this is defensive programming. The result of the [] array access is a single character from their internal symbol table.
  • The & masking is there to get a certain group of characters from the symbol table. Apparently all characters with bit 5 set (mask 0x20) are control characters. There's no making sense of this without viewing the table.
  • Anything with bit 5 set will return the value masked with 0x20, which is a non-zero value. This sates the requirement of the function returning non-zero in case of boolean true.
Gamone answered 15/11, 2019 at 15:24 Comment(2)
It is not correct that the cast sates the standard requirement that the value be representable as unsigned char. The standard requires that the value already* be representable as unsigned char, or equal EOF, when the routine is called. The cast only serves as “defensive” programming: Correcting the error of a programmer who passes a signed char (or a signed char) when the onus was on them to pass an unsigned char value when using a ctype.h macro. It should be noted this cannot correct the error when a char value of −1 is passed in an implementation that uses −1 for EOF.Kinsella
This also offers an explanation of the + 1. If the macro did not previously contain this defensive adjustment, then it could have been implemented merely as ((_ctype_+1)[_c] & _C), thus having a table indexed with the pre-adjustment values −1 to 255. So the first entry was not skipped and did serve a purpose. When somebody later added the defensive cast, the EOF value of −1 would not work with that cast, so they added the conditional operator to treat it specially.Kinsella
N
3

_ctype_ is a pointer to a global array of 257 bytes. I don't know what _ctype_[0] is used for. _ctype_[1] through _ctype_[256]_ represent the character categories of characters 0, …, 255 respectively: _ctype_[c + 1] represents the category of the character c. This is the same thing as saying that _ctype_ + 1 points to an array of 256 characters where (_ctype_ + 1)[c] represents the categorty of the character c.

(_ctype_ + 1)[(unsigned char)_c] is not a declaration. It's an expression using the array subscript operator. It's accessing position (unsigned char)_c of the array that starts at (_ctype_ + 1).

The code casts _c from int to unsigned char is not strictly necessary: ctype functions take char values cast to unsigned char (char is signed on OpenBSD): a correct call is char c; … iscntrl((unsigned char)c). They have the advantage of guaranteeing that there is no buffer overflow: if the application calls iscntrl with a value that is outside the range of unsigned char and isn't -1, this function returns a value which may not be meaningful but at least won't cause a crash or a leak of private data that happened to be at the address outside of the array bounds. The value is even correct if the function is called as char c; … iscntrl(c) as long as c isn't -1.

The reason for the special case with -1 is that it's EOF. Many standard C functions that operate on a char, for example getchar, represent the character as an int value which is the char value wrapped to a positive range, and use the special value EOF == -1 to indicate that no character could be read. For functions like getchar, EOF indicates the end of the file, hence the name end-of-file. Eric Postpischil suggests that the code was originally just return _ctype_[_c + 1], and that's probably right: _ctype_[0] would be the value for EOF. This simpler implementation yields to a buffer overflow if the function is misused, whereas the current implementation avoids this as discussed above.

If v is the value found in the array, v & _C tests if the bit at 0x20 is set in v. The values in the array are masks of the categories that the character is in: _C is set for control characters, _U is set for uppercase letters, etc.

Nonfiction answered 15/11, 2019 at 15:19 Comment(2)
(_ctype_ + 1)[_c] would use the correct array index as specified by the C standard, because it is the responsibility of the user to pass either EOF or an unsigned char value. The behavior for other values is not defined by the C standard. The cast does not serve to implement behavior required by the C standard. It is a workaround put in to guard against bugs caused by programmers incorrectly passing negative character values. However, it is incomplete or incorrect (and cannot be corrected) because a −1 character value will necessarily be treated as EOF.Kinsella
This also offers an explanation of the + 1. If the macro did not previously contain this defensive adjustment, then it could have been implemented merely as ((_ctype_+1)[_c] & _C), thus having a table indexed with the pre-adjustment values −1 to 255. So the first entry was not skipped and did serve a purpose. When somebody later added the defensive cast, the EOF value of −1 would not work with that cast, so they added the conditional operator to treat it specially.Kinsella
G
3

_ctype_ appears to be a restricted internal version of the symbol table and I'm guessing the + 1 is that they didn't bother saving index 0 of it since that one isn't printable. Or possibly they are using a 1-indexed table instead of 0-indexed as is custom in C.

The C standard dictates this for all ctype.h functions:

In all cases the argument is an int, the value of which shall be representable as an unsigned char or shall equal the value of the macro EOF

Going through the code step by step:

  • int iscntrl(int _c) The int types are really characters, but all ctype.h functions are required to handle EOF, so they must be int.
  • The check against -1 is a check against EOF, since it has the value -1.
  • _ctype+1 is pointer arithmetic to get an address of an array item.
  • [(unsigned char)_c] is simply an array access of that array, where the cast is there to enforce the standard requirement of the parameter being representable as unsigned char. Note that char can actually hold a negative value, so this is defensive programming. The result of the [] array access is a single character from their internal symbol table.
  • The & masking is there to get a certain group of characters from the symbol table. Apparently all characters with bit 5 set (mask 0x20) are control characters. There's no making sense of this without viewing the table.
  • Anything with bit 5 set will return the value masked with 0x20, which is a non-zero value. This sates the requirement of the function returning non-zero in case of boolean true.
Gamone answered 15/11, 2019 at 15:24 Comment(2)
It is not correct that the cast sates the standard requirement that the value be representable as unsigned char. The standard requires that the value already* be representable as unsigned char, or equal EOF, when the routine is called. The cast only serves as “defensive” programming: Correcting the error of a programmer who passes a signed char (or a signed char) when the onus was on them to pass an unsigned char value when using a ctype.h macro. It should be noted this cannot correct the error when a char value of −1 is passed in an implementation that uses −1 for EOF.Kinsella
This also offers an explanation of the + 1. If the macro did not previously contain this defensive adjustment, then it could have been implemented merely as ((_ctype_+1)[_c] & _C), thus having a table indexed with the pre-adjustment values −1 to 255. So the first entry was not skipped and did serve a purpose. When somebody later added the defensive cast, the EOF value of −1 would not work with that cast, so they added the conditional operator to treat it specially.Kinsella
D
2

I'll start with step 3:

increment the adress the undefined pointer points to by 1

The pointer is not undefined. It's just defined in some other compilation unit. That is what the extern part tells the compiler. So when all files are linked together, the linker will resolve the references to it.

So what does it point to?

It points to an array with information about each character. Each character has its own entry. An entry is a bitmap representation of characteristics for the character. For example: If bit 5 is set, it means that the character is a control character. Another example: If bit 0 is set, it means that the character is a upper character.

So something like (_ctype_ + 1)['x'] will get the characteristics that apply to 'x'. Then a bitwise and is performed to check if bit 5 is set, i.e. check whether it is a control character.

The reason for adding 1 is probably that the real index 0 is reserved for some special purpose.

Desexualize answered 15/11, 2019 at 15:40 Comment(0)
T
1

All information here is based on analyzing the source code (and programming experience).

The declaration

extern const char *_ctype_;

tells the compiler that there is a pointer to const char somewhere named _ctype_.

(4) This pointer is accessed as an array.

(_ctype_ + 1)[(unsigned char)_c]

The cast (unsigned char)_c makes sure the index value is in the range of an unsigned char (0..255).

The pointer arithmetic _ctype_ + 1 effectively shifts the array position by 1 element. I don't know why they implemented the array this way. Using the range _ctype_[1].._ctype[256] for the character values 0..255 leaves the value _ctype_[0] unused for this function. (The offset of 1 could be implemented in several alternative ways.)

The array access retrieves a value (of type char, to save space) using the character value as array index.

(5) The bitwise AND operation extracts a single bit from the value.

Apparently the value from the array is used as a bit field where the bit 5 (counting from 0 starting at least significant bit, = 0x20) is a flag for "is a control character". So the array contains bit field values describing the properties of the characters.

Tylertylosis answered 15/11, 2019 at 15:19 Comment(1)
I guess they moved the + 1 to the pointer to make it clear that they are accessing elements 1..256 instead of 1..255,0. _ctype_[1 + (unsigned char)_c] would have been equivalent due to the implicit conversion to int. And _ctype_[(_c & 0xff) + 1] would have been even more clear and concise.Logical
L
0

The functions declared in ctype.h accept objects of the type int. For characters used as arguments it is assumed that they are preliminary casted to the type unsigned char. This character is used as an index in a table that determines the characteristic of the character.

It seems the check _c == -1 is used in case when the _c contains the value of EOF. If it is not EOF then _c is casted to the type unsigned char that is used as an index in the table pointed to by the expression _ctype_ + 1. And if the bit specified by the mask 0x20 is set then the character is a control symbol.

To understand the expression

(_ctype_ + 1)[(unsigned char)_c]

take into account that the array subscripting is a postfix operator that is defined like

postfix-expression [ expression ]

You may not write like

_ctype_ + 1[(unsigned char)_c]

because this expression is equivalent to

_ctype_ + ( 1[(unsigned char)_c] )

So the expression _ctype_ + 1 is enclosed in parentheses to get a primary expression.

So in fact you have

pointer[integral_expression]

that yields the object of an array at index that is calculated as the expression integral_expression where pointer is (_ctype_ + 1) (gere is used the pointer arithmetuc) and integral_expression that is the index is the expression (unsigned char)_c.

Lavaliere answered 15/11, 2019 at 15:18 Comment(0)
S
0

The key here is to understand what the expression (_ctype_ + 1)[(unsigned char)_c] does (which is then fed to the bitwise and operation, & 0x20 to get the result!

Short answer: It returns element _c + 1 of the array pointed-to by _ctype_.

How?

First, although you seem to think _ctype_ is undefined it actually isn't! The header declares it as an external variable - but it is defined in (almost certainly) one of the run-time libraries that your program is linked with when you build it.

To illustrate how the syntax corresponds to array indexing, try working through (even compiling) the following short program:

#include <stdio.h>
int main() {
    // Code like the following two lines will be defined somewhere in the run-time
    // libraries with which your program is linked, only using _ctype_ in place of _qlist_ ...
    const char list[] = "abcdefghijklmnopqrstuvwxyz";
    const char* _qlist_ = list;
    // These two lines show how expressions like (a)[b] and (a+1)[b] just boil down to
    // a[b] and a[b+1], respectively ...
    char p = (_qlist_)[6];
    char q = (_qlist_ + 1)[6];
    printf("p = %c  q = %c\n", p, q);
    return 0;
}

Feel free to ask for further clarification and/or explanation.

Syracuse answered 15/11, 2019 at 15:21 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.