Confusing language in specification of strtol, et al

Asked 14/7, 2011 at 23:17 Answered 15/7, 2011 at 8:18

c standards-compliance language-lawyer strtol

The specification for strtol conceptually divides the input string into "initial whitespace", a "subject sequence", and a "final string", and defines the "subject sequence" as:

the longest initial subsequence of the input string, starting with the first non-white-space character that is of the expected form. The subject sequence shall contain no characters if the input string is empty or consists entirely of white-space characters, or if the first non-white-space character is other than a sign or a permissible letter or digit.

At one time I thought the "longest initial subsequence" business was akin to the way scanf works, where "0x@" would scan as "0x", a failed match, followed by "@" as the next unread character. However, after some discussion, I'm mostly convinced that strtol processes the longest initial subsequence that is of the expected form, not the longest initial string which is the initial subsequence of some possible string of the expected form.

What's still confusing me is this language in the specification:

If the subject sequence is empty or does not have the expected form, no conversion is performed; the value of str is stored in the object pointed to by endptr, provided that endptr is not a null pointer.

If we accept what seems to be the correct definition of "subject sequence", there is no such thing as a non-empty subject sequence that does not have the expected form, and instead (to avoid redundancy and confusion) the text should just read:

If the subject sequence is empty, no conversion is performed; the value of str is stored in the object pointed to by endptr, provided that endptr is not a null pointer.

Can anyone clarify these issues for me? Perhaps a link to past discussions or any relevant defect reports would be useful.

Proconsul answered 14/7, 2011 at 23:17 Comment(2)

And a simple normative example would have cleared everything up... – Padilla 14/7, 2011 at 23:47

Indeed. I have a feeling on matters like this the committee was actually trying to avoid being explicit for fear of opening up a bikeshed argument about how it should behave... – Proconsul 15/7, 2011 at 0:10

I think the C99 language is quite clear:

The subject sequence is defined as the longest initial subsequence of the input string, starting with the first non-white-space character, that is of the expected form.

Given "0x@", "0x@" is not of the expected form; "0x" is not of the expected form; therefore "0" is the longest initial subsequence that is of the expected form.

I agree that this implies that you cannot have a non-empty subject sequence that isn't of the expected form - unless you interpret the following:

In other than the "C" locale, additional locale-specific subject sequence forms may be accepted.

...as allowing a locale to define other possible forms that the subject sequence might have, that are nonetheless not of "the expected form".

The wording in the final paragraph seems to be just "belt-and-braces".

Pipes answered 15/7, 2011 at 1:37 Comment(1)

I'm interested in how you think one could weasel around the requirement with the locale clause.. :-) – Proconsul 15/7, 2011 at 1:57

It might be easier to understand if you started at §7.20.1.4 (The strtol, strtoll, strtoul, and strtoull functions) ¶2 of the C99 standard, instead of ¶4:

¶2 The strtol, strtoll, strtoul, and strtoull functions convert the initial portion of the string pointed to by nptr to long int, long long int, unsigned long int, and unsigned long long int representation, respectively. First, they decompose the input string into three parts: an initial, possibly empty, sequence of white-space characters (as specified by the isspace function), a subject sequence resembling an integer represented in some radix determined by the value of base, and a final string of one or more unrecognized characters, including the terminating null character of the input string. Then, they attempt to convert the subject sequence to an integer, and return the result.

¶3 If the value of base is zero, the expected form of the subject sequence is that of an integer constant as described in 6.4.4.1, optionally preceded by a plus or minus sign, but not including an integer suffix. If the value of base is between 2 and 36 (inclusive), the expected form of the subject sequence is a sequence of letters and digits representing an integer with the radix specified by base, optionally preceded by a plus or minus sign, but not including an integer suffix. The letters from a (or A) through z (or Z) are ascribed the values 10 through 35; only letters and digits whose ascribed values are less than that of base are permitted. If the value of base is 16, the characters 0x or 0X may optionally precede the sequence of letters and digits, following the sign if present.

¶4 The subject sequence is defined as the longest initial subsequence of the input string, ...

In particular, ¶3 clarifies what a subject sequence is.

Flinders answered 15/7, 2011 at 5:8 Comment(0)

The POSIX spec for strtol seems to be more clear:

These functions shall convert the initial portion of the string pointed to by str to a type long and long long representation, respectively. First, they decompose the input string into three parts:

An initial, possibly empty, sequence of white-space characters (as specified by isspace())

A subject sequence interpreted as an integer represented in some radix determined by the value of base

A final string of one or more unrecognized characters, including the terminating NUL character of the input string.

Then they shall attempt to convert the subject sequence to an integer, and return the result.

But of course, it is not normative and "defers to the ISO C standard".

Shivaree answered 14/7, 2011 at 23:24 Comment(6)

Which revision of POSIX? Maybe they clarified the wording? – Shivaree 14/7, 2011 at 23:59

2008, the same one you linked. – Proconsul 15/7, 2011 at 0:9

Ohhh... Right :-). In which case, I wonder how the ISO spec phrases it... Definitely bizarre wording. I thought the intent was "skip white space, then try to read an integer, and if that fails, do nothing" – Shivaree 15/7, 2011 at 0:11

Well the only real corner case is reading "0x" followed by junk when base is 0 or 16 - should it read as 0 with endptr pointing to the x, or as invalid input? scanf is clearly specified to read it as invalid input, but glibc's scanf reads it as 0, also consuming the x... BTW, strtod has a lot more corner cases since there are a lot more strings that are well-formed if truncated short, ill-formed truncated mid-length, and well-formed again if taken in full - for example, "1.0e+10" is well-formed, and "1.0" is well-formed, but "1.0e+" is not. – Proconsul 15/7, 2011 at 0:39

I believe strtod is to read the latter as "1.0" and leave a pointer to "e+" in endptr, but scanf is supposed to produce a matching failure in this case. – Proconsul 15/7, 2011 at 0:39

@R..: I agree - the scanf() family of functions produce additional failures in cases where it would be necessary to push back more than one character; however, many libc implementations are known to be broken - see #1426230 – Victorie 15/7, 2011 at 7:37

I completely agree with your assessment: By definition, all non-empty subject sequences are of expected form, so the wording of the standard is dubious.

In case of the floating point conversion functions, there's another blunder (C99:TC3 section 7.20.1.3, §3):

[...] The subject sequence is defined as the longest initial subsequence of the input string, starting with the first non-white-space character, that is of the expected form. The subject sequence contains no characters if the input string is not of the expected form.

This implies that the whole input string must be of expected form, defeating the purpose of the endptr parameter. One could argue that the expected form for the input string is different from the expected form for the subject sequence, but it's still pretty confusing.

You are also correct that the semantics of the strto*() and *scanf() family of functions are different: If both match, they will always agree on the value and consume the same number of characters (and any libc implemetation where they do not is broken, including newlib and glibc last time I checked), but *scanf() additionally fails to match cases where it would need to backtrack more than one character, as in your examples "0x@" and "1.0e+".

Victorie answered 15/7, 2011 at 8:18 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags