Treatment of backslash character in the bracket expression
Asked Answered
F

3

5

The section 3.4 Using Bracket Expressions of GNU awk manual, reads

To include one of the characters ‘\’, ‘]’, ‘-’, or ‘^’ in a bracket expression, put a ‘\’ in front of it. For example:
     [d\]]
matches either ‘d’ or ‘]’. Additionally, if you place ‘]’ right after the opening ‘[’, the closing bracket is treated as one of the characters to be matched.

The treatment of ‘\’ in bracket expressions is compatible with other awk implementations and is also mandated by POSIX.

On the other hand, the section Regular Expressions of POSIX awk doesn't list the \] as having a special meaning. Here are a few experiments with GNU awk (version 5.3.1) and GNU grep (version 3.11) that expose conflicting treatment of the \ in a bracket expression:

$ echo d | awk '/[d\]]/'
d
$ echo d | grep -E '[d\]]'
$ echo ']' | awk '/[d\]]/'
]
$ echo ']' | grep -E '[d\]]'

The question is:
is the GNU awk documentation wrong in claiming that the treatment of \ in a bracket expression in GNU awk is mandated by POSIX, or have I overlooked something?
In other words, does the GNU awk violate the POSIX specification?

Foreigner answered 2/11 at 14:38 Comment(0)
H
6

The POSIX reference that allows awk to interpret \ in a bracket expression as an escape character is in the table under Regular Expressions in the POSIX awk spec (emphasis mine and note in particular the last 2 rows of the table):

Regular Expressions

... these escape sequences shall be recognized both inside and outside bracket expressions ...

Escape Sequence Description Meaning
\" <backslash> <quotation-mark> In the lexical token STRING, character. Otherwise undefined.
\/ <backslash> <slash> In the lexical token ERE, <slash> character. Otherwise undefined.
\ddd A <backslash> character followed by the longest sequence of one, two, or three octal-digit characters (01234567). ... The character whose encoding is represented by the one, two, or three-digit octal integer...
\., \[, \(,\*, \+, \?, \{, \|, \^, \$ A <backslash> character followed by a character that has a special meaning in EREs ... other than <backslash>. In the lexical token ERE when not inside a bracket expression, the sequence shall represent itself. Otherwise undefined.
\\ Two <backslash> characters. In the lexical token ERE, the sequence shall represent itself...
\c A <backslash> character followed by any character not described in this table or in the table in XBD 5. File Format Notation ('\\', '\a', '\b', '\f', '\n', '\r', '\t', '\v'). Undefined

That means that in a POSIX-compliant awk, inside or outside of a bracket expression, \\ is mandated to mean a literal \ and the meaning of \c, where c is any character not listed in the table (e.g. ]), is undefined by POSIX and so gawk can treat it however it likes, hence allowing [d\]] to mean "d or ]", for example.

So no, gawk is not violating the POSIX awk spec (which supersedes the POSIX regexp spec for describing awk behavior) in its treatment of \, as it is treating \ in the way required for [\\] and as allowed (since its meaning is undefined) for [d\]].

Haemorrhage answered 3/11 at 14:12 Comment(18)
Then, gawk documentation is in apparent error to claim that [d\]] matches either ‘d’ or ‘]’, isn't it? It isn't mandated by POSIX; it may match, but doesn't have to. That construct cannot be assured to be portable.Foreigner
POSIX states that [d\]] is undefined behavior so any awk variant (e.g. BSD or GNU) can define it to mean whatever the implementer wants it to mean. The gawk documentation states that [d\]] means d or ] in gawk - thats not an error in the gawk documentation any more than defining what the gensub() function does or the value of $0 in the END section because that is what it means to gawk.Haemorrhage
Regarding "That construct cannot be assured to be portable." - the gawk documentation isn't claiming it means the same in other awks as it does in gawk, just what it means in gawk. In general if you want your code to be portable you should avoid using any construct in your code whose behavior is undefined by POSIX (and, unfortunately, with the latest spec now also some that IS defined but almost no awk implements such as .*?), regardless of how it's defined in a variant of the tool you're using.Haemorrhage
@M.NejatAydin I spoke to the gawk provider, Arnold - he's going to add some more explanatory text about this to the gawk manual and I'll open a ticket against POSIX to improve the description of this area in the standard.Haemorrhage
Also please note that, gawk --posix '/[d\>]/' issues a warning whereas gawk --posix '/[d\]]/' doesn't. Both bracket expressions are invalid per POSIX, though \> is a GNU extension.Foreigner
@M.NejatAydin that warning for \> is because \] means something in a bracket expression while \> does not and so is probably a bug where the author meant to write either [d\\>] (d or \ or >) or [d>] (d or >) or [d]\> (d followed by a word delimiter). So, gawk should not issue a warning for [d\]] as it's a valid, meaningful construct but should issue a warning for [d\>] as it's probably a bug.Haemorrhage
Regarding "Both bracket expressions are invalid per POSIX" - no, both are undefined per POSIX, that's not the same as invalid. POSIX deliberately leaves a construct undefined when there's no consensus among tool providers over what that construct means so POSIX can define the common functionality while still leaving GNU, BSD, etc. to make that construct mean whatever they want it to mean for their variant of that tool. Every awk interprets [\]] and [\>] to mean something and whatever that something is, it's valid with respect to POSIX since POSIX left it undefined.Haemorrhage
Regarding "\> is a GNU extension" - \> is a GNU extension outside of a bracket expression, not inside of one. Inside a bracket expression [\>] is no different from [\d] or [\*] - just another literal character that someone mistakenly stuck a \ in front of and so gawk warns you about your mistake with awk: cmd. line:1: warning: regexp escape sequence '\>' is not a known regexp operator.Haemorrhage
undefined describes the nature of a value or behavior not defined by POSIX.1-2024 which results from use of an invalid program construct or invalid data input. In the context of this discussion, undefined results from use of an invalid program construct, that is, [d\]]. Then, in this context, I cannot see a practical difference between undefined and invalid.Foreigner
The second sentence of that section you quoted matters and re-iterates what I'm saying above - "The value or behavior may vary among implementations that conform to POSIX.1-2024.". So something that's undefined by POSIX can be defined by the tool implementer to do whatever they like and the result is still conformant with POSIX. Something undefined (e.g. BEGIN{FS=""} or END{print $0}) will allow the program to continue at the discretion of the tool implementer while invalid (e.g. awk --posix 'BEGIN{print gensub("foo","bar",1,"xfoox")}') stops the program, usually with an error message.Haemorrhage
I see POSIX are trying to define a difference between "undefined" and "unspecified". I guess I can see why they'd want to do that but practically, no-one cares whether a given construct was stated in POSIX as explicitly undefined or just not mentioned (unspecified) by POSIX - it's all simply undefined behavior to those of us writing code and tool providers can define it however they like.Haemorrhage
I don't object that a utility can do whatever it likes when it encounters with an invalid program construct. But that construct is still invalid from the POSIX' point of view.Foreigner
As I see it, the undefined results from use of an invalid program construct (or input) whereas the unspecified results from use of a valid program construct (or input). So, use of [d\]] in a POSIX awk program is definitely invalid.Foreigner
I understand that's what those POSIX definitions for "undefined" and "unspecified" say but they don't define what "valid" and "invalid" actually mean and they both say "The value or behavior may vary among implementations that conform to POSIX.1-2024. An application should not rely on the existence or validity of the value or behavior. An application that relies on any particular value or behavior cannot be assured to be portable across conforming implementations." so both should be treated identically, there is no functional difference between them. The attempted distinction is useless.Haemorrhage
Think about anything else in life - when is something considered invalid but you can continue to use it. That's what the POSIX definition for "undefined" says - the use of an invalid language construct which can be used and still be conformant with POSIX. That makes no sense. Just think of anything that's stated as "undefined" or is not stated at all ("unspecified") in POSIX as undefined behavior, as that term has been understood across all standards for decades, and then how to use that information will be much clearer.Haemorrhage
I am (or was) familiar with the C standard. In C, the order of evaluation of subexpressions in a statement such as x = f(y) + g(z); is unspecified by the standard (but the statement is completely valid). That is, it is not specified, by the standard, which function (f or g) will be called first (that may matter and may affect the program's output if these functions have side effects, such as modifying a global variable). On the other hand, a statement such as a[i] = i++; is undefined (ie, invalid), but the compiler is allowed to compile it without an error or warning message!Foreigner
It's all just nasal demons in the end :-).Haemorrhage
In the context of awk programming, think of the statement for (idx in arr). The order of indices that are iterated is unspecified, and the statement is valid. For example, awk 'BEGIN { a["z"]; a["y"]; a["x"]; for (i in a) s = s i; print s }', then the output will be one of the permutations of the letters x, y, z. It is unspecified which one will. It cannot be any other thing. This is very different from the undefined, that the output could be anything or no output at all.Foreigner
C
1

RE Bracket Expression chapter of POSIX stipulates that

A bracket expression is either a matching list expression or a non-matching list expression. It consists of one or more expressions: ordinary characters, collating elements, collating symbols, equivalence classes, character classes, or range expressions. The right-square-bracket ( ']' ) shall lose its special meaning and represent itself in a bracket expression if it occurs first in the list (after an initial circumflex ( '^' ), if any). Otherwise, it shall terminate the bracket expression, unless it appears in a collating symbol (such as "[.].]" ) or is the ending for a collating symbol, equivalence class, or character class. The special characters '.', '*', '[', and '\' ( period, asterisk, left-square-bracket, and backslash, respectively) shall lose their special meaning within a bracket expression.

The character sequences "[.", "[=", and "[:" (left-square-bracket followed by a period, equals-sign, or colon) shall be special inside a bracket expression and are used to delimit collating symbols, equivalence class expressions, and character class expressions. These symbols shall be followed by a valid expression and the matching terminating sequence ".]", "=]", or ":]", as described in the following items(...)

therefore

if you place ‘]’ right after the opening ‘[’, the closing bracket is treated as one of the characters to be matched

is compliant with above, but

[d\]]

to my understanding could not means matches either ‘d’ or ‘]’ as 1st ] is terminating as it is neither first character nor element of collating symbol, equivalence class or character class.

Christi answered 2/11 at 19:22 Comment(0)
E
0

Usually it's just easier to make character classes instead of worrying about how many backslashes you need (and how many gets "eaten" by each layer of encapsulation). so if you want a unified ERE that works in grep -E and awk for capturing "d" (just the lowercase letter) and/or "]" (right square bracket), then do

awk '/[]d]/' or awk '$0 ~ "[]d]"'


grep -E '[]d]'

Only ^ (only applicable for the accreting char class, and only when it's the left-most item within a char class) and \ require special backslash handling in a char class. Other special items like ] and - can avoid backslashing with strategic placement within the class.

To negate ^ (i.e. any locale-valid char EXCEPT caret) —> [^^].

I usually place either - or \\ furthest right within char class, and ] furthest left. If you need both - and \ := [-...\\]. If you need all 3 of these { ], -, \}, perhaps []-...\\]. By placing \\ right-most whenever possible, there's zero chance they could be mis-interpreted as an escape sequence for anything other than the backslash itself.

If possible, capture those few characters by char ranges, and avoid all the backslashing altogether - e.g. to capture all ASCII punctuation symbols by explicit listing the char ranges, one can do

awk '/[!-/:-@[-\140{-~]/' (gawk / mawk)

awk '/[!-\/:-@[-\140{-~]/' (nawk)


awk '$0 ~ "[!-/:-@[-\140{-~]"'

(I used \140 because I don't like physical backticks anywhere in my codes, but it's safe to use \140 instead of \\140 for the string version of it)

The only exception to this rule would be if you need the ] to bookend the upper limit of a char range. In this scenario, pick your poison :

awk '/[[-\]]/' or awk '/[][\\]/'

Earthwork answered 5/11 at 9:18 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.