Given a class, such as
[:digit:]
I would like the output to be
0123456789
Note, the method should work for all POSIX character classes. Here is what I have tried
$ printf %s '[:digit:]'
[:digit:]
Given a class, such as
[:digit:]
I would like the output to be
0123456789
Note, the method should work for all POSIX character classes. Here is what I have tried
$ printf %s '[:digit:]'
[:digit:]
$ seq 126 | awk '{printf "%c", $0}' | grep -o '[[:digit:]]'
0
1
2
3
4
5
6
7
8
9
$ seq 126 | awk '($0=sprintf("%c", $0)) ~ /[[:digit:]]/'
. You don't actually need seq either - awk 'BEGIN{for (i=1; i<=126; i++) if ((c=sprintf("%c", i)) ~ /[[:digit:]]/) print c}'
–
Zared I'm sure there's a better way but here's a brute force method:
for i in {0..127}; do
char=$(printf \\$(printf '%03o' "$i"))
[[ $char =~ [[:alpha:]] ]] && echo "$char"
done
Loop through all the decimal character values, convert them to the corresponding ASCII character and test them against the character class.
The range might be wrong but the check seems to work.
As others have mentioned in the comments, it is also possible to use the ==
operator instead of the =~
in this case, which may be slightly faster.
==
instead of the regex operator =~
–
Favor ==
is faster? –
Ambit Similar to the other suggestions, you can find all matching Unicode 4.0 single codepoint graphemes in your current locale with:
for((i=0; i < 0x110000; i++)) {
printf "\U$(printf "%x" $i)\n";
} | grep -a '^[[:alpha:]]$'
Here is a non-exhaustive list of problems with this approach:
Combining characters such as $'E\U0301'
, which is two code points rendered as one grapheme (this particular sequence canonicalizes to the single codepoint É). This is especially awkward for languages like Malayalam that depend entirely on combination.
It has some issues with the cntrl
class, specifically line feeds.
Ruby characters, which I can't seem to render on Stack Overflow. Fortunately, these are generally deprecated in favor of proper markup.
It's slow.
A better approach would be to try to interpret your platform's locale definition files, but this is highly platform dependent.
$ seq 126 | awk '{printf "%c", $0}' | grep -o '[[:digit:]]'
0
1
2
3
4
5
6
7
8
9
$ seq 126 | awk '($0=sprintf("%c", $0)) ~ /[[:digit:]]/'
. You don't actually need seq either - awk 'BEGIN{for (i=1; i<=126; i++) if ((c=sprintf("%c", i)) ~ /[[:digit:]]/) print c}'
–
Zared POSIX character classes are internally defined. For grep
, you can find them via the re_format man page.
We no longer live in an ASCII based world. For example, you may assume that [[:digit:]]
might just includethe characters 0
through 9
. However, it could also include the characters ٠
through ٩
or include the characters ۰
to ۹
1 or even the characters ๐
to ၉
. It all depends what language you use and how you've setup your computer.
Also, we can no longer assume that a character is equivalent to a byte. Characters can now include multibyte sequences. Using octal codes to represent a character and translating it won't work.
It depends upon your computer and OS. If you're writing your programs on a TRS80 or a PDP11, there's a good chance you're still using ASCII coding. Thus, you can flip through all 127 (or 256) different ways of encoding a number. If you're on a Mac or Linux system, there's a good change that you're using Unicode character points represented with UTF8 encoding.
On Windows, you could be using a 256 character code point character set. By default, this is CP1252 in the U.S., but varies around the world. Then again, Windows is also very good at Unicode and UTF8. But, Windows uses UTF16 internally for its file system.
The point is that you simply cannot flip through all the characters. You could run your shell script on two different systems and get two completely different results based upon the environment, the computer, and operating system.
1 Although they look identical, Arabic and Persian numbers involve two different unicode character points, and thus are different digits.
[:digit:]
includes the digits in various languages when considering Unicode. The character encoding is only relevant when considering a particular implementation of a solution. Overall, though, this is more of a comment to request clarification of the original question than an answer. –
Anta Another "same but different" approach, just because the OP asked about POSIX character classes while many responses rely on non-POSIX components.
AFAIK, this method is 100% POSIX, although I may be abusing printf
and awk
.
I don't know if this suits all use cases or locales but "POSIXLY" this seems to work for me in for 0-127. It passes shellcheck
also.
I imagine you could extend the range as needed. It is a little long-winded, but I consider that the cost of readability.
Simply change :alpha:
to the class of your choice.
#!/bin/sh
LC_ALL=C
i=0
true > members
while [ "$i" -le 127 ]
do
echo "$i" | awk '{ printf "%c", $0 }' | awk '/[[:alpha:]]/ { print }'
# echo "$i" | awk '{ printf "%c", $0 }' | grep '[[:alpha:]]'
i=$(( i + 1 ))
done >> members
printf "%s\n" "$(tr -d '\n' < members)"
It has the added advantage that the final output can be munged for other uses.
jot
is far more convenient and flexible than seq
.
jot -s '' 10 0 # print it numerically jot -s '' -c 10 48 # print it via ASCII ordinals
0123456789
To print the upper- and lower-case letters in ASCII
, do
jot -s '' -c 26 65 # 65 = 9^2 - 4^2 jot -s '' -c 26 97 # 97 = 3^4 + 2^4
ABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyz
——————————
UPDATE : here's an overview of how gawk
match POSIX
character classes to UTF-8
(although I think gawk
mismatches around a couple dozen) :
R
= UTF-16r
rogate range of D800-DFFF
,
matched via floor( codepoint / 2^11 ) == 3^3
| A
= a
lpha
| U
= u
pper
| L
= l
ower
| M
= alnuM
| D
= d
igit
| X
= x
digit
| G
= g
raph
| P
= p
rint
| T
= punct
| C
= c
ntrl
| S
= s
pace
| B
= b
lank
48 RAU_M_XGP____
160 RAU_M__GP____
48 RA_LM_XGP____
160 RA_LM__GP____
80 R___MDXGP____
256 R______GPT___
8 R_______P__SB
8 R_________CSB
32 R_________CS_
224 R_________C__
1,024 R____________ # surrogates D[8-F][8-F][0-F]
6 _AU_M_XGP____
1,179 _AU_M__GP____
6 _A_LM_XGP____
1,346 _A_LM__GP____
31 _A__M__GP____ # in alpha and alnum but neither case
1 __U____GP____ # only in upper but neither alpha nor alnum
1 ___L___GP____ # only in lower but neither alpha nor alnum
10 ____MDXGP____ # just the ASCII digits matched
5,187 _______GPT___
252,248 _______GP____
18 ________P__SB
2 ________P__S_
6 ________P____ ***
1 __________CSB # horizontal-tab 0x09 \011 \t
5 __________CS_
190 __________C__
851,827 _____________
***
:: interestingly, from gawk
's perspective, these belong to [[:print:]]
but not to [[:graph:]]
U+ 10B3A | 68,410 | [ 𐬺 ]
U+ 10B3B | 68,411 | [ 𐬻 ]
U+ 10B3C | 68,412 | [ 𐬼 ]
U+ 10B3D | 68,413 | [ 𐬽 ]
U+ 10B3E | 68,414 | [ 𐬾 ]
U+ 10B3F | 68,415 | [ 𐬿 ]
© 2022 - 2024 — McMap. All rights reserved.