Print POSIX character class
Asked Answered
T

6

5

Given a class, such as

[:digit:]

I would like the output to be

0123456789

Note, the method should work for all POSIX character classes. Here is what I have tried

$ printf %s '[:digit:]'
[:digit:]

§ Character classes

Tasset answered 23/10, 2014 at 18:21 Comment(1)
T
1
$ seq 126 | awk '{printf "%c", $0}' | grep -o '[[:digit:]]'
0
1
2
3
4
5
6
7
8
9
Tasset answered 23/10, 2014 at 19:14 Comment(1)
You don't need grep when you're using awk - $ seq 126 | awk '($0=sprintf("%c", $0)) ~ /[[:digit:]]/'. You don't actually need seq either - awk 'BEGIN{for (i=1; i<=126; i++) if ((c=sprintf("%c", i)) ~ /[[:digit:]]/) print c}'Zared
A
3

I'm sure there's a better way but here's a brute force method:

for i in {0..127}; do 
    char=$(printf \\$(printf '%03o' "$i"))
    [[ $char =~ [[:alpha:]] ]] && echo "$char"
done

Loop through all the decimal character values, convert them to the corresponding ASCII character and test them against the character class.

The range might be wrong but the check seems to work.

As others have mentioned in the comments, it is also possible to use the == operator instead of the =~ in this case, which may be slightly faster.

Ambit answered 23/10, 2014 at 18:35 Comment(5)
On my system, this catches 52 characters and misses 101487.Winona
You can use the pattern match operator == instead of the regex operator =~Favor
@glennjackman interesting, I didn't know that. A bit of a special case then?Ambit
Not a special case, but character classes can be used in both patterns and regular expressions.Anta
@Anta thanks, will edit later. I assume the == is faster?Ambit
W
2

Similar to the other suggestions, you can find all matching Unicode 4.0 single codepoint graphemes in your current locale with:

for((i=0; i < 0x110000; i++)) {
  printf "\U$(printf "%x" $i)\n"; 
}  | grep -a '^[[:alpha:]]$'

Here is a non-exhaustive list of problems with this approach:

  • Combining characters such as $'E\U0301', which is two code points rendered as one grapheme (this particular sequence canonicalizes to the single codepoint É). This is especially awkward for languages like Malayalam that depend entirely on combination.

  • It has some issues with the cntrl class, specifically line feeds.

  • Ruby characters, which I can't seem to render on Stack Overflow. Fortunately, these are generally deprecated in favor of proper markup.

  • It's slow.

A better approach would be to try to interpret your platform's locale definition files, but this is highly platform dependent.

Winona answered 23/10, 2014 at 20:30 Comment(0)
T
1
$ seq 126 | awk '{printf "%c", $0}' | grep -o '[[:digit:]]'
0
1
2
3
4
5
6
7
8
9
Tasset answered 23/10, 2014 at 19:14 Comment(1)
You don't need grep when you're using awk - $ seq 126 | awk '($0=sprintf("%c", $0)) ~ /[[:digit:]]/'. You don't actually need seq either - awk 'BEGIN{for (i=1; i<=126; i++) if ((c=sprintf("%c", i)) ~ /[[:digit:]]/) print c}'Zared
E
1

POSIX character classes are internally defined. For grep, you can find them via the re_format man page.

We no longer live in an ASCII based world. For example, you may assume that [[:digit:]] might just includethe characters 0 through 9. However, it could also include the characters ٠‎ through ٩ or include the characters ۰ to ۹1 or even the characters to . It all depends what language you use and how you've setup your computer.

Also, we can no longer assume that a character is equivalent to a byte. Characters can now include multibyte sequences. Using octal codes to represent a character and translating it won't work.

It depends upon your computer and OS. If you're writing your programs on a TRS80 or a PDP11, there's a good chance you're still using ASCII coding. Thus, you can flip through all 127 (or 256) different ways of encoding a number. If you're on a Mac or Linux system, there's a good change that you're using Unicode character points represented with UTF8 encoding.

On Windows, you could be using a 256 character code point character set. By default, this is CP1252 in the U.S., but varies around the world. Then again, Windows is also very good at Unicode and UTF8. But, Windows uses UTF16 internally for its file system.

The point is that you simply cannot flip through all the characters. You could run your shell script on two different systems and get two completely different results based upon the environment, the computer, and operating system.


1 Although they look identical, Arabic and Persian numbers involve two different unicode character points, and thus are different digits.

Evelyne answered 23/10, 2014 at 19:32 Comment(2)
Good point about whether or not [:digit:] includes the digits in various languages when considering Unicode. The character encoding is only relevant when considering a particular implementation of a solution. Overall, though, this is more of a comment to request clarification of the original question than an answer.Anta
I don't understand the point about not being able to flip through the characters because it's different on different systems. Are you assuming that the reason why OP wants a way to dynamically determine characters in a class is because he wants to hard code the list and distribute it?Winona
D
0

Another "same but different" approach, just because the OP asked about POSIX character classes while many responses rely on non-POSIX components.

AFAIK, this method is 100% POSIX, although I may be abusing printf and awk.

I don't know if this suits all use cases or locales but "POSIXLY" this seems to work for me in for 0-127. It passes shellcheck also.

I imagine you could extend the range as needed. It is a little long-winded, but I consider that the cost of readability.

Simply change :alpha: to the class of your choice.

#!/bin/sh

LC_ALL=C
i=0
true > members
while [ "$i" -le 127 ]
do

    echo "$i" | awk '{ printf "%c", $0 }' | awk '/[[:alpha:]]/ { print }'
    # echo "$i" | awk '{ printf "%c", $0 }' | grep '[[:alpha:]]' 
    
    i=$(( i + 1 ))
    
done >> members

printf "%s\n" "$(tr -d '\n' < members)"

It has the added advantage that the final output can be munged for other uses.

Dopester answered 13/5, 2023 at 15:16 Comment(0)
L
0

jot is far more convenient and flexible than seq.

jot -s ''    10  0   # print it numerically     
jot -s '' -c 10 48   # print it via ASCII ordinals

0123456789

To print the upper- and lower-case letters in ASCII, do

jot -s '' -c 26 65   # 65 = 9^2 - 4^2
jot -s '' -c 26 97   # 97 = 3^4 + 2^4

ABCDEFGHIJKLMNOPQRSTUVWXYZ

abcdefghijklmnopqrstuvwxyz

——————————

UPDATE : here's an overview of how gawk match POSIX character classes to UTF-8 (although I think gawk mismatches around a couple dozen) :


R = UTF-16rrogate range of D800-DFFF, matched via floor( codepoint / 2^11 ) == 3^3

| A = alpha | U = upper | L = lower | M = alnuM | D = digit | X = xdigit

| G = graph | P = print | T = punct | C = cntrl | S = space | B = blank


          48 RAU_M_XGP____
         160 RAU_M__GP____
          48 RA_LM_XGP____
         160 RA_LM__GP____
          80 R___MDXGP____

         256 R______GPT___
           8 R_______P__SB
           8 R_________CSB
          32 R_________CS_
         224 R_________C__
       1,024 R____________ # surrogates D[8-F][8-F][0-F]

           6 _AU_M_XGP____
       1,179 _AU_M__GP____
           6 _A_LM_XGP____
       1,346 _A_LM__GP____

          31 _A__M__GP____ # in alpha and alnum but neither case
           1 __U____GP____ # only in upper but neither alpha nor alnum
           1 ___L___GP____ # only in lower but neither alpha nor alnum

          10 ____MDXGP____ # just the ASCII digits matched

       5,187 _______GPT___
     252,248 _______GP____
          18 ________P__SB
           2 ________P__S_
           6 ________P____ ***

           1 __________CSB # horizontal-tab 0x09 \011 \t
           5 __________CS_ 
         190 __________C__
     851,827 _____________ 

*** :: interestingly, from gawk's perspective, these belong to [[:print:]] but not to [[:graph:]]

 U+ 10B3A |  68,410 | [ 𐬺 ] 
 U+ 10B3B |  68,411 | [ 𐬻 ] 
 U+ 10B3C |  68,412 | [ 𐬼 ] 
 U+ 10B3D |  68,413 | [ 𐬽 ] 
 U+ 10B3E |  68,414 | [ 𐬾 ] 
 U+ 10B3F |  68,415 | [ 𐬿 ] 
Lori answered 13/5, 2023 at 19:57 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.