Matching extended ASCII characters in .NET Regex
Asked Answered
C

2

7

I'm writing a .NET regular expression that needs to match all ASCII and extended ASCII characters except for control characters.

To do this, I consulted the ASCII table and it seems that all these characters have an ASCII encoding of x20 to xFF.

So I suppose

[\x20-\xFF]

should be able to match all the characters that I need. However, in reality, some characters can be matched, while others cannot. For example, if you test with the online tool http://regexhero.net/tester/, or write a simple C# program, you will find that some characters such as "ç" (xE7) can be matched, but some characters such as "œ" (x9C) cannot.

Does anyone have any idea why the regex does not work?

Cheder answered 5/3, 2015 at 14:44 Comment(2)
I've copied your œ symbol from the questuin and check it via (int) 'œ it shows 339 (0x153) which is outside the range.Marzipan
"Extended ASCII" was a mistake in the previous century, responsible for the code page disaster. .NET uses Unicode. You'll have to recreate the disaster.Krein
M
3

I've tried to reproduce your error and found nothing wrong with your code:

String pattern = @"[\x20-\xFF]";

// All ANSII 
for (Char ch = ' '; ch <= 255; ++ch)
  if (!Regex.IsMatch(ch.ToString(), pattern)) 
    Console.Write("Failed!");

// All non-ANSII
for (Char ch = (Char)256; ch < Char.MaxValue; ++ch)
  if (Regex.IsMatch(ch.ToString(), pattern)) 
    Console.Write("Failed!");

Then I've examined your samples:

 ((int)'ç').ToString("X2"); // <- returns E7, OK
 ((int)'œ').ToString("X2"); // <- returns 153 NOT x9C 

Note, that 'œ' (x153) is actually outside [0x20..0xFF] and that's why matching returns false. So I guess that you've got a typo

Marzipan answered 5/3, 2015 at 15:7 Comment(1)
Thank you so much. I realized that the numeric value in a .Net regex is the Unicode encoding value, not extended ascii. In Unicode, œ is x153 and in extended ASCII it is x9C.Cheder
A
0

As I wrote https://mcmap.net/q/1627061/-expressing-byte-values-gt-127-in-net-strings, you can use the

var enc = Encoding.GetEncoding("ISO-8859-1");

to encode the bytes to a string that uses the same codes:

string str = enc.GetString(yourBytes);

Then you can use the regex you wrote. Note that what I'm doing is a cheat: "ASCII" is too little information. You would need to tell me what codepage you were using, because the block 80-FF can be mapped in various ways, depending on the place (the "codepages"), so not everywhere œ was 9C, and if you look at the string generated by that encoder, you won't get a œ, but you will get a character with the code 0x9C .

If you want a C# string that "prints" the same as the text you have, you'll need to use

var enc = Encoding.GetEncoding("Windows-1252");

(it is a MS extension of ISO-8859-1 that includes the œ character at 0x9C)

But note that in that case you won't be able to use a regex so simple, because your 80-FF codes will be mapped all around the 0000-FFFF unicode characters

Ah... and clearly you could have sidestepped this problem with:

[^\x00-\x19]

(not 0x00-0x19) :-)

Aminaamine answered 5/3, 2015 at 15:4 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.