Match.Value and international characters
Asked Answered
R

1

6

UPDATE May this post be helpful for coders using RichTextBoxes. The Match is correct for a normal string, I did not see this AND I did not see that "ä" transforms to "\e4r" in the richTextBox.Rtf! So the Match.Value is correct - human error.

A RegEx finds the correct text but Match.Value is wrong because it replaces the german "ä" with "\'e4"!

Let example_text = "Primär-ABC" and lets use the following code

String example_text = "<em>Primär-ABC</em>";
Regex em = new Regex(@"<em>[^<]*</em>" );
Match emMatch = em.Match(example_text); //Works!
Match emMatch = em.Match(richtextBox.RTF); //Fails!
while (emMatch.Success)
{
  string matchValue = emMatch.Value;
  Foo(matchValue) ...
}

then the emMatch.Value returns "Prim\'e4r-ABC" instead of "Primär-ABC".

The German ä transforms to \'e4! Because I want to work with the exact string, i would need emMatch.Value to be Primär-ABC - how do I achieve that?

Rosyrot answered 27/7, 2012 at 8:49 Comment(2)
the code looks good, but: 1. how could you know that the emMatch.Value has value with \e4? do you print it? 2. could you in the same way show the value of the example_text and ensure that is does not contain \e4?Concertante
Ohhh!!! I am really sorry! The rtf contains the \'e4 but this is later displayed as "ä".Rosyrot
W
2

In what context are you doing this?

string example_text = "<em>Ich bin ein Bärliner</em>";
Regex em = new Regex(@"<em>[^<]*</em>" );
Match emMatch = em.Match(example_text);
while (emMatch.Success)
{
    Console.WriteLine(emMatch.Value);
    emMatch = emMatch.NextMatch();
}

This outputs <em>Ich bin ein Bärliner</em> in my console

The problem probably isn't that you're getting the wrong value back, it's that you're getting a representation of the value that isn't displayed correctly. This can depend on a lot of things. Try writing the value to a text file using UTF8 encoding and see if it still is incorrect.

Edit: Right. The thing is that you are getting the text from a WinForms RichTextBox using the Rtf property. This will not return the text as is, but will return the RTF representation of the text. RTF is not plain text, it's a markup format to display rich text. If you open an RTF document in e.g. Notepad you will see that it has a lot of weird codes in it - including \'e4 for every 'ä' in your RTF document. If you would've used some markup (like bold text, color etc) in the RTF box, the .Rtf property would return that code as well, looking something like {\rtlch\fcs1 \af31507 \ltrch\fcs0 \cf6\insrsid15946317\charrsid15946317 test}

So use the .Text property instead. It will return the actual plain text.

Woodbury answered 27/7, 2012 at 8:54 Comment(4)
Äh.. very good question and example! Works with a normal string! The context is that my code looks in real like this Match emMatch = em.Match(richTextBox.Rtf) because I want to highlight with yellow (Textmarker) all text that is enclosured with <em>-Tags and remove this tags.Rosyrot
Have you tried completing the operation despite the match output looking wrong to you? It might still work. I'm not too big on RichTextBox. Is this WinForms?Woodbury
It is Winforms with .Net 2.0 :-( Sadly it does not work, the text is not replaced. But at least I have a new direction now - I did not consider to use a normal string as a test, doh! See answer above.Rosyrot
Try using richTextBox.Text instead of richTextBox.Rtf. Then it works for me at least. But it depends on what you are actually doing with it later. RTF is not fun to play with. :)Woodbury

© 2022 - 2024 — McMap. All rights reserved.