Removing hidden characters from within strings
Asked Answered
P

10

49

My problem:

I have a .NET application that sends out newsletters via email. When the newsletters are viewed in outlook, outlook displays a question mark in place of a hidden character it can’t recognize. These hidden character(s) are coming from end users who copy and paste html that makes up the newsletters into a form and submits it. A c# trim() removes these hidden chars if they occur at the end or beginning of the string. When the newsletter is viewed in gmail, gmail does a good job ignoring them. When pasting these hidden characters in a word document and I turn on the “show paragraph marks and hidden symbols” option the symbols appear as one rectangle inside a bigger rectangle. Also the text that makes up the newsletters can be in any language, so accepting Unicode chars is a must. I've tried looping through the string to detect the character but the loop doesn't recognize it and passes over it. Also asking the end user to paste the html into notepad first before submitting it is out of the question.

My question:
How can I detect and eliminate these hidden characters using C#?

Paramorphism answered 6/3, 2013 at 22:21 Comment(6)
Put an example here..Curtain
Example invalid values would be nice. Im guessing its unicode strings in ascii text, but again thats just a guess.Slaw
regex, only allow letters an numbersScoria
possible duplicate of How do I detect non-printable characters in .NET?Denyse
I don't know what the hidden char is. It only appears once displayed in outlook or in word. If I view the text in a SharePoint list (where it is stored) it is hidden.Paramorphism
It has been a while but this haven't been answered yet. How do you include the HMTL content in the sending code? if you are reading it from file, check the file encoding. If you are using UTF-8 with signature (the name slightly varies between editors), this is may cause the weird char at the begining of the mail.Anishaaniso
T
101

You can remove all control characters from your input string with something like this:

string input; // this is your input string
string output = new string(input.Where(c => !char.IsControl(c)).ToArray());

Here is the documentation for the IsControl() method.

Or if you want to keep letters and digits only, you can also use the IsLetter and IsDigit function:

string output = new string(input.Where(c => char.IsLetter(c) || char.IsDigit(c)).ToArray());
Troostite answered 6/3, 2013 at 22:27 Comment(9)
Thanks, I will try this. I'll try encoding it and immediately decoding it back to see if the hidden char is stripped out.Paramorphism
HtmlEncode/Decode does not remove any characters, not sure how you recommend using it.Triturate
@AlexeiLevenkov Yes, sorry, I misread the question... I'll update my answer accordingly.Troostite
@Paramorphism Sorry I misread your question, I've updated my answer to answer your need...Troostite
Thanks for your answer Yannick. The question mark is still appearing in outlook. It must not be a control char that outlook is having trouble with. I need to figure out what type of characters these are.Paramorphism
@Paramorphism You're welcome! Have you tried the second solution I've proposed above?Troostite
I don't know why but Char.IsControl returns false for Left-to-right markFortunia
@YannickBlondeau that will also remove punctuation and special characters "£$%^" etc. so in my opinion the best solution is the combination of the 2, or the answer I've addedFortunia
@IgorMeszaros LRM is a "format" character, but luckily C# has a GetUnicodeCategory(char c) method that can identify the category of any character. string clean = new string(e.Value.Where(c => char.GetUnicodeCategory(c) != UnicodeCategory.Format).ToArray()); works just fine to remove the LRM.Backwardation
R
29

I usually use this regular expression to replace all non-printable characters.

By the way, most of the people think that tab, line feed and carriage return are non-printable characters, but for me they are not.

So here is the expression:

string output = Regex.Replace(input, @"[^\u0009\u000A\u000D\u0020-\u007E]", "*");
  • ^ means if it's any of the following:
  • \u0009 is tab
  • \u000A is linefeed
  • \u000D is carriage return
  • \u0020-\u007E means everything from space to ~ -- that is, everything in ASCII.

See ASCII table if you want to make changes. Remember it would strip off every non-ASCII character.

To test above you can create a string by yourself like this:

    string input = string.Empty;

    for (int i = 0; i < 255; i++)
    {
        input += (char)(i);
    }
Razz answered 17/2, 2014 at 5:27 Comment(1)
I think the first ^ inverts the set, while the other ^s should not be there (will exclude ^ from the output).Onia
F
9

What best worked for me is:

string result = new string(value.Where(c =>  char.IsLetterOrDigit(c) || (c >= ' ' && c <= byte.MaxValue)).ToArray());

Where I'm making sure the character is any letter or digit, so that I don't ignore any non English letters, or if it is not a letter I check whether it's an ascii character that is greater or equal than Space to make sure I ignore some control characters, this ensures I don't ignore punctuation.

Some suggest using IsControl to check whether the character is non printable or not, but that ignores Left-To-Right mark for example.

Fortunia answered 30/11, 2016 at 12:30 Comment(0)
T
7
new string(input.Where(c => !char.IsControl(c)).ToArray());

IsControl misses some control characters like left-to-right mark (LRM) (the char which commonly hides in a string while doing copy paste). If you are sure that your string has only digits and numbers then you can use IsLetterOrDigit

new string(input.Where(c => char.IsLetterOrDigit(c)).ToArray())

If your string has special characters, then

new string(input.Where(c => c < 128).ToArray())
Terrorism answered 15/3, 2017 at 0:27 Comment(1)
Unfortunately, from my unit testing, the last suggestion (new string(input.Where(c => c < 128).ToArray())) will also strip out accented characters. For example, "Siñalizacíon" will become "Sializacon".Nesselrode
B
4

You can do this:

var hChars = new char[] {...};
var result = new string(yourString.Where(c => !hChars.Contains(c)).ToArray());
Botulin answered 6/3, 2013 at 22:27 Comment(0)
U
2

TLDR Answer

Use this Regex...

\P{Cc}\P{Cn}\P{Cs}

Like this...

var regex = new Regex(@"![\P{Cc}\P{Cn}\P{Cs}]");

TLDR Explanation

  • \P{Cc} : Do not match control characters.
  • \P{Cn} : Do not match unassigned characters.
  • \P{Cs} : Do not match UTF-8-invalid characters.

Working Demo

In this demo, I use this regex to search the string "Hello, World!". That weird character at the end is (char)4 — this is the character for END TRANSMISSION.

using System;
using System.Text.RegularExpressions;

public class Test {
    public static void Main() {
        var regex = new Regex(@"![\P{Cc}\P{Cn}\P{Cs}]");
        var matches = regex.Matches("Hello, World!" + (char)4);
        Console.WriteLine("Results: " + matches.Count);
        foreach (Match match in matches) {
            Console.WriteLine("Result: " + match);
        }
    }
}

Full Working Demo at IDEOne.com

The output from the above code:

Results: 1
Result: !

Alternatives

  • \P{C} : Match only visible characters. Do not match any invisible characters.
  • \P{Cc} : Match only non-control characters. Do not match any control characters.
  • \P{Cc}\P{Cn} : Match only non-control characters that have been assigned. Do not match any control or unassigned characters.
  • \P{Cc}\P{Cn}\P{Cs} : Match only non-control characters that have been assigned and are UTF-8 valid. Do not match any control, unassigned, or UTF-8-invalid characters.
  • \P{Cc}\P{Cn}\P{Cs}\P{Cf} : Match only non-control, non-formatting characters that have been assigned and are UTF-8 valid. Do not match any control, unassigned, formatting, or UTF-8-invalid characters.

Source and Explanation

Take a look at the Unicode Character Properties available that can be used to test within a regex. You should be able to use these regexes in Microsoft .NET, JavaScript, Python, Java, PHP, Ruby, Perl, Golang, and even Adobe. Knowing Unicode character classes is very transferable knowledge, so I recommend using it!

Unpack answered 30/6, 2021 at 15:16 Comment(0)
S
1

I experienced an error with the AWS S3 SDK "Target resource path[name -‎3.‎30.‎2022 -‎15‎.‎27.‎00.pdf] has bidirectional characters, which are not supportedby System.Uri and thus cannot be handled by the .NET SDK"

The filename in my instance contained Unicode Character 'LEFT-TO-RIGHT MARK' (U+200E) between the dots. These were not visible in html or in Notepad++. When the text was pasted into Visual Studio 2019 Editor, the unicode text was visible and I was able to solve the issue.

U+200E Left to Right Mark

The problem was solved by replacing all control and other non-printable characters from the filename using the following script.

var input = Regex.Replace(s, @"\p{C}+", string.Empty);

Credit Source: https://mcmap.net/q/161998/-c-regex-to-remove-non-printable-characters-and-control-characters-in-a-text-that-has-a-mix-of-many-different-languages-unicode-letters

Sewellyn answered 6/3, 2013 at 22:21 Comment(0)
M
1

If you need speed create a static method that looks like this:

private static string RemoveControlCharacters(ReadOnly<char> input)
{
    Span<char> output = stackalloc char[input.Length];
    int j = 0;

    foreach (char c in input)
    {
        if (!char.IsControl(c))
        {
            output[j++] = c;
        }
    }

    return new string(output.Slice(0, j));
}

It uses stackalloc to allocate the memory for the output string on the stack, which is faster than heap allocation.

Mammillate answered 21/3, 2023 at 16:8 Comment(0)
J
0
string output = new string(input.Where(c => !char.IsControl(c)).ToArray());

This will surely solve the problem. I had a non printable substitute characer(ASCII 26) in a string which was causing my app to break and this line of code removed the characters

Joan answered 29/9, 2016 at 15:56 Comment(1)
This is identical to the highest-voted and accepted answer.Kibbutznik
S
0

I used this quick and dirty oneliner to clean some input from LTR/RTL marks left over by the broken Windows 10 calculator app. It's probably a far cry from perfect but good enough for a quick fix:

string cleaned = new string(input.Where(c => !char.IsControl(c) && (char.IsLetterOrDigit(c) || char.IsPunctuation(c) || char.IsSeparator(c) || char.IsSymbol(c) || char.IsWhiteSpace(c))).ToArray());
Scar answered 17/7, 2020 at 21:43 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.