Sometimes I have evil non-printable characters in the middle of a string. These strings are user input, so I must make my program receive it well instead of try to change the source of the problem.
For example, they can have zero width no-break space in the middle of the string. For example, while parsing a .po
file, one problematic part was the string "he is a man of god"
in the middle of the file. While it everything seems correct, inspecting it with irb
shows:
"he is a man of god".codepoints
=> [104, 101, 32, 105, 115, 32, 97, 32, 65279, 109, 97, 110, 32, 111, 102, 32, 103, 111, 100]
I believe that I know what a BOM
is, and I even handle it nicely. However sometimes I have such characters on the middle of the file, so it is not a BOM
.
My current approach is to remove all characters that I found evil in a really smelly fashion:
text = (text.codepoints - CODEPOINTS_BlACKLIST).pack("U*")
The most close I got was following this post which leaded me to :print:
option on regexps. However it was no good for me:
"m".scan(/[[:print:]]/).join.codepoints
=> [65279, 109]
so the question is: How can I remove all non-printable characters from a string in ruby?
dump
which produces a new string with non-printing characters removed and special characters escaped. Docs for String#dump Ruby 2.3.0 but I can confirm it is in the docs as early as 1.8.7. – Dysphemism