I have been given an export from a MySQL database that seems to have had it's encoding muddled somewhat over time and contains a mix of HTML char codes
such as & uuml;
and more problematic characters representing the same letters such as ü
and Ã
. It is my task to to bring some consistency back to the file and get everything into the correct Latin characters, e.g. ú
and ó
.
An example of the sort of string I am dealing with is
Desinfektionslösungstücher für Flächen
Which should equate to
50 Tattoo Desinfektionsl ö sungst ü cher f ü r Fl ä chen
50 Tattoo Desinfektionsl ö sungst ü cher f ü r Fl ä chen
Is there a method available in C#/.Net 4.5 that would successfully re-encode the likes of ü
and Ã
to UTF-8
?
Else what approach would be advisable?
Also is the paragraph character ¶
in the above example string an actual paragraph character or part of some other character combination?
I have created a lookup table in the case of needing to do find and replace which is below, however I am unsure as to how complete it is.
É -> É
“ -> "
†-> "
Ç -> Ç
à -> Ã
é, 'é
à -> ú -> ú
• -> -
Ø -> Ø
õ -> õ
à -> í
â -> â
ã -> ã
ê -> ê
á -> á
é -> é
ó -> ó
– -> –
ç -> ç
ª -> ª
º -> º
à -> à
ü
andÃ
are not "special characters" exactly, but Mojibake. – PrecociousDesinfektionslösungstücher für Flächen
, which seems to be correct but in your expected result you have spaces. – Carrew