In our application, we receive text files (.txt
, .csv
, etc.) from diverse sources. When reading, these files sometimes contain garbage, because the files where created in a different/unknown codepage.
Is there a way to (automatically) detect the codepage of a text file?
The detectEncodingFromByteOrderMarks
, on the StreamReader
constructor, works for UTF8
and other unicode marked files, but I'm looking for a way to detect code pages, like ibm850
, windows1252
.
Thanks for your answers, this is what I've done.
The files we receive are from end-users, they do not have a clue about codepages. The receivers are also end-users, by now this is what they know about codepages: Codepages exist, and are annoying.
Solution:
- Open the received file in Notepad, look at a garbled piece of text. If somebody is called François or something, with your human intelligence you can guess this.
- I've created a small app that the user can use to open the file with, and enter a text that user knows it will appear in the file, when the correct codepage is used.
- Loop through all codepages, and display the ones that give a solution with the user provided text.
- If more as one codepage pops up, ask the user to specify more text.