Yes is a most frequent question, and this matter is vague for me and since I don't know much about it.
But i would like a very precise way to find a files Encoding. So precise as Notepad++ is.
Yes is a most frequent question, and this matter is vague for me and since I don't know much about it.
But i would like a very precise way to find a files Encoding. So precise as Notepad++ is.
The StreamReader.CurrentEncoding
property rarely returns the correct text file encoding for me. I've had greater success determining a file's endianness, by analyzing its byte order mark (BOM). If the file does not have a BOM, this cannot determine the file's encoding.
*UPDATED 4/08/2020 to include UTF-32LE detection and return correct encoding for UTF-32BE
/// <summary>
/// Determines a text file's encoding by analyzing its byte order mark (BOM).
/// Defaults to ASCII when detection of the text file's endianness fails.
/// </summary>
/// <param name="filename">The text file to analyze.</param>
/// <returns>The detected encoding.</returns>
public static Encoding GetEncoding(string filename)
{
// Read the BOM
var bom = new byte[4];
using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read))
{
file.Read(bom, 0, 4);
}
// Analyze the BOM
if (bom[0] == 0x2b && bom[1] == 0x2f && bom[2] == 0x76) return Encoding.UTF7;
if (bom[0] == 0xef && bom[1] == 0xbb && bom[2] == 0xbf) return Encoding.UTF8;
if (bom[0] == 0xff && bom[1] == 0xfe && bom[2] == 0 && bom[3] == 0) return Encoding.UTF32; //UTF-32LE
if (bom[0] == 0xff && bom[1] == 0xfe) return Encoding.Unicode; //UTF-16LE
if (bom[0] == 0xfe && bom[1] == 0xff) return Encoding.BigEndianUnicode; //UTF-16BE
if (bom[0] == 0 && bom[1] == 0 && bom[2] == 0xfe && bom[3] == 0xff) return new UTF32Encoding(true, true); //UTF-32BE
// We actually have no idea what the encoding is if we reach this point, so
// you may wish to return null instead of defaulting to ASCII
return Encoding.ASCII;
}
StreamReader
, that implementation is what more people will want. They make new encodings rather than using the existing Encoding.Unicode
objects, so equality checks will fail (which might rarely happen anyway because, for instance, Encoding.UTF8
can return different objects), but it (1) doesn't use the really weird UTF-7 format, (2) defaults to UTF-8 if no BOM is found, and (3) can be overridden to use a different default encoding. –
Benzel UTF7Encoding
object and perform GetPreamble()
on it, you get an empty array... –
Downwash 00 00 FE FF
), you return the system-provided Encoding.UTF32
, which is a little-endian encoding (as noted here). And also, as noted by @Nyerguds, you still are not looking for UTF32LE, which has signature FF FE 00 00
(according to en.wikipedia.org/wiki/Byte_order_mark). As that user noted, because it is subsuming, that check must come before the 2-byte checks. –
Bachelor Encoding.UTF32
is little endian, therefore should the if
should be bom[0] == 0xff && bom[1] == 0xfe && bom[2] == 0
–
Messing file.Read(bom, 0, 4)
does not guarantee the first 4 bytes are read. By contract, it returns a minimum of 1 and a maximum of 4 (the given number of) bytes and the filled in number of bytes are returned. This is a rather low level method that needs to be used in a loop, best make an extension method that does this for you. Also make sure that the method does not throw an exception if the file's size is less than 4 bytes. –
Olympus The following code works fine for me, using the StreamReader
class:
using (var reader = new StreamReader(fileName, defaultEncodingIfNoBom, true))
{
reader.Peek(); // you need this!
var encoding = reader.CurrentEncoding;
}
The trick is to use the Peek
call, otherwise, .NET has not done anything (and it hasn't read the preamble, the BOM). Of course, if you use any other ReadXXX
call before checking the encoding, it works too.
If the file has no BOM, then the defaultEncodingIfNoBom
encoding will be used. There is also a StreamReader
constructor overload without this argument (in this case, the encoding will by default be set to UTF8 before any read), but I recommend to define what you consider the default encoding in your context.
I have tested this successfully with files with BOM for UTF8, UTF16/Unicode (LE & BE) and UTF32 (LE & BE). It does not work for UTF7.
foreach($filename in $args) { $reader = [System.IO.StreamReader]::new($filename, [System.Text.Encoding]::default,$true); $peek = $reader.Peek(); $reader.currentencoding | select bodyname,encodingname; $reader.close() }
–
Merwin UTF-8 without BOM
–
Centillion defaultEncodingIfNoBom
parameter in my sample code. It's up to you to decide what that could be depending on your context. It's often UTF-8 these days. –
Kitti new StreamReader(@"C:\Temp\File without BOM.txt", true).CurrentEncoding.EncodingName
returns Unicode (UTF-8)
–
Yearround Providing the implementation details for the steps proposed by @CodesInChaos:
1) Check if there is a Byte Order Mark
2) Check if the file is valid UTF8
3) Use the local "ANSI" codepage (ANSI as Microsoft defines it)
Step 2 works because most non ASCII sequences in codepages other that UTF8 are not valid UTF8. https://mcmap.net/q/11407/-how-to-detect-the-character-encoding-of-a-text-file explains the tactic in more details.
using System; using System.IO; using System.Text;
// Using encoding from BOM or UTF8 if no BOM found,
// check if the file is valid, by reading all lines
// If decoding fails, use the local "ANSI" codepage
public string DetectFileEncoding(Stream fileStream)
{
var Utf8EncodingVerifier = Encoding.GetEncoding("utf-8", new EncoderExceptionFallback(), new DecoderExceptionFallback());
using (var reader = new StreamReader(fileStream, Utf8EncodingVerifier,
detectEncodingFromByteOrderMarks: true, leaveOpen: true, bufferSize: 1024))
{
string detectedEncoding;
try
{
while (!reader.EndOfStream)
{
var line = reader.ReadLine();
}
detectedEncoding = reader.CurrentEncoding.BodyName;
}
catch (Exception e)
{
// Failed to decode the file using the BOM/UT8.
// Assume it's local ANSI
detectedEncoding = "ISO-8859-1";
}
// Rewind the stream
fileStream.Seek(0, SeekOrigin.Begin);
return detectedEncoding;
}
}
[Test]
public void Test1()
{
Stream fs = File.OpenRead(@".\TestData\TextFile_ansi.csv");
var detectedEncoding = DetectFileEncoding(fs);
using (var reader = new StreamReader(fs, Encoding.GetEncoding(detectedEncoding)))
{
// Consume your file
var line = reader.ReadLine();
...
reader.Peek()
instead of while (!reader.EndOfStream) { var line = reader.ReadLine(); }
–
Aurthur reader.Peek()
doesn't read the whole stream. I found that with larger streams, Peek()
was inadequate. I used reader.ReadToEndAsync()
instead. –
Manado var Utf8EncodingVerifier = Encoding.GetEncoding("utf-8", new EncoderExceptionFallback(), new DecoderExceptionFallback());
It is used in the try
block when reading a line. If the encoder fails to parse the provided text (the text is not encoded with utf8), Utf8EncodingVerifier will throw. The exception is catched and we then know the text is not utf8, and default to ISO-8859-1 –
Mitosis Check this.
This is a port of Mozilla Universal Charset Detector and you can use it like this...
public static void Main(String[] args)
{
string filename = args[0];
using (FileStream fs = File.OpenRead(filename)) {
Ude.CharsetDetector cdet = new Ude.CharsetDetector();
cdet.Feed(fs);
cdet.DataEnd();
if (cdet.Charset != null) {
Console.WriteLine("Charset: {0}, confidence: {1}",
cdet.Charset, cdet.Confidence);
} else {
Console.WriteLine("Detection failed.");
}
}
}
The library is subject to the Mozilla Public License Version 1.1 (the "License"). Alternatively, it may be used under the terms of either the GNU General Public License Version 2 or later (the "GPL"), or the GNU Lesser General Public License Version 2.1 or later (the "LGPL").
–
Landwehr I'd try the following steps:
1) Check if there is a Byte Order Mark
2) Check if the file is valid UTF8
3) Use the local "ANSI" codepage (ANSI as Microsoft defines it)
Step 2 works because most non ASCII sequences in codepages other that UTF8 are not valid UTF8.
Utf8Encoding
you can pass in an extra parameter that determines if an exception should be thrown or if you prefer silent data corruption. –
Mistymisunderstand .NET is not very helpful, but you can try the following algorithm:
Here is the call:
var encoding = FileHelper.GetEncoding(filePath);
if (encoding == null)
throw new Exception("The file encoding is not supported. Please choose one of the following encodings: UTF8/UTF7/iso-8859-1");
Here is the code:
public class FileHelper
{
/// <summary>
/// Determines a text file's encoding by analyzing its byte order mark (BOM) and if not found try parsing into diferent encodings
/// Defaults to UTF8 when detection of the text file's endianness fails.
/// </summary>
/// <param name="filename">The text file to analyze.</param>
/// <returns>The detected encoding or null.</returns>
public static Encoding GetEncoding(string filename)
{
var encodingByBOM = GetEncodingByBOM(filename);
if (encodingByBOM != null)
return encodingByBOM;
// BOM not found :(, so try to parse characters into several encodings
var encodingByParsingUTF8 = GetEncodingByParsing(filename, Encoding.UTF8);
if (encodingByParsingUTF8 != null)
return encodingByParsingUTF8;
var encodingByParsingLatin1 = GetEncodingByParsing(filename, Encoding.GetEncoding("iso-8859-1"));
if (encodingByParsingLatin1 != null)
return encodingByParsingLatin1;
var encodingByParsingUTF7 = GetEncodingByParsing(filename, Encoding.UTF7);
if (encodingByParsingUTF7 != null)
return encodingByParsingUTF7;
return null; // no encoding found
}
/// <summary>
/// Determines a text file's encoding by analyzing its byte order mark (BOM)
/// </summary>
/// <param name="filename">The text file to analyze.</param>
/// <returns>The detected encoding.</returns>
private static Encoding GetEncodingByBOM(string filename)
{
// Read the BOM
var byteOrderMark = new byte[4];
using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read))
{
file.Read(byteOrderMark, 0, 4);
}
// Analyze the BOM
if (byteOrderMark[0] == 0x2b && byteOrderMark[1] == 0x2f && byteOrderMark[2] == 0x76) return Encoding.UTF7;
if (byteOrderMark[0] == 0xef && byteOrderMark[1] == 0xbb && byteOrderMark[2] == 0xbf) return Encoding.UTF8;
if (byteOrderMark[0] == 0xff && byteOrderMark[1] == 0xfe) return Encoding.Unicode; //UTF-16LE
if (byteOrderMark[0] == 0xfe && byteOrderMark[1] == 0xff) return Encoding.BigEndianUnicode; //UTF-16BE
if (byteOrderMark[0] == 0 && byteOrderMark[1] == 0 && byteOrderMark[2] == 0xfe && byteOrderMark[3] == 0xff) return Encoding.UTF32;
return null; // no BOM found
}
private static Encoding GetEncodingByParsing(string filename, Encoding encoding)
{
var encodingVerifier = Encoding.GetEncoding(encoding.BodyName, new EncoderExceptionFallback(), new DecoderExceptionFallback());
try
{
using (var textReader = new StreamReader(filename, encodingVerifier, detectEncodingFromByteOrderMarks: true))
{
while (!textReader.EndOfStream)
{
textReader.ReadLine(); // in order to increment the stream position
}
// all text parsed ok
return textReader.CurrentEncoding;
}
}
catch (Exception ex) { }
return null; //
}
}
The solution proposed by @nonoandy is really interesting, I have succesfully tested it and seems to be working perfectly.
The nuget package needed is Microsoft.ProgramSynthesis.Detection
(version 8.17.0 at the moment)
I suggest to use the EncodingTypeUtils.GetDotNetName
instead of using a switch for getting the Encoding
instance:
using System.Text;
using Microsoft.ProgramSynthesis.Detection.Encoding;
...
public Encoding? DetectEncoding(Stream stream)
{
try
{
if (stream.CanSeek)
{
// Read from the beginning if possible
stream.Seek(0, SeekOrigin.Begin);
}
// Detect encoding type (enum)
var encodingType = EncodingIdentifier.IdentifyEncoding(stream);
// Get the corresponding encoding name to be passed to System.Text.Encoding.GetEncoding
var encodingDotNetName = EncodingTypeUtils.GetDotNetName(encodingType);
if (!string.IsNullOrEmpty(encodingDotNetName))
{
return Encoding.GetEncoding(encodingDotNetName);
}
}
catch (Exception e)
{
// Handle exception (log, throw, etc...)
}
// In case of error return null or a default value
return null;
}
Look here for c#
https://msdn.microsoft.com/en-us/library/system.io.streamreader.currentencoding%28v=vs.110%29.aspx
string path = @"path\to\your\file.ext";
using (StreamReader sr = new StreamReader(path, true))
{
while (sr.Peek() >= 0)
{
Console.Write((char)sr.Read());
}
//Test for the encoding after reading, or at least
//after the first read.
Console.WriteLine("The encoding used was {0}.", sr.CurrentEncoding);
Console.ReadLine();
Console.WriteLine();
}
The following codes are my Powershell codes to determinate if some cpp or h or ml files are encodeding with ISO-8859-1(Latin-1) or UTF-8 without BOM, if neither then suppose it to be GB18030. I am a Chinese working in France and MSVC saves as Latin-1 on french computer and saves as GB on Chinese computer so this helps me avoid encoding problem when do source file exchanges between my system and my colleagues.
The way is simple, if all characters are between x00-x7E, ASCII, UTF-8 and Latin-1 are all the same, but if I read a non ASCII file by UTF-8, we will find the special character � show up, so try to read with Latin-1. In Latin-1, between \x7F and \xAF is empty, while GB uses full between x00-xFF so if I got any between the two, it's not Latin-1
The code is written in PowerShell, but uses .net so it's easy to be translated into C# or F#
$Utf8NoBomEncoding = New-Object System.Text.UTF8Encoding($False)
foreach($i in Get-ChildItem .\ -Recurse -include *.cpp,*.h, *.ml) {
$openUTF = New-Object System.IO.StreamReader -ArgumentList ($i, [Text.Encoding]::UTF8)
$contentUTF = $openUTF.ReadToEnd()
[regex]$regex = '�'
$c=$regex.Matches($contentUTF).count
$openUTF.Close()
if ($c -ne 0) {
$openLatin1 = New-Object System.IO.StreamReader -ArgumentList ($i, [Text.Encoding]::GetEncoding('ISO-8859-1'))
$contentLatin1 = $openLatin1.ReadToEnd()
$openLatin1.Close()
[regex]$regex = '[\x7F-\xAF]'
$c=$regex.Matches($contentLatin1).count
if ($c -eq 0) {
[System.IO.File]::WriteAllLines($i, $contentLatin1, $Utf8NoBomEncoding)
$i.FullName
}
else {
$openGB = New-Object System.IO.StreamReader -ArgumentList ($i, [Text.Encoding]::GetEncoding('GB18030'))
$contentGB = $openGB.ReadToEnd()
$openGB.Close()
[System.IO.File]::WriteAllLines($i, $contentGB, $Utf8NoBomEncoding)
$i.FullName
}
}
}
Write-Host -NoNewLine 'Press any key to continue...';
$null = $Host.UI.RawUI.ReadKey('NoEcho,IncludeKeyDown');
This seems to work well.
First create a helper method:
private static Encoding TestCodePage(Encoding testCode, byte[] byteArray)
{
try
{
var encoding = Encoding.GetEncoding(testCode.CodePage, EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback);
var a = encoding.GetCharCount(byteArray);
return testCode;
}
catch (Exception e)
{
return null;
}
}
Then create code to test the source. In this case, I've got a byte array I need to get the encoding of:
public static Encoding DetectCodePage(byte[] contents)
{
if (contents == null || contents.Length == 0)
{
return Encoding.Default;
}
return TestCodePage(Encoding.UTF8, contents)
?? TestCodePage(Encoding.Unicode, contents)
?? TestCodePage(Encoding.BigEndianUnicode, contents)
?? TestCodePage(Encoding.GetEncoding(1252), contents) // Western European
?? TestCodePage(Encoding.GetEncoding(28591), contents) // ISO Western European
?? TestCodePage(Encoding.ASCII, contents)
?? TestCodePage(Encoding.Default, contents); // likely Unicode
}
I have tried a few different ways to detect encoding and hit issues with most of them.
I made the following leveraging a Microsoft Nuget Package and it seems to work for me so far but needs tested a lot more.
Most of my testing has been on UTF8, UTF8 with BOM and ANSI.
static void Main(string[] args)
{
var path = Directory.GetCurrentDirectory() + "\\TextFile2.txt";
List<string> contents = File.ReadLines(path, GetEncoding(path)).Where(w => !string.IsNullOrWhiteSpace(w)).ToList();
int i = 0;
foreach (var line in contents)
{
i++;
Console.WriteLine(line);
if (i > 100)
break;
}
}
public static Encoding GetEncoding(string filename)
{
using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read))
{
var detectedEncoding = Microsoft.ProgramSynthesis.Detection.Encoding.EncodingIdentifier.IdentifyEncoding(file);
switch (detectedEncoding)
{
case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Utf8:
return Encoding.UTF8;
case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Utf16Be:
return Encoding.BigEndianUnicode;
case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Utf16Le:
return Encoding.Unicode;
case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Utf32Le:
return Encoding.UTF32;
case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Ascii:
return Encoding.ASCII;
case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Iso88591:
case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Unknown:
case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Windows1252:
default:
return Encoding.Default;
}
}
}
It may be useful
string path = @"address/to/the/file.extension";
using (StreamReader sr = new StreamReader(path))
{
Console.WriteLine(sr.CurrentEncoding);
}
© 2022 - 2024 — McMap. All rights reserved.