Encoding problems with dBase III .dbf files on different machines
Asked Answered
T

4

0

I'm using C# and .NET 3.5, trying to import some data from old dbf files using ODBC with Microsoft dBase Driver.

The dbf's are in dBase III format and using ibm850 encoding for strings.

Now, when I run my program on my machine, all string data read from OdbcDataReader comes out converted to UTF-16 or UTF-8 or something, idk and I save it as UTF-8 and everything is ok, but when I try to use this program on an XP box, some characters aren't converted correctly to UTF-8. 'Õ' for example. There may be some others too. Characters like 'Ä', 'Ö' and 'Ü' are ok. This is the problem. Maybe the ODBC or the driver uses some machine culture info or something to mess everything up.

Is it possible to read strings from the database as binary? Maybe some functions like CONVERT or CAST? Or where could I find some references for SQL functions and syntax which works for this dBase driver or other drivers? I searched around and couldn't find anything. I feel so blind when using ODBC and SQL.

Right now I'm using a temporary hack that replaces all σ's with Õ's.

Thanks!

Example code:

System.Data.Odbc.OdbcConnection oConn = new System.Data.Odbc.OdbcConnection();
oConn.ConnectionString = @"Driver={Microsoft dBase Driver (*.dbf)};DriverID=277;Dbq=" + dbPath + ";";
oConn.Open();

System.Data.Odbc.OdbcCommand oCmd = oConn.CreateCommand();
oCmd.CommandText = @"SELECT name FROM " + dbPath + "TABLE.DBF";

System.Data.Odbc.OdbcDataReader reader = oCmd.ExecuteReader();
reader.Read();

byte[] buf = Encoding.UTF8.GetBytes(reader.GetString(0));
BinaryWriter writer = new BinaryWriter(File.Open(@"C:\DBF\Test.txt", FileMode.Create));
writer.Write(buf);

Result:

E5 in dbf (Õ in 850)

Test.txt on pc1: C3 95 (Õ in UTF-8)

Test.txt on pc2: CF 83 (σ in UTF-8)

Thirzi answered 2/10, 2010 at 12:46 Comment(0)
F
3

If you are still having a problem with these files, I may be able to help you.

What is in the "codepage byte" aka "language driver id" (LDID) at offset 29 (decimal) in the file?

I have a Python-based DBF reader which can read just about any field data type and just about any codepage -- it has a long list compiled from various sources of mappings from codepage byte to codepage number. Options are (1) believe the LDID, deliver Unicode (2) ignore the LDID, deliver undecoded bytes (3) override the LDID, decode with a specific codepage into Unicode. The Unicode can of course be then encoded into UTF-8.

The DBF reader also does a whole lot of reasonableness cross-checks which may help investigating why VFP thinks the file is corrupt.

How do you know that it's using IBM850? Another piece of Python code that I have is a prototype encoding detector, which unlike detectors like 'chardet' which are derived from Mozilla code is not web-centric and can happily recognise most old DOS codepages -- this may help.

A observation: the Greek letter lowercase sigma (σ) is 0xE5 in codepage 437, which was succeded by codepage 850 -- "pc2" seems a little outdated ...

If you think I can be of any help, feel free to e-mail me at insert_punctuation("sjmachin", "lexicon", "net")

Fanfaron answered 12/11, 2010 at 7:27 Comment(4)
Hi im also having problem with reading a Dbase file, works fine when reading on my swedish windows client but messes up characters when run on a english os, are you still offering assistance?Erastes
@Andreas: email me. What is the LDID of the file? What are you reading it with? "messes up characters" doesn't help. Show repr(expected characters), repr(actual characters). If possible, send me your code and your file.Fanfaron
I sent you a email sjmachin at lexicon dot netErastes
Thank, 850 is the right number, I can open with correct encoding in LibreOffice Calc, using Europe occidentale (DOS/OS2-850/International)Hendren
C
2

Try this code.

var oConn = new System.Data.Odbc.OdbcConnection();
oConn.ConnectionString = "Driver={Microsoft Visual FoxPro Driver};SourceType=DBF;SourceDB=" + dbPath;
oConn.Open();
var oCmd = oConn.CreateCommand();
oCmd.CommandText = @"SELECT name FROM " + dbPath + "TABLE.DBF";
var reader = oCmd.ExecuteReader();
reader.Read(); 
byte[] A = Encoding.GetEncoding(Encoding.Default.CodePage).GetBytes(reader.GetString(0));
string p = Encoding.Unicode.GetString((Encoding.Convert(Encoding.GetEncoding(850), Encoding.Unicode, A)));
Caster answered 3/5, 2011 at 6:36 Comment(0)
P
1

When you read dbf file you should understand that you should take into account 3 types of encoding:

1.Encoding in which database provider reads the file. It depends on provider and current operation system. This encoding shall be used for bytes array receiving. For example on my PC:

  • when I use connection string "Data Source={0}; Provider=Microsoft.JET.OLEDB.4.0;Extended Properties=DBase IV;User ID=;Password=;", strings are read using 866 code page (Russian MS-DOS)

  • when I use connection string "Data Source={0}; Provider=vfpoledb.1;Exclusive=No;Collating Sequence=Machine", strings are read using Encoding.Default (1251 code page)

2.Encoding in which strings are written to dbf file. It can be received from 29 byte of dbf file, but in fact there is no matter what how dbf file encoding is marked, you should just know what encoding was used. This encoding shall be used as source encoding during string conversion

3.Encoding to which string shall be converted. This is UTF-8 usually.

So string conversion should look like this:

byte[] bytes = Encoding.GetEncoding(codePage1).GetBytes(reader.GetString(0));

string result = Encoding.UTF8.GetString((Encoding.Convert(Encoding.GetEncoding(codePage2), Encoding.UTF8, bytes)));
Phyllis answered 2/11, 2016 at 8:57 Comment(0)
S
0

Have you tried using the Visual Foxpro driver "VFPOleDb" driver instead???

Store answered 4/10, 2010 at 11:5 Comment(1)
Yes, I have. Foxpro driver didn't like my database - tells me it's corrupt, but everything looked fine when I opened the file in a hex editor and compared it to the file format specs.Thirzi

© 2022 - 2024 — McMap. All rights reserved.