How to read EBCDIC data with a non standard codepage, and not mess up numbers?

T

3

Here is one for the old(er) hands :-)

I'm reading a binary dump from a mainframe DB2 table. The table has varchar, char, smallint, integer and float columns. To make it interesting, the DB2 uses code page 424 (Hebrew). I need my code to be codepage independent.

So I open the file with a streamreader using System.Text.Encoding like so:

Dim encoding As System.Text.Encoding = System.Text.Encoding.GetEncoding(20424)
Dim sr As New StreamReader(item.Key, encoding)

and proceed to read the VARCHAR and CHAR data according to their lengths into char arrays using

sr.ReadBlock(buffer, 0, iFieldBufferSize)

Always remembering the first 2 bytes in a VARCHAR column should be discarded and getting the correct string with

SringValue = encoding.GetString(encoding.GetBytes(buffer))

And all is Great!

But now i get to the SMALLINT column, and i'm in trouble. The value of the signed number is stored in 2 bytes, and because its Big endian, i do

Dim buffer(iFieldBufferSize - 1) As Byte
buffer(1) = sr.Read ''switch the bytes around!
buffer(0) = sr.Read
Dim byteBuffer(iFieldBufferSize - 1) As Byte
Dim i16 As Int16 = BitConverter.ToUInt16(buffer, 0)

and i get wrong numbers! for example, if the bytes are 00 03 i get 0 in buffer(1) and 3 in buffer(0) - good. BUT when the two bytes are 00 20, i get 128 read into buffer(0)!

So after half a day of pulling my hair, i drop the encoder from the streamreader declaration, and now i'm getting 32 read into buffer(0), like it should be!!!

Bottom line, the non stadard codepage encoder messes up the byte readings!!!

Any idea how to get around this?

Tansey answered 24/2, 2011 at 19:12 Comment(0)

D

4

You can't read something like an EBCDIC file dump as a stream. The StreamReader class is a type of TextReader and exists for reading characters. You're reading a record -- a complex data structure containing mixed binary and text.

You need to do the reads with a FileStream and read blocks of octets as needed. You'll need some trivial helper methods like these:

private byte[] ReadOctets( Stream input , int size )
{
    if ( size < 0 ) throw new ArgumentOutOfRangeException() ;

    byte[] octets      = new byte[size] ;
    int    octets_read = input.Read( octets , 0 , size ) ;

    if ( octets_read != size ) throw new InvalidDataException() ;

    return octets ;
}

public string readCharVarying( Stream input )
{
    short    size        = readShort( input ) ;

    return readCharFixed( input , size ) ;
}

public string readCharFixed( Stream input , int size )
{
    Encoding e           = System.Text.Encoding.GetEncoding(20424) ;
    byte[]   octets      = ReadOctets( input , size ) ;
    string   value       = e.GetString( octets ) ;

    return value ;
}

private short readShort( Stream input )
{
    byte[] octets            = ReadOctets(input,2) ;
    short  bigEndianValue    = BitConverter.ToInt16(octets,0) ;
    short  littleEndianValue = System.Net.IPAddress.NetworkToHostOrder( bigEndianValue ) ;

    return littleEndianValue ;
}

private int readInt( Stream input )
{
    byte[] octets            = ReadOctets(input,4) ;
    int    bigEndianValue    = BitConverter.ToInt32(octets,0) ;
    int    littleEndianValue = System.Net.IPAddress.NetworkToHostOrder( bigEndianValue ) ;

    return littleEndianValue ;
}

private long readLong( Stream input )
{
    byte[] octets            = ReadOctets(input,8) ;
    long   bigEndianValue    = BitConverter.ToInt64(octets,0) ;
    long   littleEndianValue = System.Net.IPAddress.NetworkToHostOrder( bigEndianValue ) ;

    return littleEndianValue ;
}

The IBM mainframe typically has fixed or variable length records in its file system. Fixed length is easy: you just need to know the record length and you can read all the bytes for the record in a single call to the Read() method, then convert the pieces as needed.

Variable length records are a little trickier, they start with 4-octet record descriptor word, consisting of 2-octet (16-bit) logical record length, followed by a 2-octet (16-bit) 0 value. the logical record length is exclusive of the 4-octet record descriptor word.

You might also see variable, spanned records. These are similar to variable length records, except that the 4-octet prefix is a segment descriptor word. the first 2 octets contains the segment length, the next octet identifies the segment type and the last octet is NUL (0x00). Segment types are as follows:

0x00 indicates a complete logical record
0x01 indicates that this is the first segment of a spanned record
0x10 indicates that this is the last segment of a spanned record
0x11 indicates that this is an "internal" segment of a spanned record, that is, a "Segment of a multisegment record other than the first or last segment."

You can treat variable length and variable spanned records as identical. To read one of these, you first need to parse out the segment/record/descriptor word and read/assemble the complete record into a byte[] from its constituent segment(s), then do whatever needs to be done to convert that byte[] into a form that you can use.

Dody answered 24/2, 2011 at 20:52 Comment(7)

Nicholas, amazingly helpful! Can you be so kind as to add a helper method for FLOAT? I have several FLOAT(53) columns. – Tansey 24/2, 2011 at 21:21

Float is rather difficult. IBM mainframes don't use IEEE 754. They use a base-16 based floating point format that predates IEEE 754. Microsoft has a KB article with some code at support.microsoft.com/kb/235856. Also look at IBM's [i]Principles of Operation[/i]. You get get an older version from hack.org/mc/texts/principles-of-operation.pdf and the current versions from IBM at www-01.ibm.com/support/… (but you'll need to register with IBM). – Dody 24/2, 2011 at 22:20

Nichlas, I will look these up tomorrow... Thank you much! – Tansey 24/2, 2011 at 22:49

One point about your implementation for readCharVarying. The way you have it, if the column width is bigger then the number of bytes actually used, the Reader will be left at the wrong position. So I added an additional call to ReadOctets(ColumnWidth-size-2). – Tansey 24/2, 2011 at 22:52

So they're dumping varchar fields as fixed width, with a length prefix? Sheesh. One other thing: since 1998, IBM mainframes have had IEEE 754 support available as well: you need to know which flavor of floating point you've got. Even if it's IEEE floats, though, the byte order will still be network byte order (big-endian), so it will need to be reversed. – Dody 24/2, 2011 at 23:10

Nicholas, I just posted a follow-on question at https://mcmap.net/q/1329662/-reading-db2-clob-from-binary-download/149769. – Tansey 9/5, 2012 at 13:7

Nicholas, I just posted a follow-on question at https://mcmap.net/q/1329662/-reading-db2-clob-from-binary-download/149769. I wander if you can take a look. Thanks! – Tansey 9/5, 2012 at 13:7

T

4

Do not use a StreamReader to read this file. It is going to interpret the binary numbers in the file as though they are characters and that will mess up their value. Use a FileStream and a BinaryReader. Only use Encoding.GetString() when you are translating a group of bytes from the file that represents a string.

Termless answered 24/2, 2011 at 19:52 Comment(1)

Thanks! Pointed me in the right direction. I didn't dream that a question about EBCDIC files will be answered within minutes of posting! – Tansey 24/2, 2011 at 21:22

D

4

You can't read something like an EBCDIC file dump as a stream. The StreamReader class is a type of TextReader and exists for reading characters. You're reading a record -- a complex data structure containing mixed binary and text.

You need to do the reads with a FileStream and read blocks of octets as needed. You'll need some trivial helper methods like these:

private byte[] ReadOctets( Stream input , int size )
{
    if ( size < 0 ) throw new ArgumentOutOfRangeException() ;

    byte[] octets      = new byte[size] ;
    int    octets_read = input.Read( octets , 0 , size ) ;

    if ( octets_read != size ) throw new InvalidDataException() ;

    return octets ;
}

public string readCharVarying( Stream input )
{
    short    size        = readShort( input ) ;

    return readCharFixed( input , size ) ;
}

public string readCharFixed( Stream input , int size )
{
    Encoding e           = System.Text.Encoding.GetEncoding(20424) ;
    byte[]   octets      = ReadOctets( input , size ) ;
    string   value       = e.GetString( octets ) ;

    return value ;
}

private short readShort( Stream input )
{
    byte[] octets            = ReadOctets(input,2) ;
    short  bigEndianValue    = BitConverter.ToInt16(octets,0) ;
    short  littleEndianValue = System.Net.IPAddress.NetworkToHostOrder( bigEndianValue ) ;

    return littleEndianValue ;
}

private int readInt( Stream input )
{
    byte[] octets            = ReadOctets(input,4) ;
    int    bigEndianValue    = BitConverter.ToInt32(octets,0) ;
    int    littleEndianValue = System.Net.IPAddress.NetworkToHostOrder( bigEndianValue ) ;

    return littleEndianValue ;
}

private long readLong( Stream input )
{
    byte[] octets            = ReadOctets(input,8) ;
    long   bigEndianValue    = BitConverter.ToInt64(octets,0) ;
    long   littleEndianValue = System.Net.IPAddress.NetworkToHostOrder( bigEndianValue ) ;

    return littleEndianValue ;
}

The IBM mainframe typically has fixed or variable length records in its file system. Fixed length is easy: you just need to know the record length and you can read all the bytes for the record in a single call to the Read() method, then convert the pieces as needed.

Variable length records are a little trickier, they start with 4-octet record descriptor word, consisting of 2-octet (16-bit) logical record length, followed by a 2-octet (16-bit) 0 value. the logical record length is exclusive of the 4-octet record descriptor word.

You might also see variable, spanned records. These are similar to variable length records, except that the 4-octet prefix is a segment descriptor word. the first 2 octets contains the segment length, the next octet identifies the segment type and the last octet is NUL (0x00). Segment types are as follows:

0x00 indicates a complete logical record
0x01 indicates that this is the first segment of a spanned record
0x10 indicates that this is the last segment of a spanned record
0x11 indicates that this is an "internal" segment of a spanned record, that is, a "Segment of a multisegment record other than the first or last segment."

You can treat variable length and variable spanned records as identical. To read one of these, you first need to parse out the segment/record/descriptor word and read/assemble the complete record into a byte[] from its constituent segment(s), then do whatever needs to be done to convert that byte[] into a form that you can use.

Dody answered 24/2, 2011 at 20:52 Comment(7)

Nicholas, amazingly helpful! Can you be so kind as to add a helper method for FLOAT? I have several FLOAT(53) columns. – Tansey 24/2, 2011 at 21:21

Float is rather difficult. IBM mainframes don't use IEEE 754. They use a base-16 based floating point format that predates IEEE 754. Microsoft has a KB article with some code at support.microsoft.com/kb/235856. Also look at IBM's [i]Principles of Operation[/i]. You get get an older version from hack.org/mc/texts/principles-of-operation.pdf and the current versions from IBM at www-01.ibm.com/support/… (but you'll need to register with IBM). – Dody 24/2, 2011 at 22:20

Nichlas, I will look these up tomorrow... Thank you much! – Tansey 24/2, 2011 at 22:49

One point about your implementation for readCharVarying. The way you have it, if the column width is bigger then the number of bytes actually used, the Reader will be left at the wrong position. So I added an additional call to ReadOctets(ColumnWidth-size-2). – Tansey 24/2, 2011 at 22:52

So they're dumping varchar fields as fixed width, with a length prefix? Sheesh. One other thing: since 1998, IBM mainframes have had IEEE 754 support available as well: you need to know which flavor of floating point you've got. Even if it's IEEE floats, though, the byte order will still be network byte order (big-endian), so it will need to be reversed. – Dody 24/2, 2011 at 23:10

Nicholas, I just posted a follow-on question at https://mcmap.net/q/1329662/-reading-db2-clob-from-binary-download/149769. – Tansey 9/5, 2012 at 13:7

Nicholas, I just posted a follow-on question at https://mcmap.net/q/1329662/-reading-db2-clob-from-binary-download/149769. I wander if you can take a look. Thanks! – Tansey 9/5, 2012 at 13:7

A

3

@Hans Passant is correct. If you are reading a file that contains binary data (as your discription indicates), then it is incorrect to read the file as though it were text.

Fortunately, the BinaryReader class includes a constructor that takes a character encoding as one of the parameters. You may use this to automatically convert any Hebrew EBCDIC strings in the file to ordinary Unicode strings without affecting the interpretation of the non-text (binary) portion.

Also, you should probably use the two-byte VARCHAR length field to read your strings instead of just throwing it away!

The ReadString() method will not work in this case, since the file was not encoded with the .NET BinaryWriter class. Instead you should get the length of the VARCHAR (or the hard-coded length of the CHAR field) and pass that to the ReadChars(int) method. Then construct your resulting string from the character array that is returned.

Aphrodisiac answered 24/2, 2011 at 20:43 Comment(0)

Recommended topics

Hot tags