Parsing variable length strings of fixed column widths
Asked Answered
M

3

5

I am trying to parse a text report that is formatted into columns. Each column appears to be right justified with a fixed length. For each line, there are times that not all the columns are used. In that case it appears that spaces are used to justify each column in the line. Example input:

031   91    1221,154
043   66     312,222    1      3,047                       3,047    1.5%    .9%
040  118     529,626    1      1,842                       1,842     .8%    .3%
037   45     427,710
019   80     512,153    1     14,685                      14,685    1.2%   2.8%
009   68     520,301                      1    16,085     16,085    1.4%   3.0%
030   13     106,689                      1     1,581      1,581    7.6%   1.4%
008   54     377,593    1      7,098                       7,098    1.8%   1.8%
018   24     171,264
022   25       8,884    1        433                         433    4.0%   4.8%
035    9      42,043
041   13     112,355

The column widths appear to be as follows (in character counts including white spaces): 3,5,12,6,10,7,10,11,8,7.

What is a good way to parse this? I have tried using a regular expression to do it, but it obviously fails on the first line being read in because I am using an expression that expects the whole line to have data:

string pattern = @"^(?.{3})(?.{5})(?.{12})(?thirtyeightyninenumber>.{6})(?{10})(?.{7}(?.{10})(?.{11})(?.{8})(?.{7})";

Looking for a good way to read this into appropriate variables depending on whether that column has data or not. I feel like I need to throw a bunch of if checks in, but am hoping there is a better way I am not thinking of.

Thanks for any help.

BTW - I am reading the lines using a StreamReader and ReadLine.

Megass answered 20/9, 2012 at 14:42 Comment(0)
D
6

Don't use regular expressions for this. You know the number of columns and the widths of those columns, so just use String.Substring and String.Trim:

string field1 = line.Substring(0, 5).Trim();
string field2 = line.Substring(5, 3).Trim();
string field3 = line.Substring(12, 8).Trim();
/* etc, etc */
Dalt answered 20/9, 2012 at 14:43 Comment(2)
As with my answer, this will fail on lines where the right-most column isn't populated, as the indexes will be out of range. However, your answer will be much easier to adapt to overcome this, I expect : )Beefy
I don't know who down voted because I was considering the Substring method also. Rawling gave me something new to consider (and learn), but with just adding padding to the input line, I suppose Substring would be the easiest solution (until I learn Rawlings method maybe).Megass
B
8

There is a TextFieldParser available that is specifically meant for reading fixed-width/delimited text files like this.

It's in the Microsoft.VisualBasic.FileIO namespace but you should can still call it from C#.

Add a reference to Microsoft.VisualBasic, a using Microsoft.VisualBasic.FileIO;, then the code looks like this:

TextFieldParser parser = new TextFieldParser(stream);
parser.TextFieldType = FieldType.FixedWidth;
parser.SetFieldWidths(3, 5, 12, 6, 10, 7, 10, 11, 8, 7);
while (!parser.EndOfData)
{
    //Processing row
    string[] fields = parser.ReadFields();

    // Treat each field appropriately e.g. int.TryParse,
    // remove the "%" then float.TryParse etc.
}
parser.Close();

Edit: That said, looking in Reflector, I think this fails if your shortened lines don't have a full width worth of spaces. I'm not sure how to suggest you fix this; you could pre-process your stream to insert any missing spaces per line?

Beefy answered 20/9, 2012 at 14:48 Comment(2)
I should mention that I tested this and this method is VERY slow on large files. File.ReadLines() worked faster and then I used Substring to get the values.Shoelace
To add one more authored package: github.com/borisdj/FixedWidthParserWriterOstler
D
6

Don't use regular expressions for this. You know the number of columns and the widths of those columns, so just use String.Substring and String.Trim:

string field1 = line.Substring(0, 5).Trim();
string field2 = line.Substring(5, 3).Trim();
string field3 = line.Substring(12, 8).Trim();
/* etc, etc */
Dalt answered 20/9, 2012 at 14:43 Comment(2)
As with my answer, this will fail on lines where the right-most column isn't populated, as the indexes will be out of range. However, your answer will be much easier to adapt to overcome this, I expect : )Beefy
I don't know who down voted because I was considering the Substring method also. Rawling gave me something new to consider (and learn), but with just adding padding to the input line, I suppose Substring would be the easiest solution (until I learn Rawlings method maybe).Megass
R
-1

Just check your regex (or whatever else) against myDataString + new string(' ', 1000);

Ruffina answered 27/10, 2014 at 16:21 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.