F# Read Fixed Width Text File
Asked Answered
C

3

6

Hi I'm looking to find the best way to read in a fixed width text file using F#. The file will be plain text, from one to a couple of thousand lines long and around 1000 characters wide. Each line contains around 50 fields, each with varying lengths. My initial thoughts were to have something like the following

type MyRecord = {
    Name : string
    Address : string
    Postcode : string
    Tel : string
}

let format = [
    (0,10)
    (10,50)
    (50,7)
    (57,20)
]

and read each line one by one, assigning each field by the format tuple(where the first item is the start character and the second is the number of characters wide).

Any pointers would be appreciated.

Consequent answered 16/9, 2015 at 21:8 Comment(1)
Parsing complex formats probably require a special tool, like a parser combinator. Take a look at fparsec; there are plenty of questions here on the subject. The biggest advantage in favor of this approach is that you may define (and debug) individual parsers separately and then chain them to process complex inputs.Steam
C
4

The hardest part is probably to split a single line according to the column format. It can be done something like this:

let splitLine format (line : string) =
    format |> List.map (fun (index, length) -> line.Substring(index, length))

This function has the type (int * int) list -> string -> string list. In other words, format is an (int * int) list. This corresponds exactly to your format list. The line argument is a string, and the function returns a string list.

You can map a list of lines like this:

let result = lines |> List.map (splitLine format)

You can also use Seq.map or Array.map, depending on how lines is defined. Such a result will be a string list list, and you can now map over such a list to produce a MyRecord list.

You can use File.ReadLines to get a lazily evaluated sequence of strings from a file.

Please note that the above is only an outline of a possible solution. I left out boundary checks, error handling, and such. The above code may contain off-by-one errors.

Chelsiechelsy answered 16/9, 2015 at 21:33 Comment(0)
A
4

Here's a solution with a focus on custom validation and error handling for each field. This might be overkill for a data file consisting of just numeric data!

First, for these kinds of things, I like to use the parser in Microsoft.VisualBasic.dll as it's already available without using NuGet.

For each row, we can return the array of fields, and the line number (for error reporting)

#r "Microsoft.VisualBasic.dll"

// for each row, return the line number and the fields
let parserReadAllFields fieldWidths textReader =
    let parser = new Microsoft.VisualBasic.FileIO.TextFieldParser(reader=textReader)
    parser.SetFieldWidths fieldWidths 
    parser.TextFieldType <- Microsoft.VisualBasic.FileIO.FieldType.FixedWidth
    seq {while not parser.EndOfData do 
           yield parser.LineNumber,parser.ReadFields() }

Next, we need a little error handling library (see http://fsharpforfunandprofit.com/rop/ for more)

type Result<'a> = 
    | Success of 'a
    | Failure of string list

module Result =

    let succeedR x = 
        Success x

    let failR err = 
        Failure [err]

    let mapR f xR = 
        match xR with
        | Success a -> Success (f a)
        | Failure errs -> Failure errs 

    let applyR fR xR = 
        match fR,xR with
        | Success f,Success x -> Success (f x)
        | Failure errs,Success _ -> Failure errs 
        | Success _,Failure errs -> Failure errs 
        | Failure errs1, Failure errs2 -> Failure (errs1 @ errs2) 

Then define your domain model. In this case, it is the record type with a field for each field in the file.

type MyRecord = 
    {id:int; name:string; description:string}

And then you can define your domain-specific parsing code. For each field I have created a validation function (validateId, validateName, etc). Fields that don't need validation can pass through the raw data (validateDescription).

In fieldsToRecord the various fields are combined using applicative style (<!> and <*>). For more on this, see http://fsharpforfunandprofit.com/posts/elevated-world-3/#validation.

Finally, readRecords maps each input row to the a record Result and chooses the successful ones only. The failed ones are written to a log in handleResult.

module MyFileParser = 
    open Result

    let createRecord id name description =
        {id=id; name=name; description=description}

    let validateId (lineNo:int64) (fields:string[]) = 
        let rawId = fields.[0]
        match System.Int32.TryParse(rawId) with
        | true, id -> succeedR id
        | false, _ -> failR (sprintf "[%i] Can't parse id '%s'" lineNo rawId)

    let validateName (lineNo:int64) (fields:string[]) = 
        let rawName = fields.[1]
        if System.String.IsNullOrWhiteSpace rawName then
            failR (sprintf "[%i] Name cannot be blank" lineNo )
        else
            succeedR rawName

    let validateDescription (lineNo:int64) (fields:string[]) = 
        let rawDescription = fields.[2]
        succeedR rawDescription // no validation

    let fieldsToRecord (lineNo,fields) =
        let (<!>) = mapR    
        let (<*>) = applyR
        let validatedId = validateId lineNo fields
        let validatedName = validateName lineNo fields
        let validatedDescription = validateDescription lineNo fields
        createRecord <!> validatedId <*> validatedName <*> validatedDescription 

    /// print any errors and only return good results
    let handleResult result = 
        match result with
        | Success record -> Some record 
        | Failure errs -> printfn "ERRORS %A" errs; None

    /// return a sequence of records
    let readRecords parserOutput = 
        parserOutput 
        |> Seq.map fieldsToRecord 
        |> Seq.choose handleResult 

Here's an example of the parsing in practice:

// Set up some sample text
let text = """01name1description1
02name2description2
xxname3badid-------
yy     badidandname
"""

// create a low-level parser
let textReader = new System.IO.StringReader(text)
let fieldWidths = [| 2; 5; 11 |]
let parserOutput = parserReadAllFields fieldWidths textReader 

// convert to records in my domain
let records = 
    parserOutput 
    |> MyFileParser.readRecords 
    |> Seq.iter (printfn "RECORD %A")  // print each record

The output will look like:

RECORD {id = 1;
 name = "name1";
 description = "description";}
RECORD {id = 2;
 name = "name2";
 description = "description";}
ERRORS ["[3] Can't parse id 'xx'"]
ERRORS ["[4] Can't parse id 'yy'"; "[4] Name cannot be blank"]

By no means is this the most efficient way to parse a file (I think there are some CSV parsing libraries available on NuGet that can do validation while parsing) but it does show how you can have complete control over validation and error handling if you need it.

Abisia answered 18/9, 2015 at 14:18 Comment(0)
I
1

A record of 50 fields is a bit unwieldy, therefore alternate approaches which allow dynamic generation of the data structure may be preferable (eg. System.Data.DataRow).

If it has to be a record anyway, you could spare at least the manual assignment to each record field and populate it with the help of Reflection instead. This trick relies on the field order as they are defined. I am assuming that every column of fixed width represents a record field, so that start indices are implied.

open Microsoft.FSharp.Reflection

type MyRecord = {
    Name : string
    Address : string
    City : string
    Postcode : string
    Tel : string } with
    static member CreateFromFixedWidth format (line : string) =
        let fields =
            format 
            |> List.fold (fun (index, acc) length ->
                let str = line.[index .. index + length - 1].Trim()
                index + length, box str :: acc )
                (0, [])
            |> snd
            |> List.rev
            |> List.toArray
        FSharpValue.MakeRecord(
            typeof<MyRecord>,
            fields ) :?> MyRecord

Example data:

"Postman Pat     " +
"Farringdon Road " +
"London          " +
"EC1A 1BB"         +
"+44 20 7946 0813"
|> MyRecord.CreateFromFixedWidth [16; 16; 16; 8; 16]
// val it : MyRecord = {Name = "Postman Pat";
//                      Address = "Farringdon Road";
//                      City = "London";
//                      Postcode = "EC1A 1BB";
//                      Tel = "+44 20 7946 0813";}
Irina answered 17/9, 2015 at 5:25 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.