Parsing numbers in FParsec
Asked Answered
I

1

8

I've started learning FParsec. It has a very flexible way to parse numbers; I can provide a set of number formats I want to use:

type Number =
    | Numeral of int
    | Decimal of float
    | Hexadecimal of int
    | Binary of int

let numberFormat = NumberLiteralOptions.AllowFraction
                   ||| NumberLiteralOptions.AllowHexadecimal
                   ||| NumberLiteralOptions.AllowBinary

let pnumber = 
    numberLiteral numberFormat "number"
    |>> fun num -> if num.IsHexadecimal then Hexadecimal (int num.String)
                   elif num.IsBinary then Binary (int num.String)
                   elif num.IsInteger then Numeral (int num.String)
                   else Decimal (float num.String)

However, the language I'm trying to parse is a bit strange. A number could be numeral (non-negative int), decimal (non-negative float), hexadecimal (with prefix #x) or binary (with prefix #b):

numeral: 0, 2
decimal: 0.2, 2.0
hexadecimal: #xA04, #x611ff
binary: #b100, #b001

Right now I have to do parsing twice by substituting # by 0 (if necessary) to make use of pnumber:

let number: Parser<_, unit> =  
    let isDotOrDigit c = isDigit c || c = '.'
    let numOrDec = many1Satisfy2 isDigit isDotOrDigit 
    let hexOrBin = skipChar '#' >>. manyChars (letter <|> digit) |>> sprintf "0%s"
    let str = spaces >>. numOrDec <|> hexOrBin
    str |>> fun s -> match run pnumber s with
                     | Success(result, _, _)   -> result
                     | Failure(errorMsg, _, _) -> failwith errorMsg

What is a better way of parsing in this case? Or how can I alter FParsec's CharStream to be able to make conditional parsing easier?

Ihab answered 6/2, 2012 at 11:41 Comment(0)
M
12

Parsing numbers can be pretty messy if you want to generate good error messages and properly check for overflows.

The following is a simple FParsec implementation of your number parser:

let numeralOrDecimal : Parser<_, unit> =
    // note: doesn't parse a float exponent suffix
    numberLiteral NumberLiteralOptions.AllowFraction "number" 
    |>> fun num -> 
            // raises an exception on overflow
            if num.IsInteger then Numeral(int num.String)
            else Decimal(float num.String)

let hexNumber =    
    pstring "#x" >>. many1SatisfyL isHex "hex digit"
    |>> fun hexStr -> 
            // raises an exception on overflow
            Hexadecimal(System.Convert.ToInt32(hexStr, 16)) 

let binaryNumber =    
    pstring "#b" >>. many1SatisfyL (fun c -> c = '0' || c = '1') "binary digit"
    |>> fun hexStr -> 
            // raises an exception on overflow
            Binary(System.Convert.ToInt32(hexStr, 2))


let number =
    choiceL [numeralOrDecimal
             hexNumber
             binaryNumber]
            "number literal"

Generating good error messages on overflows would complicate this implementation a bit, as you would ideally also need to backtrack after the error, so that the error position ends up at the start of the number literal (see the numberLiteral docs for an example).

A simple way to gracefully handle possible overflow exception is to use a little exception handling combinator like the following:

let mayThrow (p: Parser<'t,'u>) : Parser<'t,'u> =
    fun stream ->
        let state = stream.State        
        try 
            p stream
        with e -> // catching all exceptions is somewhat dangerous
            stream.BacktrackTo(state)
            Reply(FatalError, messageError e.Message)

You could then write

let number = mayThrow (choiceL [...] "number literal")

I'm not sure what you meant to say with "alter FParsec's CharStream to be able to make conditional parsing easier", but the following sample demonstrates how you could write a low-level implementation that only uses the CharStream methods directly.

type NumberStyles = System.Globalization.NumberStyles
let invariantCulture = System.Globalization.CultureInfo.InvariantCulture

let number: Parser<Number, unit> =
  let expectedNumber = expected "number"
  let inline isBinary c = c = '0' || c = '1'
  let inline hex2int c = (int c &&& 15) + (int c >>> 6)*9

  let hexStringToInt (str: string) = // does no argument or overflow checking        
      let mutable n = 0
      for c in str do
          n <- n*16 + hex2int c
      n    

  let binStringToInt (str: string) = // does no argument or overflow checking
      let mutable n = 0
      for c in str do
          n <- n*2 + (int c - int '0')
      n

  let findIndexOfFirstNonNull (str: string) =
      let mutable i = 0
      while i < str.Length && str.[i] = '0' do
          i <- i + 1
      i

  let isHexFun = id isHex // tricks the compiler into caching the function object
  let isDigitFun = id isDigit
  let isBinaryFun = id isBinary

  fun stream ->
    let start = stream.IndexToken
    let cs = stream.Peek2()        
    match cs.Char0, cs.Char1 with
    | '#', 'x' ->
        stream.Skip(2)
        let str = stream.ReadCharsOrNewlinesWhile(isHexFun, false)
        if str.Length <> 0 then
            let i = findIndexOfFirstNonNull str
            let length = str.Length - i
            if length < 8 || (length = 8 && str.[i] <= '7') then
                Reply(Hexadecimal(hexStringToInt str))
            else
                stream.Seek(start)
                Reply(Error, messageError "hex number literal is too large for 32-bit int")
        else 
            Reply(Error, expected "hex digit")

    | '#', 'b' ->
        stream.Skip(2)
        let str = stream.ReadCharsOrNewlinesWhile(isBinaryFun, false)
        if str.Length <> 0 then
            let i = findIndexOfFirstNonNull str
            let length = str.Length - i
            if length < 32 then 
                Reply(Binary(binStringToInt str))
            else
                stream.Seek(start)
                Reply(Error, messageError "binary number literal is too large for 32-bit int")
        else 
            Reply(Error, expected "binary digit")

    | c, _ ->
        if not (isDigit c) then Reply(Error, expectedNumber)
        else
            stream.SkipCharsOrNewlinesWhile(isDigitFun) |> ignore
            if stream.Skip('.') then
                let n2 = stream.SkipCharsOrNewlinesWhile(isDigitFun)
                if n2 <> 0 then
                    // we don't parse any exponent, as in the other example
                    let mutable result = 0.
                    if System.Double.TryParse(stream.ReadFrom(start), 
                                              NumberStyles.AllowDecimalPoint,
                                              invariantCulture, 
                                              &result)
                    then Reply(Decimal(result))
                    else 
                        stream.Seek(start)
                        Reply(Error, messageError "decimal literal is larger than System.Double.MaxValue")                    
                else 
                    Reply(Error, expected "digit")
            else
               let decimalString = stream.ReadFrom(start)
               let mutable result = 0
               if System.Int32.TryParse(stream.ReadFrom(start),
                                        NumberStyles.None,
                                        invariantCulture,
                                        &result)
               then Reply(Numeral(result))
               else 
                   stream.Seek(start)
                   Reply(Error, messageError "decimal number literal is too large for 32-bit int")

While this implementation parses hex and binary numbers without the help of system methods, it eventually delegates the parsing of decimal numbers to the Int32.TryParse and Double.TryParse methods.

As I said: it's messy.

Meehan answered 6/2, 2012 at 15:16 Comment(4)
+1, thanks for the fast response, Stephan. By "alter FParsec's CharStream...", I meant a low level manipulation of CharStream. I will go for the first approach, simple and understandable. BTW, what is the cost of using combinators with labels? Does it cost a lot if I use labels everywhere in the parser?Ihab
I've just added a comment on how one could deal with overflow exceptions in the first version more gracefully. Regarding the labels: it depends. choiceL is actually faster than choice, because it doesn't have to collect the individual error messages. In general the overhead of <?> and similar combinators should be hardly measurable in non-trivial applications. And if you actually spot a performance issue in an FParsec parser, there are always ways to make it faster...Meehan
Thanks for the detailed answer. Just one minor point, skipString should be preferred to pstring in this case, right?Ihab
There's no performance difference, as both parsers have to do no work to create the result value (which in both cases is a reference type constant). Hence, it's only a question of taste.Meehan

© 2022 - 2024 — McMap. All rights reserved.