Associate AST nodes with token locations in source file in menhir
Asked Answered
N

2

8

I am using Menhir to parse a DSL. My parser builds an AST using an elaborate collection of nested types. During later typecheck and other passes in error reports generated for a user, I would like to refer to source file position where it occurred. These are not parsing errors, and they generated after parsing is completed.

A naive solution would be to equip all AST types with additional location information, but that would make working with them (e.g. constructing or matching) unnecessary clumsy. What are the established practices to do that?

Nipa answered 11/7, 2017 at 2:4 Comment(6)
Why would additional position information affect matching AST nodes? And yes, even for dynamically constructed nodes you would want some debug informationReck
instead of: match x with Node x I will have to do match x with Node (pos,x)Nipa
Use records in your constructors instead, so you can pattern match on a subset of fieldsReck
Carrying all the positions of AST nodes is not native at all. OCaml compiler does it. You should look its parser/parsetree.mli.Betjeman
I was thinking about some sort of monadic approach...Nipa
Link to @camlspotter's recommend: github.com/ocaml/ocaml/blob/trunk/parsing/parsetree.mliCancroid
V
5

You need somehow to attach the location information to your nodes. The usual solution is to encode your AST node as a record, e.g.,

type node = 
  | Typedef of typdef
  | Typeexp of typeexp
  | Literal of string
  | Constant of int
  | ...

type annotated_node = { node : node; loc : loc}

Since you're using records, you can still pattern match without too much syntactic overhead, e.g.,

 match node with
 | {node=Typedef t} -> pp_typedef t
 | ...

Depending on your representation, you may choose between wrapping each branch of your type individually, wrapping the whole type, or recursively, like in Frama-C example by @Isabelle Newbie.

A similar but more general approach is to extend a node not with the location, but just with a unique identifier and to use a final map to add arbitrary data to nodes. The benefit of this approach is that you can extend your nodes with arbitrary data as you, technically, externalize node attributes. The drawback is that you can't guarantee the totality of an attribute since finite maps are inherently non-total. Thus, preserving an invariant that, for example, all nodes have a location is harder.

Since every heap-allocated object already has an implicit unique identifier, the address, it is possible to attach data to the heap-allocated objects without wrapping it in another type. For example, we can still use type node as it is and use finite maps to attach arbitrary pieces of information to them, as long as each node is a heap object, i.e., the node definition doesn't contain constant constructors (in case if it has, you can work around it by adding a bogus unit value, e.g., | End can be represented as | End of unit.

Of course, by saying an address, I do not mean the physical or virtual address of an object. OCaml uses a moving GC so the actual address of an OCaml object may change during a program execution. Moreover, an address, in general, is not unique, as once an object is deallocated its address can be grabbed by a completely different entity.

Fortunately, after ephemera were added to the recent version of OCaml it is no longer a problem. Moreover, an ephemeron will play nicely with the GC, so that if a node is no longer reachable its attributes (like file locations) will be collected by the GC. So, let's ground this with a concrete example. Suppose we have two nodes c1 and c2:

let c1 = Literal "hello"
let c2 = Constant 42

Now we can create a location mapping from nodes to locations (we will represent the latter as just strings)

module Locations = Ephemeron.K1.Make(struct
   type t = node
   let hash = Hashtbl.hash (* or your own hash if you have one *)
   let equal = (==) (* aka the physical equality *)
end)

The Locations module provides an interface of a typical imperative hash table. So let's use it. In the parser, whenever you create a new node you should register its locations in the global locations value, e.g.,

let locations = Locations.create 1337

(* let's assume semantics actions produced different constants `c1` and `c2` which are structurally equal, e.g., *)

# let c1 = Constant 42 and c2 = Constant 42 

(* with each constant we can associate its location *)

Locations.add c1 "hello.ml:12:32"
Locations.add c2 "hello.ml:13:56"

And later, you can extract the location:

# Locations.find locs c1;;
- : string = "hello.ml:12:32"

# Locations.find locs c2;;
- : string = "hello.ml:13:56"

As you see, although the solution is nice in the sense, that it doesn't touch the node data type, so the rest of your code can pattern match on it nicely and easily, it is still a little bit dirty, as it requires global mutable state, that is hard to maintain. Also, since we are using an object address as a key, every newly created object, even if it was logically derived from the original object, will have a different identity. For example, suppose you have a function, that normalizes all literals:

let normalize = function
  | Literal str -> Literal (normalize_literal str)
  | node -> node 

It will create a new Literal node from the original nodes, so all the location information will be lost. That means, that you need to update the location information, every time you derive one node from another.

Another issue with ephemera is that they can't survive the marshaling or serialization. I.e., if you store your AST somewhere in a file, and then you restore it, all nodes will loose their identity, and the location table will become empty.

Concerning the "monadic approach" that you mentioned in the comments. Though monads are magic, they still can't magically solve all the problems. They are not silver bullets :) To attach something to a node we still need to extend it with an extra attribute - either a location information directly or an identity through which we can attach properties indirectly. The monad can be useful for the latter though, as instead of having a global reference to the last assigned identifier, we can use a state monad, to encapsulate our id generator. And for the sake of completeness, instead of using a state monad or a global reference to generate unique identifiers, you can use UUID and get identifiers that are not unique in a program run but are also universally unique, in the sense that there are no other objects in the world with the same identifier, no matter how often you run your program (in the sane world). Although it looks like generating the UUID doesn't use any state, underneath the hood it still uses an imperative random number generator, so it is sort of cheating, but still can seen as purely functional, as it doesn't contain observable effects.

Visconti answered 12/7, 2017 at 15:35 Comment(2)
I don't understand why you say that we're using an object address as a key. In your definition of Locations, you use structural equality (not physical), so I don't understand what it has to do with addresses.Query
@Hirrolot, you're right, there should be physical equality here, I corrected the answer. The semantics Ephemeron's hashtable interface is a little bit obscure. The ephemeron itself always operates on physical equality, e.g., let e1 = K1.make c1 1 and e2 = K1.make c2 2 in K1.query e1 c1, K1.query e1 c2 yields Some 1, None even when c1 = c2 structurally. The table of ephemerons will still store data by address, but it will use the provided equal function to find them so we should provide physical equality here.Visconti
E
6

I don't know if it's a best practice, but I like the approach taken in the abstract syntax tree of the Frama-C system; see https://github.com/Frama-C/Frama-C-snapshot/blob/master/src/kernel_services/ast_data/cil_types.mli

This approach uses "layers" of records and algebraic types nested in each other. The records hold meta-information like source locations, as well as the algebraic "node" you can match on.

For example, here is a part of the representation of expressions:

type ...

and exp = { 
  eid: int; (** unique identifier *)
  enode: exp_node; (** the expression itself *)
  eloc: location; (** location of the expression. *)
}

and exp_node =
  | Const      of constant              (** Constant *)
  | Lval       of lval                  (** Lvalue *)
  | UnOp       of unop * exp * typ
  | BinOp      of binop * exp * exp * typ
...

So given a variable e of type exp, you can access its source location with e.eloc, and pattern match on its abstract syntax tree in e.enode.

So simple, "top-level" matches on syntax are very easy:

let rec is_const_expr e =
  match e.enode with
  | Const _ -> true
  | Lval _ -> false
  | UnOp (_op, e', _typ) -> is_const_expr e'
  | BinOp (_op, l, r, _typ) -> is_const_expr l && is_const_expr r

To match deeper in an expression, you have to go through a record at each level. This adds some syntactic clutter, but not too much, as you can pattern match on only the one record field that interests you:

let optimize_double_negation e =
  match e.enode with
  | UnOp (Neg, { enode = UnOp (Neg, e', _) }, _) -> e'
  | _ -> e

For comparison, on a pure AST without metadata, this would be something like:

let optimize_double_negation e =
  match e.enode with
  | UnOp (Neg, UnOp (Neg, e', _), _) -> e'
  | _ -> e

I find that Frama-C's approach works well in practice.

Elodiaelodie answered 12/7, 2017 at 15:37 Comment(0)
V
5

You need somehow to attach the location information to your nodes. The usual solution is to encode your AST node as a record, e.g.,

type node = 
  | Typedef of typdef
  | Typeexp of typeexp
  | Literal of string
  | Constant of int
  | ...

type annotated_node = { node : node; loc : loc}

Since you're using records, you can still pattern match without too much syntactic overhead, e.g.,

 match node with
 | {node=Typedef t} -> pp_typedef t
 | ...

Depending on your representation, you may choose between wrapping each branch of your type individually, wrapping the whole type, or recursively, like in Frama-C example by @Isabelle Newbie.

A similar but more general approach is to extend a node not with the location, but just with a unique identifier and to use a final map to add arbitrary data to nodes. The benefit of this approach is that you can extend your nodes with arbitrary data as you, technically, externalize node attributes. The drawback is that you can't guarantee the totality of an attribute since finite maps are inherently non-total. Thus, preserving an invariant that, for example, all nodes have a location is harder.

Since every heap-allocated object already has an implicit unique identifier, the address, it is possible to attach data to the heap-allocated objects without wrapping it in another type. For example, we can still use type node as it is and use finite maps to attach arbitrary pieces of information to them, as long as each node is a heap object, i.e., the node definition doesn't contain constant constructors (in case if it has, you can work around it by adding a bogus unit value, e.g., | End can be represented as | End of unit.

Of course, by saying an address, I do not mean the physical or virtual address of an object. OCaml uses a moving GC so the actual address of an OCaml object may change during a program execution. Moreover, an address, in general, is not unique, as once an object is deallocated its address can be grabbed by a completely different entity.

Fortunately, after ephemera were added to the recent version of OCaml it is no longer a problem. Moreover, an ephemeron will play nicely with the GC, so that if a node is no longer reachable its attributes (like file locations) will be collected by the GC. So, let's ground this with a concrete example. Suppose we have two nodes c1 and c2:

let c1 = Literal "hello"
let c2 = Constant 42

Now we can create a location mapping from nodes to locations (we will represent the latter as just strings)

module Locations = Ephemeron.K1.Make(struct
   type t = node
   let hash = Hashtbl.hash (* or your own hash if you have one *)
   let equal = (==) (* aka the physical equality *)
end)

The Locations module provides an interface of a typical imperative hash table. So let's use it. In the parser, whenever you create a new node you should register its locations in the global locations value, e.g.,

let locations = Locations.create 1337

(* let's assume semantics actions produced different constants `c1` and `c2` which are structurally equal, e.g., *)

# let c1 = Constant 42 and c2 = Constant 42 

(* with each constant we can associate its location *)

Locations.add c1 "hello.ml:12:32"
Locations.add c2 "hello.ml:13:56"

And later, you can extract the location:

# Locations.find locs c1;;
- : string = "hello.ml:12:32"

# Locations.find locs c2;;
- : string = "hello.ml:13:56"

As you see, although the solution is nice in the sense, that it doesn't touch the node data type, so the rest of your code can pattern match on it nicely and easily, it is still a little bit dirty, as it requires global mutable state, that is hard to maintain. Also, since we are using an object address as a key, every newly created object, even if it was logically derived from the original object, will have a different identity. For example, suppose you have a function, that normalizes all literals:

let normalize = function
  | Literal str -> Literal (normalize_literal str)
  | node -> node 

It will create a new Literal node from the original nodes, so all the location information will be lost. That means, that you need to update the location information, every time you derive one node from another.

Another issue with ephemera is that they can't survive the marshaling or serialization. I.e., if you store your AST somewhere in a file, and then you restore it, all nodes will loose their identity, and the location table will become empty.

Concerning the "monadic approach" that you mentioned in the comments. Though monads are magic, they still can't magically solve all the problems. They are not silver bullets :) To attach something to a node we still need to extend it with an extra attribute - either a location information directly or an identity through which we can attach properties indirectly. The monad can be useful for the latter though, as instead of having a global reference to the last assigned identifier, we can use a state monad, to encapsulate our id generator. And for the sake of completeness, instead of using a state monad or a global reference to generate unique identifiers, you can use UUID and get identifiers that are not unique in a program run but are also universally unique, in the sense that there are no other objects in the world with the same identifier, no matter how often you run your program (in the sane world). Although it looks like generating the UUID doesn't use any state, underneath the hood it still uses an imperative random number generator, so it is sort of cheating, but still can seen as purely functional, as it doesn't contain observable effects.

Visconti answered 12/7, 2017 at 15:35 Comment(2)
I don't understand why you say that we're using an object address as a key. In your definition of Locations, you use structural equality (not physical), so I don't understand what it has to do with addresses.Query
@Hirrolot, you're right, there should be physical equality here, I corrected the answer. The semantics Ephemeron's hashtable interface is a little bit obscure. The ephemeron itself always operates on physical equality, e.g., let e1 = K1.make c1 1 and e2 = K1.make c2 2 in K1.query e1 c1, K1.query e1 c2 yields Some 1, None even when c1 = c2 structurally. The table of ephemerons will still store data by address, but it will use the provided equal function to find them so we should provide physical equality here.Visconti

© 2022 - 2024 — McMap. All rights reserved.