I'm trying to implement a tree-processing algorithm in Haskell, and (due to the fact this is my first Haskell program!), am struggling with the design of the data structures. Can any of the FP gurus out there lend a hand?
I'll start by describing the important characteristics of the algorithm, sketch out how I would approach this using an imperative language, and finish with the stumbling baby-steps I have made so far in Haskell.
The problem
I won't describe the full algorithm in detail, but the salient points are as follows:
- The algorithm operates on two rose trees, X and Y.
- The first phase of the algorithm computes some derived properties for each node, based on its label and attributes, and those of its descendants.
- These derived properties are used to compute a partial mapping between the two trees, such that a node in X may be associated with a node in Y and vice versa. Because the mapping is partial, any node in X or Y may either be mapped (i.e. have a partner the other tree), or alternatively may be unmapped.
- The final phase of the algorithm optimises these mappings, via a sequence of operations which inspect parent / children / siblings of mapped nodes.
The data structures must therefore have the following characteristics:
- Given a reference to a node, provide access the parent of that node, siblings of that node, and children of that node.
- Given a node in an input tree, permit annotation of that node with additional information (derived properties, and an optional reference to a node in the other tree).
Sketch of an imperative solution
If I was to implement this algorithm using an imperative language, the solution would look something like the following.
Let's assume that the starting point is the following definition of the input tree:
struct node {
// Identifier for this node, unique within the containing tree
size_t id;
// Label of this node
enum label label;
// Attributes of this node
// An attribute can be assumed to be a key-value pair
// Details of the attributes themselves aren't material to this
// discussion, so the "attribute" type is left opaque
struct attribute **attributes;
size_t n_attributes;
// Pointer to parent of this node
// NULL iff this node is root
struct node *parent;
// Pointer to first child of this node
// NULL iff this node is leaf
struct node *child;
// Double-linked list of siblings of this node
struct node *prev;
struct node *next;
};
The pointers embedded in each node clearly support the up / down / left / right traversals required by the algorithm.
Annotation can be implemented by defining the following structure:
struct algo_node {
// Pointer to input node which has been wrapped
struct node *node;
// Derived properties computed by first phase of the algorithm
// Details of the properties themselves aren't material to this
// discussion, so the "derived" type is left opaque
struct derived props;
// Pointer to corresponding node in the other tree
// NULL iff this node is unmatched
struct node *match;
};
The first phase of the algorithm constructs an algo_node
for each node
in each of the input trees.
Mapping from an algo_node
to a node
is trivial: follow the embedded *node
pointer. Mapping in the other direction can be supported by storing algo_node
s in an array, indexed by the id
of the input node.
This is of course just one possible implementation. Many variations are possible, including
- Abstracting the children linked list behind a
list
orqueue
interface, rather than storing three raw pointers - Instead of associating the input tree with the algorithm tree via indices, encode parent / child / sibling relationships directly in
struct algo_node
Moving to Haskell
Let's start with the following definition of the input tree:
data Tree = Leaf Label Attributes
| Node Label Attributes [Tree]
Augmentation of each node with an id can be achieved as follows:
data AnnotatedTree = Tree Int
addIndex :: Int -> AnnotatedTree -> (AnnotatedTree, Int)
indexedTree = addIndex 0 tree
Similarly, we can write a function which computes derived properties:
data AnnotatedTree = Tree DerivedProperties
computeDerived :: DerivedProperties -> AnnotatedTree -> (AnnotatedTree, DerivedProperties)
derivedTree = computeDerived DefaultDerived tree
The above snippets can be adjusted with little work, such that AnnotatedTree
contains both the index and the derived properties.
However, I don't know where to begin with representing mappings between the two trees. Based on some reading, I have some half-baked ideas ...
- Define
AnnotatedTree
to contain a path from the root of the other tree to the mapped node - encoded as a list of indices into each successive children list,[Integer]
- Use a zipper (of which I currently have a fairly loose understanding) to access the mapped node (or its parent / children / siblings) via the path
- Or perhaps use a lens (... of which I have an even less clear understanding!) to do the same
- Define
AnnotatedTree
to contain a reference to the mapped node directly, as aMaybe Tree
- But then I don't see a way to walk to parent / siblings of the mapped node
... but I could really do with some guidance on which (if any) of these are worth pursuing.
Any help would be much appreciated!
x
inX
has a corresponding nodey
inY
, are all the nodes inY
that correspond to descendants ofx
also descendants ofy
? – Lepore