I'm trying to implement a tree-processing algorithm in Haskell, and (due to the fact this is my first Haskell program!), am struggling with the design of the data structures. Can any of the FP gurus out there lend a hand?

I'll start by describing the important characteristics of the algorithm, sketch out how I would approach this using an imperative language, and finish with the stumbling baby-steps I have made so far in Haskell.

The problem

I won't describe the full algorithm in detail, but the salient points are as follows:

The algorithm operates on two rose trees, X and Y.
The first phase of the algorithm computes some derived properties for each node, based on its label and attributes, and those of its descendants.
These derived properties are used to compute a partial mapping between the two trees, such that a node in X may be associated with a node in Y and vice versa. Because the mapping is partial, any node in X or Y may either be mapped (i.e. have a partner the other tree), or alternatively may be unmapped.
The final phase of the algorithm optimises these mappings, via a sequence of operations which inspect parent / children / siblings of mapped nodes.

The data structures must therefore have the following characteristics:

Given a reference to a node, provide access the parent of that node, siblings of that node, and children of that node.
Given a node in an input tree, permit annotation of that node with additional information (derived properties, and an optional reference to a node in the other tree).

Sketch of an imperative solution

If I was to implement this algorithm using an imperative language, the solution would look something like the following.

Let's assume that the starting point is the following definition of the input tree:

struct node {
    // Identifier for this node, unique within the containing tree
    size_t id;

    // Label of this node
    enum label label;

    // Attributes of this node
    // An attribute can be assumed to be a key-value pair
    // Details of the attributes themselves aren't material to this
    // discussion, so the "attribute" type is left opaque
    struct attribute **attributes;
    size_t n_attributes;

    // Pointer to parent of this node
    // NULL iff this node is root
    struct node *parent;

    // Pointer to first child of this node
    // NULL iff this node is leaf
    struct node *child;

    // Double-linked list of siblings of this node
    struct node *prev;
    struct node *next;
};

The pointers embedded in each node clearly support the up / down / left / right traversals required by the algorithm.

Annotation can be implemented by defining the following structure:

struct algo_node {
    // Pointer to input node which has been wrapped
    struct node *node;

    // Derived properties computed by first phase of the algorithm
    // Details of the properties themselves aren't material to this
    // discussion, so the "derived" type is left opaque
    struct derived props;

    // Pointer to corresponding node in the other tree
    // NULL iff this node is unmatched
    struct node *match;
};

The first phase of the algorithm constructs an algo_node for each node in each of the input trees.

Mapping from an algo_node to a node is trivial: follow the embedded *node pointer. Mapping in the other direction can be supported by storing algo_nodes in an array, indexed by the id of the input node.

This is of course just one possible implementation. Many variations are possible, including

Abstracting the children linked list behind a list or queue interface, rather than storing three raw pointers
Instead of associating the input tree with the algorithm tree via indices, encode parent / child / sibling relationships directly in struct algo_node

Moving to Haskell

Let's start with the following definition of the input tree:

data Tree = Leaf Label Attributes
          | Node Label Attributes [Tree]

Augmentation of each node with an id can be achieved as follows:

data AnnotatedTree = Tree Int

addIndex :: Int -> AnnotatedTree -> (AnnotatedTree, Int)

indexedTree = addIndex 0 tree

Similarly, we can write a function which computes derived properties:

data AnnotatedTree = Tree DerivedProperties

computeDerived :: DerivedProperties -> AnnotatedTree -> (AnnotatedTree, DerivedProperties)

derivedTree = computeDerived DefaultDerived tree

The above snippets can be adjusted with little work, such that AnnotatedTree contains both the index and the derived properties.

However, I don't know where to begin with representing mappings between the two trees. Based on some reading, I have some half-baked ideas ...

Define AnnotatedTree to contain a path from the root of the other tree to the mapped node - encoded as a list of indices into each successive children list, [Integer]
- Use a zipper (of which I currently have a fairly loose understanding) to access the mapped node (or its parent / children / siblings) via the path
- Or perhaps use a lens (... of which I have an even less clear understanding!) to do the same
Define AnnotatedTree to contain a reference to the mapped node directly, as a Maybe Tree
- But then I don't see a way to walk to parent / siblings of the mapped node

... but I could really do with some guidance on which (if any) of these are worth pursuing.

Any help would be much appreciated!

The problem

Sketch of an imperative solution

Moving to Haskell

Recommended topics

Hot tags