Efficiently find marked referenced records
Asked Answered
S

4

9

I have

  • a few million records in a database that
  • reference each other (a directed acyclic graph). There are direct references (A -> B) and indirect references (if A -> B and B -> C, then A -> C). Indirect references can have any recursion depths, but in reality the depth is at most 100. This is very similar to objects in an object oriented language can reference other objects, recursively, except that cycles are not allowed.
  • A record can have between zero and 100 direct references.
  • Each record can be marked or not (most records are not marked).

Problem

I'm looking for an efficient data structure and algorithm to find all marked referenced (directly or indirectly referenced) records given a set of records (often just one, or up to 100). There are directly marked records (if a directly referenced record is marked), or indirectly marked records (if an indirectly referenced record is marked).

Reading the records is relatively slow, let's say 2 milliseconds per record.

I'm not looking for using a faster storage or similar here. I know it is possible, but it is quite hard to keep in sync. I'm trying to add a secondary data structure that contains just the relevant data. This will speed things up quite a bit (maybe factor of 10 or even 100), but bring a constant-factor improvement. I'm still interested in understanding if it's possible to improve the algorithm, if the amount of data grows.

Ideas

I have thought about the following options:

  • Brute force: One algorithm would be to search for all (directly or indirectly referenced) entries, and filter for marked entries. But that is slow, obviously, as I have to process all (directly or indirectly) referenced entries. Maybe none are marked, but 20'000 are referenced.

  • Shadow mark: Another algorithm would be to have a reverse index (which entries are referencing which other entries), and then each time an entry is marked, also "shadow-mark" all the entries that reference this entry, recursively. That way, when searching for marked entries, we can filter for those that have the "shadow-mark" set. The disadvantage is that many updates are needed if an entry is marked. A related option would be using a Bloom filter for shadow marking. But this would just reduce the memory usage.

  • Let's say we maintain a "maximum-depth" which is the maximum depth of a tree (the maximum number of hops from any record). And then we use the shadown-mark algorithm from above, but only partially: only up to maximum-depth / 2 recursion levels. So we limit propagating the shadow-mark. And then, for a query, we also limit the recursion depth to maximum-depth / 2. That way, we will "meet in the middle" in the worst case. (I should probably draw a picture.) A sub-problem is then how to efficiently maintain this maximum-depth.

I wonder, is there something similar to this approach? Something that doesn't require many updates when marking an entry, and doesn't require too many reads when querying? Or maybe a solution that allows to gradually update entries, if an entry is marked?

Example

In this example (blue is "marked"), for example if I search for (indirectly) referenced marked records for 5, I would like to quickly find 1 and 3.

Graph of a few nodes pointing to each other

Saddleback answered 18/1, 2023 at 17:38 Comment(13)
What does "indirectly marked" mean? Is it "references a marked entry", "referenced by a marked entry" or both? Is it a transitive relationship, or just one level?Adaptable
"2 milliseconds per record." Why so slow? What DB engine are you using? Are you reading across a network? I think you should do some optimization of your DB operations before `getting fancy with one particular problem.Align
@Adaptable I have updated the question: "indirectly" means having a transitive relationship of any depth.Saddleback
@Align This is definitely a very good question! Even if all entries are in memory, it takes 9 seconds to process 16'000 records. It is amazingly slow! Any you can imagine that it's much slower if records are read from MongoDB, one at a time. I'm aware this is all very weird, but it's a large and old system, and trying to optimize this part is really, really hard. I have already added two caches, which tripled the speed, but more than that will take more time. What I'm looking for is an algorithmic improvement. If I switch to, say PostgreSQL, then it is 70 ms using one query with a CTE.Saddleback
I assume that your records contain a lot of info that is irrelevant to this problem. Why not extract just the info you need ( record id, references and markings ) to a local SSD ( using a flat file or high performance DB engine ( e.g. SQLite ) ) Then update as you go along and run the brute force algorithm as required.Align
The MongoDB detail in your comments explains much. But what version of MongoDB? Based on my past experiences, this is likely to be less a question of "what is the right way?" and more a question of "what features can I use to make this least painful?" And the features available are very much MongoDB specific.Adaptable
BTW while an individual read is slow, you can also send a query with a $in of 1000 records. This should be much faster than doing 1000 reads. And now you can do the recursive logic outside of mongo, streaming queries back and forth. Honestly if I couldn't get rid of MongoDB, this is what I would try.Adaptable
I'm aware that you can speed up things by maybe a factor of 10 or so, if you spend a lot of time... But I'm interested in speeding up things by a factor of 1000 or more. Similar to the following: yes you can speed up Bubble sort. I did that when I was young, I wrote bi-directional optimized Bubble sort. It didn't help much. Switching to Shell sort whould have made a huge difference (without adding much complexity). I'm looking for an algorithmic improvement, not a micro-optimization. BTW this is Apache Jackrabbit Oak, and MongoDB is not the only backend.Saddleback
Do you want to find marked records that are reachable from A; a specified record B: any other marked record? C: any record?Align
In that case I would look at jackrabbit.apache.org/oak/docs/nodestore/document/…, note that you can migrate to PostgreSQL, and call it a day. As for algorithmic improvements, I can think of many. Which ones are faster or slower is a rabbit hole that depends on your use case. I once used something vaguely like this for a permission system, and updates turned out to be O(n^5)! (Later, very carefully, improved to O(n^3). But reads were O(log(n)).)Adaptable
@Adaptable Migrating to PostgreSQL doesn't help: even when using the segment store and all data is fully in memory, it takes 9 seconds. I would be very interested in your idea used for the permission system!Saddleback
I think you should show edge directions in your example diagram. It would help clarify whether you mean "referenced from" or "referenced by". In regards to question, unless you have some kind of extra info to guide search - like spatial distance - I think a depth first search is provably optimal.Jost
@Jost Thanks! I'll add the arrows. Distance is constant. Yes, it might be depth-first or breath-first search are best. Possibly Bidirectional BFS / DFS are even better.Saddleback
J
2

You could keep a table on each node that records which marked nodes are reachable from it, and keep it updated whenever a node (or edge) is added or removed from the graph, similar to network routing tables are kept for each node in a network. There are a couple of specifics about your problem that make it simpler than a network routing table though:

  • You don't want to know the actual path to the marked nodes from a given node, only that one (or more) exists.
  • The graph is acyclic.
  • It's not a distributed system so you have full control (obviously ...).

Because you don't care about path and because the graph is acyclic, the table on each node can be a map marked_node_id -> count where count is the number of paths from the given node to the given marked-node. When a new node is added the new node's table is built as the union of all the nodes tables adjacent to the new node where count is the sum. Additionally, the tables of all nodes adjacent from the new node have to be updated by adding the new node's table to each of them, and this has to be done recursively up the adjacent from chain. When a node is deleted you have to do similar.

Basic complexity analysis: Finding all marked nodes of a given node is O(1) and can be done with info stashed on a single node - which is the whole point. In general, adding and removing an edge (or a new node plus its edges) will require updating tables of all connected nodes recursively (upto a call depth of 100 and branching factor upto 100). Building tables initially would be O(number-of-nodes) by reverse flooding from marked nodes.


Code Example:

This is abstract and in-code solution but should translate. I'm using Python (+GraphViz) because you didn't specify a language, it's probably most accessible to widest audience, and is easy to prototype in. I'm also going to only implement add/remove node operations (to modify a node can remove then add with different initialization) and build the graph from scratch which isn't really realistic, but you can build tables initially given an existing graph by working backwards from marked nodes pretty easily. Also note:

  • The following require each node to have/maintain an adjacent_from list in addition to adjacent_to list so we can recurse up the adjacent from paths when a given node is deleted.
  • I've assumed each marked node is reachable from itself - just makes things bit easier to implement.

def main():
  ''' Build a test graph, then test. '''
  graph = Graph()
  a = graph.add_node('a', marked=True)
  b = graph.add_node('b', marked=True)
  c = graph.add_node('c', marked=True)
  d = graph.add_node('d', adjacent_to=[a])
  e = graph.add_node('e', adjacent_to=[d])
  f = graph.add_node('f',adjacent_to=[c])
  g = graph.add_node('g', adjacent_to=[d,f])
  h = graph.add_node('h', adjacent_to=[e,g])
  i = graph.add_node('i')
  j = graph.add_node('j', marked=True, adjacent_to=[i])
  k = graph.add_node('k', adjacent_to=[j])
  l = graph.add_node('l', adjacent_to=[k])
  m = graph.add_node('m', adjacent_to=[j])
  with open('main0.dot', 'w') as f:
    f.write(graph.gviz())
  graph.delete_node('f')
  with open('main1.dot', 'w') as f:
    f.write(graph.gviz())
  graph.delete_node('e')
  with open('main2.dot', 'w') as f:
    f.write(graph.gviz())
  graph.delete_node('g')
  with open('main3.dot', 'w') as f:
    f.write(graph.gviz())
  # Run this script to process graphviz files: for i in *.dot; do dot -Tpng $i > "${i%%.dot}.png"; done

class Graph:
  ''' Container for nodes. '''
  def __init__(self):
    self.nodes = {}

  def add_node(self, id, marked=False, adjacent_to=[]):
    assert id not in self.nodes
    self.nodes[id] = Node(id, marked, adjacent_to)
    return self.nodes[id]

  def delete_node(self, id):
    assert id in self.nodes
    node = self.nodes[id]
    self._recursive_subtract_table_on_delete(node, node)
    for adjacent_from_node in node.adjacent_from:
      adjacent_from_node._remove_adjacent_node(node.id)
    del self.nodes[id]

  def _recursive_subtract_table_on_delete(self, node, deleted_node):
    for adjacent_from_node in node.adjacent_from:
      self._recursive_subtract_table_on_delete(adjacent_from_node, deleted_node)
    node._delete_reachability_table(deleted_node)

  def gviz(self):
    return 'strict digraph {\n%s}' % ''.join([n._gviz_edges() for n in self.nodes.values()])

class Node:
  def __init__(self, id, marked=False, adjacent_to = []):
    ''' Init node. Note only adjacent_to not adjacent_from node are allowed,
    which measn we dno't have to update adjacent_from reachable_marks.  '''
    self.id = id
    self.marked = marked
    self.adjacent_to = adjacent_to
    self.adjacent_from = []
    self.reachable_marks = {}

    if marked:
      self.reachable_marks[id] = 1
    for adjacent_node in adjacent_to:
      adjacent_node.adjacent_from.append(self);
      self._add_reachability_table(adjacent_node)

  def _add_reachability_table(self, node):
    ''' Add the reachable_marks table from node to self. '''
    for (marked_node_id, k) in node.reachable_marks.items():
      self.reachable_marks[marked_node_id] = self.reachable_marks[marked_node_id] + 1 if marked_node_id in self.reachable_marks else 1

  def _delete_reachability_table(self, node):
    ''' Delete the reachable_marks table from node from self. '''
    for (marked_node_id, k) in node.reachable_marks.items():
      self.reachable_marks[marked_node_id] = self.reachable_marks[marked_node_id] - 1 if marked_node_id in self.reachable_marks else 0
    self.reachable_marks = {k: v for k,v in self.reachable_marks.items() if v}

  def _remove_adjacent_node(self, id):
    self.adjacent_to = list(filter(lambda n: n.id != id, self.adjacent_to))

  def _gviz_edges(self):
    ''' Helper to print graphviz edges adjacent to this node. '''
    _str = ''
    if self.marked:
      _str += ' %s[style=filled,fillcolor=blue]\n' % (self._gviz_name(),)
    else:
      _str +=  self._gviz_name() + '\n'
    for adjacent_node in self.adjacent_to:
      _str += ' %s -> %s\n' % (self._gviz_name(), adjacent_node._gviz_name())
    return _str;

  def _gviz_name(self):
    ''' Helper to print graphviz name with reachable marks. '''
    return '"' + self.id + '(' + ','.join(self.reachable_marks.keys()) + ')"'

if __name__ == '__main__':
  main()

Results:

The output graph shows marked nodes reachable from each node in brackets.

Initial:

enter image description here

Remove node f:

enter image description here

Remove node e:

enter image description here

Remove node g:

enter image description here

Jost answered 28/1, 2023 at 5:15 Comment(4)
Thanks! Yes it makes sense. The challenge is that marking an entry can result in many updates (too many, in some cases). I'll investigate further and report want I find in a separate answer.Saddleback
You mean count updates are double handled? Interesting. I never actually went all the way through to a formal proof that the counts were consistent, just did some sketches on paper and couldn't come up with a counter example. I could be wrong. Anyway, the counts are supposed to maintain the total number of unique paths from one node to another (marked) node, so the algo should add and subtract unique path counts as nodes are added and deleted. Acyclic and only one edge between node are defs a precondition for this to work.Jost
No, I don't mean "double handled". I mean, marking one entry can cause 1 million updates or more, depending on how many entries indirectly reference the entry. It fully moves the performance problem from query time to update time.Saddleback
Oh good, yeah I'm pretty sure the algo is correct. And yep, whether is makes sense depends on a number of factors, like R/W ratio, how often your doing "get-reachable-marks" queries, how connected graph is, etc. Your basically doing search upfront for all nodes with every modification of graph and stash result, except this way the "search" is much more efficient for any given node.Jost
T
1

This problem is related to fully dynamic transitive closure. I'm not intimately familiar with the research literature on the latter (probably most of which is not practical), but there is one algorithmic trick that you might not know about, related to your "maximum depth" idea.

Add a binary flag ("open" or "closed") to each node, and store both incoming and outgoing arcs. The rules are, every node that can reach an open node is open, and (equivalently) every node that can be reached by a closed node is closed. Each closed node also stores the set of marked nodes that it can reach. To query, traverse forward (outgoing arcs) from the queried node via open nodes, stopping at closed nodes. To update, traverse backward (incoming arcs) from the updated node via closed nodes, stopping at open nodes.

A closed node with incoming arcs from open nodes only can be converted to open. An open node with outgoing arcs to closed nodes only can be converted to closed. Conversion requires updates proportional to (in- or out-) degree. At this scale, I would suggest dumping the whole graph periodically and computing a reasonable set of adjustments in main memory.

Trapp answered 23/1, 2023 at 13:40 Comment(5)
This is very useful! I'm afraid I don't quite understand yet how this works, and so I'm trying to implement it myself following your description... I couldn't find this in the literature; I'm wondering what would be a good place to look for? Possibly the literature uses other terms ("open" and "closed" are quite generic terms, maybe it's just hard to find...). P.S. Interestingly, I was at a conference in January 2020 (ALENEX) presenting something else, where Monika Henzinger had a keynote speech about graph algorithms... I thought I'm probably never going to use any of that...Saddleback
@ThomasMueller another take on this idea: arxiv.org/pdf/2002.00813.pdfTrapp
Very interesting! They mention Bidirectional Breath-First Search, and it seems very competitive. I'll implement that plus a few more algorithms, and then do a comparison. It would be great if changes wouldn't cause any updates: a read-only algorithm would be so much simpler to implement.Saddleback
I thought "open/closed" was just conventional terminology used in graph search: You mark nodes "closed" when you've already searched them, so if you encounter them again you know?Gorilla
@FrancisM.Bacon This usage is not standard AFAIK.Trapp
A
0

To find all the marked records that are reachable from a given record is equivalent counting the marked records in the component that contains the given record.

This can be done with breadth first or depth first search.

There is no faster algorithm. To improve your performance I believe you need to:

  1. Implement an efficient search code using an optimising compiler

  2. Switch to a high performance database engine

  3. Optimize your queries. ( Do not read records one at a time! )

  4. Optimize your hardware configuration ( no networks, no spinning disks )

Align answered 18/1, 2023 at 22:13 Comment(3)
I have updated the question. "a specified record" (actually, one, or a few). Note that we are allowed to "shadow mark" other records if a record is marked. That means, Dijkstra is not needed, at the cost of "shadow marking" all entries if we mark an entry. I'm looking for such a solution.Saddleback
The Dijkstra algorithm is used to calculate the shortest distance. Why would I need to know the shortest distance? It seems sufficient to find all the entries (that are marked). Breath-first or depth-first seem sufficient for that, no?Saddleback
You are correct.Align
S
0

I have implemented 4 query algorithms with real-world data. All algorithms are don't require new data structures, or updates when marking an entry.

The test uses real-world data, with 40'000 nodes, and for each node tries to find a connection to 200 other random nodes. I expect that with more data, the results will be similar, because I expect that the "shape" of the data is very similar. For this experiment, I report the minimum, maximum, and average number of nodes read for each check.

  • Depth-first search: min: 1, max: 1659, avg: 4.03
  • Breath-first search: min: 1, max: 1659, avg: 4.02
  • Reverse breath-first search: min: 1, max: 102859, avg: 4.21
  • Bidirectional adaptive breath-first search: min: 1, max: 174, avg: 1.29

The best algorithm is 3.11 times faster than breath-first search (BFS). Reverse BFS is slower than BFS, due to the "shape" of the data: It seems each node references at most a few children. But there are a few nodes that are referenced by a lot of other nodes. So reverse search can be slow (max is much higher for reverse BFS).

Bidirectional adaptive BFS uses the following algorithm:

  • The idea is to do a mix of BFS and reverse BFS, in a balanced way.
  • We use a mix of downward search (source to target), and upward search (target to source).
  • We remember the nodes we have seen so far in both directions.
  • At each step, we either expand downwards, or upwards.
  • Which direction we go at each step depends on how many nodes we have seen in the last step on the way down, versus on the way up: If we have seen more on the way down, then the next step is to go up. Otherwise, go down.

Implementation (Java):

    System.out.println("Number of nodes: " + nodes.size());
    System.out.println();
    int dfsCount = 0, bfsCount = 0, revCount = 0, biCount = 0;
    int dfsMin = Integer.MAX_VALUE, bfsMin = Integer.MAX_VALUE, revMin = Integer.MAX_VALUE, biMin = Integer.MAX_VALUE;
    int dfsMax = Integer.MIN_VALUE, bfsMax = Integer.MIN_VALUE, revMax = Integer.MIN_VALUE, biMax = Integer.MIN_VALUE;
    int totalCount = 0;
    Random r = new Random(1);
    for (int i = 0; i < nodeList.size(); i++) {
        int a = i;
        for (int j = 0; j < 200; j++) {
            int b;
            do {
                b = r.nextInt(nodes.size());
            } while (a == b);
            Node na = nodeList.get(a);
            Node nb = nodeList.get(b);
            totalCount++;
            AtomicInteger x = new AtomicInteger();
            
            boolean r1 = depthFirstSearch(x, na, nb);
            dfsCount += x.get();
            dfsMin = Math.min(dfsMin, x.get());
            dfsMax = Math.max(dfsMax, x.get());
            x.set(0);
            
            boolean r2 = breathFirstSearch(x, na, nb);
            bfsCount += x.get();
            bfsMin = Math.min(bfsMin, x.get());
            bfsMax = Math.max(bfsMax, x.get());
            x.set(0);
            
            boolean r3 = reverseBreathFirstSearch(x, na, nb);
            revCount += x.get();
            revMin = Math.min(revMin, x.get());
            revMax = Math.max(revMax, x.get());
            x.set(0);
            
            boolean r4 = bidirectionalAdaptiveBFS(x, na, nb);
            biCount += x.get();
            biMin = Math.min(biMin, x.get());
            biMax = Math.max(biMax, x.get());
            x.set(0);

            if (r1 != r2 || r1 != r3 || r1 != r4) {
                depthFirstSearchTrace(na, nb);
                bidirectionalAdaptiveBFS(x, na, nb);
                throw new AssertionError(r1 + " " + r2 + " " + r3 + " " + r4);
            }
        }
    }
    System.out.println("Depth-first search");
    System.out.printf("min: %d, max: %d, avg: %2.2f\n", dfsMin, dfsMax, ((double) dfsCount / totalCount));
    System.out.println();

    System.out.println("Breath-first search");
    System.out.printf("min: %d, max: %d, avg: %2.2f\n", bfsMin, bfsMax, ((double) bfsCount / totalCount));
    System.out.println();

    System.out.println("Reverse breath-first search");
    System.out.printf("min: %d, max: %d, avg: %2.2f\n", revMin, revMax, ((double) revCount / totalCount));
    System.out.println();

    System.out.println("Bidirectional adaptive breath-first search");
    System.out.printf("min: %d, max: %d, avg: %2.2f\n", biMin, biMax, ((double) biCount / totalCount));
    System.out.println();

static boolean depthFirstSearch(AtomicInteger count, Node source, Node target) {
    HashSet<Node> tested = new HashSet<>();
    tested.add(source);
    return depthFirstSearch(count, tested, source, target);
}

static boolean depthFirstSearch(AtomicInteger count, HashSet<Node> tested, Node source, Node target) {
    count.incrementAndGet();
    for(Node n : source.references) {
        if (n == target) {
            return true;
        }
        if (!tested.contains(n)) {
            tested.add(n);
            if (depthFirstSearch(count, n, target)) {
                return true;
            }
        }
    }
    return false;
}

static boolean breathFirstSearch(AtomicInteger count, Node source, Node target) {
    HashSet<Node> tested = new HashSet<>();
    tested.add(source);
    return breathFirstSearch(count, tested, source, target);
}

static boolean breathFirstSearch(AtomicInteger count, HashSet<Node> tested, Node source, Node target) {
    count.incrementAndGet();
    for(Node n : source.references) {
        if (n == target) {
            return true;
        }
    }
    for(Node n : source.references) {
        if (!tested.contains(n)) {
            tested.add(n);
            if (breathFirstSearch(count, n, target)) {
                return true;
            }
        }
    }
    return false;
}

static boolean reverseBreathFirstSearch(AtomicInteger count, Node source, Node target) {
    HashSet<Node> tested = new HashSet<>();
    tested.add(target);
    return reverseBreathFirstSearch(count, tested, source, target);
}

static boolean reverseBreathFirstSearch(AtomicInteger count, HashSet<Node> tested, Node source, Node target) {
    count.incrementAndGet();
    for(Node n : target.referencedBy) {
        if (n == source) {
            return true;
        }
    }
    for(Node n : target.referencedBy) {
        if (!tested.contains(n)) {
            tested.add(n);
            if (breathFirstSearch(count, source, n)) {
                return true;
            }
        }
    }
    return false;
}    

static boolean bidirectionalAdaptiveBFS(AtomicInteger count, Node source, Node target) {
    HashSet<Node> allSources = new HashSet<>();
    HashSet<Node> sources = new HashSet<>();
    allSources.add(source);
    sources.add(source);
    HashSet<Node> allTargets = new HashSet<>();
    HashSet<Node> targets = new HashSet<>();
    allTargets.add(target);
    targets.add(target);
    return bidirectionalAdaptiveBFS(count, allSources, allTargets, sources, targets);
}

static boolean bidirectionalAdaptiveBFS(AtomicInteger count, Set<Node> allSources, Set<Node> allTargets, Set<Node> sources, Set<Node> targets) {
    while (!sources.isEmpty() && !targets.isEmpty()) {
        if (sources.size() <= targets.size()) {
            HashSet<Node> newSources = new HashSet<>();
            for(Node source: sources) {
                count.incrementAndGet();
                for(Node n : source.references) {
                    if (!allSources.contains(n)) {
                        newSources.add(n);
                        allSources.add(n);
                        if (allTargets.contains(n)) {
                            return true;
                        }
                    }
                }
            }
            sources = newSources;
        } else {
            HashSet<Node> newTargets = new HashSet<>();
            for(Node target: targets) {
                count.incrementAndGet();
                for(Node n : target.referencedBy) {
                    if (!allTargets.contains(n)) {
                        newTargets.add(n);
                        allTargets.add(n);
                        if (allSources.contains(n)) {
                            return true;
                        }
                    }
                }
            }
            targets = newTargets;
        }
    }
    return false;
}

static class Node {
    String name;
    HashSet<Node> references = new HashSet<>();
    HashSet<Node> referencedBy = new HashSet<>();
    boolean marked;
    Node(String name) {
        this.name = name;
    }
    void addReference(Node n) {
        references.add(n);
        n.referencedBy.add(this);
    }
    public String toString() {
        return name;
    }
    @Override
    public boolean equals(Object other) {
        if (!(other instanceof Node)) {
            return false;
        }
        return name.equals(((Node) other).name);
    }
    @Override
    public int hashCode() {
        return name.hashCode();
    }
}
Saddleback answered 2/2, 2023 at 11:28 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.