What is the best/fastest way to construct a very large markov chain from simulation data?
Asked Answered
H

1

11

I have written a C++ program that simulates a certain process I'm studying. It outputs discrete "states" each timestep of the simulation. For example:

a
b
c
b
c
b

would be the output of a simulation run with a as the initial condition (set by me or randomly generated) and b & c would be the states the system keeps oscillating between.

I would like to combine many of these runs into a Markov chain, so that it turns into a graph with the following vertices and edges. (Preferably at runtime, because saving the output first takes a lot of diskspace.) The number between the parentheses indicate the number of times a certain vertex or edge was encountered, so this should also be stored.

Vertices: a(1), b(3) and c(2).

Edges: a->b(1), b->c(2), c->b(2).

The real states contain 112 bits of information and I'm generating billions of these transitions. The problem is that I haven't found a graph library or program to generate the Markov chain efficiently and fast. I have been toying around with:

  • Google sparse hash to construct my own graph class in C++.
  • Neo4J (I was just getting started with this one)
  • Lemon library

I just completed the "Google sparse hash graph", but it turns out to get real slow halfway into the runs. After about a day (memory usage goes above 20 GB, not a problem in itself, because there is way more), it slows down and takes about three weeks to complete.

I have access to computers with 12 or 16 cores and 256 or 512 GB of memory, and my feeling is they should be up for the job.

Since I'm not a trained programmer and I code quite slowly, I'm looking for some information before I spent a lot of time working on another imperfect solution.

  • What would be the best program/library that can quickly accept large numbers of vertices and edges to construct the Markov chain?
  • Is the slowness a result of using the wrong tools or imperfect coding (which I suspect) or am I simply trying to do something that will always take a lot of time?

I hope I was able to make my issue clear. Thanks in advance for any wisdom or answers.

EDIT:

Based on the questions and answers in the comments I guess my question should have been: what is a suitable fast matrix library for C++?

Hathcock answered 27/10, 2013 at 10:38 Comment(7)
Not sure I follow: in your example it seems like you have a 3 state markov chain, where a is a non-recurrent state. do you want to generate a different markov chain? or to generate a graph out of an output of an instance of this markov chain? the definition of the markov chain here is a bit unclear.Pepperandsalt
ok, so you have a markov chain with many states. what is the use of the edges and vertices, and why do you count them? a markov chain is defined by a NxN matrix.Pepperandsalt
Hm, perhaps my previous comment did not to answer your question. Perhaps this helps: I'm trying to discover the Markov chain of a discrete process I can only simulate. I don't yet have the Markov chain, but I can infer it by observing the state transitions of my simulation.Hathcock
Ok, now I get it. is it possible to store a NxN array in memory for counting, or is the number of states too big for that?Pepperandsalt
@RonTeller, I made an estimation. With my current simulation parameters, worst case is 2x10^9 unique states every simulation. This is unlikely, so I estimate 6x10^8 unique states. This estimation is based on guesswork :)Hathcock
Do you have an estimation of the number of (unique) edges that are connected to a state on average? because if this number is ~1000 or more, then you won't be able to store this markov chain in memory, even with a 512GB system.Pepperandsalt
I just looked at some preleminary data and extrapolated from there. I estimate that an average node will have no more than ~30 edges. Some will be above that, but it will be a very small subgroup. Anyway: I once got an simulation to complete, it just took very long. I'm pretty sure it will fit in memory, but it would have to be a sparse table I guess, because I can't list all possible states (which would be 2^112).Hathcock
G
1

Did you look at boost::numeric::ublas? It has a member sparse matrix that gives you matrix like access but instead of building a NxN array in memory keeps a list of edges per node.

So if N is the number of nodes instead of a NxN array in memory you keep Nx30 -avg num of edges per node-

However even assuming you can use a single byte to count the reccurence of edges you still have 600M nodes each with a list of 30 edges.

the list entry is the edge name an uint32 and content is at least 1 byte. so 150 bytes minimum for the list. which comes out to a minimum 90GB in memory. likely higher because there is overhead per element in a list.

If you can keep this all in memory without OS swapping data to disk then there is no reason why it should not work fast. Of course it is possible that an ordered map will out perform a hash_map. It depends on implementation and the hash function used.

Naively std::map<uint32, std::map<uint32, unint8>> If the tree is balanced the length of the big tree is 30, and the small one is tiny. So access shouldn't take ages. It is possible that a hash_map will work better for the columns though but not certain: hash_map<uint32, std::map<uint32, unint8>> (google sparse hash map is tuned for memory not speed and the columns map will be very big which probably makes it a bad fit)

Finally you should consider holding this information on disk instead of in memory. In fact you can use an external data service like a DB with a table for each node (NodeId, NumOfHits) and a table for the edge (NodeId, NodeId, NumOfHits) {this representation takes up a lot more space}

I'd try something like Cassandra which can manage a disk vs memory cache for you and can easily scale for multiple computers. And you don't need to overhead of complex transaction models etc.

Greenebaum answered 30/10, 2013 at 8:30 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.