Machine Learning Algorithm for Predicting Order of Events?

Asked 26/3, 2010 at 15:52 Answered 29/6, 2010 at 19:39

python compression machine-learning neural-network evolutionary-algorithm

Simple machine learning question. Probably numerous ways to solve this:

There is an infinite stream of 4 possible events:

'event_1', 'event_2', 'event_4', 'event_4'

The events do not come in in completely random order. We will assume that there are some complex patterns to the order that most events come in, and the rest of the events are just random. We do not know the patterns ahead of time though.

After each event is received, I want to predict what the next event will be based on the order that events have come in in the past. So my question is: What machine learning algorithm should I use for this predictor?

The predictor will then be told what the next event actually was:

Predictor=new_predictor()

prev_event=False
while True:
    event=get_event()
    if prev_event is not False:
        Predictor.last_event_was(prev_event)
    predicted_event=Predictor.predict_next_event(event)

The question arises of how long of a history that the predictor should maintain, since maintaining infinite history will not be possible. I'll leave this up to you to answer. The answer can't be infinte though for practicality.

So I believe that the predictions will have to be done with some kind of rolling history. Adding a new event and expiring an old event should therefore be rather efficient, and not require rebuilding the entire predictor model, for example.

Specific code, instead of research papers, would add for me immense value to your responses. Python or C libraries are nice, but anything will do.

Update: And what if more than one event can happen simultaneously on each round. Does that change the solution?

Arabel answered 26/3, 2010 at 15:52 Comment(0)

This is essentially a sequence prediction problem, so you want Recurrent neural networks or hidden Markov models.

If you only have a fixed time to look back, time window approaches might suffice. You take the sequence data and split it into overlapping windows of length n. (eg. you split a sequence ABCDEFG into ABC, BCD, CDE, DEF, EFG). Then you train a function approximator (e.g. neural network or linear regression) to map the first n-1 parts of that window onto the nth part.

Your predictor will not be able to look back in time longer than the size of your window. RNNs and HMMs can do so in theory, but are hard to tune or sometimes just don't work.

(State of the art RNN implementations can be found in PyBrain http://pybrain.org)

Update: Here is the pybrain code for your problem. (I haven't tested it, there might be some typos and stuff, but the overall structure should work.)

from pybrain.datasets import SequentialDataSet
from pybrain.supervised.trainers import BackpropTrainer
from pybrain.tools.shortcuts import buildNetwork
from pybrain.structure import SigmoidLayer

INPUTS = 4
HIDDEN = 10
OUTPUTS = 4

net = buildNetwork(INPUTS, HIDDEN, OUTPUTS, hiddenclass=LSTMLayer, outclass=SigmoidLayer, recurrent=True)

ds = SequentialDataSet(INPUTS, OUTPUTS)

# your_sequences is a list of lists of tuples which each are a bitmask
# indicating the event (so 1.0 at position i if event i happens, 0.0 otherwise)

for sequence in your_sequences:
    for (inpt, target) in zip(sequence, sequence[1:]):
        ds.newSequence()
        ds.appendLinked(inpt, target)

net.randomize()

trainer = BackpropTrainer(net, ds, learningrate=0.05, momentum=0.99)
for _ in range(1000):
    print trainer.train()

This will train the recurrent network for 1000 epochs and print out the error after every epochs. Afterwards you can check for correct predictions like this:

net.reset()
for i in sequence:
  next_item = net.activate(i) > 0.5
  print next_item

This will print an array of booleans for every event.

Indican answered 26/3, 2010 at 17:4 Comment(1)

Is it possible to provide a small example of how "your_sequences" variable should look like? Even with the description, I guess that i'm not getting it right. – Brahear 13/1, 2015 at 1:40

Rather than keeping a full history, one can keep aggregated information about the past (along with a relatively short sliding history, to be used as input to the Predictor logic).

A tentative implementation could go like this:
In a nutshell: Managing a set of Markov chains of increasing order, and grading and averaging their predictions

keep a table of individual event counts, the purpose is to calculate the probability of any of the 4 different events, without regards to any sequence.
keep a table of bigram counts, i.e. a cumulative count the events observed [so far]
Table starts empty, upon the second event observe, we can store the first bigram, with a count of 1. upond the third event, the bigram made of the 2nd and 3rd events is "added" to the table: either incrementing the count of an existing bigram or added with original count 1, as a new (never-seen-so-far) bigram. etc.
In parallel, keep a total count of bigrams in the table.
This table and the total tally allow calculating the probability of a given event, based on the one preceding event.
In a similar fashion keep a table of trigram counts, and a running tally of total trigram seen (note that this would be equal to the number of bigrams, minus one, since the first trigram is added one event after the first bigram, and after that one of each is added with each new event). This trigram table allows calculating the probability of a given event based on the two preceding events.
likewise, keep tables for N-Grams, up to, say, 10-grams (the algorithm will tell if we need to increase or decrease this).
keep an sliding windows into the last 10 events.
The above tables provide the basis for prediction; the general idea are to:
- use a formula which expresses the probabilities of the next event as a weighed average of the individual probabilities based on the different N-grams.
- reward the better individual N-gram length by increasing the corresponding weight in the formula; punish the worse lengths in the reverse fashion. (Beware the marginal probability of individual events needs to be taken into account lest we favor N-grams which happen to predict the most frequent events, regardless of the relative poor predicting value associated with them them)
- Once the system has "seen" enough events, see the current values for the weights associated with the long N-Grams, and if these are relatively high, consider adding tables to keep aggregate info about bigger N-Grams. (This unfortunately hurts the algorightm both in terms of space and time)

There can be several variations on the general logic described above. In particular in the choice of the particular metric used to "grade" the quality of prediction of the individual N-Gram lengths.
Other considerations should be put with regards to detecting and adapting to possible shifts in the events distribution (the above assumes a generally ergodic event source). One possible approach is to use two sets of tables (combining the probabilities accordingly), and periodically dropping the contents of all tables of one of the sets. Choosing the right period for these resets is a tricky business, essentially balancing the need for statistically significant volumes of history and the need for short enough period lest me miss on the shorter modulations...

Imaginative answered 26/3, 2010 at 17:6 Comment(1)

This is what I would try. Unlike a neural network it should be straightforward to implement and more importantly straightforward to understand. – Fructificative 30/3, 2010 at 23:21

The question arises of how long of a history that the predictor should maintain

The only answer is "it depends".

It depends on how accurate this needs to be. I don't believe this strategy could ever be 100% accurate even with an infinite history. Try out a history of 10 and you'll get x% accuracy, then try 100 and you'll get y% accuracy, etc etc...

Eventually you should find either the system is as accurate as you desire it to be or you will find the rise in accuracy will not be worth the increase in history length (and increased memory usage, processing time etc...). At this point either job done, or you need to find a new strategy.

For what it is worth i think looking into a simple "soft" neural net might be a better plan.

Heti answered 26/3, 2010 at 16:10 Comment(2)

Although, you could probably do a bit of maths to work out the expected accuracy for a give history but this would be dependent on your algorithm. – Heti 26/3, 2010 at 16:12

You cannot determine the amount of time needed to look back since you don't know the underlying dynamics. You only now examples, and what you see might even be stochastic. – Indican 26/3, 2010 at 17:6

We just studied about branch-predictors in computer architecture (Because the processor would take too long to actually evaluate a condition if(EXPRESSION), it tries to 'guess' and save some time that way). I am sure more research has been done in this area, but that's all I can think of at the moment.

I haven't seen a unique setup like yours, so I think you might need to do some preliminary experimentation on your own. Try running your solution for X number of seconds with a history of N slots, what is the correctness ratio? And compare that with the same fixed X and varying N history slots to try to find the best memory-history ratio (graphing them out ).

If more than one event can happen simulataneously... that's a little mind bending, there has to be some constraints there : what if infinite number of events happen at a time? Uhoh, that's computationally impossible for you. I'd try the same approach as just one event at a time, except where the predictor is enabled predict multiple events at a time.

Marlo answered 26/3, 2010 at 16:11 Comment(2)

Branch predictors are however designed to work on hardware. You can use much more sophisticated algorithms when you don't care about microseconds that much. – Indican 26/3, 2010 at 17:7

That's true - but the same underlying problems of memory vs correctness (minus microseconds) exist. Some popular Branch-Predictors use Markov Models. – Marlo 26/3, 2010 at 18:52

Processors use a few really lightweight tricks to predict whether a branch statement will branch or not. This helps them with efficient pipe-lining. They may not be as general as Markov models for instance, but they are interesting because of their simplicity. Here is the Wikipedia article on branch prediction. See the Saturating Counter, and the Two-Level Adaptive Predictor

Microscopic answered 29/6, 2010 at 19:39 Comment(0)

Recommended topics

Hot tags