Sequence prediction of characters?

Asked 12/3, 2017 at 13:30 Answered 15/1, 2023 at 1:47

text machine-learning neural-network lstm n-gram

I am new to machine learning, so please go easy in case the problem is trivial.

I have been given a sequence of observed characters say, ABABBABBB..... (n characters). My goal is to predict the next characters by some "learning" mechanisms. My constraints are that the obeserved characters (training data?) is not too much i.e I have say a sequence of length 6000 to learn the underlying patterm

I am quite confused about what strategy to take to solve this problem,my initial bets are: 1) some kind of ngram model? 2) Neural networks (the likes of LSTM) etc.? 3) HMMs

Can you please give directions of the right approaches for solving this problem?

Pontus answered 12/3, 2017 at 13:30 Comment(4)

Is the underlying pattern of the sequence constant? (or does the pattern change per sequence/input) – Triphammer 14/3, 2017 at 9:51

I don't know the pattern, it is something I want to learn. – Pontus 16/3, 2017 at 6:47

How many characters are in your alphabet? Just 'A' and 'B'? Can you upload your data? – Cranky 16/3, 2017 at 9:50

it can contain a maximum of 20 alphabets – Pontus 17/3, 2017 at 5:22

If you're dealing with a fairly trivial pattern, where letters are based only off of the previous one, then you're spot on that a Hidden Markov Model (HMM) would solve it - in fact something as simple as a Markov Chain would work.

If you want to have a little fun, then here's a custom solution based on the HMM that you can fiddle around with.

Go over the sample data, and create a linked list of every element, in the order they were inserted. Now make another list for each different character, and put the index of every list element where it belongs. Here's a (very poorly drawn) visual representation of the linked list, and the buckets below it:

Now when you are presented a sequence, and asked to predict the next character, all you have to do is look at the most recent X characters, and see how sub-sequences that were similar to it acted.

To use my example above, then look at the most recent (the last) 3 characters to get BAC. You want to see if the sequence BAC has ever happened before, and what came after it when it did happen. If you check the bucket for the first letter of BAC (the B), you can see that the letter B showed up once before. Thankfully it follows the sequence - and an A came after it, so that will be the prediction.

You might want to not only check sequences of the past X, but also every number below X, giving each one less weight if the sequence matches, to create a better heuristic.

The difficult part is deciding how far behind to look - if you look too far, it'll take too long, and you might not get any matches. If you look too short, then you might miss a pattern, and have to guess.

Good luck - hopefully this is nice and easy to implement, and works for you.

Heavily answered 17/3, 2017 at 6:39 Comment(0)

Your problem looks to be a time-series analysis. For that you should also take the use of statistics and Exploratory Data Analysis (EDA) besides machine-learning algorithms into account.

I would start by assigning numbers to chars (A->1 , B->2 , and so on). It is usually not recommended to turn nominal variables (values without order) into ordinal ones, (2 is bigger than 1 but is "C" bigger than "A" or "Red" bigger than "Green"?!) but in this case, this will change your problem into an absolute time-series analysis one.
Then I would utilize some routine EDA approaches, like 4-plot or autocorrelation analysis. this would tell you a lot about the statistical behavior of data, like "is the mean of data shifting?" or "how much random the dataset might be?" Afterwards you probably would have a better decision on which machine learning algorithm to use
Depending on what you might find in the EDA analysis, you can go ahead with implementing ML algorithms. If you have a high correlated data (perceived from autocorrelation plot) then you would probably choose a sliding window approach in your feature selection i.e. assuming that each value depends on previous k values ( x_k = f(x_(k-1),x_(k-2),...,x_(k-m)) ). the value of "m" could be selected by analyzing autocorrelation plot. If you have a moving average, it would be a good idea to learn the average curve first and then go on with learning the offset of each instance from its average. If you perceived a degree of randomness in either the average curve or instance offsets, then you might want to choose a stochastic approach through your prediction problem.

Generally, EDA philosophy is "analysis should come before model selection" and I think it's true. If you know more about what you're dealing with, then you definitely will have a better call for model selection

Petrillo answered 22/3, 2017 at 9:46 Comment(0)

Genetic Algorithms could be a solution. It creates classifiers replacing letters by wildcards Since classifiers should have a given length, you pick the last N characters. ex (N-= 4): ABBACDC -> "ACDC" and one classifier example could be "A**C". To test it, it could be quite fast for a good programmer and the results could be astonishing.

Bug answered 15/1, 2023 at 1:47 Comment(0)

Recommended topics

Hot tags