Hidden Markov Model: Is it possible that the accuracy decreases as the number of states increases?
L

1

7

I constructed a couple of Hidden Markov Models using the Baum-Welch algorithm for an increasing number of states. I noticed that after 8 states, the validation score goes down for more than 8 states. So I wondered whether it's possible that the accuracy of an Hidden Markov Model can decrease with an increasing number of states, due to some kind of overfitting?

Thanks in advance!

Lesseps answered 7/7, 2015 at 12:37 Comment(1)
Do you find that surprising? Only one state is correct and you now offer more possibilities to choose from. Wouldn't you expect the problem to become harder?Aquacade
G
10

For the sake of clarity, I propose here a very simplified illustration of the phenomenon.

Say you train your HMM with the data sequence (A-B-A-B). Say you use a 2-state HMM. Naturally, state 1 will optimize itself to represent A and state 2 will represent B (or the opposite). Then, you have a new sequence (A-B-A-B). You want to know the likelihood this sequence has with respect to your HMM. A Viterbi algorithm will find that the most probable state sequence is (1-2-1-2), and the Baum-Welch algorithm will give this sequence a high likelihood as the sequence of states and the "values" of the new sequence (if working with continuous data) clearly match your training sequence.

Say now that you train a 3-state HMM with the same training sequence (A-B-A-B). The initial clustering of your data will most probably either assign the 2 first states of the HMM for the representation of symbol A, and the last one to symbol B (or the opposite once again).

So now, the query sequence (A-B-A-B) can be represented as the state sequence (1-3-1-3) or (2-3-2-3) or (1-3-2-3) or (2-3-1-3) ! This means that for this 3-state HMM, two identical sequences (A-B-A-B) can have low similarity for the HMM. That is the reason why for any HMM and any data set, beyond a certain number of states, performance will decrease.

You can estimate the optimal number of states using criteria such as the Bayesian Information Criterion, the Akaike Information Criterion, the Minimum Message Length Criterion, or if you just want to get a blur idea, a k-means clustering combined with the percentage of variance explained. The 3 first criteria are interesting because they include a penalty term that goes with the number of parameters of the model.

Hope it helps! :)

Gandhi answered 21/7, 2015 at 14:30 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.