For the sake of clarity, I propose here a very simplified illustration of the phenomenon.
Say you train your HMM with the data sequence (A-B-A-B).
Say you use a 2-state HMM.
Naturally, state 1 will optimize itself to represent A and state 2 will represent B (or the opposite).
Then, you have a new sequence (A-B-A-B). You want to know the likelihood this sequence has with respect to your HMM.
A Viterbi algorithm will find that the most probable state sequence is (1-2-1-2), and the Baum-Welch algorithm will give this sequence a high likelihood as the sequence of states and the "values" of the new sequence (if working with continuous data) clearly match your training sequence.
Say now that you train a 3-state HMM with the same training sequence (A-B-A-B). The initial clustering of your data will most probably either assign the 2 first states of the HMM for the representation of symbol A, and the last one to symbol B (or the opposite once again).
So now, the query sequence (A-B-A-B) can be represented as the state sequence (1-3-1-3) or (2-3-2-3) or (1-3-2-3) or (2-3-1-3) !
This means that for this 3-state HMM, two identical sequences (A-B-A-B) can have low similarity for the HMM. That is the reason why for any HMM and any data set, beyond a certain number of states, performance will decrease.
You can estimate the optimal number of states using criteria such as the Bayesian Information Criterion, the Akaike Information Criterion, the Minimum Message Length Criterion, or if you just want to get a blur idea, a k-means clustering combined with the percentage of variance explained. The 3 first criteria are interesting because they include a penalty term that goes with the number of parameters of the model.
Hope it helps! :)