One set of methods for applying HMMs to gesture recognition would be to apply a similar architecture as commonly used for speech recognition.
The HMM would not be over space but over time, and each video frame (or set of extracted features from the frame) would be an emission from an HMM state.
Unfortunately, HMM-based speech recognition is a rather large area. Many books and theses have been written describing different architectures. I recommend starting with Jelinek's "Statistical Methods for Speech Recognition" (http://books.google.ca/books?id=1C9dzcJTWowC&pg=PR5#v=onepage&q&f=false) then following the references from there. Another resource is the CMU sphinx webpage (http://cmusphinx.sourceforge.net).
Another thing to keep in mind is that HMM-based systems are probably less accurate than discriminative approaches like conditional random fields or max-margin recognizers (e.g. SVM-struct).
For an HMM-based recognizer the overall training process is usually something like the following:
1) Perform some sort of signal processing on the raw data
- For speech this would involve converting raw audio into mel-cepstrum format, while for gestures, this might involve extracting image features (SIFT, GIST, etc.)
2) Apply vector quantization (VQ) (other dimensionality reduction techniques can also be used) to the processed data
- Each cluster centroid is usually associated with a basic unit of the task. In speech recognition, for instance, each centroid could be associated with a phoneme. For a gesture recognition task, each VQ centroid could be associated with a pose or hand configuration.
3) Manually construct HMMs whose state transitions capture the sequence of different poses within a gesture.
Emission distributions of these HMM states will be centered on the VQ vector from step 2.
In speech recognition these HMMs are built from phoneme dictionaries that give the sequence of phonemes for each word.
4) Construct an single HMM that contains transitions between each individual gesture HMM (or in the case of speech recognition, each phoneme HMM). Then, train the composite HMM with videos of gestures.
- It is also possible at this point to train each gesture HMM individually before the joint training step. This additional training step may result in better recognizers.
For the recognition process, apply the signal processing step, find the nearest VQ entry for each frame, then find a high scoring path through the HMM (either the Viterbi path, or one of a set of paths from an A* search) given the quantized vectors. This path gives the predicted gestures in the video.