HMM algorithm for gesture recognition
Asked Answered
C

2

7

I want to develop an app for gesture recognition using Kinect and hidden Markov models. I watched a tutorial here: HMM lecture

But I don't know how to start. What is the state set and how to normalize the data to be able to realize HMM learning? I know (more or less) how it should be done for signals and for simple "left-to-right" cases, but 3D space makes me a little confused. Could anyone describe how it should be begun?

Could anyone describe the steps, how to do this? Especially I need to know how to do the model and what should be the steps of HMM algorithm.

Celloidin answered 28/1, 2013 at 22:32 Comment(0)
I
10

One set of methods for applying HMMs to gesture recognition would be to apply a similar architecture as commonly used for speech recognition.

The HMM would not be over space but over time, and each video frame (or set of extracted features from the frame) would be an emission from an HMM state.

Unfortunately, HMM-based speech recognition is a rather large area. Many books and theses have been written describing different architectures. I recommend starting with Jelinek's "Statistical Methods for Speech Recognition" (http://books.google.ca/books?id=1C9dzcJTWowC&pg=PR5#v=onepage&q&f=false) then following the references from there. Another resource is the CMU sphinx webpage (http://cmusphinx.sourceforge.net).

Another thing to keep in mind is that HMM-based systems are probably less accurate than discriminative approaches like conditional random fields or max-margin recognizers (e.g. SVM-struct).

For an HMM-based recognizer the overall training process is usually something like the following:

1) Perform some sort of signal processing on the raw data

  • For speech this would involve converting raw audio into mel-cepstrum format, while for gestures, this might involve extracting image features (SIFT, GIST, etc.)

2) Apply vector quantization (VQ) (other dimensionality reduction techniques can also be used) to the processed data

  • Each cluster centroid is usually associated with a basic unit of the task. In speech recognition, for instance, each centroid could be associated with a phoneme. For a gesture recognition task, each VQ centroid could be associated with a pose or hand configuration.

3) Manually construct HMMs whose state transitions capture the sequence of different poses within a gesture.

  • Emission distributions of these HMM states will be centered on the VQ vector from step 2.

  • In speech recognition these HMMs are built from phoneme dictionaries that give the sequence of phonemes for each word.

4) Construct an single HMM that contains transitions between each individual gesture HMM (or in the case of speech recognition, each phoneme HMM). Then, train the composite HMM with videos of gestures.

  • It is also possible at this point to train each gesture HMM individually before the joint training step. This additional training step may result in better recognizers.

For the recognition process, apply the signal processing step, find the nearest VQ entry for each frame, then find a high scoring path through the HMM (either the Viterbi path, or one of a set of paths from an A* search) given the quantized vectors. This path gives the predicted gestures in the video.

Iambic answered 4/2, 2013 at 20:34 Comment(3)
"probably less accurate than discriminative approaches" -- Wikipedia: "Despite the fact that discriminative models do not need to model the distribution of the observed variables, they cannot generally express more complex relationships between the observed and target variables. They don't necessarily perform better than generative models at classification and regression tasks.". I think your point is misleading, could you clarify? I am thinking the skeletonization with different methods such as generative and discriminatory.Autobahn
It is pretty generally accepted in the machine learning community that unless you have strong prior knowledge about a problem that you can incorporate into a generative model, then discriminative approaches are better for tasks like classification and labeling because i) the discriminative objective better-matches the task and 2) it is easier to incorporate different a wide variety of feature types into discriminative models.Iambic
Here is some literature: speech recognition - cseweb.ucsd.edu/~saul/papers/lmb08_cdhmm.pdf (this has a good review of different training techniques) , NLP - cs.brown.edu/~th/papers/AltTsoHof-ICML2003.pdf, Character recognition, NLP parsing, and machine translation tasks - seas.upenn.edu/~taskar/pubs/max-margin-acl05-tutorial.pdfIambic
P
1

I implemented the 2d version of this for the Coursera PGM class, which has kinect gestures as the final unit.

https://www.coursera.org/course/pgm

Basically, the idea is that you can't use HMM to actually decide poses very well. In our unit, I used some variation of K-means to segment the poses into probabilistic categories. The HMM was used to actually decide what sequences of poses were actually viable as gestures. But any clustering algorithm run on a set of poses is a good candidate- even if you don't know what kind of pose they are or something similar.

From there you can create a model which trains on the aggregate probabilities of each possible pose for each point of kinect data.

I know this is a bit of a sparse interview. That class gives an excellent overview of the state of the art but the problem in general is a bit too difficult to be condensed into an easy answer. (I'd recommend taking it in april if you're interested in this field)

Piddle answered 28/1, 2013 at 22:41 Comment(1)
I signed up, thank you. But I will still look for an answer and try to solve my problem.Celloidin

© 2022 - 2024 — McMap. All rights reserved.