Determine skeleton joints with a webcam (not Kinect)
Asked Answered
H

8

28

I'm trying to determine skeleton joints (or at the very least to be able to track a single palm) using a regular webcam. I've looked all over the web and can't seem to find a way to do so.

Every example I've found is using Kinect. I want to use a single webcam.

There's no need for me to calculate the depth of the joints - I just need to be able to recognize their X, Y position in the frame. Which is why I'm using a webcam, not a Kinect.

So far I've looked at:

  • OpenCV (the "skeleton" functionality in it is a process of simplifying graphical models, but it's not a detection and/or skeletonization of a human body).
  • OpenNI (with NiTE) - the only way to get the joints is to use the Kinect device, so this doesn't work with a webcam.

I'm looking for a C/C++ library (but at this point would look at any other language), preferably open source (but, again, will consider any license) that can do the following:

  • Given an image (a frame from a webcam) calculate the X, Y positions of the visible joints
  • [Optional] Given a video capture stream call back into my code with events for joints' positions
  • Doesn't have to be super accurate, but would prefer it to be very fast (sub-0.1 sec processing time per frame)

Would really appreciate it if someone can help me out with this. I've been stuck on this for a few days now with no clear path to proceed.

UPDATE

2 years later a solution was found: http://dlib.net/imaging.html#shape_predictor

Hirundine answered 15/6, 2013 at 13:54 Comment(11)
This is really difficult with a single webcam, even more so in real time. Hence the Kinect. To only track a single palm you should be able to modify this real time tracker to do the job: www4.comp.polyu.edu.hk/~cslzhang/CT/CT.htm. IT works really well and their C++ code uses OpenCV.Peccadillo
This is not a StackOverflow kind of question, is it?Anticyclone
It would help if you would give a little bit more context, so we have an idea why it should absolutely not involve Kinect (and maybe suggest a viable alternative within the bounds of this context)Votive
Since your using an infrared camera I imagine you have infrared LEDs somewhere?Silicify
Hi, I just want to ask if you've been able to proceed with this. Currently I am also looking at skeletonization but can't use OpenNI or any other NI libraries targeted for Kinect use. Currently we've been able to proceed with our project using image processing and analysis based on data collected but I'd rather have skeleton tracking moving forward.Carob
So far... no :( The only thing that even came close (based on claims) was XTR3D, but they failed to deliver. Failed so miserably... Their code wouldn't even launch, and tech support was not only less than useful but turned out to be extremely rude and dishonest. Personally I vowed to never deal with that company again.Hirundine
@YePhick Hi, I work at Extreme Reality as an Algorithms engineer, we have noticed your comment and we are sorry for your bad experiance. Please feel free to download our SDK for multiple platforms here (xtr3d.com/developers/sdk-download) and contact [email protected] for any issue that may occur. We would love to help you out. YonatanPsychoanalysis
@YonatanSimson thank you for your attention. I suppose it has been almost 2 years since then and the horrible aftertaste has dulled down a bit. I'll give it a go :)Hirundine
Downloaded, installed, tried to compile the C++ sample (CConsoleSample) - failed for both Debug and Release (using MSVC 2015), uninstalled, manually cleaned up the clutter left behind. Vowed to never deal with XTR3D again. Thanks, but no thanks.Hirundine
Currently our SDK doesn’t support vs2015, but nevertheless when building after the default installation of vs2015 I got an error - fatal error RC1015: cannot open include file 'afxres.h'. A quick Google search told me I had to install MFC for C++ (Programming Languages -> Visual C++ -> Microsoft Foundation Classes for C++), which I did, and the sample compiled without any more problems and ran.Psychoanalysis
I'm glad it worked for you. I have MFC installed (with the sources, too) and it didn't work for me. And considering the amount of time I have already wasted in the past I'm not going to take anything less than an effort-less process. I'm sorry to be such a pain but I'm trying to be as polite and as cooperative here as I can and avoiding the detailed account of the full range of frustration I have experienced when dealing with the XTR3D in the past.Hirundine
H
2

At last I've found a solution. Turns out a dlib open-source project has a "shape predictor" that, once properly trained, does exactly what I need: it guesstimates (with a pretty satisfactory accuracy) the "pose". A "pose" is loosely defined as "whatever you train it to recognize as a pose" by training it with a set of images, annotated with the shapes to extract from them.

The shape predictor is described in here on dlib's website

Hirundine answered 8/4, 2017 at 22:54 Comment(2)
there is also pre trained models available, for example I used a frontal facial pose detector some time back.Fontenot
Definitely google once to find out if a model is already available that does what you want it to do. Essentially, its just trained feature weights.Fontenot
A
19

To track a hand using a single camera without depth information is a serious task and topic of ongoing scientific work. I can supply you a bunch of interesting and/or highly cited scientific papers on the topic:

  • M. de La Gorce, D. J. Fleet, and N. Paragios, “Model-Based 3D Hand Pose Estimation from Monocular Video.,” IEEE transactions on pattern analysis and machine intelligence, vol. 33, Feb. 2011.
  • R. Wang and J. Popović, “Real-time hand-tracking with a color glove,” ACM Transactions on Graphics (TOG), 2009.
  • B. Stenger, A. Thayananthan, P. H. S. Torr, and R. Cipolla, “Model-based hand tracking using a hierarchical Bayesian filter.,” IEEE transactions on pattern analysis and machine intelligence, vol. 28, no. 9, pp. 1372–84, Sep. 2006.
  • J. M. Rehg and T. Kanade, “Model-based tracking of self-occluding articulated objects,” in Proceedings of IEEE International Conference on Computer Vision, 1995, pp. 612–617.

Hand tracking literature survey in the 2nd chapter:

  • T. de Campos, “3D Visual Tracking of Articulated Objects and Hands,” 2006.

Unfortunately I don't know about some freely available hand tracking library.

Academicism answered 18/6, 2013 at 14:21 Comment(5)
I do not require a depth information - only the pixel position (or a center) of an object in camera's view.Hirundine
To track an articulated 3D object including position of its joints is to my knowledge usually done by recovering the complete 3D position and orientation. Simply you get also the depth even when you don't need it.Auctioneer
What you are describing requires a stereo vision, which is not what I have listed in the requirements (a single webcam)Hirundine
I thought that all of them were using a single camera, but some multi camera papers went through by mistake. I removed one that used multiple cameras and marked the thesis by Campos that includes possibly helpful literature survey. The rest is really a single view reconstruction of the hand pose and orientation. But the implementation would be hard and performance can be unsatisfactory for your application.Auctioneer
Due to current constraints I'm looking for an implemented solution that is ready-to-useHirundine
D
9

there is a simple way for detecting hand using skin tone. perhaps this could help... you can see the results on this youtube video. caveat: the background shouldn't contain skin colored things like wood.

here is the code:

''' Detect human skin tone and draw a boundary around it.
Useful for gesture recognition and motion tracking.

Inspired by: https://mcmap.net/q/242844/-computer-vision-masking-a-human-hand

Date: 08 June 2013
'''

# Required moduls
import cv2
import numpy

# Constants for finding range of skin color in YCrCb
min_YCrCb = numpy.array([0,133,77],numpy.uint8)
max_YCrCb = numpy.array([255,173,127],numpy.uint8)

# Create a window to display the camera feed
cv2.namedWindow('Camera Output')

# Get pointer to video frames from primary device
videoFrame = cv2.VideoCapture(0)

# Process the video frames
keyPressed = -1 # -1 indicates no key pressed

while(keyPressed < 0): # any key pressed has a value >= 0

    # Grab video frame, decode it and return next video frame
    readSucsess, sourceImage = videoFrame.read()

    # Convert image to YCrCb
    imageYCrCb = cv2.cvtColor(sourceImage,cv2.COLOR_BGR2YCR_CB)

    # Find region with skin tone in YCrCb image
    skinRegion = cv2.inRange(imageYCrCb,min_YCrCb,max_YCrCb)

    # Do contour detection on skin region
    contours, hierarchy = cv2.findContours(skinRegion, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)

    # Draw the contour on the source image
    for i, c in enumerate(contours):
        area = cv2.contourArea(c)
        if area > 1000:
            cv2.drawContours(sourceImage, contours, i, (0, 255, 0), 3)

    # Display the source image
    cv2.imshow('Camera Output',sourceImage)

    # Check for user input to close program
    keyPressed = cv2.waitKey(1) # wait 1 milisecond in each iteration of while loop

# Close window and camera after exiting the while loop
cv2.destroyWindow('Camera Output')
videoFrame.release()

the cv2.findContour is quite useful, you can find the centroid of a "blob" by using cv2.moments after u find the contours. have a look at the opencv documentation on shape descriptors.

i havent yet figured out how to make the skeletons that lie in the middle of the contour but i was thinking of "eroding" the contours till it is a single line. in image processing the process is called "skeletonization" or "morphological skeleton". here is some basic info on skeletonization.

here is a link that implements skeletonization in opencv and c++

here is a link for skeletonization in opencv and python

hope that helps :)

--- EDIT ----

i would highly recommend that you go through these papers by Deva Ramanan (scroll down after visiting the linked page): http://www.ics.uci.edu/~dramanan/

  1. C. Desai, D. Ramanan. "Detecting Actions, Poses, and Objects with Relational Phraselets" European Conference on Computer Vision (ECCV), Florence, Italy, Oct. 2012.
  2. D. Park, D. Ramanan. "N-Best Maximal Decoders for Part Models" International Conference on Computer Vision (ICCV) Barcelona, Spain, November 2011.
  3. D. Ramanan. "Learning to Parse Images of Articulated Objects" Neural Info. Proc. Systems (NIPS), Vancouver, Canada, Dec 2006.
Diogenes answered 28/6, 2013 at 23:47 Comment(9)
Thank you, that was helpful. Unfortunately doesn't work for my needs - I am using a near-IR wavelength and it's much-much harder to predict the "color" of the background. As for the skeletonization - I have looked at it (see my initial post) and so far I don't have a good feeling about it in terms of translating a human outline into a skeleton. That probably only works if I stand with my legs and hands spread ;)Hirundine
nearIR is interesting, but is there a special reason to use that range of spectrum? a normal camera, i would suspect, should do the job. the alternative is to put "markers" on the joints that you are interested in and use a typical camera to detect them; using opencv you can draw a line between the detected points. there are ways of obtaining 3d information from single camera.Diogenes
@Hirundine some more info on articulated body parts added to answer :)Diogenes
The type is camera and color information is very important. That your using a near-IR wavelength camera should be added to the original question.Silicify
@Diogenes yes, there's a specific reason I'm using the hardware that I'm using. Can't get into that though as I'm under NDA. In my case "markers" are best to be avoided - the solution should be generic enough to be independent of skin color recognition, of markers being put on joints, be fast, and not require Haar (or any similar) trainingHirundine
Do you have any idea to count hair using OpenCV?Gulden
Where did you get the values for min_YCrCb and max_YCrCb? Was it trial and error or did you read somewhere that those values work best?Southeastwards
@MattD the values for the thresholds were originally inspired from: https://mcmap.net/q/242844/-computer-vision-masking-a-human-handDiogenes
I had the following error: ValueError: too many values to unpack I fixed the error with this post #25505464Signora
L
2

The most common approach can be seen in the following youtube video. http://www.youtube.com/watch?v=xML2S6bvMwI

This method is not quite robust, as it tends to fail if the hand is rotated to much (eg; if the camera is looking at the side of the hand or at a partially bent hand).

If you do not mind using two camera's you can look into the work Robert Wang. His current company (3GearSystems) uses this technology, augmented with a kinect, to provide tracking. His original paper uses two webcams but has much worse tracking.

Wang, Robert, Sylvain Paris, and Jovan Popović. "6d hands: markerless hand-tracking for computer aided design." Proceedings of the 24th annual ACM symposium on User interface software and technology. ACM, 2011.

Another option (again if using "more" than a single webcam is possible), is to use a IR emitter. Your hand reflects IR light quite well whereas the background does not. By adding a filter to the webcam that filters normal light (and removing the standard filter that does the opposite) you can create a quite effective hand tracking. The advantage of this method is that the segmentation of the hand from the background is much simpler. Depending on the distance and the quality of the camera, you would need more IR leds, in order to reflect sufficient light back into the webcam. The leap motion uses this technology to track the fingers & palms (it uses 2 IR cameras and 3 IR leds to also get depth information).

All that being said; I think the Kinect is your best option in this. Yes, you don't need the depth, but the depth information does make it a lot easier to detect the hand (using the depth information for the segmentation).

Lindley answered 20/6, 2013 at 15:36 Comment(3)
Thank you for your suggestions, but I'm specifically looking for a non-Kinect solution. Very specifically :)Hirundine
Unfortunately, these don't exist within the parameters you've given.Lindley
@Lindley adobe uses face tracking and I thhkn partial limb tracking using only 1 webcam for adobe animate I'm pretty sureErosion
T
2

My suggestion, given your constraints, would be to use something like this: http://docs.opencv.org/doc/tutorials/objdetect/cascade_classifier/cascade_classifier.html

Here is a tutorial for using it for face detection: http://opencv.willowgarage.com/wiki/FaceDetection?highlight=%28facial%29|%28recognition%29

The problem you have described is quite difficult, and I'm not sure that trying to do it using only a webcam is a reasonable plan, but this is probably your best bet. As explained here (http://docs.opencv.org/modules/objdetect/doc/cascade_classification.html?highlight=load#cascadeclassifier-load), you will need to train the classifier with something like this:

http://docs.opencv.org/doc/user_guide/ug_traincascade.html

Remember: Even though you don't require the depth information for your use, having this information makes it easier for the library to identify a hand.

Typescript answered 24/6, 2013 at 15:28 Comment(0)
H
2

At last I've found a solution. Turns out a dlib open-source project has a "shape predictor" that, once properly trained, does exactly what I need: it guesstimates (with a pretty satisfactory accuracy) the "pose". A "pose" is loosely defined as "whatever you train it to recognize as a pose" by training it with a set of images, annotated with the shapes to extract from them.

The shape predictor is described in here on dlib's website

Hirundine answered 8/4, 2017 at 22:54 Comment(2)
there is also pre trained models available, for example I used a frontal facial pose detector some time back.Fontenot
Definitely google once to find out if a model is already available that does what you want it to do. Essentially, its just trained feature weights.Fontenot
C
0

I don't know about possible existing solutions. If supervised (or semi-supervised) learning is an option, training decision trees or neural networks might already be enough (kinect uses random forests from what i have heard). Before you go such a path, do everything you can to find an existing solution. Getting Machine Learning stuff right takes a lot of time and experimentation.

OpenCV has machine learning components, what you would need is training data.

Cardialgia answered 24/6, 2013 at 13:34 Comment(1)
I've been playing with OpenCV's recognition components for a while now and have to say they tend to be quite bulky and not as accurate as I'd like them to be. Though so far that seems to be one of the very few viable options... Doesn't meet all the requirements I need, but at least comes somewhat closeHirundine
P
0

With the motion tracking features of the open source Blender project it is possible to create a 3D model based on 2D footage. No kinect needed. Since blender is open source you might be able to use their pyton scripts outside the blender framework for your own purposes.

Putput answered 24/6, 2013 at 14:30 Comment(3)
That link to YouTube you put in here is jaw-dropping, truly amazing. But completely irrelevant to what I need :(Hirundine
It uses structure from motion. It uses the fact that the object you want to "scan" is at a location/orientation compared to the camera at each frame to estimate depths.Lindley
Once again - I don't need the depth (I do the depth myself using a different method), I just need to know "where" on the 2D image the object I'm looking for is :)Hirundine
T
0

Have you ever heard about Eyesweb

I have been using it for one of my project and I though it might be usefull for what you want to achieve. Here are some interesting publication LNAI 3881 - Finger Tracking Methods Using EyesWeb and Powerpointing-HCI using gestures

Basically the workflow is:

  1. You create your patch in EyesWeb
  2. Prepare the datas you want to send with a network client
  3. Use theses processed datas on your own server (your app)

However, I don't know if there is a way to embed the real time image processing part of Eyes Web into a soft as a library.

Tannatannage answered 1/7, 2013 at 0:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.