Using ofxCV or ofxOpenCV for hand tracking?

Hi guys, I was wondering if anyone has used either of the two computer vision addons I mentioned (ofxCV or ofxOpenCV) to do hand-tracking? I have an Xcode project that uses ofxBox2d and repulsion/attract forces on a joint from the mouse location, and want to replace the mouse with some point on my hand. Are there recommended approaches to this? I know using a Kinect might be ideal, but I want to know how to do it without one.


There are several approaches using ofxCv and/or ofxOpenCv). Most of the approaches require first extracting the hand contour (aka blob) from the background. Then, depending on the quality, the contour can be analyzed to calculate its centroid (aka its geometric center), which can stand in for a mouse position.

More sophisticated contour analysis techniques can be used to determine concavities (these may represent the spaces between fingers) and convexities (these might represent the fingertips) (e.g. this video or just about any of these). Further analysis can be done verify hand orientation by fitting the contours to a hand model. But that is a bit more sophisticated and at this point is probably better done with 3d sensors devices (e.g. this work presented at CHI earlier this year) rather than RGB cameras.

Anyway, to track a hand with a normal RGB camera, the first step is usually to isolate the hand from the background. This can be done by a standard background subtraction and thresholding operation (see examples/addons/opencvExample or ofxCv/example-background for example) to produce a binary image. The resulting binary image is then passed to a contour finder for connected-component analysis resulting in a set of contours (aka blobs).

While there are color-based based methods of isolating the hand from the background (see examples of skin-color based segmentation here), these can be less effective in real-world installation settings when presented with a wider range of skin tones.

In the end, the most reliable way of tracking hands (or anything for that matter) is to reduce the “noise” (i.e. the background) in your “signal” (i.e. the hand) as much as possible. One of the easiest ways to do this is to pick a camera orientation that results in a high contrast homogenous stable background (think of a camera pointed down onto a monochromatic surface so that the hand is in sharp contrast to the background). Another way is the “shadow puppet” approach. A shadow is very high contrast and sometimes a simple threshold makes background subtraction unnecessary. In order to avoid shadows cast by light in the visible spectrum, shadows are often cast using near infrared spectrum light. Shadows can be cast by IR leds (or flood lights with dark dark red or Wratten gel filters) and “sensed” by cameras cameras with no IR filters. Many cheap webcams such as the PS3 Eye can be modified to remove the IR filters. IR illuminators can be made or found on ebay or in stores that sell closed circuit surveillance cameras. I found the diagrams (e.g. 1 2 3) and technical rider of @zach and @golan’s Mesa di Voce to be very instructive early on. Anyway, all of these simple (and not-so-simple) tricks have traditionally been used to make it easier to remove the background.

But all of that became much much easier with the kinect and subsequent RGBD cameras. They were cheap and all but solved the background subtraction problem (for most smaller scale applications at least) by providing a clean “depth” image that could be separated into background and foreground based on its physical location (i.e. typically, things that are farther away from the depth camera are in the background and things that are closer are in the foreground). Thus, a simple thresholding operation could say that only things that are within 1 meter of the sensor are in the foreground.

That said, depth can be determined (with with pretty decent results) using stereo RGB cameras (see this for example). Previously high res/high speed synchronized stereo camera rigs were super expensive and computationally intensive, so they weren’t used all that much in art contexts. But hacked PS3Eye cameras can be synchronized to give pretty nice results (much of this RGB camera hacking was pioneered by various members of the oF community who spent a lot of time figuring that stuff out 5 or 6 years ago). Needless to say, the PS3Eye is still alive and well in the oF community and the cameras are now super inexpensive. Now that they are so cheap some of us have recently purchased hundreds of them for various experiments :wink:

Anyway … hopefully that isn’t too much information. It’s been a lot of fun to watch how these things have changed since oF got started.


@bakercp – this has been extremely helpful. I knew that Kinect made things easier but did not know much about the process behind it all. Thanks; will definitely keep this in mind.

1 Like

I’ve made a project where I had to track handy butom up with a kinect.

I just put the code on github, it’s quite a mess, but the tracking is programmed in it’s own class and you should be able to extract the important parts. In the code provided is also included some mapping towards realworld points, and forwarding of the contour over osc to an other program as well as extraction and filtering of the movement towards movement qualities.
The idea behind my code is described in the papers:

The tracking is inspired and based on the work of KinectArms (
and the work of Wei-chao Chen (陳威詔)’s master thesis “Real-Time Palm
Tracking and Hand Gesture Estimation Based on Fore-Arm Contour” (

Hope it helps.

P.S.: I use both ofxOpenCv and ofxCv but that’s because i have change during the development and never cleaned.
If I get time in the near future I’ll try to clean it.