I’m starting on a project soon, and I’d appreciate your opinions on the best way to solve the problem of tracking kids dancing.
The camera (a Kinect … thanks everyone for ofxKinect!) will be mounted up high pointing more or less straight down, and the depth will be calibrated to ‘see’ from their head to mid-torso. I don’t need to track specific moves/gestures, just whether they are moving rapidly enough to decipher, yes, they are dancing.
The development time for this is really short (2-3 days) so I need to come up with a quick but robust solution. My planned approach is to simply calculate the rate-of-change of the size of the bounding box of each blob - and then use a set of ranges to specify the vigorousness of their dancing based of this rate of change.
So, I’m interested to see how you guys would expand on this to be able to keep track of individual kids when they more close enough together that openCV sees them as one blob. By tracking their head as well? maybe… but the camera is looking at the tops of their heads so there won’t be any facial features to track.
I’m going to look into ofxDelaunay as perhaps a better approach is to track to mass of the contour rather than the bounding box, I figure ofxDelaunay would help with this - but maybe I work out the contour mass without it? Still, may still not solve keeping the kids blob’s separated.
I’d be really interested in hearing your opinions on how best to approach this problem from a newcomers perspective, as I’m sure many other ‘rookies’ like myself would. I’m researching and testing many approaches myself soI’m not looking for the code on a plate - just some good guidance from people that know a lot more about this than me
Fastest way would probably be using OpenNI: it will track every kid out of the box if you can use the calibration pose before the dance starts
Thanks for your reply naus3a - I’ve checked out openNI, it looks awesome but I’m don’t think it will work in this situation for 2 reasons:
it’s a store front installation, so kids will just be passing by and jumping in, no time for posing/calibrating
the camera is seeing a birds-eye-view, so openNI will never have a full body to track, only head/shoulders/arms
I know Kyle (kylemcdonald) was looking into whether you can store 1 persons calibration data for use on someone else, but I’m not sure he got any where with it. I’ll get it installed later today and let you know how I get on (will it even work from a birds-eye view… I’ll find out)
Any other suggestions using functionality available with ofxOpenCV?
Hmm I might be oversimplifying…but maybe take advantage of people’s natural actions as they walk past?
So normally people don’t raise their arms above their heads when walking, right? Since you have a depth-image, if you see a couple fast-moving objects that are past a certain threshold, it could indicate people raising their arms and dancing? This also assumes the Kinect is stationary the entire time.
It could get tricky to differentiate a couple very tall people just walking by though with small children jumping around.
Not sure there’s a very simple solution, sounds to me like something you’d just have to experiment with and hack together some heuristics that work 90% of the time. Sounds do-able for sure.
Just capture that depth map to only register pixels above (or below) some depth threshold from the Kinect. Then have ofxOpenCV detect blobs from that image, count them, calculate a center for them, and save that as a ‘reference’. Then swap out that ‘reference’ frame maybe every couple seconds or until the blobs disappear, meanwhile compare changes from the frames to determine velocity of those center-points for the blobs.
1 things moving fast = dancing…? Then again, this will only tell you that something resembling dancing has occured, not really who might be doing it and how. That will be more work.
What nemik is suggesting is how I’d approach this: track blobs, even using the canonical ofxCvBlob tracker should work fine with an appropriate ‘time to live’ for the space you’re tracking in, and then add an additional layer of info from the depth map. The blob growing but maintaining the same centroid indicates someone putting their arms out, from there you could begin to track orientation, and that’s where dancing can come from: depth changing + movement at the extremities is dancing. Depth mapping from the kinect helps a lot because it allows you to filter out shadows/noise, (and non-midget adults I suppose if you want).
Thanks for the input guys - it’s reassuring to know I’m on the right path. Now to get down with implementation, will let you know how I get on.
Regarding the ‘time to live’, do you mean a buffer of nx previous frames that I compare the current one to, the length of that buffer being the ‘time to live’ … just want to get my head straight on that. Thanks again…
ps. Josh, good to hear from you, we haven’t spoken for ages… hope you are well mate!
Yeah, time to live is time before deleting a blob that doesn’t have activity on it (usually a sign that the object that was being tracked has left the tracking area or that it wasn’t a proper object in the first place).
Thanks Josh, this has all been really helpful.