User segmentation/3D blob tracking on multi-kinect data

Hi all. Creating a new thread here related to some discussion from here and here with @Caroline_Record and @bgstaal. I’m trying to do user segmentation, potentially via 3D blob tracking, on a space with multiple KinectV2 cameras. I’ve got a 3D scene with point clouds manually aligned (thanks to @NickHardeman & @theo for a great demo of this in the Connected World BTS video!).

cam1 on the left, cam2 on the right. You can see the gentleman in the middle with the coffee cup appears correctly aligned in 3D

Looking at my scene straight on with an ortho camera, I can do standard OpenCv blob tracking/contour finding on the combined point cloud and get silhouettes of people in the scene. Of course, as soon as there’s occlusion or close proximity of people, blobs for separate people will combine into one larger blob.

I’m currently working on getting ofxKinectBlobFinder to work on data from 2 kinects which I elaborate on toward the bottom, though here are some other techniques I also looked at:

The Kinect 2.0 SDK does user segmentation via the BodyIndex (aka the “people image”) stream, though I’ve found that it’s fairly easy to walk past the camera and not have your body tracked. Seems like once you have your front or back to the cam it’s great & you get picked up quickly, but if you’re at a 90° angle it can miss you. Separate from that, the different perspective of each cam means you can’t combine these streams directly…instead you could use BodyIndex + depth together to make a mesh for each tracked body (and do some work to ensure people visible to both cams aren’t double-tracked), but the unreliability of BodyIndex tracking is making me explore other options.

I know that NiTE/OpenNI will do user segmentation, though at a quick glance it looks like these work by passing in an OpenNI device. I haven’t dug in to see if I can get around the device communication layer and just feed pixels to the user tracker, but I’m holding off on this for now thinking it might suffer from the same sideways/straight on issues as the MS SDK.

Picture from here

I was playing with PCL to do 3D blob tracking with the Euclidean cluster extraction class. For anyone looking to use PCL on Windows, there are excellent instructions, precompiled binaries for VS2012, 2013, and 2015, and property sheets here, and micuat’s fork of ofxPCL was handy for converting between OF & PCL types. The segmentation looked like it ran well (I didn’t really mess with the parameters to get better results, or remove the floor for example) on a single frame, but dropped to ~5fps when trying to run it real time. There is a gpu version, though compiling PCL on Windows with all of its dependencies isn’t something I’m looking to tackle yet.

Finally, there’s the ofxKinectBlobFinder addon (thanks to @robotconscience for his example & getting it working with ofxMultiKinectV2!) which works well and is fast. Before most of the work I mentioned above, I modified the addon to work with ofxKFW2 and also by passing in a mesh, thinking I could feed it my combined point cloud . I realized though that it’s fast because it actually uses the depth map to only check for nearby neighbors within a couple pixels (ie a 2D bounding box around the current pixel) and then checks those neighbors for 3D thresholds, rather than checking each vertex in the mesh against every other one.

So with that knowledge, it should be possible to feed it a virtual/“faux” depth map, i.e. a merged point cloud rendered to look like a depth map. What I’m having trouble with is getting the world coordinates back out of this depth map… I have a mesh, and so I know the 3D position of every vertex. Looking at the FBO I’ve rendered the mesh into, I know the depth value (in mm) of any pixel given an (x,y) input based on the color for that pixel. And so I should be able to use ofCamera’s screenToWorld() or cameraToWorld() (not sure which is the right one to use in this case) to get back world coordinates, but the values I’m getting don’t make sense…the z values are about 2 orders of magnitude higher than they should be. Here’s the code I’m using to go from the depth map back to 3D coordinates…I’ve tried a number of combinations of screenToWorld vs cameraToWorld, passing in the FBO dimensions for the viewport vs leaving it empty for ofGetCurrenViewport() to fill in, passing in a z value of 0 and overwriting it later with data from the depth map, passing z at cm or mm scales, etc. I’m kind of at a loss at this point…any thoughts would be greatly appreciated! Also pinging @kylemcdonald who has done similar things with ofxVirtualKinect

Virtual depth map of merged point clouds in perspective & ortho (these are actually from different frames)

in setup()

pointCloudFbo.allocate(640, 480, GL_RGB16);
mergedDepthMap.allocate(pointCloudFbo.getWidth(), pointCloudFbo.getHeight(), OF_IMAGE_GRAYSCALE);

in update()

// Draw point clouds into FBO
	ofClear(0, 0);


	bOrthoCam ? cam.enableOrtho() : cam.disableOrtho();

	if (bDrawFloor) {
		ofDrawGridPlane(100, 100);

	// scale by cam scaling factor since ortho has no "zoom"
	ofScale(camScale, camScale, camScale);

	if (bDrawMesh1) mesh1.draw();
	if (bDrawMesh2) mesh2.draw();


// Created virtual depth map image from point cloud FBO
if (bDrawMergedMesh) {


	// build mesh from merged depth map
	for (int y = 0; y < mergedDepthMap.getHeight(); y++) {
		for (int x = 0; x < mergedDepthMap.getWidth(); x++) {
			int index = (y*mergedDepthMap.getWidth()) + x;

			// get depth from merged depth map, which is stored in mm just like a regular depth map
			// compare against thresholds, which are in m (so multiply by 1000)
			unsigned short depth = mergedDepthMap.getColor(x, y).r;
			if (depth <= thresholdNearMerged * 1000 || depth >= thresholdFarMerged * 1000) continue;

			// at this point, x & y should be in cm, which is the scale of the point clouds
			// and z is mm from the depth map. so convert z to cm by dividing by 10
			ofVec3f screen = ofVec3f(x, y, depth / 10.);
			ofVec3f world = cam.screenToWorld(screen, ofRectangle(0, 0, pointCloudFbo.getWidth(), pointCloudFbo.getHeight()));

I’m a little surprised the Euclidean Cluster Extraction technique works at 5fps! My first recommendation is to decimate the point cloud to using only 10% of the points, then feed it into PCL and see if you get enough of a speedup to make it usable.

In theory, if you implemented ECE yourself you could optimize the way the kd-tree is built to avoid rebuilding it from scratch every frame, but I have a feeling that’s outside of the scope here.

If the 10% approach doesn’t work, I would start looking into more heuristic solutions. If you know everyone is going to be upright (not laying on the floor) I suggest taking a projection of the point cloud from above, and finding places where the points are concentrated. This is what I do when I use ofxVirtualKinect: instead of using the frontal projection, I mostly use the overhead perspective.

This was pre-ofxCv but you could do it in less code now:

Have you guys spoken to Elliot Woods about all of this? It seems RULR would be perfect for this.