Multiple Kinect setup for real time volumetric reconstruction and tracking of people

Hi!

I’m working on a project these days where we are gonna do realtime tracking of people in a tight space, with a low roof and moving obstacles (we will be moving these obstacles so we now where they are at all times with a precision of ~10 mm). I have had success with the normal depth map -> threshold -> blob-tracking approach before but this time around we will have to go for a more sophisticated approach.

I have done some research already (mostly theory) and I want to share what I’ve learned so far and hopefully gain some from you girls and guys before we start reinventing the wheel over here. It would be great if this thread could turn in to a one-stop resource for multiple kinect setups with openFrameworks. (I have numbered the places in my post where I actually have questions.)

Hardware

We are planning to use four “kinect for windows” devices, as this project is already windows-specific and my experience with using xbox kinects and libfreenect on windows is not the best. Basically it’s proven to be quite unstable (Slow startup, several reconnections before receiving data, random crashes). It would be great to hear if anyone has tips for improved stability, though we are already set with using the official drivers for this particular project.

I have read several posts describing issues with connecting multiple kinects to one computer and it seems that the number of kinects that can be connected is directly linked to the number of internal usb buses on the system. I have seen claims that one bus can manage two kinect streams, while others claim you will need a dedicated bus pr. unit. Can anyone confirm?

To solve this issue we are looking at getting a quad-bus usb card like this: http://www.unibrain.com/products/dual-bus-usb-3-0-pci-express-adapter/

It is a USB 3.0 card but it claims to have “Legacy connectivity support for USB 2.0”. This might or might not be an issue. Does anyone have an experience with this? If some of you have a working set up with four kinects on a windows machine, do you mind sharing some details on the hardware-setup?

Software

Our plan is to use (and maybe also contribute?) to the ofxKinectCommonBridge add-on. It says it only supports 32-bit windows for now and we are gonna run 64-bit windows. James George mentioned that an update for it was on the way. If so, what’s holding it back? Maybe we can help?

I’m also missing a method for converting from depth-map coordinates to 3d coordinates like the getWorldCoordinateAt() method from ofxKinect. Is there an equivalent function in the official kinect SDK or do we need to implement this ourselves? If so, does anyone have any experience with what implementations yields the best approximation? (There’s one implementation described in here http://openkinect.org/wiki/Imaging_Information but I haven’t tried it myself yet).

Set up & Calibration

Our plan is to place one kinect in each corner of a room facing inwards, basically trying to cover as much of the room’s volume as possible. By effectively extracting point clouds from each of the kinects and projecting them in to the same coordinate space we are hoping reconstruct the geometry of people in the room as accurately as possible, while avoiding the normal “shadowing” effect you get from only using one kinect. (We will also have another layer of sensory information coming from a capitative floor but I’m not gonna include that in this discussion).

As the installation will be exhibited at at leas two different locations we will have to make some sort of calibration routine for the positioning and aligning of the kinects. Here I’m quite inexperienced and would love to get some input on the best approach? Do we physically measure and input the position and angles of the kinects in relation to each other? Do we do some sort of reference object calibration routine (checkerboard or similar?). Has anyone tried a ICP approach? (http://en.wikipedia.org/wiki/Iterative_closest_point)

I have done some tests with overlapping kinects and to my surprise the interference is not that bad. If it becomes a problem we might consider trying the vibration trick (see page 6 in this paper http://www.matthiaskronlachner.com/wp-content/uploads/2013/01/2013-01-07-Kronlachner-Kinect.pdf) or go to the extreme of implementing a way of triggering the readings in sequence. (pages https://isas.uka.de/Publikationen/IROS12_Faion.pdf). We are hoping to avoid this by trying to avoid pointing the kinects directly at each other.

Tracking & predicting movement

The information we are mostly interested in is the position, height and roughly the volume of each person in the room. I haven’t worked much with point cloud data in this sense before (except performing a brute force distance calculation on all the points to find the closes one) so any pointers you guys have on how to work with this data in a effective way is very welcome. I’m already looking at PLC (http://www.pointclouds.org) for working with the point cloud data and I got tipped that there’s already an ofxPLC out there so we will probably look into that. Another option I’m thinking about is generating some kind of height map based on all the data and perform normal opencv based blob tracking. That will probably require quite a lot of cleverness to avoid . Any takers?

We will also have to try to predict where people will be in about 500ms. I’m thinking an approximation based on the velocity/trajectory of each person will do? If anyone have experience with anything similar it would be awesome to get some details!

It would be amazing if some of you have some input! I’ll try to keep this thread updated as we make discoveries that can be valuable to others.

Thanks!

-Bjørn

4 Likes

This sounds like an amazing project. I’m speaking with the MS Open Tech team right now to see what we can do to help you. For getWorldCoordinateAt(), it’s pretty easy for us to expose something like this http://msdn.microsoft.com/en-us/library/nuisensor.inuicoordinatemapper.mapdepthpointtoskeletonpoint.aspx so that should help you. Hopefully we’ll have more news soon!

That’s great news. Thanks a lot!

I wrote a big reply but it wouldn’t let me post it all, perhaps because I’m new.

The text is here: http://pastebin.com/7b3ASTS6

1 Like

I’ll try posting it.

@joshblake’s response:

Yes, each Kinect sensor requires its own USB bus. This is because the OS reserves 10% of the USB 2.0 bandwidth and Kinect reserves roughly 45% of the available USB bus bandwidth. regardless of which streams are in use. You can share the bus with a few very lower overhead items, such as keyboard, mouse, but webcam or USB drive not recommended.

For multiple Kinect sensors on one PC, you do need multiple USB buses. Most laptops expose a single external bus. Most PCs expose two - one in front and one in back. If you want more, you need to use a PC with expansion space.

For the Liberty University Video Wall project (featured here: http://blogs.msdn.com/b/kinectforwindows/archive/2014/01/27/kinect-for-windows-sets-tone-at-high-tech-library.aspx video is on its way), we built the system to use a single very powerful computer and run with four K4W 1.0 sensors. In the end, we only used three sensors but that was just to optimize the overlap of the field of view. The application and K4W SDK worked with four Kinects simultaneously at full frame rate just fine, with one caveat.

We added three “Rocketfish USB 3.0 PCI Express Card RF-P2USB3” cards, which uses the Renesas USB chipset. We plugged one K4W sensor to each of these three cards, plus one in one of the other built-in ports. For our particular machine (an HP Z820), we had to experiment with which of the built-in ports would work properly. When running all four Kinects at once, using one particular built-in port caused all the Kinects to be slow, another caused just the one Kinect to be slow, and a third worked. I suspect something weird with the Northbridge and interrupt handling. Either way, in the end we could have used four.

The caveat I mentioned was that in testing, sometimes the machine would hang on shutdown. I wasn’t able to isolate it to a USB driver or using a particular USB card or port before it resolved itself, but it could have been related.

Go ahead and try this (Renesas chipset) card, but I’m 95% sure that it will still only allow you to use one K4W sensor per card. “But why?” you object! “USB 3.0 has much more bandwidth, and this fancy card has four USB 3.0 channels!” Well yes, but K4W 1.0 is USB 2.0, and these cards implement that support by adding a single USB 2.0 chip somewhere, with the same USB 2.0 bandwidth limitations. It is possible a manufacturer might make a card with multiple USB 2.0 channels/buses with a Full TT for each port, but I haven’t seen one.

If you’re using the Kinect SDK, you’ll want to use CoordinateMapper:
C#:
http://msdn.microsoft.com/en-us/library/microsoft.kinect.coordinatemapper_members.aspx
C++:
http://msdn.microsoft.com/en-us/library/nuisensor.inuicoordinatemapper.aspx
It provides methods to transform single points as well as entire frames (much more efficient.) You can transform color space to depth space and back, and depth space to skeleton (real world meters) space and back. It works out-of-the-box using the factory calibration.

Don’t bother physically measuring if you need things to actually line up. Since the Kinects will be opposed, you could use a calibration object, such as a large cube with a different AR marker on each face. Write something (using AR Toolkit or whatever is more popular now) to recover the 6 DOF transformation matrix when it sees the marker, and you will know the scale from physically measuring the cube. When Kinect 1 sees Face 1 and Kinect 2 sees Face 2, you can then get the Kinect 1 -> 2 transformation matrix by combining the individual measurements plus the physical offset of Face 1 to Face 2. Of course, you’ll want to do this measurement multiple times in different locations and average the results. Chessboards could also work, but I suggested AR markers because they can be automatically identified and it would be easier to do on site.

Yes, you’ll be fine here. You won’t need the vibration trick. If the Kinects are permanently mounted and there does happen to be interference, just hit one once so it doesn’t line up with the others IR patterns anymore. In your case the only interference you’ll likely see is a small spot if one can directly see the IR projector of the other, but that area is probably not of interest to your reconstruction.

For reference, for the video wall project referenced above, the interference with two Kinects 50% overlapping was not bad. Skeleton tracking still worked fine for both if you were visible to both. The depth image had a few holes, but the holes are typically only in one depth image or another. If you have three Kinects overlapping in one area, then there can be more trouble, but this probably won’t happen for you except maybe on the floor depending on how you line things up.

Heads up, PCL has lots of dependencies and they also tend to hard fork other projects. For example, if you want to use Kinect with PCL directly, you need to use their fork of an old version of OpenNI. Ugh. Not sure what the ofxPCL story is though. They also don’t tend to care much about running on Windows or with K4W at all.

Not sure what all the other junk will be in the room, but if Kinect SDK skeleton tracking ends up working, even just for a few seconds per person, you can get the height smoothing the head joint and adding an offset (distance from center of head to top of head, possibly scaled by a factor and the distance between shoulder and hip, if shorter people have smaller heads.) This can also give you the position.

If you do a blob based tracker on top of this, then you can simply associate the skeleton metrics with the blob and track the person even when the skeleton tracking fails.

Depending upon what this other stuff in the space will be, you might get false skeletons, or “mini-me’s”. Not sure.

If you have a merged point cloud space, then yes you could project all the points onto the floor to form an occupancy map, then blob track that. The blobs might be outlines only if you don’t get a lot of the head points.

Sounds reasonable. You’ll have to play with the prediction vs smoothing parameters to get a good effect for your visualization.

Hope my input was helpful! Let me know if you have any additional questions.

2 Likes

First of all @joshblake: Thanks a lot for taking the time to write such a detailed response! It has been very helpful already!

This is the kind of information that is pure gold to us! I’ll contact the manufacturer of that card to get it confirmed. If not I’ll do some more research and report back what we find out.

Again, thanks a lot!

Here is another resource on multi kinect calibration:

Hi Bjørn,

how is this project going? I am very interested in your results.
Have you already published something about this work?
Please share your improvements.

Best regards,
Gergely

I wish I had seen this sooner, as the calibration system I have developed for my thesis would have helped.

I did have multiple Kinects working on a PC using openframeworks in 2011. We had six. It was to register information along a 30 foot wall. My thesis has since evolved that program to generate what you are looking for (height of individuals) I’m not sure how it could be used to consider volume of individuals.

How has your project progressed so far?

Hi nosarious,

Did you manage to correctly calibrate your six Kinects together with openframeworks ? I am looking into having two Kinects capturing depth images with partial cover and I have to find a way to calibrate them easily, so I’m interested in your results.

Please let me know about this,

Felix

My results were based off of using the knits on a projection wall, and trying to keep a fairly obvious on-screen rendition of the way they were lining up.

It worked very well for generating combined point clouds. I had four Kiencts running but ended up using only 3 given the amount of space I had to work with.

The calibration technique would need some refinement to enable someone to use it for a wider volume of space. As I said, I needed it to be around the top edge of a screen, so I didn’t need to worry as much about the third dimension, but I am sure someone could use the technique to create a better way to calibrate them.

When everything is complete, I shall share a baseline of the code so people can try it. It is kind of messy, so I would be embarrassed to share it just now.

Hi @bgstaal, Did you check out the unibrain card and get a response from the manufacturer re usb 2 bus or better yet a chance to try it out?

I’m about to embark on a very similar project so would like any advice as I’ve been trying to get past 4 kinects (we need 8) all day with no love. Currently I’ve just got a cheap usb2 card and the onboard buses and can get 4 capturing but only at around 17fps and this is a very speced out machine.

I was wondering about using 2 or 4 machines to capture all of the kinect info, generate the point clouds and then send this via an OSC blob to a central machine for merging them all into one scene. Does anyone have any experience or comments on weather this might work, tips and things to avoid?

Hello All, this is great. I’m currently writing a research proposal in this exact problem domain. I would be keen to collaborate and work on any part that could help me get through my thesis.
I have 4x v2 Kinects attached to a single beefy machine using 4x USB3.0 cards. I could get more PCs if needed.

Does anyone know if there are any active open source projects (working or not) that would like my help?

Thanks

@jasbro I’m also delving into a similar project, so I’m happy to exchange info as I make progress. For now: I’m just looking for kinect 2.0 based photogrammetry models, but after that I will be moving things up into videogrammetry through multiple kinect 2.0 units on one PC (if I can) to make an animated series of models. Feel free to reach out to my email, same username but at gmail.com.

Hi @jasbro. Depending on your setup, you may or may not actually be able to use more than one KinectV2 per computer. If you’re on Windows using Microsoft’s SDK (via either the ofxKinectForWindows2 addon or directly), they only support one sensor per machine. I’m working around this by streaming the depth image from one machine to another using ofxZmq which wraps ZeroMQ.

Ironically, on the Mac side you can use multiple Kinects via the ofxMultiKinectV2 addon which uses libfreenect2. I believe libfreenect2 is cross-platform, though the addon is only supported on OSX at the moment. I’m sure either of these projects would appreciate help! With this approach you can get access to the camera streams (color, depth, IR), and registration between depth & color space, but none of the Microsoft SDK features are available, such as the BodyIndex stream, skeletons/joints, etc.

Hope that helps!

Hi everyone!

I feel bad for not replying to this post before now, I’m really not good at this internet thing. The project we did last year worked really well though. You can see the results here:

http://www.creativeapplications.net/openframeworks/breaking-the-surface/

We ended up using the quad bus usb card from unibrain that I mentioned: http://www.unibrain.com/products/dual-bus-usb-3-0-pci-express-adapter/
Combined with the built in usb bus this enabled us to connect 5 kinect cameras to one windows-machine.

We developed an addon on top of @joshuajnoble’s ofxKinectCommonBridge that enabled us to translate/rotate point clouds into the same corrdinate space. We had to fork the original lib and tweak a lot of the code to suit our purposes specifically so we never released the code as it is not very generalized. I can upload the code if it can be helpfull to anyone though.

Since our purpose was to track people’s location we got away with projecting the point cloud into a 2d occupancy map and do normal edge-detection/blob-tracking on that. We then combined this with data from a capacative sensor floor to ensure redundancy (this is a safety critical application).

You can see a short video of the tracking software/physical application here:

We are working on installing this again at a different location right now. The new space is a lot bigger than the original one so we need to re-evaluate how we approach the kinect-setup. I have been doing some tests with kinect v2 recently to see what the expanded FOV actually means in practice and if we get more reliable depth data at a longer range than v1.

Here’s a couple of screens when comparing point clouds from the two versions:

As you can see from the comparison above there is noticeable spread in the depth precision between the two models. Looking at the scene from above you can see the orange jacket (hanging on an office chair) clearly defined at the 6 m line in the kinect 2 screenshot. Comparing with v1 you can see point cloud smudged out over more than half a meter. This problem is increasingly apparent the further away from the sensor we placed the reference object. We concluded that v1 has a practical range of ~5.5meters while v2 actually is very reliable up to 7.5m. Comparing the two frustums gives us this result in terms of practical range for point cloud tracking:

In our application this has a significant impact as it would allow us to cover the whole area with 5 kinect v2 sensors while we would need about 7 or 8 v1 sensors to do the same thing.

I started porting our application to run on kinect v2 last week until I realised that the new SDK does not support more than one sensor. This seems like an arbitrary limitation imposed by microsoft in my mind. I know there is hardware limitations in terms of usb bandwidth but I think it is weird that the SDK should put restrictions on the usage if one is able to overcome the hardware limitations. Any thoughts?

I have looked into the libfreenect2 version as @mattfelsen mentioned but I’m reluctant to resort to ‘experimental’ solutions as this is a safety critical application. At this point I haven’t even been able to successfully build and run the examples on windows.

We will probably resort to just adding more kinect v1 sensors as a short term solution this time around and hope that Microsoft starts supporting multiple v2 sensors soon. I have seen discussions on microsoft forums where it seems that this is something a lot of developers are asking for.

If anyone has contacts at Microsoft please give’em a push!

3 Likes

Wow, this looks really great! Thanks for sharing the project & your new research, and congrats on getting to do it all over again :smile:

As I mentioned, my solution for multiple cams is to stream the depth map across the network. Currently with ZeroMQ but I just got ofxLibwebsockets working for vs2015/x64 today so I may switch to that. Without compression it takes about 100mbps (!!), so I’m not sure if that would scale to ~5 Kinects. You could take a look at ofxTurboJpeg - this branch has 0.9/vs2015/x86 support, but you’ll have to compile libjpeg-turbo for x64 if you need that. There’s also a handy-looking class in ofxDepthKit with some PNG compression but I haven’t tried it. I had issues using ofSaveImage() – I forget the specifics but I think FreeImage couldn’t figure out the image format when loading if I passed it an ofBuffer with 16bit, grayscale image data.

Regarding MS people, I think @joshuajnoble has some contacts there? I reached out to Jason Walters who is here in NYC – I’ll let you know if I hear anything back. Maybe you can hit up Rick Barraza as well?

Best of luck!

Hi, I also am bad at this internet thing some time. I think Rick Barraza would be a better person to hit up than me but I do know that the 1 kinect limitation is imposed by MSFT. I think they’re assuming that people will be networking multiple machines together, though I’m not sure why that seems more reasonable than someone having a beast of a machine that handle 2 or 3 streams.

Hi @bgstaal, what a great project!
If you are still interested in using multipke Kinect v2 with skeletal tracking, you might be interested in an app that I made.
It uses Kinects on separate machines, connected to a server that governs them. The server can perform calibration (so that there is a uniform coordinate system between the kinects), filtering, body data separation etc.

The framerate of the live stream greatly depends on whether you use wifi, 100 mbps ethernet or gigabit ethernet. If you are only interested in the skeletons and body point clouds the streaming is very fast.

You can find its source and a link to binaries here:

And a video showing how it works here:

2 Likes

Shakir, it may be better to start a new question than tag onto an existing question that does not really cover the topic.