ofxTensorFlow2 runMultiModel

Hey folks,

I am using MoveNet with ofxTensorFlow2 to examine multiple video feeds. I have managed to get it working using individual calls to model.runModel(), but the performance is not good enough. The two things I want to try next are (1) using model.runMultiModel() to see if that increases performance at all, and then (2) use a threaded model and have each video feed running in its own thread. I know #2 is more likely to improve performance, but I want to also understand what I am doing wrong.

I can’t seem to get runMultiModel to work or find a good example to guide me (this discussion was somewhat helpful). Here are the in and output specifications

The given SavedModel SignatureDef contains the following input(s):
  inputs['input'] tensor_info:
      dtype: DT_INT32
      shape: (1, -1, -1, 3)
      name: serving_default_input:0
The given SavedModel SignatureDef contains the following output(s):
  outputs['output_0'] tensor_info:
      dtype: DT_FLOAT
      shape: (1, 6, 56)
      name: StatefulPartitionedCall:0
Method name is: tensorflow/serving/predict

So I set up the model with 7 inputs/outputs: “serving_default_input:0”/“StatefulPartitionedCall:0” (I’ve also tried changing the ‘0’ to 0-6), and then populate a vector of tensors like:

for (int i = 0; i < 7; i++) {
		videos[i].update();
		if (videos[i].isFrameNew()) {
			ofPixels pix(videos[i].getPixels());

			pix.resize(nnWidth, nnHeight);
			cppflow::tensor t = ofxTF2::pixelsToTensor(pix);
			auto inputCast = cppflow::cast(t, TF_UINT8, TF_INT32);
			inputs[i] = cppflow::expand_dims(inputCast, 0);
		}
}

But then, later, when I call vector<cppflow::tensor> output = model.runMultiModel(inputs); it crashes somewhere in model.cpp with a exception: std::out_of_range at memory location 0x0000000000148F00

Can anyone point out what I am doing wrong, or perhaps point me to a working example? Or, more generally, I’d also appreciate any performance optimizations.

Thanks in advance!

For what it is worth I tried threading the video matting example and while it freed up the main thread it didn’t help the rate of processing much.

Your best bet probably would be to make sure the TensorFlow / CUDA libs are using your GPU. I think that can have the biggest performance improvements.

Theo

In terms of the crash I bet it is because for some of the videos isFrameNew() isn’t true right at launch, so some of the inputs vector haven’t been set.

A fix would be to make sure all 7 inputs have been set at least once before calling runMultiModel.

Thanks, @Theo. I actually am checking the tensor shape with

ofxTF2::shapeVector shape = ofxTF2::getTensorShape(inputs[i]);
ofLogNotice(__FUNCTION__) << ofxTF2::vectorToString(shape);

and getting

[notice ] ofApp::update: [1, 288, 512, 3]

for all of the tensors I am passing in. So I am sure all of them have been populated.

I am seeing this output:

2022-01-19 15:47:13.575430: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 3949 MB memory: → device: 0, name: NVIDIA GeForce RTX 2060, pci bus id: 0000:01:00.0, compute capability: 7.5

which I took to mean that it is using the GPU. Is that correct?

Ahh yes!
That does look like it is running on the GPU. :+1:

No idea about the exception though. You might get some help from @danomatika or via the Github Issues for the project.

What fps rate are you getting?

7 streams seems like a lot, honestly, even with the GPU but I am no expert. Pinging @bytosaur

As Theo writes, using a threaded model only keeps the main thread from being blocked, but the video frames and the computed positions will be out of sync if the processing takes longer than the amount of time to grab one frame from the video/camera input. The model is crunching on a frame while new ones keep coming in, so the final computed positions are too old. If a slower processing frame rate is fine, just skip the input frames in between. If realtime is not an issue, ie. batch processing video input to output, then there isn’t a need to thread the model and the processing frame rate can be slow. If you are somewhere in between, well then you need a faster system.

I’m on a RTX 2060, running 7 streams through the MoveNet model. Each stream is running at about 6-8 fps and not affecting the main loop very much.

7-streams (there’s an alignment problem here, but that’s a separate issue)

This will ultimately be running on an Quadro RTX 5000, which should have +50% effective speed, so maybe I’d get +3 or +4 fps. But I don’t have access to that hardware yet. The obvious solution is for each camera to have its own machine to do the processing and send the skeleton data via OSC or smthg, but unfortunately we already put our money on the idea that it could all be done on one monster GPU :frowning: So I’m trying to figure out how badly we misjudged this.

It’s also worth noting that I get a bunch of this message:

2022-01-20 09:40:50.599848: W tensorflow/core/common_runtime/bfc_allocator.cc:275] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.73GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.

Perhaps obviously, there seems to be a linear relationship

7 cameras = each thread 6-8fps*
6 cameras = each thread 8-9fps*
5 cameras = each thread 9-10fps*
4 cameras = each thread 11-13fps
3 cameras = each thread 13-15fps
2 cameras = each thread 17-20fps
1 camera = thread runs at 20-21fps
*GPU runs out of memory

But I’ve also been looking at OpenPose, which runs at around 10fps on A SINGLE 800x600 video feed. And I’m just starting to get MediaPipe. But it does seem like MoveNet is pretty damn performant compared to other pose estimation solutions.

Thanks for the tips, @danomatika. Yes, 7 is probably too much.

FWIW, this is my update loop. Basically, I only take new frames from the camera when the threaded model is ready. This does need to run real time, but accuracy is not a top priority. So I’m not super concerned about the synch issues at this point.

void TFCameraFeed::update() {
	vid.update();

	if (vid.isFrameNew() && model->readyForInput()) {
		ofPixels pix(vid.getPixels());
		pix.resize(nnWidth, nnHeight);

		cppflow::tensor t = ofxTF2::pixelsToTensor(pix);
		t = cppflow::cast(t, TF_UINT8, TF_INT32);
		t = cppflow::expand_dims(t, 0);
		model->update(t);
	}

	if (model->isOutputNew()) {
		auto output = model->getOutput();
		parseSkeletons(output);

		uint64_t now = ofGetElapsedTimeMillis();
		outputRate.push_back(now - lastOutput);
		lastOutput = ofGetElapsedTimeMillis();
	}
}

hey @jefftimesten,

just to clarify what runMultiModel is doing:
tf graphs do not necesarily need to have exactly one in and one output tensor. RunMultiModel handles the general case of naming your ins and outs. E.g. your model wants to have a picture and a random vector as inputs.
I understand that the naming of the function is misleading…

Unfortunately, this model doesnt let you input a batch of images as the batchsize is set to 1.

What are the dimensions of your inputs?
The model is from here and it says the input dims should be a factor of 32 while 256 is recommended. Maybe reducing them gives you a benefit…

the error about memory is normal, i was going to comment the same as others.

Maybe you can try to scale down frames to analyze and later translate positions to real size, maybe that will help a bit on fps

Just what bytosaur said :slight_smile: we are all typing same time

1 Like

Thanks, @bytosaur

Okay, good to know. I can eliminate that as an option.

Ah! I was under the impression that it MUST be 512x288, so I was scaling my 640x360 camera feeds down to that.

So I could send in 7 videos frames munged together as one big tensor as long as the dimensions are factors of 32? A single 3584x288 image? (gonna try that now)

You can go even lower than that:

where H and W need to be a multiple of 32 and the larger dimension is recommended to be 256

So I could send in 7 videos frames munged together as one big tensor as long as the dimensions are factors of 32? A single 3584x288 image? (gonna try that now)

Ahh no… you cant merge the images!

Also, have you seen the ofxMovenet model wrapper class? It handles either threaded or non-threaded usage already: https://github.com/zkmkarlsruhe/ofxTensorFlow2/tree/main/example_movenet/src

@bytosaur We should probably add this info plus the batch size of 1 to the ofxMovenet.h header commenting.

To answer myself :slight_smile:https://github.com/zkmkarlsruhe/ofxTensorFlow2/commit/90551fbcb64c3c68b63ff118968c3d1f9c442a60

I have! I kind of deconstructed that and refactored it into my various different approaches. Very helpful. In the end, it was easier to use a ofxTF2::ThreadedModel.

In fact, I can! I made a 3584x288 texture and the performance is decent (12fps). I’m currently trying to balance the quality of the pose estimation with the processing speed.

i mean sure you can pass it through, but dont you want to use multiperson detection? this model can only detect 6 people in a single image…

You’re absolutely right. I was getting ahead of myself… :frowning: