I am using MoveNet with ofxTensorFlow2 to examine multiple video feeds. I have managed to get it working using individual calls to model.runModel(), but the performance is not good enough. The two things I want to try next are (1) using model.runMultiModel() to see if that increases performance at all, and then (2) use a threaded model and have each video feed running in its own thread. I know #2 is more likely to improve performance, but I want to also understand what I am doing wrong.
I canāt seem to get runMultiModel to work or find a good example to guide me (this discussion was somewhat helpful). Here are the in and output specifications
The given SavedModel SignatureDef contains the following input(s):
inputs['input'] tensor_info:
dtype: DT_INT32
shape: (1, -1, -1, 3)
name: serving_default_input:0
The given SavedModel SignatureDef contains the following output(s):
outputs['output_0'] tensor_info:
dtype: DT_FLOAT
shape: (1, 6, 56)
name: StatefulPartitionedCall:0
Method name is: tensorflow/serving/predict
So I set up the model with 7 inputs/outputs: āserving_default_input:0ā/āStatefulPartitionedCall:0ā (Iāve also tried changing the ā0ā to 0-6), and then populate a vector of tensors like:
for (int i = 0; i < 7; i++) {
videos[i].update();
if (videos[i].isFrameNew()) {
ofPixels pix(videos[i].getPixels());
pix.resize(nnWidth, nnHeight);
cppflow::tensor t = ofxTF2::pixelsToTensor(pix);
auto inputCast = cppflow::cast(t, TF_UINT8, TF_INT32);
inputs[i] = cppflow::expand_dims(inputCast, 0);
}
}
But then, later, when I call vector<cppflow::tensor> output = model.runMultiModel(inputs); it crashes somewhere in model.cpp with a exception: std::out_of_range at memory location 0x0000000000148F00
Can anyone point out what I am doing wrong, or perhaps point me to a working example? Or, more generally, Iād also appreciate any performance optimizations.
For what it is worth I tried threading the video matting example and while it freed up the main thread it didnāt help the rate of processing much.
Your best bet probably would be to make sure the TensorFlow / CUDA libs are using your GPU. I think that can have the biggest performance improvements.
In terms of the crash I bet it is because for some of the videos isFrameNew() isnāt true right at launch, so some of the inputs vector havenāt been set.
A fix would be to make sure all 7 inputs have been set at least once before calling runMultiModel.
7 streams seems like a lot, honestly, even with the GPU but I am no expert. Pinging @bytosaur
As Theo writes, using a threaded model only keeps the main thread from being blocked, but the video frames and the computed positions will be out of sync if the processing takes longer than the amount of time to grab one frame from the video/camera input. The model is crunching on a frame while new ones keep coming in, so the final computed positions are too old. If a slower processing frame rate is fine, just skip the input frames in between. If realtime is not an issue, ie. batch processing video input to output, then there isnāt a need to thread the model and the processing frame rate can be slow. If you are somewhere in between, well then you need a faster system.
Iām on a RTX 2060, running 7 streams through the MoveNet model. Each stream is running at about 6-8 fps and not affecting the main loop very much.
(thereās an alignment problem here, but thatās a separate issue)
This will ultimately be running on an Quadro RTX 5000, which should have +50% effective speed, so maybe Iād get +3 or +4 fps. But I donāt have access to that hardware yet. The obvious solution is for each camera to have its own machine to do the processing and send the skeleton data via OSC or smthg, but unfortunately we already put our money on the idea that it could all be done on one monster GPU So Iām trying to figure out how badly we misjudged this.
Itās also worth noting that I get a bunch of this message:
2022-01-20 09:40:50.599848: W tensorflow/core/common_runtime/bfc_allocator.cc:275] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.73GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
Perhaps obviously, there seems to be a linear relationship
7 cameras = each thread 6-8fps*
6 cameras = each thread 8-9fps*
5 cameras = each thread 9-10fps*
4 cameras = each thread 11-13fps
3 cameras = each thread 13-15fps
2 cameras = each thread 17-20fps
1 camera = thread runs at 20-21fps
*GPU runs out of memory
But Iāve also been looking at OpenPose, which runs at around 10fps on A SINGLE 800x600 video feed. And Iām just starting to get MediaPipe. But it does seem like MoveNet is pretty damn performant compared to other pose estimation solutions.
Thanks for the tips, @danomatika. Yes, 7 is probably too much.
FWIW, this is my update loop. Basically, I only take new frames from the camera when the threaded model is ready. This does need to run real time, but accuracy is not a top priority. So Iām not super concerned about the synch issues at this point.
void TFCameraFeed::update() {
vid.update();
if (vid.isFrameNew() && model->readyForInput()) {
ofPixels pix(vid.getPixels());
pix.resize(nnWidth, nnHeight);
cppflow::tensor t = ofxTF2::pixelsToTensor(pix);
t = cppflow::cast(t, TF_UINT8, TF_INT32);
t = cppflow::expand_dims(t, 0);
model->update(t);
}
if (model->isOutputNew()) {
auto output = model->getOutput();
parseSkeletons(output);
uint64_t now = ofGetElapsedTimeMillis();
outputRate.push_back(now - lastOutput);
lastOutput = ofGetElapsedTimeMillis();
}
}
just to clarify what runMultiModel is doing:
tf graphs do not necesarily need to have exactly one in and one output tensor. RunMultiModel handles the general case of naming your ins and outs. E.g. your model wants to have a picture and a random vector as inputs.
I understand that the naming of the function is misleadingā¦
Unfortunately, this model doesnt let you input a batch of images as the batchsize is set to 1.
What are the dimensions of your inputs?
The model is from here and it says the input dims should be a factor of 32 while 256 is recommended. Maybe reducing them gives you a benefitā¦
Okay, good to know. I can eliminate that as an option.
Ah! I was under the impression that it MUST be 512x288, so I was scaling my 640x360 camera feeds down to that.
So I could send in 7 videos frames munged together as one big tensor as long as the dimensions are factors of 32? A single 3584x288 image? (gonna try that now)
where H and W need to be a multiple of 32 and the larger dimension is recommended to be 256
So I could send in 7 videos frames munged together as one big tensor as long as the dimensions are factors of 32? A single 3584x288 image? (gonna try that now)
I have! I kind of deconstructed that and refactored it into my various different approaches. Very helpful. In the end, it was easier to use a ofxTF2::ThreadedModel.
In fact, I can! I made a 3584x288 texture and the performance is decent (12fps). Iām currently trying to balance the quality of the pose estimation with the processing speed.