Poor ofTexture::loadData() performance on Linux after update/upgrade

After getting my shader code to finally work well, a recent set of updates has totally wrecked my performance. I tried switching back to the old kernel and old NV drivers (using the same ofx) and no change. I tried to do everything I could think of, including fresh -lucid- precise (and trusty) installs. No matter what I do, the code that used to perform well now performs poorly. I went from 0.033 seconds per frame to 0.2.

My previous code (following) improved greatly after switching from ofImage::setImageType() to ofImage::update() Now, both cases take the same time to render, 0.2s per frame.

maskIter = masks.begin();
for (imageIter = images.begin(); imageIter != images.end() and maskIter != masks.end(); imageIter++) {

    // Convert Mats to ofImages for uploading to GPU
    ofImage uploadImage, uploadMask; // or ofPixels? seems this will do the GPU upload, what we want?
    ofxCv::toOf(*imageIter, uploadImage);
    uploadImage.update();
    //uploadImage.setImageType(OF_IMAGE_COLOR);
    ofxCv::toOf(*maskIter, uploadMask);
    uploadMask.update();
    //uploadMask.setImageType(OF_IMAGE_GRAYSCALE);
    // Render to FBO
    perceptsFBO.begin(); // render to FBO
    shader.begin();
    shader.setUniformTexture("image", uploadImage.getTextureReference(), 0);
    shader.setUniformTexture("mask", uploadMask.getTextureReference(), 1);
    ofTranslate(640, 360); // 1280/2, 720/2, specific to drawing on the plane.
    plane.draw();
    shader.end();
    perceptsFBO.end();

    maskIter++;
}

(see this post for details.)

If I comment don’t call update() or setImageType(), then I get 0.033 seconds per frame. So maybe it could be slower texture uploading?

Has anyone else seen anything like this? Any hints on what I should be looking at?

I ended up with a new trusty install, ofx 0.8.3, CUDA 6.5 (nv driver 340.29), running on my shuttle SH87R6 (QC i5-4670) with a GTX 780.

There is a chance its not the texture upload operations that have slowed down, but something else. Unfortunately the working fast config is long gone so I have little to compare my issues with.

Thanks all!

Here (4.2 KB) is updated test source. Note, the input images should be 1280x720 with alpha channels, but could be anything after that. I used to get output like this:

0 0
1 0.033425
2 0.033294
3 0.033347
4 0.03336
5 0.033311
6 0.033353
7 0.033308
8 0.033332
9 0.033358

And now I get output like this:

0 0
1 0.21974
2 0.207283
3 0.214092
4 0.202567
5 0.200023
6 0.202854
7 0.196178
8 0.196054
9 0.245869

I can also confirm that according to callgrind, ofImage::update() uses 42% CPU, which seems very high.

What performance are others seeing with this same code? What is callgrind reporting as update() usage?

Thanks.

not sure i’m understanding… this used to run at 0.03s per frame and now runs at 0.2s per frame, but you don’t remember what changed? was it an older version of OF? maybe try swapping that.

also, try removing ofSetFrameRate(30) and add ofSetVerticalSync(false)

creating an image several times per frame is going to be slow, the best would be to create those images in setup and then update them every frame.

Kyle, exactly. Nothing in the code changed, all I did was install ubuntu package updates, and boom, performance went to hell. I did not change the OF code, the OF version, the nvidia driver version, the opencv version. I’m going to try going back to a fresh -lucid- (actually it was precise, not lucid) install and disable lucid updates to get the older OF dependency packages and see if that works. Still, having to go back to such an old config seems to indicate a problem in the newer packages somewhere…

Arturo: The images are generated on the fly in openCV and are stored in cv::Mat classes. So there is no way to do it but generate ofImages on the fly. Would ofTexture or ofPixels be faster?

Still, I was getting great performance on this machine with this code not long ago, making it fast enough to run with updated packages would work, but I’d be surprised to get back down to 30fps by tweaking this code. I’m willing to give it a try.

Note that if I disable the ofImage::update() call, things go back to 0.033. It seems that its ofTexture::uploadData that is suddenly slow.

Have either of you been able to run my code? How fast does it run on your machines?

So I tried going back to a stock precise and not install updates, but I can’t even get the ofx dependencies to install because the packages already installed are more up to date than those available to install with updates disabled. So I can’t even get back to the config I had before without a huge amount of per-package finagling. It seems the only way is forward, maybe some bleeding edge mesa packages?

looking at the original post, i see you are using nvidia 340? i’m on ubuntu 14.10 and the latest version i get is 331, might it be that you have some experimental ppa for the nvidia drivers?

also even if you create the images on the fly every update you should find a way of not having to recreate the ofImages every frame. you could have a pool of ofTexture, just create 2 vectors for example, once for used textures and the other for non used. whenever you need a new texture go to the unused vector, get a texture out of it, load data into it and put it in the used vector. when you are done with a texture remove it from the used vector and put it again the non used one. that will avoid allocating pixels and textures every frame

Thanks Arturo.

I tried a bunch of difference NV drivers, all give the same results (if I could install them). I originally was using the NV driver shipped with CUDA, so it was a more up to date version (using the NV shipped .run installer).

Ok, so I know my images are all the same size, so in theory, could I just allocate at startup a vector/array of ofTexures, and then just populate those on the fly? I did notice that it looks like ofTexture::loadData() is where the high CPU load comes from. Even if I have a vector of ofTextures, I would still need to upload each one, so it seems this function would still be a problem. Would allocating the ofTexture (rather than ofImage) decrease the load caused by ofTexture::loadData() ?

I’ll try that… Still, it seems like a huge issue somewhere, probably in mesa… ugh.

I did try allocating the ofImage outside of the loop, but it makes no discernible difference. It really seems to be reducible to ofTexture::loadData(). I’ve installed the -dbg packages for the debug symbols for the mesa packages, but I still can’t see what calls inside loadData() are causing the issue, only memory locations. Which debug packages need I install to see what is going on? I installed:

freeglut3-dbg
libc6-dbg
libegl1-mesa-dbg
libegl1-mesa-drivers-dbg
libgl1-mesa-dri-dbg
libgl1-mesa-glx-dbg
libglapi-mesa-dbg
libgles2-mesa-dbg
libopenvg1-mesa-dbg
libwayland-egl1-mesa-dbg
libxcb-glx0-dbg

I’ve uploaded my callgrind output here: I don’t know enough to understand what part of the loadData() implementation could be at fault.

I’m showing a work that uses this code at ISEA in Dubai, and leaving in a couple days. I think I’ll have to admit defeat and forgo the smooth transitions I had working at a higher frame-rate. Disappointing to say the least.

I will have to figure out this mess at some point though…

Thanks for your help Arturo and Kyle.

not sure what might be going on i’m on ubuntu 14.10 and the performance seems to be the same as before, i have the standard driver though

if you are using OF from the master branch in github, you can use ofBufferObject to upload the data, which should be faster than using ofTexture loadData and can be even done in a different thread. if you are on 0.8.4 you can use https://github.com/arturoc/ofxPBO which also uses gl buffers to upload textures in different threads.

What versions of the mesa packages do you have installed? I can try installing those particular versions. Since I changed so many variables, I don’t expect it’s the NV driver, but some strange interaction between NV and mesa. I see the implementation of loadData() is quite different in the nightly build, so I’ll try that first and then have a look at PBO.

Do you have any example code using the PBO?

If you’re using the NV driver and Mesa, then you’re on the open source drivers, right? Have you tried with the proprietary ones?