FBO to ofImage or unigned char :: very slow ! , any advice?

Hi all OF lovers,

I am just finishing making a custom version of CCV and Reactivsion.
It was running 60 fps (in a 2006 macbook pro), with a PS3Eye, making two different shaders workflows to detect fingers and fiducials.
Extremely promising!
All goes smooth, because i made the blurring, contrast, hipass, etc. via FBOs (the “pre-treatment”) of the video before passing it to the finger-fiducial detector.


The very bad news is now that i have to pass my resulting fbo through my detector, i need to pass the “pixels” (an image or a “unsigned char*”).

Using different techniques :

  image_fbo.grabScreen(0, 0, width, height); 
unsigned char * pix = new unsigned char[width*height*4];  
   glReadPixels(0, 0, width, height,GL_RGBA, GL_UNSIGNED_BYTE, pix);

I get my app getting extremely slow.
At the end, it is faster to do every process in the CPU than making them in the GPU and uploading the result to the CPU…
That is a real shame…
It makes the process of fbos and shaders useless (that is why perhaps the NUI group did not use GPU to make ccv 1.5).

Any advice on that?

can you maybe try this addon and see if it helps? I find about a 10 fps improvement using PBO

also there’s some discussion here:

in terms of improving your speed, is there a way you can send a smaller image back? ie, draw into a smaller FBO? you can see big improvements if you can cut the size of the transfers down, cutting resolution in half cuts data down by a quarter, etc.

sorry, forgot the addon link:

requires this addon as well for the example:

Thank you, that is great.
I will try that addon.

So to reduce that reading, could we use “grey images” fbos?
How can we set grey images fbos? (and load them back).
As a GL_RGBA has 4 values per pixel, using “grey fbo” will reduce the amount by 4…

I will try… thanks.

the main advantage of using PBO’s is that it doens’t block the main thread while downloading the texture but the speed at which the image is downloaded from the graphics card is still the same so you’ll still get a big delay, in my experience unless you have a really fast video card and bus doing the image processing in the graphics card won’t make any difference as you’ll loose the time you gain in the processing in downloading the image from the graphics card.

in this case the meassure is not really how fast your app is running but how much time passes between you get the frame from the camera till you get the results from the analisys, using a PBO or even doing the processing in the cpu but in a different thread will give you higher fps for your applciation but if the analisys can’t run in less than 16ms then at 60fps you are loosing frames

a solution would be to implement the detector in the gpu using openCL or cuda, that way you just need to download a few points instead of a full image from the graphics card which can make a big difference


Thank you for the good replies.

Yes, you are right…
Anyway, i am getting closer to make it worthy (make GLSL and upload). Using PBO increases a few fps, and using GL_RGB instead of GL_RGBA too.
My graphic card (or something) gives an error using GL_LUMINANCE . A shame, because we could divide by 4…

So a question that I have now: is there a specific kind of shaders and fbos to deal with only grayscale images, i.e., only one value per coordinate in GLSL…

Another idea is to make my fbo half the size RGBA, and make some trick in the GPU to manipulate the pixels there. So pixel(i) will be gpu at 4 (x + wh) + col , where col is 0 for red, 1 for blue, 2 for green, 4 for alpha.
Sure somebody did that before…
That would be to work with grayscale images into RGBA fbos, half the size of the grayscale image.


so after a long weekend of tests, i just got a nice solution to make GPU shader tasks and upload the results.
Thanking this thread to inspire me ideas, the trick was to “codify” a 640x480 gray image into a 320x240 RGBA one.
In the GPU the process is going into 640x480 RGBA textures, but to upload the result into the CPU, i apply a shader that makes in the first quarter of the image all the info for the resulting gray image.

Then, i decodify that image in a simple loop.

Here are the shaders, functions and example:

void rgba_to_gray(unsigned char *src_rgba, unsigned char *dest){
		// converts from a codified RGBA image that contains a w*2,h*2 
		// gray image, to that original GRAY image (w*2,h*2)
		// used with the shader gray2RGBA.frag . 
		// dest and src_rgba are size w*h*4, where w and h are the sizes of the rgba image, 
		// i.e, half the sizes of the GRAY image. 
		int w = width; 
		int h = height;
		int w2 = w/2; 
		int h2 = h/2; 
		unsigned char r,g,b,a;
		int k,x,y; 
		for (int i=0; i<w2*h2; i++) {
			x = i % w2; 
			y = i / w2; 
			r = src_rgba[k]; 
			g = src_rgba[k+1];
			b = src_rgba[k+2];
			a = src_rgba[k+3];
			// decodify the image:: 
			dest[x+w*y] = r;  // dest[x,y] = r; 
			dest[x+w*y + w2 ] = g;  // dest[x+w2,y] = g; 
			dest[x+w*y + w*h2  ] = b;//dest[i+ w*h] = b;  // dest[x,y+h2] = b; 
			dest[x+w*y + w2 + w*h2 ] = a; //dest[i+ w + w*h] = a; //  dest[x+w2,y+h2] = a;

The shader ::

// transform a RGBA image into a w/2, h/2 GRAY image 
// the rule will be: 
// R channel is for the pixels x [0,w/2] y [0,h/2].
// G channel is for the pixels x [w/2,w] y [0,h/2].
// B channel is for the pixels x [0,w/2] y [h/2,h].
// A channel is for the pixels x [w/2,w] y [h/2,h].

// we will upload the result drawing the fbo into a w/2,h/2 RGBA image. 
// then looping through those pixels we recover a GRAY pixels vector w,h. 

uniform sampler2DRect tex;
uniform float width;
uniform float height;

vec4 vr = vec4(1,0,0,0); 
vec4 vg = vec4(0,1,0,0); 
vec4 vb = vec4(0,0,1,0); 
vec4 va = vec4(0,0,0,1); 

vec4 gray_to_rgba(vec2 pos){

   float w = width;
   float h = height;  
   float h2 = h/2.;
   float w2 = w/2.;     
   vec2 pos_g = vec2(w2,0); 
   vec2 pos_b = vec2(0,h2); 
   vec2 pos_a = vec2(w2,h2);
   float cr =  texture2DRect(tex, pos).r;
   float cg =  texture2DRect(tex, pos + pos_g).r;
   float cb =  texture2DRect(tex, pos + pos_b).r;
   float ca =  texture2DRect(tex, pos + pos_a).r;
   vec4 col = vec4(cr,cg,cb,ca); 
   //vec4 col = vec4(cr,cg,cb,1); 
   //vec4 col = vec4(0,0,0,ca); 

void main()

    vec2 pos = gl_TexCoord[0].xy; 
   float h2 = height/2.;
   float w2 = width/2.;  
	if( pos.x <w2 && pos.y <h2){
		gl_FragColor = gray_to_rgba(pos);
		gl_FragColor = vec4(0.,0.,0.,1.);

An example on how to use it:

        void do_grey_rgb(){

	ofClear(0,1); // we clear the fbo.		
	ofDisableAlphaBlending(); // IMPORTANT !!! 
	// drawing here the image from the camera (grey-scaled): 
	s_grey.begin(); // shader to codify the image as RGBA, using only w/2, h/2.. 
	s_grey.setUniform1f("width", width); 
	s_grey.setUniform1f("height", height); 
	int w = width; 
	int h = height; 
	glTexCoord2f(0, 0); glVertex3f(0, 0, 0);
	glTexCoord2f(w, 0); glVertex3f(w, 0, 0);
	glTexCoord2f(w, h); glVertex3f(w, h, 0);
	glTexCoord2f(0,h);  glVertex3f(0,h, 0);
	s_grey.end(); // shader's end. 
	// this next fbo is half the size :: 
    // we use this one to draw only 1/4 of the previous fbo. 
    // it is the only needed info.
	ofClear(0,1); // we clear the fbo.
	// when the camera gots a new frame, then upload the image from GPU: 
	if(newFrame) {
                    // this will un-codify the image and make one full size grey image.
		// image_fbo_grey contains the result... 

Hope this helps…
Now i can make 60fps (NO, THAT WAS NOT THE CASE…see below) with extensive GPU work on images from my Ps3…
Soon some source code…

mmm, my problem still remains, reading the fbo passes from 2ms to 12ms when there is intensive GPU processes.
Here is some example code (for OF -0071):
fbo upload test

So for the moment, no miracles with my old macbook pro, 2006.
Any comments on that?

Hi there, I had the same problem though I wasn’t using any shaders, my workaround was simply to use a newer machine with faster memory, your 2006 MBP probably has DDR2 ram, mine is a 2009 white macbook and also has DDR2, I first tried running my program on a newer macbook pro with DDR3 ram, it did run a bit faster, it went from something like 12-15 ms to 8-10 ms, but then I tried it on a desktop with similar but slightly lesser specs (this machine also has the newer DDR3 ram) BUT with ubuntu linux, I dont know if its a driver thing or what but on that machine I get around 1-2 ms every time no matter what. If you dont have a problem with using a different OS I suggest you try using a newer machine with linux (or even install linux on your MBP with bootcamp and see if there’s a difference), also if you do go this route and build a computer get the fastest RAM you can buy (check the youtube channel of “linus tech tips” and “tech of tomorrow” for reference and orientation on the fastest RAM out there), it will make a difference trust me. Oh and one last option, get the latest haswell core i7 with iris pro graphics 5200, that chip comes with a 128 MB eDRAM cache, if the FBO gets cached with this CPU in theory you’d get the fastest performance.

Finally, if you are planning to get a new MBP for christmas, get one of the newest retina 15 inch models, even though only one of the two models comes with a discrete GPU they both come with the core i7 with iris pro graphics 5200 gpu.