Bypassing OpenGL render pipeline?

I have some projects coming up that really require very low latency. And anyway, I’ve always been a little bit perplexed by this:

OSX is clearly able to deal with my mouse movements with very low latency, whereas OpenGL/oF takes multiple frames to render anything through the usual pipeline. Hence the lag in tracking my mouse. This is especially noticeable and frustrating when I’m trying to pass video through the system from, say, a 240fps camera (latency: ~4ms) and it takes 100ms to render it to screen. Maybe this isn’t really an oF question? But is there any way to push pixels out to screen these days (in OSX or Windows) that doesn’t involve the gpu pipeline?



Hi @Jez,

there’s always a bit of latency between submitting frames on the CPU and frames being displayed on screen. In OpenGL, to get the smoothest framerates, some GPUs are up to 3 frames behind the CPU submitting the draw commands.

But all in all, the CPU->GPU latency should not be more than 16 to (at very worst) 48ms. Its possible that most of the latency you’re experiencing happens due to how the video signal is grabbed from the camera, its memory moved around, and submitted to the GPU.

Also, the GPU has to wait for the monitor to signal that the scanline has reached the bottom of the screen before drawing the next frame (v-sync). If you don’t want that, you can:

  1. in driver settings, and in openFrameworks disable vsync. For openFrameworks, that’s a line ofSetVerticalSync(false); in ofApp::setup().
  2. Also, set your framerate to the maximum: ofSetFrameRate(0);
  3. Use a monitor with g-sync (and an nvidia card could give the impression of slightly more responsive frame rates, because of how these

This should bring down CPU->GPU latency by a couple of milliseconds, but the image will most likely show tearing.

I’m not sure there is a way to bypass OpenGL completely, maybe in the latest version of OS X you could write a custom renderer and use Metal to render to screen, but even then, the trickiest thing would be to make sure that you don’t have to copy the video memory around not more than once, and that you grab your video signal directly to a Metal texture, which is not trivial, and not something openFrameworks currently supports.

Thanks, tgfrerer. I’ll bet I could shave a few ms by disabling vsync and such, but I’m more concerned with the 3 or 4 frames (64+ ms?) of OpenGL pipeline activity that I’m pretty clearly seeing in my system. (See the lag between my mouse input/OSX draw and the OpenGL rendering.)

I posted a question to StackOverflow which led me to this:

I’ll report back if I can get anything to work…

Hi @Jez,

That latency looks a little strange. Can you post the code to your app?

Latency with uploading video data to a texture often isn’t due to OF, but problems with CPU -> GPU syncing.

Whenever a buffer or texture is updated (copying from system RAM to GPU memory, for example), the data is uploaded and this stalls the GPU until it has finished uploading so that the texture is safe to use.

If you only have 1 texture that the video data is uploaded to each frame, then what happens is that as soon as you use that texture in a draw, everything stalls until the upload is finished so that the updated texture can be used.

Stalling the GPU is something that you want to avoid whenever possible. A lot of engine optimizations revolve around reducing GPU stalls and CPU->GPU synchronization points. One way around this is to use 2 or 3 buffers (you may have heard the term multi-buffered before).

If you have 2 textures that you are uploading the video to, you can do something like this:

Frame 1:
Upload video to Texture 1
Draw Texture 2

Frame 2:
Upload video to Texture 2
Draw Texture 1

Frame 3:
Upload video to Texture 1
Draw Texture 2

And so on… You can profile this and see if cycling through 2 textures is enough. I’ve found that I needed 3 in one project to avoid stalls. You’ll always be drawing a video data frame that’s 1 or 2 frames old, but that’s ok. In fact, all modern game engines do this for rendering - UE4 for example is always 1 or 2 frames behind in rendering compared to the game world state.

Arturo has a good example of Pixel Buffer Objects (PBO) used to download a texture (you’ll see he has 2 buffers front and back buffer). You can do something similar for updating a texture on the GPU vs. downloading. Here’s his example:

The general idea with rendering engines is to keep the CPU busy preparing the current frame while the GPU is busy rendering the previous frame. If all is designed well, they line up :wink:


1 Like


100ms of latency does sound a little excessive. This is all I’m doing:


  Frame *frame = camera->GetLatestFrame();



I measured roughly 100ms of latency with a high speed camera and a clap test. That’s 100ms between clapping in meatspace and clapping in screenspace. The camera was streaming in at 240fps, So I can see it adding 5+ ms of latency, but… 50ms?

(Note: When I turn vsync off and my app runs around 240fps, tearing all over the place, my meatspace->screenspace latency drops to about 12ms, which is magical. I’m going to take this to mean my camera latency is really only around 5ms.)

Maybe that 100ms latency is a red herring. Even without the camera and textures in play, I’m still seeing 3 or 4 full frames of latency on that mouse latency test video I posted above.

I think I understand what you mean about stalling the GPU, and I can imagine that this would have an impact if I was rendering a complex scene in a game. But when updating a texture only takes 2 or 3ms and that’s ALL I’m doing, is this really going to shave entire frames off my latency?

I asked initially about pushing pixels into a buffer or texture because I thought there might be a way to access a video buffer somewhere a frame or two AFTER a tex.draw() call. Those pixels must be hiding SOMEWHERE in the pipeline, right?

I read through this:

…but couldn’t tell if there was anything here to help me. It looks like all of the optimizations anyone can discuss have to do with getting data into a buffer efficiently before it gets drawn. The trouble point for me is how to deal with the 3 or 4 buffer flips that happen after I draw.

I haven’t been able to get setDoubleBuffering(false); working in oF 0.8.4. (I just get a back screen.) With the new 0.9 beta it works, (I mean: I can successfully draw to screen and see it,) and I think it knocks me down to 3 frames of latency instead of the 3-4 I see normally.

Is this a NaturalPoint TrackIR camera by any chance? I’ve worked with them for IR tracking and they work super well.

Do you need 240fps? Or can you make do with less? If you’re only ever going to render at 60fps, can you try using 100fps (that’s what I used for the TrackIR camera and it worked well).

What resolution are you using for the camera?

Are you working in greyscale mode with the camera?

Try passing parameters into your rasterize call:

in setup():
m_bitmapData = new unsigned char[ m_width * m_height ];
std::fill( &m_bitmapData[0], &m_bitmapData[ m_width * m_height ], 0 );

in your update() or draw();
CamLib::Frame* frame = camera->GetLatestFrame();
frame->Rasterize( width, height, width, 8, m_bitmapData );

if ( NULL != frame )
frame->Rasterize( m_bitmapData );

// load contents of m_bitmapData into texture here



// draw your frame

Your issue is actually one of desynchronisation.
Your solution is simple, use glFinish() and glFlush().

For you see most people are unaware that the GPU is not required to complete it’s workload before additional work is allocated by the CPU even if one of those commands is a screen swap / frame present…

With vSync naively enabled and no explicit synchronization you’re left observing stale old data which was submitted to the GPU many frames ago ( this is why your perceived latency disappeared when you disabled vsync )

Eventually your GPUs command buffer will fill up which is why your only ~5 frames behind, however calling glFLush() after your done drawing and then glFinish after callign swap on your window is all you need to do to ensure you have begun drawing each frame with very fresh data ( users mouse position etc… )

The very best possible results in low latency rendering are achieved by timing and synchronising your CPU code manually such that both CPU and GPU are entirely idle until the very last moments before each screen swap, at which point the keyboard and mouse inputs are read, the screen is cleared and drawn and the extremely fresh frame is immediately presented on this screen swap ( ensuring no tearing )