Volumetric rendering of real-time motion diffs, CPU vs GPU

Hey guys, I’ve been working on a time-distortion program for some time (haha) and its evolved into a volumetric renderer that operates on motion diffed realtime video now. I’ve been doing all of my processing on the CPU up until this point, which, with a 1.8ghz c2d has been just fine, actually even on a dual core atom 330 it has been pretty fast, even at 640x480 with my initial time distortion algorithm. It allowed me to rotate the “time cube” and slice through it at any arbitrary angle.

I had been doing this CPU based and just realized I could replicate the effect really easily by using a 3d texture and setting the texcoords appropriately. The image quality is substantially better using the GPU because it automatically does multisampling for the texture. I wrote a test program that generates a 256^3 random 3d texture and uploads it once, but transforms the texcoords every frame and my framerate is ~300. The only problem is that I need to upload the 3d texture every frame with new video data, which is very slow.

my CPU volumetrics is reasonably fast at low resolutions but does not scale well since I only have a dual core. I have not tried GPU volume rendering yet, but I’m thinking that GPU volumetrics should be super fast using some kind of fragment shader that works on a 3dTexture, once again the bottleneck will be uploading the texture.

Here are some benchmarks:

Lenovo ThinkPad T60, 1.8GhzC2D, IntelGMA whatever
single slice CPU render @ 640x480: 30fps
single slice GPU render @ 128x128: 30fps
single slice GPU render @ 256x256: 10fps

128^3 volume render on the CPU: 20-28fps
256^3 volume render on the CPU: 3-10fps

I cant get it to load 3d textures bigger than 256^3. probably due to the GMA’s crappy amount of gfx memory. I will try on my Atom machine which has an ION to see if it fares any better. Hopefully it can handle larger textures as well as upload faster, I’ll post back with results.

enough talk, whats your question!?
Does anyone know of any optimizations that can be done to make uploading textures every frame more efficient?
Here’s a code snippet to show what I am doing:

void testApp::draw()  
    if(use3dTexture) {  
        myFS->build3dTexture(vidTex3d); //copys the video data into a new buffer to be uploaded  
        glBindTexture(GL_TEXTURE_3D, tex3dHandle);  
        glTexImage3D(GL_TEXTURE_3D, 0, GL_RGB8, camWidth,camHeight, maxFrames, 0, GL_RGB, GL_UNSIGNED_BYTE, vidTex3d); //uploads the data  
        //draw full-screen poly with transformed tex-coords  
        glTexCoord3d(tl.x/(float)camWidth, tl.y/(float)camHeight, tl.z/(float)maxFrames);  
        glTexCoord3d(tr.x/(float)camWidth, tr.y/(float)camHeight, tr.z/(float)maxFrames);  
        glTexCoord3d(br.x/(float)camWidth, br.y/(float)camHeight, br.z/(float)maxFrames);  
        glVertex2d(ofGetWidth(), ofGetHeight());  
        glTexCoord3d(bl.x/(float)camWidth, bl.y/(float)camHeight, bl.z/(float)maxFrames);  
        glVertex2d(0, ofGetHeight());  

I know that it is the upload because if I comment out the step that copys my data to the buffer, I get no increase in fps.

Here are some images of the volumetrics at 256^3. I just waved my hands around a bit, using a ps3eye @ 125fps.

If you only need to upload a slice (i.e. a frame) of data each frame you draw, you can use the TexSubImage function for 3D Textures:

This will result in a massive reduction in memory copying and bump your framerate up.

glTexSubImage3d sounds like it might do what I need. The problem is that I also need to shift the existing frames over, deleting the oldest and inserting the new one. I did some searching and found glPixelTransfer(), which when used with the argument GL_INDEX_SHIFT or GL_INDEX_OFFSET, but I’m unsure how to properly use these. I’ll test it out and see what I can come up with. Thanks!

Also, I tested the code on my IONitx board, it can indeed handle larger textures, but the uploading is just as slow.

That’s tricky and expensive no matter what hardware you use since you’re moving a giant chunk of memory over a little bit. If you can adapt your rendering algorithms, the ideal solution would be to treat your 3D texture as a circular buffer such that you put each new frame into the oldest slice of the buffer. You can apply a shader when you render that knows which zslice it the most recent and offsets and wraps texture coordinates accordingly. One issue you’re have to deal with is interpolation at the boundaries, but otherwise, but the benefits are a much greater efficiency and rendering detail, so it’s probably worth it.

So I tried glTexSubImage3d, and it is faster than replacing the entire texture. I get about 2x fps increase over replacing the entire texture.

Also, instead of doing a shift somehow, which I’m pretty sure is impossible, I just used the GL_REPEAT parameter and shifted the texcoords appropriately, it works great with no performance penalty!

The GPU method is getting better but the CPU still beats it in performance, but the GPU’s image quality is much better, it might be worth it. Next up is trying out GPU volumetrics…pray for me!

Amazing revelations!!!

I tried the new single frame uploading on the Atom board with its beefy nVidiaION, and instead of 20fps@256^3, i get >300fps. Memory access must be much more efficient compared to GMA.

i tried it at 512^3 and I still get over 150fps. AMAZING!

The only problem now is I have to make sure only to upload a frame when i need to, which might be tricky because I am constantly capturing frames to a circular queue in a separate thread (which I apparently don’t need to do anymore because my queue is in the GPU now). Wow, GPUs are fun! The image quality using 256^ on the GPU is easily better than 640x480 on the CPU due to its multisampling, and now, its about 10 times faster too!

Sorry for so many posts, but I am really excited about this! I made the changes to only upload frames if needed, or upload as many as are needed at once, usually 0 per frame because the render thread is running much faaster than the camera can capture. Also, as an added bonus, the Ion GPU can support arbitrary texture dimensions even in 3d, so I am unconstrained by powers of two, and there seems to be no performance hit.

As you can see in the images, the quality and framerate of the GPU method are better than the CPU now, even on a lowly zotac IonITX Atom 330 board. Its all in the GPU, baby!
the GPU method on an ION is 16 times faster than rendering on the atom CPU, and 8 times faster than rendering on a 1.8ghz c2d! This is all done without using any shaders at all.

Now I’ll need to look into shader land to do the volumetric rendering.

Thanks a lot, otherside, for telling me about the sub image uploading. I’ll post back here when I’ve made some progress with volumetrics.

I’ve finally got some time to look into GLSL to try and render volumes on the GPU.

I’ve found a raycasting tutorial which uses nVidia CG, so I need to translate this into GLSL:

I’m starting out by playing with super simple GLSL shaders from this tutorial:

I’m gonna be chugging away at this for a bit and posting my progress here, but if anyone has other references to look at or some ideas of how to integrate this with OF, I’d love to hear it!


You might be interested in
where I use a GLSL raycasting shader to render a series of frames of live video as a 3D volume.
The shader code and basic setup is based on the tutorial found here:

The key lies in the creation of the ray start and end textures.

I’m afraid I can’t share the project with you unless you happen to use Quartz Composer on the Mac.

Hope this helps,


Hi toneburst,
I’ve seen your videos on vimeo–cool stuff, the time cube project you have is very similar to what I am interested in making.

I’ve recently gotten my hands on a 2.8GHz quad core AMD chip, and my CPU based renderer is pretty fast with 4 threads at 256x256x256.

I’m a bit too busy right now to focus on really getting my hands dirty with GLSL, but I’ve got Peter Trier’s blog post bookmarked; thanks for the tip! The GPU’s texture lookup function really makes this stuff smooth as butter. The CPU version comes out pretty pixelated as I’m not doing any multisampling for the textures to get higher framerate.

When I get some time to spend on this project again I’ll post here with my findings.

Alright I’ve started writing a GLSL shader based off of trier’s CG shader. It’s pretty straightforward to convert to GLSL except I’m not 100% sure if I’m doing the translation from IN.pos correctly. Basically I used the built in gl_position inside the vertex shader and pass it to the frag:


varying vec4 pos;  
void main()  
	// Transforming The Vertex  
	gl_Position = gl_ModelViewProjectionMatrix * gl_Vertex;  
	pos = gl_Position;  
	// Passing The Texture Coordinate Of Texture Unit 0 To The Fragment Shader  
    gl_TexCoord[0] = gl_MultiTexCoord0;  
    gl_TexCoord[1] = gl_MultiTexCoord1;  


#extension GL_ARB_texture_rectangle : enable  
varying vec4 pos;  
uniform sampler2DRect backface;  
uniform sampler3D volume_tex;  
void main()  
    float stepsize = 1.0f;  
    vec2 texc = (((pos.xy / pos.w) ) / 2) * vec2(320.0f, 240.0f);  
    vec4 start = gl_TexCoord[0]; // the start position of the ray is stored in the texturecoordinate  
    vec4 back_position = texture2DRect(backface, texc);  
    vec3 dir = vec3(0,0,0);  
    dir.x = back_position.x - start.x;  
    dir.y = back_position.y - start.y;  
    dir.z = back_position.z - start.z;  
    float len = length(dir.xyz);  
    vec3 norm_dir = normalize(dir);  
    float delta = stepsize;  
    vec3 delta_dir = norm_dir * delta;  
    float delta_dir_len = length(delta_dir);  
    vec3 vec = vec3(start);  
    vec4 col_acc = vec4(0,0,0,0);  
    float alpha_acc = 0;  
    float length_acc = 0;  
    vec4 color_sample;  
    float alpha_sample;  
    for(int i = 0; i < 450; i++)  
        color_sample = texture3D(volume_tex,vec);  
        alpha_sample = color_sample.a * stepsize;  
        col_acc   += (1.0f - alpha_acc) * color_sample * alpha_sample * 3.0f;  
        alpha_acc += alpha_sample;  
        vec += delta_dir;  
        length_acc += delta_dir_len;  
        if(length_acc >= len || alpha_acc > 1.0f) break; // terminate if opacity > 1 or the ray is outside the volume  
	// Sampling The Texture And Passing It To The Frame Buffer  
	gl_FragColor = col_acc;  

The other problem I’m having is setting up the fbos and textures within OF so that they get properly shipped out to the GPU. If I

void testApp::draw(){  
    float angle = mouseX*360.0f/(float)ofGetWidth();  
    float size = ofGetHeight()/2;  
    //render backface  
	glTranslatef(camWidth/2-size/2, camHeight/2-size/2,-size);  
	drawQuads(1.0,1.0, 1.0);  
    glBindTexture(GL_TEXTURE_3D, tex3dHandle);  
    myShader.setUniform("volume_texture", 2);  
	glTranslatef(camWidth/2-size/2, camHeight/2-size/2,-size);  
	drawQuads(1.0,1.0, 1.0);  

I’m also getting a pinkish color out of my fbo, even if I change the frag code to output white no mater what, it instead outputs a magenta like color.

Ugh, shaders and fbos and texture units, oh my! what a pain in the ass!

So the pinkish color was from the last vertex drawn being pink, so when the FBO was drawn at the end, glColor was still set to the pink color which filters it.

I’m still having problems with the shader though, it ends up only drawing the faces and nothing inside the cube, which is the whole point.

I had to make some small changes to the shader code, but the problem was that the first fbo was not being linked to the shader properly. I’m not at my workstation now, so I dont have the code in front of me, but I’ll post an update with pretty pictures and code samples so we can all do volumetrics on the GPU. Oh, and its really fast. I was getting over 100fps on an ATI 5650 with a volume of 320x240x300, the image quality is excellent too. We’ll see how it fares on an nvidia ION gpu too.