A couple of tips regarding using OpenCL

Following the “1 million particles” system demo i built a particle system on OF with OpenCL. The video is on vimeo and on some other post here if you didn’t see it. It uses ofxKinect and OpenCL. I would guess that my wording here may not be “correct” however i try to use common sense language where possible.

Some tips i thought about with OpenCL, in no particular order.

  1. Make sure you really understand what is generally happening in the OpenCL calls. There is very little documentation (other than the official Khronos stuff) of any use and mainly it concerns the minute details of cache optimization. Although this is important down the road, the major speed up’s are before this step. For instance i was calling
  
openclA.readBuffer(numBytes, particlesIn, particles);  

which reads from OpenCL a buffer the size of numBytes from the OpenCL (server) buffer particlesIn to the Host (OF) buffer particles. I was doing this because i thought that each time the OpenCL kernel was called it read in the state of particles to the OpenCL particlesIn. I was wrong and hadn’t understood what was going on. So i was doing the read in the code above which was incredibly expensive and simply did nothing of use. It’s an unusual case but it showed that i didn’t understand the code i was calling even though i thought i did and it worked perfect all the time.

  1. Make sure you understand the bottlenecks. Memory transfer/movement on even super fast new machines is dead slow (this is hopefully where we will see leaps forward in the future of hardware). I had 2 million particles each around 28bytes in size…that’s around 56MB. Bus time i believe is in the Gigabytes (is it theoretical 8GBs/s for PCIe v2.0, so 8192MB/s at 60 FPS is a theoretical max of 136MB per frame one way?) per second range for PCIe . Anyhow that’s a whole chunk of a frame if running at 60FPS. It was costing almost as much time to move the 2M particles onto the video card as actually do the computation of the particle state in OpenCL.

So i changed the OpenCL target to the CPU. I run a dual quad core mac, it was still very fast and no moving the particles across the PCI onto the video card…and of course no equally expensive move back. I guess this goes hand in hand with point 1. However it’s worth noting that in tests i found OpenCL on the CPU (server) side still far (maybe 10x) faster than straight OF C (host). This is because it is natively running across all compute units (OpenCL talk for each little processor i.e. core on the CPU). A 10x speed is immense when it comes to optimization…far greater than any other optimization you will do (albeit for quite a bit of work involved)

  1. Write your OpenCL code as a C function (or C++ class whatever) and get it working and understood precisely prior to beginning a step wise move to OpenCL. For the first few iterations of the OpenCL code also update the C code…but it soon becomes expensive and slow to work like that.

  2. OpenCL is difficult to debug, has almost zero compiler help (i.e. your code just doesn’t compile no reasons given), and has very little online help.

  3. Remembering that memory transfers are expensive; think about what info needs to go to OpenCL (the server) to do the computation then what needs to come back to OF C/C++ (the host) to do whatever needs doing i.e. to draw. So in my example i need 28 bytes of data to compute the particles movement (state)

This is the structure i need to pass to OpenCL (in fact a giant 2M array of these structures)

  
typedef struct{  
    Vec3 pos;  
    Vec3 accel, vel;  
    float nvel;  
      
} Particle;  

. I can’t do the work needed with less than this 28 bytes. However i then have OpenCL copy the results of each particle computation into this structure

  
Vec3 pos;  
    ilRGBA col;  

which then goes directly into a VBO. Move the minimum amount of data you possibly can. NOTE…there is a method used where the OpenCL computation is done on the GPU and then moves directly to a VBO within the GPU without back and forth movement. This would probably be faster, and more so with a faster video card.

  1. I like the idea of Fire & Forget. Try and write (or even earlier visualize) your system with a single input to OpenCL on setup, and then grabbing the results back each frame. As a side note i eventually moved to this. I drop the particle buffer into OpenCL on setup then instruct it to move the particles (compute their state) and store the results within the OpenCL buffer each frame, and grab the results back needed only to draw.

Note. You might ask then why indeed do i not move the whole computation onto the graphics card (and pipe straight out to the VBO never needing to transfer memory between the Host and graphics card after the initial setup)…the simple answer is that my graphics card OpenCL baulks at the 56MB. If i wasn’t so lazy i would try and reduce that so it did fit, as the reduction in time would be (i think) half a frame per frame.

For those not used to optimization, in general it’s worth bearing in mind that optimizations that give back a 10th or more of a frame for a few hours work are immense. My background is console games, and as an example when PS3 and 360 first came around as the “next generation” vast amounts (as in 10,000’s) of coder hours were spent trying to build job schedulers, message handlers, and dealing with the threading issues involved in the then new multi processor units. To some extent these have never been totally solved for the current generation of consoles for various reasons. However OpenCL changes all that…in a couple of weeks you can harness the vast amount of power that is sat in your multi processor (compute unit) computer or video card. The cheapest mass optimization i have ever seen.

My thought is that what you are seeing with OpenCL (and CUDA the more advanced but proprietary NVidia system) is a once in a decade or two giant leap forward in computing. Maybe as an example see this http://www.youtube.com/watch?v=RuZQpWo9Qhs&feature=player-embedded

on NVidia (nasty closed CUDA though, not shiny open OpenCL) of particle turbulence in real time that is astonishing. This just freaks me out.

PS… Particle Turbulence is what i am currently working on, to bring this other paper http://www.cs.cornell.edu/~tedkim/WTURB/ (not nearly as good but still impressive) paper over to Xcode/OF and then OpenCL.

I think that’s about it. Hope it helps.

PS… in case anyone is interested i am sure that the video above is totally based on this paper http://graphics.ethz.ch/~tpfaff/data/sga10.pdf, i think one of the authors of the system is the same as the paper, and the method talks about a similar real time capability of the method. Unfortunately whereas the wavelet method has been released as a code sample, the anisotropic method (the car one) has not, and the math is pretty crazy so no chance of a mere mortal building code from the paper…volunteers please step forward. As a further note the wavelet method code released is under GPL 3 so that causes some headaches in the long term also.

nik

thx for the insight. appreciated. :slight_smile:

Does anyone have any experience getting OpenCL / openFrameworks / and Linux to play nicely? I have gotten ofxOpenCL to work on my mac, but not all of my computers are Macs…