OpenCL/Cuda optimisation notes in paper (particles, turbulence)

Hi folks, i am currently on a mission to try and work out how to code this http://graphics.ethz.ch/~tpfaff/data/sga10.pdf, my math is poor and i am yet to find any implementations other than noticing that one of the authors is now a research engineer at NVidia on their new particle turbulence engine for APEX (or whatever they call it) http://developer.nvidia.com/apex-turbulence the car video is excellent to watch and then realize it’s real time.

Any whilst looking through other related papers i found this excellent advice on coding for CUDA/OpenCL http://www.jcohen.name/papers/Cohen-Fast-2009-final.pdf starting on page 5 “Implementation” basically discussing what to GPU, what to CPU, and what not to do.

Good advice for sure.