A guide to speeding up your oF app with Accelerate (OSX & iOS)

Hi oF! This is a short primer on the Accelerate framework, what it is, why you would use it, and how to do so.

What is it?

Accelerate is a collection of functions and objects you can use to get a big speed boost when you’re working with sets of data (arrays, vectors, etc). You can think of it as a collection of pre-made ‘for’ loops that are built to make the most out of your CPU (if “SIMD” means anything to you, Accelerate makes use of SSE on Intel processors and NEON on iOS). These functions do all manner of standard arithmetic operations (“multiply everything by 5”), analysis (“what’s the biggest value?”), utility (sorting, absolute values, casting…) and your standard DSP stuff (FFT, BiQuad filtering, correlation, convolution…).

Why / When?

Accelerate fills the gap between your everyday ‘for’ loop and more involved GPU-based techniques like OpenCL. Working with audio is the most ideal scenario in my experience (i.e. it never needs to touch the GPU, there’s usually a lot of data flying around, and it needs to be done ASAP). That said, any time you’ve got to make some numbers happen, it’s probably worth it to consider using Accelerate.

Another reason to use Accelerate is if you’re doing a fair amount of calculations on an iOS device, since it takes power usage into account. As in, your app will typically use less battery if you do your calculations with Accelerate.

When NOT to use Accelerate is if you’re working with data that’s already on the GPU (e.g. textures). In that case you’ll probably be much better served by a shader. A hefty portion of Accelerate is the vImage library, which I personally haven’t found too much use for because of this. This guide focuses entirely on the vDSP part of Accelerate.


First, add Accelerate.framework to your project. Then, #include <Accelerate/Accelerate.h> in files where you want to make use of it.

Here’s an example of an Accelerate function. This will multiply everything in the array “input” by 5 and store the results in the array “output”:

const int array_size = 3;  
float input[array_size] = {1,2,3};  
float output[array_size];  
float factor = 5;  
vDSP_vsmul(input, 1, &factor, output, 1, array_size);  

So what’s worth noticing here?

  1. Accelerate’s function names look like a particularly nerdy cat walked across your keyboard.
  2. Accelerate typically doesn’t work “in place”. Meaning, operations usually take values from one array, work on them, then store the results in another. (EDIT: It seems you can actually use the same array as both input and output, though the Accelerate docs don’t seem to mention this explicitly).
  3. What’s up with those 1s in the function call?

First, the function names. There are two terms in particular that you’ll need to know to work with Accelerate:

scalar = one value
vector = one or more values

Note that Accelerate’s definition of “vector” is not the same as C++'s std::vector (though a std::vector can work as an Accelerate vector). As far as Accelerate is concerned, a “vector” is just a pointer to some values.

Breaking the function name vDSP_vsmul into parts, you get:

  • “vDSP” = This function is part of the “vDSP” section of Accelerate
  • “v” = it works on vectors
  • “s” = it uses a scalar
  • “mul” = it does multiplication

Knowing this, it stands to reason that vDSP_vsadd and vDSP_vsdiv are the same as vDSP_vsmul, but do addition and division respectively.

A couple more characters Accelerate uses:

  • D = works on doubles (as opposed to floats). Ex. vDSP_vsmulD
  • i = works on ints. Ex. vDSP_vsaddi

Secondly, Accelerate typically operates “out of place”. This means that just about every Accelerate function will take one or more inputs, and store the results somewhere else. When Accelerate requests an “input vector” or an “output vector” you can use one of a few things. Namely:

  • arrays
  • a std::vector (to get the pointer Accelerate wants, use &myVector[0])
  • some other chunk of contiguous data (e.g. ofPixels)

Thirdly, what’s with those extra 1s in the vDSP_vsmul call above? In case you forgot, it looked like this :

vDSP_vsmul(input, 1, &factor, output, 1, array_size);  

The 1s here represent the “stride” of the data in the array. A stride of 1 means that all of the values are right next to each other in memory. In ASCII terms, this:

[ v ][ v ][ v ][ v ][ v ][ v ][ v ]

Where v represents a value. Why does Accelerate bother with this? Well, this lets you extract certain values from a dataset that uses a more complex packing scheme. For example, an ofPixels object can represent RGB data like this in memory:

[ r ][ g ][ b ][ r ][ g ][ b ][ r ][ g ][ b ]

If you wanted to just operate on the red pixels, for example, you could pass in a stride of 3. This would mean “work on every 3rd value”.

I’ll demonstrate some more interesting uses of Accelerate in a bit. But first! About that “speed” thing.

Here’s the results of a simple test I ran on my laptop (MacBook pro, mid 2009, Core 2 Duo, 2.66GHz), using the “Release” Xcode build scheme. The test loads some random values into an array, finds the max value, then divides the entire array by the max value to map it from 0 to 1. The times are in seconds (i.e. 1.0 would be one second).

Values: 220500  For loop: 0.0015861  Accelerate: 0.000147832  
Values: 441000  For loop: 0.0032297  Accelerate: 0.000890234  
Values: 661500  For loop: 0.00487999 Accelerate: 0.00182216  
Values: 882000  For loop: 0.00656921 Accelerate: 0.00307542  
Values: 1102500 For loop: 0.00826815 Accelerate: 0.0034329  
Values: 1323000 For loop: 0.014751   Accelerate: 0.00446377  
Values: 1543500 For loop: 0.0115801  Accelerate: 0.00548144  
Values: 1764000 For loop: 0.0130506  Accelerate: 0.00529263  
Values: 1984500 For loop: 0.0146551  Accelerate: 0.00612535  
Values: 2205000 For loop: 0.016566   Accelerate: 0.00694328  
Values: 2425500 For loop: 0.0183522  Accelerate: 0.00821662  
Values: 2646000 For loop: 0.0202095  Accelerate: 0.00862035  
Values: 2866500 For loop: 0.0211966  Accelerate: 0.00892981  
Values: 3087000 For loop: 0.0236832  Accelerate: 0.00998177  

That’s a little over twice as fast with Accelerate. More complex operations with bigger data sets will show more of a benefit for Accelerate.

So then, some more complex operations.

These examples assume that there are 3 float arrays that exist already, called A, B and C. A, B and C could also be std::vectors, in which case they are used as &A[0] instead of just passing them into the functions directly. These examples also assume that there is a variable called data_size, which represents the size of the arrays (in terms of elements, so a float[5] would have a data_size of 5).

Here’s the max value & divide functions I used for the test above:

float max_value;  
// store the maximum value from array A in variable max_value
vDSP_maxv(A, 1, &max_value, data_size);  
// divide each value in array A by max_value, store the results  
// in array B. (A is unchanged after this)  
vDSP_vsdiv(A, 1, &max_value, B, 1, data_size);  

This adds each value in A to the value in B with the same index, then stores the results in C. For example, if A = [2, 5] and B = [3, 10], this would make C = [5,15].

vDSP_vadd(A, 1, B, 1, C, 1, data_size);  

This generates an array which ramps from one value to another. For example, if data_size is 5, this would fill A with [5, 7.5, 10, 12.5, 15].

float start = 5;  
float end   = 15;  
vDSP_vgen(&start, &end, A, 1, data_size);

This calculates the average (mean) of the values in A.

float mean;  
vDSP_meanv(A, 1, &mean, data_size);  

This stores clamped values from A into B. For example, if A = [1,2,3,4,5], this would make B = [2,2,3,4,4].

float min = 2;  
float max = 4;  
vDSP_vclip(A, 1, &min, &max, B, 1, data_size);

This calculates the FFT for an audio signal stored in A. This assumes A is a buffer holding 1024 audio samples. FFTs with Accelerate should typically be for data sets that are a power of 2 (i.e 256, 512, 1024, 2048…).

// Setup -------------  
// You should do this once, and keep these variables for subsequent FFTs.  
UInt32 log2N          = 10; // 1024 samples  
UInt32 N              = (1 << log2N);  
FFTSetup FFTSettings  = vDSP_create_fftsetup(log2N, kFFTRadix2);  
FFTData.realp         = (float *) malloc(sizeof(float) * N/2);  
FFTData.imagp         = (float *) malloc(sizeof(float) * N/2);  
float * hammingWindow = (float *) malloc(sizeof(float) * N);  
// create an array of floats to represent a hamming window  
vDSP_hamm_window(hammingWindow, N, 0);  
// FFT Time ----------  
// Moving data from A to B via hamming window  
vDSP_vmul(A, 1, hammingWindow, 1, B, 1, N);                                 
// Converting data in B into split complex form  
// http://en.wikipedia.org/wiki/Split-complex-number
vDSP_ctoz((COMPLEX *) B, 2, &FFTData, 1, N/2);  
// Doing the FFT  
vDSP_fft_zrip(FFTSettings, &FFTData, 1, log2N, kFFTDirection_Forward);   
// calculating square of magnitude for each value  
vDSP_zvmags(&FFTData, 1, FFTData.realp, 1, N/2);  
// At this point, FFTData.realp is an array of 512 FFT values (1024/2).  
// Cleanup -----------  
// You should do this only when you're done doing FFTs.  

Here’s rudimentary pitch detection. This is an addendum to the previous FFT example. It picks up after the vDSP_zvmags call above (before cleanup). Note that if you want serious pitch detection, you’ll have to delve into the DSP world a bit more.

// Doing an inverse FFT. (FFT -> magnitude squared -> IFFT = autocorrelation, sort of)  
vDSP_fft_zrip(FFTSettings, &FFTData, 1, log2N, kFFTDirection_Inverse);  
// Storing the autocorrelation results in B  
vDSP_ztoc(&FFTData, 1, (COMPLEX *)B, 2, N/2);  
// Calculating the zero-crossings in B. A "zero-crossing" is when a  
// signal goes from above 0 to below (or vice versa). Since the autocorrelation  
// results stored in B will be a signal from -1 to 1, this will provide  
// a rudimentary pitch detection for the signal. Emphasis on rudimentary.  
vDSP_Length lastZeroCrosssing;  
vDSP_Length zeroCrossingCount;  
vDSP_nzcros(B, 1, N, &lastZeroCrossing, &zeroCrossingCount, N);  
// At this point zeroCrossingCount will be an int representing double the  
// pitch of the signal (2 zero crossings = 1 oscillation)  

The vDSP reference and list of functions can be found here

Shouts to Golan Levin and Kyle McDonald for their DSP workshop at Eyeo.


very nice tutorial. regarding the fft calculations, have you compared this to fftw? i’m curious if it performs any better speed-wise.

I haven’t, though fftw is supposed to be the f-est fft in the w…

FWIW I’ve always found vDSP’s FFT fast enough that the bottleneck is elsewhere. I’d like to know what the results are if someone wants to run that test. I’ll take a look tomorrow morning otherwise.

EDIT : after a simple test comparing vDSP to FFTW (via ofxFft), it looks like vDSP is roughly twice as fast. This is by no means conclusive. Test app here : https://gist.github.com/3452792

great tutorial!

I just wanted to mention that anyone trying to eek out more clock cycles with accelerate framework should experiment with switching the compiler (I’ve been adjusting between LLVM GCC 4.2 and LLVM 3) as well as adjusting enabling openMP, etc. There are differences in speed as a result.

Also, if you allocate your arrays like:

float blah[10]  

you should remember that the memory needs to be aligned, (malloc aligns properly), so use this:

float blah[10] __attribute__ ((aligned));  

also, I’ve posted a small example here w/ accelerate:


I have a question about “Converting data in B from interleaved complex to split complex”. If A is a buffer holding 1024 audio samples, it means that A is not interleaved complex because it contains only real part. Even if B is windowed, it shouldn’t be interleaved complex too.

That’s correct, my mistake! I based that comment directly off of the documentation for vDSP_ctoz as found in the-vDSP-reference.

That line treats B as if it is interleaved complex (hence the (COMPLEX *) cast), but at that point in time B is just windowed, linear audio data.

What about creating an array using the operator new

float *foo = new float[10];

Is its memory aligned by default, or do I need to use __attribute__((aligned))

What about the stride, is there a maximum value? I read somewhere that 4 is the maximum for iOS devices but can’t recall where I got that information.

And just one more question, is there a correlation between the value of stride and performance for a single instruction.

Great tutorial by the way, I’ve refactored some segments of code to gain up to 15% faster execution times, thanks to this guide.