Performance Proofs I - Arrays vs Vectors

Since I’ve got into trouble for posting these typesof things in the middle of threads about other issues, I thought I’d start putting up some dedicated threads addressing performance proofs of common or uncommon c/c++ claims and counter claims.

I’m hoping this won’t be taken the wrong way, or turned into Stack Overflow sledging sessions. I’m doing it purely in the spirit of curiosity.

I’m not a computer science graduate so whenever I see someone (on this forum) or in other c/c++ how-to/style guides say ‘x’ is the best; or ‘y’ is for newbies; or ‘z’ is the most efficient, I want to do the sort of thing I imaging Comp Sci undergrads spend all day doing. Namely wrapping bits of code with timers and seeing which one goes fastest under a variety of conditions…

I’m hoping people have other code examples, demonstrations, proofs or improvements to testing procedures to offer.

…so vectors vs arrays…

This has come up a lot on the forums, especially in relation to particle systems, threads and http requests :wink: See for instance: topic=7248.0O and topic,1146.0.html

Arturo recently wrote a super-great-guide-to-vectors-vs-c-style-arrays and I was very surprised to see that they should perform as well as an array as long as you use them right.

So like a nerd I spent (wasted?) the day writing some code to test different size and types of allocations, deletions and manipulations using arrays and vectors. The src code is attached at the bottom.

Some results with a little raw data and selected averages:

Results for items1 = 100 , items2 = 1 and 20 iterations of the test:

  
OF: OF_LOG_NOTICE: Fill (ints) reserved vector array-style[] access: 0.105263   
OF: OF_LOG_NOTICE: Fill (ints) reserved vector push_back-style access: 0.45   
OF: OF_LOG_NOTICE: Fill (ints) unreserved vector push_back-style access: 3.6   
OF: OF_LOG_NOTICE: Fill (ints) unreserved vector with resize 'optimisation' access: 6.8   
OF: OF_LOG_NOTICE: Fill (ints) array [] access: 0.05   
OF: OF_LOG_NOTICE: Fill (custom struct) reserved vector array-style[] access: 19.65   
OF: OF_LOG_NOTICE: Fill (custom struct) array [] access: 14   
OF: OF_LOG_NOTICE: Fill (unsigned char*) reserved vector array-style[] access: 7426.35   
OF: OF_LOG_NOTICE: Fill (unsigned char*) array[] access: 4779.7   

Results for items1 = 10000 , items2 = 5 and 20 iterations of the test:

  
OF: OF_LOG_NOTICE: Fill (ints) reserved vector array-style[] access: 10.1053   
OF: OF_LOG_NOTICE: Fill (ints) reserved vector push_back-style access: 24.2   
OF: OF_LOG_NOTICE: Fill (ints) unreserved vector push_back-style access: 51.4   
OF: OF_LOG_NOTICE: Fill (ints) unreserved vector with resize 'optimisation' access: 249.9   
OF: OF_LOG_NOTICE: Fill (ints) array [] access: 11.3   
OF: OF_LOG_NOTICE: Fill (custom struct) reserved vector array-style[] access: 1443.7   
OF: OF_LOG_NOTICE: Fill (custom struct) array [] access: 1343.4   
OF: OF_LOG_NOTICE: Fill (unsigned char*) reserved vector array-style[] access: 37032.6   
OF: OF_LOG_NOTICE: Fill (unsigned char*) array[] access: 24771.8   

Outlier was 33 for first fill of reserved vector.

Results for items1 = 100000 , items2 = 10 and 20 iterations of the test:

  
OF: OF_LOG_NOTICE: Fill (ints) reserved vector array-style[] access: 76.3158   
OF: OF_LOG_NOTICE: Fill (ints) reserved vector push_back-style access: 236.9   
OF: OF_LOG_NOTICE: Fill (ints) unreserved vector push_back-style access: 393.2   
OF: OF_LOG_NOTICE: Fill (ints) unreserved vector with resize 'optimisation' access: 2454.35   
OF: OF_LOG_NOTICE: Fill (ints) array [] access: 90.15   
OF: OF_LOG_NOTICE: Fill (custom struct) reserved vector array-style[] access: 15337.8   
OF: OF_LOG_NOTICE: Fill (custom struct) array [] access: 14368.1   
OF: OF_LOG_NOTICE: Fill (unsigned char*) reserved vector array-style[] access: 66839.1   
OF: OF_LOG_NOTICE: Fill (unsigned char*) array[] access: 44535.5    

Outlier was 280 for first fill of reserved vector.

Results for items1 = 1000000 , items2 = 20 and 20 iterations of the test:

  
OF: OF_LOG_NOTICE: Fill (ints) reserved vector array-style[] access: 983.053   
OF: OF_LOG_NOTICE: Fill (ints) reserved vector push_back-style access: 2280.05   
OF: OF_LOG_NOTICE: Fill (ints) unreserved vector push_back-style access: 6163.5   
OF: OF_LOG_NOTICE: Fill (ints) unreserved vector with resize 'optimisation' access: 25276.9   
OF: OF_LOG_NOTICE: Fill (ints) array [] access: 947.35   
OF: OF_LOG_NOTICE: Fill (custom struct) reserved vector array-style[] access: 196648   
OF: OF_LOG_NOTICE: Fill (custom struct) array [] access: 189263   
OF: OF_LOG_NOTICE: Fill (unsigned char*) reserved vector array-style[] access: 133421   
OF: OF_LOG_NOTICE: Fill (unsigned char*) array[] access: 89309   

Outlier was 4314 for first fill of reserved vector.

Results for items1 = 8294400 (ie., 1920*1080*4), items2 = 100 and 20 iterations of the test:

  
OF: OF_LOG_NOTICE: Fill (ints) reserved vector array-style[] access: 29833.6   
OF: OF_LOG_NOTICE: Fill (ints) reserved vector push_back-style access: 18624.9   
OF: OF_LOG_NOTICE: Fill (ints) unreserved vector push_back-style access: 72154.6   
OF: OF_LOG_NOTICE: Fill (ints) unreserved vector with resize 'optimisation' access: 200904   
OF: OF_LOG_NOTICE: Fill (ints) array [] access: 28400.8   
OF: OF_LOG_NOTICE: Fill (custom struct) reserved vector array-style[] access: 1.73499e+06   
OF: OF_LOG_NOTICE: Fill (custom struct) array [] access: 1.72908e+06   
OF: OF_LOG_NOTICE: Fill (unsigned char*) reserved vector array-style[] access: 651981   
OF: OF_LOG_NOTICE: Fill (unsigned char*) array[] access: 412695   

One funny thing to note is the very first time I fill a reserved vector using:

  
for (int i = 0; i < items1; i++) {  
	v1[i] = i;  
}  

I get an outlier value much higher than the average, hence excluding the first go…

So some questions/observations for comment by the even nerdier:

* Any ideas why the first iteration is so costly?
* Seems like for small allocations there is a very minor advantage to arrays (but so minor the flexibility of the vector makes it pretty obsolete)
* Surprisingly, as we get into allocations of between 100000 to 1000000, the vector (reserved and accessed via []) actually out performs the array until the upper end of the range (and again the difference is a matter of 20-40 microseconds over literally millions of accesses)
* But once we get into ‘pixel-sized’ allocations and accesses ie 8294400+ the array begins to outperform the vector (reserved array[] access) by more significant amounts, but even more surprisingly ‘push_back’ access outperforms the array-style [] accessor for a vector by a non-trivial amount.
* Using custom data types rather than an integral type significantly changes the weighting of results -> arrays consistently out perform the vector regardless of accessor by non-trivial amounts, particularly as the number of allocations rise (obviously ‘non-trivial’ is up for grabs here -> we’re talking microseconds for 100-1000 allocations, but it does get up to seconds for 100000+…)
* And for multidimensional arrays with ‘pixel-size’ allocations [n][1920*1080*4] in these examples, the array is clearly far more efficient for any value of n.
* Finally, and this is only anecdotal (I’m trying to work out a test for this) I’ve found that vectors of more complex types (ie., ofVideoPlayer’s) perform much worse than arrays of the same types -> I remember doing a lot of tests of loading vectors of 20-30 videos vs doing the same with an array and getting much worse overall FPS…but need to check this when I have a suitable amount of videos on hand…

testApp.cpp

About the first access it’s probably related with the fact that after the first access that memory gets copied to the cache so it’s way faster. That’s also a factor to take into account when doing this kind of tests. If the memory page is the same for the vector than that of the array that you are allocating later, it’s possible that the array is already in cache so it’s going to be faster from the first access. Try changing the order in the tests to see if you get the same results

To be completely accurate you should do each test in a different program, start the program, do each test once, run the program several times and get the average so the cache won’t affect the results.

Are you working with release mode? if not the results with vectors mainly can vary greatly

Also this:

  
  
v1.reserve(items1);  
for (int i = 0; i < items1; i++) {  
	v1[i] = i;  
}  
  

is wrong, if you reserve you should be using push_back, if you check the size after that you’ll see it’s still 0. It won’t crash because the memory is allocated for that process but you should be using resize if you are going to use the [] operator

About the problem with vectors + videoGrabbers i’m pretty sure that cannot affect the fps at all, it should be something else different in your code.

Try changing the order in the tests to see if you get the same results

Ok will do…although

To be completely accurate you should do each test in a different program, start the program, do each test once, run the program several times and get the average so the cache won’t affect the results.

…perhaps this is better, yes?

Are you working with release mode?

yes

if you reserve you should be using push_back

Doh…my bad! Changing it…

About the problem with vectors + videoGrabbers i’m pretty sure that cannot affect the fps at all, it should be something else different in your code.

Yes I remember this from 2 years or more ago…so very likely a lot of things were wrong with my code :wink:

Ok that was laborious :wink:

Using attached source code and running ten times with each #define (ie., methods run just once per program to eliminate caching) I get:

  
#define VECRESIZEARRAY 1; items = 10000; Average = 3.8  
#define VECRESERVEPUSH 1; items = 10000; Average = 46.3  
#define VECSTRAIGHTPUSH 1; items = 10000; Average = 100  
#define ARRAYNEWALLOC 1; items = 10000; Average = 23.3  
  
#define VECRESIZEARRAY 1; items = 1000000; Average = 761.4  
#define VECRESERVEPUSH 1; items = 1000000; Average = 4878  
#define VECSTRAIGHTPUSH 1; items = 1000000; Average = 9192  
#define ARRAYNEWALLOC 1; items = 1000000; Average = 3194.9  
  
#define VECRESIZEARRAY 1; items = 8294400; Average =  7239.4  
#define VECRESERVEPUSH 1; items = 8294400; Average = 40555.4  
#define VECSTRAIGHTPUSH 1; items = 8294400; Average = 74044.6  
#define ARRAYNEWALLOC 1; items = 8294400; Average = 26042.4  
  
#define VECRESIZEARRAYCUSTOM 1; items = 10000; Average = 2088.8  
#define ARRAYNEWALLOCCUSTOM 1; items = 10000; Average = 2007.9  
  
#define VECRESIZEARRAYCUSTOM 1; items = 1000000; Average = 175962.7  
#define ARRAYNEWALLOCCUSTOM 1; items = 1000000; Average = 205011.7  
  
#define VECRESIZEARRAYCUSTOM 1; items = 8294400; Average = 1593221.1  
#define ARRAYNEWALLOCCUSTOM 1; items = 8294400; Average = 1823864.6  
  
#define VECRESIZEARRAYMULTI 1; Items = 1; Average = 7351.1  
#define ARRAYNEWALLOCMULTI 1; Items = 1; Average = 4738.5  
  
#define VECRESIZEARRAYMULTI 1; Items = 10; Average = 67640.3  
#define ARRAYNEWALLOCMULTI 1; Items = 10; Average = 43507.3  

So in all cases now I get vector.resize(n)/access with array-style operator[] faster than new’d c style-array!

Except in the case of multidimensional data, where the array is still a clear winner.

Raw data is there too…worth noting that the array access times (for int data type) exhibited the broadest range of variation for access (eg., 10000 items getting half the time 6 microseconds and the other half 25+)…curious…

testApp.cpp

VecVSArrayRawData.rtf.zip

…and oddly booleans are an exception:

  
#define VECRESIZEARRAY 1; items = 8294400; using bools instead of ints; Average = 11364.7  
#define ARRAYNEWALLOC 1; items = 8294400; using bools instead of ints; Average = 6335.8  

This seems to be relevant When-a-Container-is-Not-a-Container especially ‘The Life and Times of vector’

it looks like your results are generally consistent with what i’d expect.

i will add that as a former CS student, arturo is totally write about cache issues. when i had assignments about timing, they were just about timing the performance of the lowest level instructions (things like multiply vs divide, less than vs less than equals, etc.) for anything more complex it’s better to actually read the assembly to get an idea of what’s happening. but of course, after a certain level of complexity (even STL level complexity) you need to resort to timing again… and there are a lot of weird things that computers do that make it hard to test.

didn’t knew about the vector issue but yes there the difference seems significant.

about arrays being slower now, it can also be another caching effect. in vectors a resize() is actually creating 1 element and copying it to all the created positions, so you are accessing that memory and it surely gets in the cache while with arrays, new only allocates the memory which is probably not enough for the OS deciding to put in the cache yet.

anyway, as a general rule you cannot trust this kind of generic benchmarks unless the difference is really consistent and you are measuring small isolated things. it’s better to understand what’s going on (in this case vector and array memory layout are the same so they should be equally fast) and benchmark/try to optimize individual cases in applications and always once you notice you really need to optimize.

Ah yes it does appear to be a caching issue.

If I run each of the #defined tests in a for…loop with the array allocated and deleted inside each loop (and a resize for a vector) then I get largely the same results as my hand-made tests above.

If however if I move the array allocation outside of the loop (ie., only allocate once) then it gets cached and I get vector access == array access times.

It probably is incredibly tedious for CS grads to have someone engaging in this sort of benchmarking (I know it has been for me :wink: but then again I’ve learnt a few important things, that I didn’t know before:

* The caching effect is actually something that I vaguely knew was there, but if nothing else all this shows that it has a much larger effect than the type of ‘container’ highly iterative functions act upon. The order of allocation/destruction is important, and done the right way forcing the cache is really efficient.
* I’ve almost always thought vectors were slower than arrays, regardless of implementation; now I know that is wrong (at least in the case when a resize is possible)
* Multidimensional data is faster in an array (is that true? or is that yet another cache or other bi-product of my synthetic test?)
* And vector is just plain weird

In the end the benchmarking was not really the issue -> sometimes it takes me this sort of ‘rigor’ or ‘boneheadedness’ to understand what’s going on ‘under the hood’.

it’s better to…benchmark/try to optimize individual cases in applications and always once you notice you really need to optimize.

In general that’s my approach. But sometimes (and this is especially true) of vector vs array arguments it’s really very time consuming to go from one to the other (usually from a vector to an array) to see if you can scrape an extra 1,2 5, 10 fps…so this was good in that it has demonstrated how just a tiny change in an implementation can make a major difference to performance.

So not a total waste of a day…even if it feels like it :wink:

Thanks for your patience.

If you’ve arrived to the point of understanding the effects of caching and even slightly understand how to use it in your favor be sure it’s not been a waste of time :slight_smile:

About the multi dimensional arrays, i’m not sure what can be going on there, it can be some optimization that the compiler is doing in the case of arrays that it cannot do for the vectors but i’ve really no idea. There’s more things going on in that part than in the rest, like lot’s of multiplications that you can avoid by incrementing a pointer to the current position and pretty sure that will make both versions faster and probably you’ll have similar timings then.

I like this whole “let’s understand containers” thing. Too bad I can’t actually run your test suite as is since my computer doesn’t have enough memory and bails with std::bad_alloc :slight_smile: The multidimensional array is interesting, I’m betting you’re on XCode 4? LLVM does some odd stuff under the hood. I’m still on GCC and I get:

  
OF: OF_LOG_NOTICE: Allocate unsigned char* vector: 150   
OF: OF_LOG_NOTICE: Fill unsigned char* vector: 509226   
OF: OF_LOG_NOTICE: Allocate unsigned char* array: 314   
OF: OF_LOG_NOTICE: Fill unsigned char* array: 407027   

So how’s that for fun? :slight_smile:

hi, this is definitely an interesting topic which can make your code faster w/o any compiler optimizations. A great website related to “fast c++” is http://fastcpp.blogspot.com/.

roxlu

@roxlu: that is some seriously nerdy stuff! Are those ‘SSE’ tricks available on all platforms?

@joshua actually I’m still using Xcode 3.x/GCC mostly (looking shy) -> I find the autocomplete promising but not always delivering with 4.02 (on 10.6.8) though better in 4.2 on 10.7 … but I’m not up for a Lion in the Wardrobe (at least not on my lap top) just yet, as I have too many projects happening to blunder off into a new OS at the moment.

Interested to find out about optimizing the multidimensional array. For a recent project I did using multiple depth masks in ofxOpenNI was essential. These would be much easier to deal with in vector form, but I couldn’t get the performance I needed without using an array…hence going blind on testing all this stuff :wink: