Optimising a nested for loop

Hello everyone! just like the title says, Im looking for ways to optimise a for loop within another for loop, right now Im using it to scan an image in order to pick out the background pixels from the foreground based on a threshold value. The way sometimes this is done is like so:

   unsigned char * pixels = image.getPixels();
   for (int y = 0; y < image_height; y++) {
     for (int x = 0; x < image_width; x++) {
       int i = x + y * image_width;
       unsigned char R = pixels[i * 3];
       unsigned char G = pixels[i * 3 + 1];
       unsigned char B = pixels[i * 3 + 2];      
       // do something with the RGB values
     }
   }

if you dont really need to do anything with the x and y values you can rewrite this as a single for loop like this:

unsigned char * pixels = image.getPixels();
for (int i = 0; i < number_of_pixels; i++) {
  unsigned char R = pixels[i * 3];
  unsigned char G = pixels[i * 3 + 1];
  unsigned char B = pixels[i * 3 + 2];      
  // do something with the RGB values
}

this will make it a tiny bit faster (only fractions of a millisecond but still). We can also do 1 multiplication instead of 3 inside the index like this:

unsigned char * pixels = image.getPixels();
for (int i = 0; i < number_of_pixels; i++) {
      int j = i * 3;
      unsigned char R = pixels[j];
      unsigned char G = pixels[j + 1];
      unsigned char B = pixels[j + 2];      
      // do something with the RGB values
    }

This will again make it faster, another way is looping from 0 to the total number of RGB values, given that the total number of rgb values is total = number_of_pixels * 3 this would be:

  unsigned char * pixels = image.getPixels();
  for (int i = 0; i < total; i += 3) {    //  notice we increment by 3 and not 1
       unsigned char R = pixels[i];
       unsigned char G = pixels[i + 1];
       unsigned char B = pixels[i + 2];      
       // do something with the RGB values
     }

This is supossed to make it even faster (again only a little bit) because there is no multiplication going on, but what if you really need the x and y values, well, they can be calculated this way:

  unsigned char * pixels = image.getPixels();
  for (int i = 0; i < total; i += 3) {    //  notice we increment by 3 and not 1
       unsigned char R = pixels[i];
       unsigned char G = pixels[i + 1];
       unsigned char B = pixels[i + 2];      
       // do something with the RGB values

       int x = i % image_width;
       int y = y / image_width;
     }

Here we introduce 2 of the usually slowest arithmetic ops a cpu can do which in reality makes this chunk of code not run much faster than the first example with the 2 for loops. We could however improve the nested loop a little if we do something like:

unsigned char * pixels = image.getPixels();
for (int y = 0; y < image_height; y++) {
         int j = y * image_width;
         for (int x = 0; x < image_width; x++) {
           int i = x + j;
           unsigned char R = pixels[i * 3];
           unsigned char G = pixels[i * 3 + 1];
           unsigned char B = pixels[i * 3 + 2];      
           // do something with the RGB values
         }
       }

Now, what if you need to do this over and over again? say, this image actually is a frame coming from a webcam feed, then we could make it run faster if we precalculate most of the values in our setup function. We’d need something like a struct containing the x,y,r,g,b values associated with an index, this could be implemented as follows:

struct point {
  int x, y, r, g, b;
} points_lookupt[number_of_pixels];

void setup() {
   for (int i=0; i<number_of_pixels; i++) {
            points_lookupt[i].r = i * 3;
            points_lookupt[i].g = i * 3 + 1;
            points_lookupt[i].b = i * 3 + 2;                        
            points_lookupt[i].x = i % image_width;
            points_lookupt[i].y = i / image_width;
   }
}

at this point we would simply iterate over this array and fetch all the necessary values by accessing each structs fields like so:

unsigned char * pixels = image.getPixels();
for (int i = 0; i < number_of_pixels; i++) {
   int x = points_lookupt[i].x;
   int y = points_lookupt[i].y;
   unsigned char R = pixels[points_lookupt[i].r];
   unsigned char G = pixels[points_lookupt[i].g];
   unsigned char B = pixels[points_lookupt[i].b];
  // do something with the RGB values
}

and now finally my question: why is this last chunk of code not faster than :

unsigned char * pixels = image.getPixels();
      for (int i = 0; i < total; i += 3) {    //  notice we increment by 3 and not 1
           unsigned char R = pixels[i];
           unsigned char G = pixels[i + 1];
           unsigned char B = pixels[i + 2];      
           // do something with the RGB values

           int x = i % image_width;
           int y = y / image_width;
         }

they both take almost the same time to finish but this last one is a few microseconds faster, why ???

Saludos everyone!
JLafarga

1 Like

I’m curious to know what platform you are using, what compiler / compilation flags and what your ultimate goal is – when talking about “a few microseconds”, it’s important to understand the context, larger goals and other factors (i.e. there may be other bottlenecks that will produce speedups orders of magnitude larger).

I’m running this on osx mavericks with the latest version of xcode using all default compilation flags and settings. Should it be faster or am I missing something here? what do you think?

btw, my final goal is to optimise an algorithm known as “connected component labeling/analysis”, it is used to label & uniquely identify connected regions of pixels in an image. I’m trying to make it run as fast as possible, this is just one of many optimisations I have implemented.

It’s indeed weird to try to focus so much on that for parsing an image. Usually the good solution is to try first existing code ( openCV maybe?)to do the threshold. Usually pretty smart people worked really hard on various algorithm to have the degree of optimization you want.

And to answer the question, what is probably happening (or would in a regular x86 architecture) is that your image is a big chunk of data in memory, probably contiguous (I guess it might be wrong if you have a 1GB image…). If the compiler or processor realize that you try to access all the byte one by one, it’s going to cache the value (ie copy a chunk of data from RAM to some memory faster/closer to the CPU) and then access it super fast, until it reaches a value that was not cached, and may start again with a new chunk.
Your last example is really complicated, with your indexes stored somewhere else in memory (your look up table) so it gets harder to optimize that, and adds levels of indirection you should try to avoid : accessing the memory is ‘slow’ while doing arithmetic operations is probably faster.

The fastest might be something like:

int i=0;
unsigned char R,G,B;
int x =-1;
int y = -1;
while (i < number_of_pixels)
{
  R = pixels[i++];
  G = pixels[i++];
  B = pixels[i++];
  x++;
  x = x %  image_width;
  if (x ==0) y++;
}

If you seriously want to go that way, you should read books about High performance computing…

1 Like

hi there, this is for a school thingy where I’am not allowed to use openCV, I’am also still learning to program in C++ thats why I did not use any prefabricated code, I’am aware there are probably faster ways of doing this, (SSE instructions or what not), but my main goal here is to learn and understand how things work, the openCV code still seems a bit cryptic to me that is why I ask.

Saludos!

do you mean cache misses? like what happens when iterating over a linked list vs iterating over a vector type of scenario?

Mostly yes.

If you want to have a real understanding of what’s happening you should probably look at the dissassembly of your code ( the optimization you try to put in there might be modified by the compiler’s own ones) and use profiling tools like valgrind/cachegrind to see what happening in memory. And obviously run your things tons of time to be sure you average the results you get in a statisticaly relevant way, with a meaningful timer…

(And if it’s a class about image processing that’s probably not the right questions :D)

1 Like

You might try using Apple’s Accelerate Framework if you really need it to run faster. It gives you functions that do operations on vectors, e.g. summing two vectors, multiplying a vector by a scalar, etc.

Optimized graphics libraries almost always use GPU acceleration, so if you want to be as fast as those libraries, you should too.
I’ve only hacked at the Accelerate Framework, and badly at that, so I can’t help much more than point and grunt.

with:

int i=0;
int x = 640;
int y = 480;
for(int y=0;y<h;y++){
    for(int x=0;x<w;x++){
        pixels[i++]...
        pixels[i++]...
        pixels[i++]...
    }
}

you avoid multiplications or divisions at all and get x and y. also if you use constants for w and h as i’ve set in the example the compiler will be able to unroll the loops which can make a difference.

1 Like

Only using additions and ++ constructs really speeds things up like arturo shows.
If there are lots of cpu calculations going on in the nested loop, you might consider some openMP construction. Note the overhead for creating threads there!

I often use, in pseudocode:

int nrOfCores= openmp get cores
int w = 640;
int h = 480;
int h_per_core = h / nrOfCores;
#pragma parallel for
for (int core = 0; core < nrOfCores; core++) {
  int fromY = core * h_per_core;
  int toY = (core+1) * h_per_core;
  int i= fromY * w;

  for(int y=fromY ;y<toY;y++){
      for(int x=0;x<w;x++){
          pixels[i++] = heavyCalc()...
          pixels[i++] = heavyCalc()...
          pixels[i++] = heavyCalc()...
      }
  }
}

this would only create x threads, matching the nrs of cores you have (no 480 threads created but only 4/8 - less openmp overhead!) and process the image/pixels several times faster as without openmp.

Note that the OPENMP can only result in a performancegain when the work to be done is cpu intensive. also code has to be thread save ofcourse, but this mostly the case with such loops.
Basically, a frag shader does the same over all its shader processors (i think?)