Fastest way of blending hundreds of images together?

I am working on a project where I need to blend hundreds of images together dynamically and maintain a fast FPS. I am using a fixed camera that is constantly taking panographic pictures. My goal is to have 4 days of imagery that a person can navigate through at all times. I have already written this by using a grey scale mask and using the brightness value as a the index for an array of images that were taken at various times.

for( int y=0; y< mskPix.getHeight(); y++){
        for( int x=0; x< mskPix.getWidth(); x++){
            if(mskPix.getColor(x,y).a >0 ){
                float bright = mskPix.getColor(x,y).getBrightness();
                ofColor temp = allDayImagesFull.at(int(bright)).getColor(x+imgScrollX, y);
                rsltImg.setColor(x, y,ofColor(temp.r, temp.g, temp.b,mskPix.getColor(x,y).a ));
            }
            else{
                rsltImg.setColor(x, y,ofColor(0,0,0,0));
            }
            
        }
    }
    
    rsltImg.update();

The effect is what I intended, but as you might imagine my FPS is abysmal (I was getting about 5 fps). Here is a sped up video of what I made to get a sense of what I am trying to make:


I am in the beginning of rewriting this using shaders. I make a series of gradients decreasing in size and apply them as alpha masks using img.getTexture().setAlphaMask(temp); to an array of images.

BUT I am suspicious that the truth is that I should really just repurpose OFXSlitScan. Any words of wisdom before I invest a lot of time going in the wrong direction?

1 Like

i’m not sure you can actually keep so many images on the graphics card.

multithreading should work, your task is perfect for parallelization. on a quad core machine that should get you up to 20fps already.

Hi, a complicated but very fast solution would be to use a GL_TEXTURE_3D, you’d be pretty limited in how many frames you can have at what size but the mechanism you need is all there. And the new NVIDIA cards come with 8Gb standard!

That’s how this was implemented, according to a note on the second link:
http://www.k2.t.u-tokyo.ac.jp/members/alvaro/Khronos/
http://www.k2.t.u-tokyo.ac.jp/members/alvaro/Khronos/test/test.html

/A

do you have to do these calculations on the fly? can you compute them at start up? and put them in to a vector

i think getColor can be quite slow

Best
Ben

So would it be faster to store all the color information in vectors rather than in ofImages? I guess I could store all the info in a 3D vector of color information.

Good suggestion with checking out Alvaro’s work. I don’t mind the added complexity of GL_TEXTURE_3D if it gives me better performance. The resolution limit may be a problem. It is going to be on a very large, very prominent screen so it needs to be as crisp as possible.

I can buy whatever computer I need to make this happen (within reason). I was planning on getting a Mac mini with a 2.6GHz Dual Core , and 16GB of RAM. Perhaps I should go with something more powerful? It needs to be something I can configure to run as a permanent installation (no automatic updating etc.)

Hi, for the GL_TEXTURE_3D approach it would be important to check how much RAM the Mac Mini GPU has access to, my MacBook Air’s Intel HD 5000 has access to 1.5Gb which isn’t too bad, but you can get a Nvidia GTX 960 card with 4GB for around 200 euros. I love the Mac Mini, but it’s price went up as it’s tech stood still…

With 1920x1080x3 channel images you could fit around 172 frames in 1Gb of RAM. I haven’t looked into it so it could be a world of pain, but there is some texture compression built into OpenGL that could help you squeeze in a few more.

/A

Thank you for your advice about the hardware! :relaxed: So just to be clear would you recommend switching to a linux or windows machine to use the Nvidia GTX 960 card? I see there are some options out there for mac, but they still seem a bit far out.

I am willing to take the hit and move to a different OS then my preference to increase performance, but I’d just need to make sure I could set it up to restart automatically and never pop up unexpected windows.

Any particular machine you would recommend? I was thinking of getting a Mac Mini- but those are only dual core, so perhaps it would not be enough…

Hi!

I definitely prefer the Mac for leaving something unattended, but the price/performance ratio is really in favour of the PC, especially with the Mac options at the moment (!?).

A cheap PC with an OK card like the NVIDIA 960 will be maybe 2-300 euros more than the Mac Mini, but that 960 is SOO much faster than the integrated Intel GPU in the Mac Mini, so if you end up putting a lot of stuff on the GPU then it’s totally worth it.

Oh and the 960 was just made ancient as NVIDIA just released new stuff, so it should come down in price pretty soon.

1 Like

Thanks for the advice. If that is what gives me the GPU I need than that is what I’ll have to do. I’ll just need to find an equivalent article for windows machines as this article for macs on how to leave it unattended.

Looks like Arturo has a solid write up of doing it for linux, which is much simpler than on either Windows or OSX imo.

Cool. I’ve always wanted to be into Linux! :sunglasses: I will get a Linux Machine + Nvidia 960 and try implementing it with GL_TEXTURE_3D and report back.

i’ve played around with this too much yesterday. if you wanna stay on the cpu side of things i think it’s possible, but it’ll require a fast computer. i have a 2.3ghz quad core and managed to get up to a bit over 20 fps in full hd, a bit faster would be nice.

i made a test video. as input i used two scenes from a nasa video. as mask i used a single circle.

so, before i post the code, here are some import performance pointer:

  • the contents of your mask will determine which source image is read. the smoother your mask is, the faster the code will run. the impact of this is huge (ie. going from 10 to 30 fps switching from a random to a linear mask). doing these completely random memory queries will make the cache fail a lot i guess.
  • using set color/getcolor is indeed slow. work with the raw buffer data as much as possible
  • use arrays instead of vectors if possible
  • parallelization doesn’t speed things up as much as expected. on my machine (i have 8 threads) using 3 threads yields maximum performance

all in all it was too fun to play with and i got a bit sidetracked, so i made a lot of changes:

  • the effect looks quite bad when limiting the mask accuracy to brightness values. i’ve switched from the mask to a float buffer, and then linearly interpolate between neighbouring source frames. this gives very neat results, but dropped me down to ~10fps again.
  • no idea how you assemble the mask. i’m using a circle that also shifts through time, and got rid of your x scanning offset.
  • from your snippet i’m not sure what you use the mask transparency for. i’m not using it anymore.

alrighty, here’s the code:

#pragma once

#include "ofMain.h"

#define NUM_WORKERS 4

class MaskWorker;
class EffectWorker;

class ofApp : public ofBaseApp{

    public:
        void setup();
        void update();
        void draw();

        void keyPressed(int key);
        void keyReleased(int key);
        void mouseMoved(int x, int y );
        void mouseDragged(int x, int y, int button);
        void mousePressed(int x, int y, int button);
        void mouseReleased(int x, int y, int button);
        void mouseEntered(int x, int y);
        void mouseExited(int x, int y);
        void windowResized(int w, int h);
        void dragEvent(ofDragInfo dragInfo);
        void gotMessage(ofMessage msg);
    
    // not really used anymore,
    // except to tell everyone what size the images are.
    ofPixels mskPix;
    
    // that's the new mask buffer.
    float * mess;
    
    // final output goes here
    ofImage rsltImg;
    
    // source images
    vector<ofImage> allDayImagesFull;
    // source buffer of the mask data
    unsigned char** allDayData;
    
    // workers to assemble the mask
    vector<shared_ptr<MaskWorker>> maskWorkers;
    
    // workers to assemble the result image
    vector<shared_ptr<EffectWorker>> effectWorkers;
    
    // repurposing this. it's now used for the global time offset
    int imgScrollX;
    
    // determines the size of the circle
    float fxFactor;
    
    float mouseX;
    float mouseY;
};



#include "ofApp.h"
#include <mutex>


// this generates the mask
// it is split by lines (for two workers, worker 1 does upper half, worker 2 does the lower half)
class MaskWorker : public ofThread{
private:
    ofApp & app;
    int offset;
    std::mutex m;
    
public:
    MaskWorker( ofApp & app, int offset ) : app(app), offset(offset){
        m.lock();
    }
    
    void threadedFunction(){
        while(isThreadRunning()){
            // Wait until main() sends data
            m.lock();
            
            unsigned char * maskData = app.mskPix.getData();
            unsigned char * resultData = app.rsltImg.getPixels().getData();
            
            int w = app.mskPix.getWidth();
            int h = app.mskPix.getHeight();
            float mx = app.mouseX;
            float my = app.mouseY;
            float of_width = ofGetWidth();
            float of_height = ofGetHeight();
            
            size_t dest_len = w*h;
            size_t dest_start = w*h*offset/NUM_WORKERS;
            size_t dest_end = w*h*(offset+1)/NUM_WORKERS;
            
            for( ; dest_start < dest_end; dest_start++ ){
                int x = (dest_start)%w;
                int y = (dest_start)/w;
                //app.mess[dest_start] = 1-ofClamp((fabsf(mx-x)+fabsf(my-y))/of_width/app.fxFactor,0,1);
                app.mess[dest_start] = 1-ofClamp(ofDist(x,y,mx,my)/of_width/app.fxFactor,0,1);
            }
            
            m.unlock();
        }
    }
    
    void work(){
        m.unlock();
    }
    
    void wait(){
        m.lock();
    }
};



// this processes a chunk of the image
// it is split by lines (for two workers, worker 1 does upper half, worker 2 does the lower half)
class EffectWorker : public ofThread{
private:
    ofApp & app;
    int offset;
    std::mutex m;
    
public:
    EffectWorker( ofApp & app, int offset ) : app(app), offset(offset){
        m.lock();
    }
    
    void threadedFunction(){
        while(isThreadRunning()){
            // Wait until main() sends data
            m.lock();

            
            unsigned char * maskData = app.mskPix.getData();
            unsigned char * resultData = app.rsltImg.getPixels().getData();
            
            int w = app.mskPix.getWidth();
            int h = app.mskPix.getHeight();
            size_t src_len = w*h*3;
            size_t src_start = w*h*3*(offset)/NUM_WORKERS;
            size_t src_end = w*h*3*(offset+1)/NUM_WORKERS;
            size_t scroll_offset = app.imgScrollX*3;
            
            int imgs = app.allDayImagesFull.size();
            
            for( ; src_start < src_end; src_start+=3 ){
                // instead of linear time, use a triangle ramp
                // so we go from 0...2*numImages and make the second half ramp down
                float brightness = 2*ofClamp(app.mess[src_start/3]*(imgs-1), 0, imgs-1.1); // app.mess is the mask (each value = 0...1)
                
                int a = floor(brightness);
                int b = a+1;
                float alpha = brightness-a;
                a = (a+app.imgScrollX)%(2*imgs);
                b = (b+app.imgScrollX)%(2*imgs);
                if(a>=imgs) a = 2*imgs-a-1;
                if(b>=imgs) b = 2*imgs-b-1;

                // i see potential for a huge speed up here, by writing
                // alpha, srca and srcb to separate result textures/arrays,
                // and then either blending with simd or the graphics card.
                unsigned char * srca = &app.allDayData[a][src_start];
                unsigned char * srcb = &app.allDayData[b][src_start];
                
                resultData[src_start+0] = (unsigned char)(srca[0]*(1-alpha)+srcb[0]*alpha);
                resultData[src_start+1] = (unsigned char)(srca[1]*(1-alpha)+srcb[1]*alpha);
                resultData[src_start+2] = (unsigned char)(srca[2]*(1-alpha)+srcb[2]*alpha);
            }
            
            
            m.unlock();
        }
    }
    
    void work(){
        m.unlock();
    }
    
    void wait(){
        m.lock();
    }
};

//--------------------------------------------------------------
void ofApp::setup(){
    fxFactor = 1;
    cout << "Loading images..." << endl;
    for( int i = 0; i <= 299; i++){
        cout << (i*100/255) << "%" << endl;
        
        ofImage nextImage;
        nextImage.setUseTexture(false);
        allDayImagesFull.push_back(nextImage);
        
        ofImage &img = allDayImagesFull.back();
        img.load("img/" + ofToString(i+1, 5, '0') + ".png");
    }
    
    allDayData = new unsigned char * [allDayImagesFull.size()];
    for( int i = 0; i < allDayImagesFull.size(); i++ ){
        allDayData[i] = allDayImagesFull[i].getPixels().getData();
    }
    cout << "Loaded all images" << endl;
    
    ofImage & first = allDayImagesFull.front();
    
    rsltImg.allocate(first.getWidth(), first.getHeight(), OF_IMAGE_COLOR);
    mskPix.allocate(first.getWidth(), first.getHeight(), OF_IMAGE_COLOR);
    mess = new float[(int)first.getWidth()*(int)first.getHeight()];
    
    // create some workers
    for( int i = 0; i < NUM_WORKERS; i++ ){
        effectWorkers.push_back(make_shared<EffectWorker>(*this, i));
        effectWorkers.back()->startThread();
        maskWorkers.push_back(make_shared<MaskWorker>(*this, i));
        maskWorkers.back()->startThread();
    }
}

//--------------------------------------------------------------
void ofApp::update(){

}

//--------------------------------------------------------------
void ofApp::draw(){
    
    bool exporting = false;
    if(exporting){
        mouseX = ofGetWidth()/2;
        mouseY = ofGetHeight()/2;
        fxFactor = 1.25;
    }
    else{
        mouseX = ofGetMouseX();
        mouseY = ofGetMouseY();
        if(ofGetMousePressed()){
            fxFactor = 10*ofGetMouseX()/(float)ofGetWidth();
        }
    }
    
    
    imgScrollX ++;
    imgScrollX %= (int)allDayImagesFull.front().getWidth();
    
    // assemble the mask in parallel
    for( shared_ptr<MaskWorker> worker : maskWorkers )
        worker->work();
    for( shared_ptr<MaskWorker> worker : maskWorkers )
        worker->wait();
    
    // start work on all threads, then wait for them
    for( shared_ptr<EffectWorker> worker : effectWorkers )
        worker->work();
    for( shared_ptr<EffectWorker> worker : effectWorkers )
        worker->wait();
    
    
    
    rsltImg.update();
    rsltImg.draw(0,0);
    cout << ofGetFrameRate() << "/" << fxFactor << "/" << mouseX << "/" << mouseY << endl;
    
    if(exporting){
        rsltImg.save("out/" + ofToString(ofGetFrameNum(),5,'0') + ".tiff");
        if(ofGetFrameNum()>2*allDayImagesFull.size()){
            std::exit(0);
        }
    }
}

//--------------------------------------------------------------
void ofApp::keyPressed(int key){

}

//--------------------------------------------------------------
void ofApp::keyReleased(int key){

}

//--------------------------------------------------------------
void ofApp::mouseMoved(int x, int y ){

}

//--------------------------------------------------------------
void ofApp::mouseDragged(int x, int y, int button){
}

//--------------------------------------------------------------
void ofApp::mousePressed(int x, int y, int button){

}

//--------------------------------------------------------------
void ofApp::mouseReleased(int x, int y, int button){

}

//--------------------------------------------------------------
void ofApp::mouseEntered(int x, int y){

}

//--------------------------------------------------------------
void ofApp::mouseExited(int x, int y){

}

//--------------------------------------------------------------
void ofApp::windowResized(int w, int h){

}

//--------------------------------------------------------------
void ofApp::gotMessage(ofMessage msg){

}

//--------------------------------------------------------------
void ofApp::dragEvent(ofDragInfo dragInfo){ 

}
3 Likes

probably i’m missing something but blending happens right after the fragment shader for anything you’ve drawn to the screen already, so if you just draw all the images one after another every frame it should blend them without doing anything else. you can adjust the belnding mode using ofSetBlendMode or if you want something even more fine grained directly with opengl using glBlendEquation and glBlendFunction.

1 Like

Wow! Thank you for all your research and code! What a big help. It’s good to have your recommendations on the ideal number of threads and using arrays instead of images or vectors.

The mask transparency is important because the layers need to blend together smoothly. I wonder if there is any advantage on staying on the CPU side as opposed to moving over the heavy lifting to the GPU?

I have to be full focus on another project for ~two weeks, but I am going to try out the various suggestions after that.

I’ve been working on this and I have implemented a solution that is running at ~60 fps. I am sure that will go down as I work with more images at higher resolutions, but this is what I have so far. It is working on linux ubuntu and full screening across 4 monitors.

First I made a simple shader that I call repeatedly to make a series of staggered gradients that I use later as the alpha masks for each image.

Here is the shader:

uniform int begFadeStart;
uniform int begFadeEnd;
uniform int endFadeStart;
uniform int endFadeEnd;

uniform int width;

out vec4 outputColor;

void main()
{
    float texCoord = gl_FragCoord.x;
    
    float a = 0.0;
    if ((texCoord > begFadeStart) && (texCoord < begFadeEnd) ){
        a = (texCoord - begFadeStart) / (begFadeEnd - begFadeStart);
    }
    else if ((texCoord > begFadeEnd ) && (texCoord < endFadeStart)){
        a = 1 ;
    }
    else if ((texCoord > endFadeStart ) && (texCoord < endFadeEnd)){
        a = 1 - ((texCoord - endFadeStart) / (endFadeEnd - endFadeStart));
    }
    
    outputColor = vec4(0.0,0.0,0.0, a);
}

I’ve made each day it’s own class and in the setup I make as many gradient alpha masks as I have images:

for (int i=0; i < singleImg.size(); i++){
    //singleImg.at(i).isWrapped = false;
    ofFbo mask;
    mask.allocate(imgWidth, imgHeight);
    mask.begin();
        ofClear(0, 0, 0, 0);
        gradientMaker.begin();
            gradientMaker.setUniform1i("begFadeStart",posMsk);
            singleImg.at(i).startDay = posMsk;
            gradientMaker.setUniform1i("begFadeEnd", posMsk + interval);
            gradientMaker.setUniform1i("endFadeStart",posMsk + interval*2 );
            gradientMaker.setUniform1i("endFadeEnd",posMsk + interval*3);
            singleImg.at(i).endDay = posMsk + interval*3;
            ofDrawRectangle(0, 0,imgWidth,imgHeight);
        gradientMaker.end();

         // TO DO: only wrap it if it is needed.
        // draw a second one with an offset to make it seamless.
    
        gradientMaker.begin();
            gradientMaker.setUniform1i("begFadeStart",offset + posMsk);
            gradientMaker.setUniform1i("begFadeEnd", offset + posMsk + interval );
            gradientMaker.setUniform1i("endFadeStart",offset +posMsk + interval*2 );
            gradientMaker.setUniform1i("endFadeEnd",offset +posMsk + interval*3);
            ofDrawRectangle(0, 0,imgWidth,imgHeight);
        gradientMaker.end();
      
   
    mask.end();
    posMsk += interval;
    ofTexture temp = mask.getTexture();
    singleImg.at(i).msk= temp;

Then I made a modified alpha shader that allows me to move around and loop the image and mask independently. This post documents that process:

Then I draw each image with its alpha mask.

for (int i = 0; i < singleImg.size(); i++){
       // this conditional makes sure I only draw if it is in view.
       if (((wrapIt(singleImg.at(i).startDay) < windowWidth )& (wrapIt(singleImg.at(i).startDay) > 0)) | ((wrapIt(singleImg.at(i).endDay)  > 0 )&(wrapIt(singleImg.at(i).endDay)  < windowWidth ))){
            alphaShader.begin();
                alphaShader.setUniformTexture("imageMask", singleImg.at(i).msk, 1);
                alphaShader.setUniform1i("mskXPos", mskPos);
                alphaShader.setUniform1i("imgXPos", imgPos);
                singleImg.at(i).img.draw(0,0, imgWidth , singleImg.at(i).img.getHeight() );
            alphaShader.end();
       }
    }

Is this a solid approach? Are there any spots where I could optimize?

1 Like