c++multithreadingopencvparallel-processing

parallelisation of clustering method in OpenCV


I'm training a fabMap algorithm for loop-closing detection in my project. The training comprises of the creation of descriptors, vocabulary and Chow-Liu tree. I have a database with more than 10.000 images. I am working with a pretty good desktop (12 cores doubled-threaded, 32 GB of RAM and a 6 GB Nvidia graphic card), and I'd like to make the most of it when training my system. I am using opencv 3.0, TBB enabled, on a windows 7, 64 bit system.

The thing is that only the extraction of the descriptors is multi-threaded. The clustering and building of the Chow-Liu tree is performed in a single thread. The cluster() method of BOWMSCTrainer class has 3 nested for() loops, where each depends on the previous one, and even the sizes of the nested loops are dynamically assigned. This is the core of the cluster() method:

//_descriptors is a Matrix wherein each row is a descriptor

Mat icovar = Mat::eye(_descriptors.cols,_descriptors.cols,_descriptors.type());

std::vector<Mat> initialCentres;
initialCentres.push_back(_descriptors.row(0));
for (int i = 1; i < _descriptors.rows; i++) {
    double minDist = DBL_MAX;
    for (size_t j = 0; j < initialCentres.size(); j++) {
        minDist = std::min(minDist,
            cv::Mahalanobis(_descriptors.row(i),initialCentres[j],
            icovar));
    }
    if (minDist > clusterSize)
        initialCentres.push_back(_descriptors.row(i));
}

std::vector<std::list<cv::Mat> > clusters;
clusters.resize(initialCentres.size());
for (int i = 0; i < _descriptors.rows; i++) {
    int index = 0; double dist = 0, minDist = DBL_MAX;
    for (size_t j = 0; j < initialCentres.size(); j++) {
        dist = cv::Mahalanobis(_descriptors.row(i),initialCentres[j],icovar);
        if (dist < minDist) {
            minDist = dist;
            index = (int)j;
        }
    }
    clusters[index].push_back(_descriptors.row(i));
}

// TODO: throw away small clusters.

Mat vocabulary;
Mat centre = Mat::zeros(1,_descriptors.cols,_descriptors.type());
for (size_t i = 0; i < clusters.size(); i++) {
    centre.setTo(0);
    for (std::list<cv::Mat>::iterator Ci = clusters[i].begin(); Ci != clusters[i].end(); Ci++) {
        centre += *Ci;
    }
    centre /= (double)clusters[i].size();
    vocabulary.push_back(centre);
}

return vocabulary;
}

In order to see how long will the training take, I've down-sampled the database. I started with just 10 images (~20.000 descriptors), and it took about 40 minutes. With a sample of 100 images (~300.000 descriptors) the whole thing took about 60 hours, and I am afraid that with 1000 images (which will render a decent vocabulary) may take 8 months (!), (if the method is O(n²)->60 hours *10² ~ 8 months) and I don't want to imagine how long would take the whole database.

So, my question is: is it possible to parallelise somehow the execution of the cluster() method, so that the training of the system doesn't take ridiculous amounts of time? I've thought of applying openMP pragmas, or creating a thread for each loop but I don't think it's possible given the dynamics of the for() loops. Although I am familiar with parallel programming and multi-threading, I am not an expert at all in this field.

Many thanks in advance!


Solution

  • For what is worth, I leave here the code I came up with, using the call parallel_for of OpenCV. I also added a feature to the code, now it deletes all the clusters smaller than a threshold. The code effectively speeds up the process:

    //The first nest of fors remains untouched, but the following ones: 
    
    std::vector<std::list<cv::Mat> > clusters;
    clusters.resize(initialCentres.size());
    
    Mutex lock = Mutex();
    parallel_for_(cv::Range(0, _descriptors.rows - 1),
            for_createClusters(clusters, initialCentres, icovar, _descriptors, lock));
    
    Mat vocabulary;
    Mat centre = Mat::zeros(1,_descriptors.cols,_descriptors.type());
    parallel_for_(cv::Range(0, clusters.size() - 1), for_estimateCentres(clusters,
            vocabulary, centre, minSize, lock));
    

    And, in the header:

    //parallel_for_ for creating clusters:
    class CV_EXPORTS for_createClusters: public ParallelLoopBody {
    private:
    
    std::vector<std::list<cv::Mat> >& bufferCluster;
    const std::vector<Mat> initCentres;
    const Mat icovar;
    const Mat descriptorsParallel;
    Mutex& lock_for;
    
    public:
    for_createClusters(std::vector<std::list<cv::Mat> >& _buffCl,
            const std::vector<Mat> _initCentres, const Mat _icovar,
            const Mat _descriptors, Mutex& _lock_for)
    : bufferCluster (_buffCl), initCentres(_initCentres), icovar(_icovar),
      descriptorsParallel(_descriptors), lock_for(_lock_for){}
    
    
    virtual void operator()( const cv::Range &r ) const
    {
        for (register int f = r.start; f != r.end; ++f)
        {
            int index = 0; double dist = 0, minDist = DBL_MAX;
            for (register size_t j = 0; j < initCentres.size(); j++) {
                dist = cv::Mahalanobis(descriptorsParallel.row(f),
                        initCentres[j],icovar);
                if (dist < minDist) {
                    minDist = dist;
                    index = (int)j;
                }
            }
            {
    //              AutoLock Lock(lock_for);
                lock_for.lock();
                bufferCluster[index].push_back(descriptorsParallel.row(f));
                lock_for.unlock();
            }
        }
        }
    };
    
    class CV_EXPORTS for_estimateCentres: public ParallelLoopBody {
    private:
    
    const std::vector<std::list<cv::Mat> > bufferCluster;
    Mat& vocabulary;
    const Mat centre;
    const int minSizCl;
    Mutex& lock_for;
    
    public:
    for_estimateCentres(const std::vector<std::list<cv::Mat> > _bufferCluster,
            Mat& _vocabulary, const Mat _centre, const int _minSizCl, Mutex& _lock_for)
    : bufferCluster(_bufferCluster), vocabulary(_vocabulary),
      centre(_centre), minSizCl(_minSizCl), lock_for(_lock_for){}
    
    virtual void operator()( const cv::Range &r ) const
    {
        Mat ctr = Mat::zeros(1, centre.cols,centre.type());
    
        for (register int f = r.start; f != r.end; ++f){
            ctr.setTo(0);
            //Not taking into account small clusters
            if(bufferCluster[f].size() >= (size_t) minSizCl)
            {
                for (register std::list<cv::Mat>::const_iterator
                        Ci = bufferCluster[f].begin();
                        Ci != bufferCluster[f].end(); Ci++)
                            ctr += *Ci;
    
                ctr /= (double)bufferCluster[f].size();
    
                {
    //              AutoLock Lock(lock_for);
                    lock_for.lock();
                    vocabulary.push_back(ctr);
                    lock_for.unlock();
                }
            }
        }
      }
    };
    

    Hope this helps to someone...