
Fastest way to multithread doing quickselect on all columns or all rows of a matrix in Rcpp - OpenMP, RcppParallel or RcppThread

I was using this Rcpp code to do a quickselect on a vector of values, i.e. obtain the kth largest element from a vector in O(n) time (I saved this as qselect.cpp):

// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
using namespace arma;

// [[Rcpp::export]]
double qSelect(arma::vec& x, const int k) {

  // x: vector to find k-th largest element in
  // k: desired k-th largest element

  // safety copy since nth_element modifies in place
  arma::vec y(x.memptr(), x.n_elem);

  // partially sort y in O(n) time
  std::nth_element(y.begin(), y.begin() + k - 1, y.end());

  // the k-th largest value
  const double kthValue = y(k-1);

  return kthValue;


I was using this as a fast way to calculate a desired percentile. E.g.

n = 50000
x = rnorm(n=n, mean=100, sd=20)
tau = 0.01 # desired percentile
k = tau*n+1 # here we will get the 6th largest element
microbenchmark(qSelect(x,k)) # 53.32917, 548 µs
microbenchmark(sort(x, partial=k)[k]) # 53.32917, 694 µs = pure R solution

[This may look like it's already fast but I need to do this millions of time in my application]

Now I would like to modify this Rcpp function so that it would do a multithreaded quickselect on all columns or all rows of an R matrix, and return the result as a vector. As I am a bit of a novice in Rcpp I would like some advice though on which framework would likely be fastest for this & would be easiest to code (it would have to work easily cross-platform & I would need good control over the nr of threads to use). Using OpenMP, RcppParallel or RcppThread? Or even better - if someone could perhaps demonstrate a fast and elegant way to do this?


  • Following the advice below I tried multithreading with OpenMP and this seems to give decent speedups using 8 threads on my laptop. I modified my qselect.cpp file to:

    // [[Rcpp::depends(RcppArmadillo)]]
    #include <RcppArmadillo.h>
    using namespace arma;
    // [[Rcpp::export]]
    double qSelect(arma::vec& x, const int k) {
      // ARGUMENTS
      // x: vector to find k-th largest element in
      // k: k-th statistic to look up
      // safety copy since nth_element modifies in place
      arma::vec y(x.memptr(), x.n_elem);
      // partially sorts y
      std::nth_element(y.begin(), y.begin() + k - 1, y.end());
      // the k-th largest value
      const double kthValue = y(k-1);
      return kthValue;
    // [[Rcpp::export]]
    arma::vec qSelectMbycol(arma::mat& M, const int k) {
      // ARGUMENTS
      // M: matrix for which we want to find the k-th largest elements of each column
      // k: k-th statistic to look up
      arma::mat Y(M.memptr(), M.n_rows, M.n_cols);
      // we apply over columns
      int c = M.n_cols;
      arma::vec out(c);
      int i;
      for (i = 0; i < c; i++) {
          arma::vec y = Y.col(i);
          std::nth_element(y.begin(), y.begin() + k - 1, y.end());
          out[i] = y(k-1); // the k-th largest value of each column
      return out;
    #include <omp.h>
    // [[Rcpp::plugins(openmp)]]
    // [[Rcpp::export]]
    arma::vec qSelectMbycolOpenMP(arma::mat& M, const int k, int nthreads) {
      // ARGUMENTS
      // M: matrix for which we want to find the k-th largest elements of each column
      // k: k-th statistic to look up
      // nthreads: nr of threads to use
      arma::mat Y(M.memptr(), M.n_rows, M.n_cols);
      // we apply over columns
      int c = M.n_cols;
      arma::vec out(c);
      int i;
    #pragma omp parallel for shared(out) schedule(dynamic,1)
      for (i = 0; i < c; i++) {
        arma::vec y = Y.col(i);
        std::nth_element(y.begin(), y.begin() + k - 1, y.end());
        out(i) = y(k-1); // the k-th largest value of each column
      return out;


    n = 50000
    x = rnorm(n=n, mean=100, sd=20)
    M = matrix(rnorm(n=n*10, mean=100, sd=20), ncol=10)
    tau = 0.01 # desired percentile
    k = tau*n+1 # we will get the 6th smallest element
    microbenchmark(apply(M, 2, function (col) sort(col, partial=k)[k]),
                   apply(M, 2, function (col) qSelect(col,k)),
    Unit: milliseconds
                                                     expr      min       lq      mean    median        uq        max neval cld
     apply(M, 2, function(col) sort(col, partial = k)[k]) 8.937091 9.301237 11.802960 11.828665 12.718612  43.316107   100   b
               apply(M, 2, function(col) qSelect(col, k)) 6.757771 6.970743 11.047100  7.956696  9.994035 133.944735   100   b
                                      qSelectMbycol(M, k) 5.370893 5.526772  5.753861  5.641812  5.826985   7.124698   100  a 
                  qSelectMbycolOpenMP(M, k, nthreads = 8) 2.695924 2.810108  3.005665  2.899701  3.061996   6.796260   100  a 

    I was surprised by the ca 2 fold gain in speed of doing the apply in Rcpp without even using multithreading (qSelectMbycol function) and there was a further 2 fold speed increase with OpenMP multithreading (qSelectMbycolOpenMP).

    Any advice on possible code optimization welcome though...

    For small n (n<1000) the OpenMP version is not faster, maybe because the individuals jobs are just too small then. E.g. for n=500:

    Unit: microseconds
                                                     expr     min       lq      mean   median       uq      max neval cld
     apply(M, 2, function(col) sort(col, partial = k)[k]) 310.477 324.8025 357.47145 337.8465 361.5810 1782.885   100   c
               apply(M, 2, function(col) qSelect(col, k)) 103.921 114.8255 141.59221 119.3155 131.9315 1990.298   100  b 
                                      qSelectMbycol(M, k)  24.377  32.2885  44.13873  35.2825  39.3440  900.210   100 a  
                  qSelectMbycolOpenMP(M, k, nthreads = 8)  76.123  92.1600 130.42627  99.8575 112.4730 1303.059   100  b