multithreadingopenmprcppsimdmandelbrot

Multithreaded & SIMD vectorized Mandelbrot in R using Rcpp & OpenMP


As an OpenMP & Rcpp performance test I wanted to check how fast I could calculate the Mandelbrot set in R using the most straightforward and simple Rcpp+OpenMP implementation. Currently what I did was:

#include <Rcpp.h>
#include <omp.h>
// [[Rcpp::plugins(openmp)]]

using namespace Rcpp;

// [[Rcpp::export]]
Rcpp::NumericMatrix mandelRcpp(const double x_min, const double x_max, const double y_min, const double y_max,
                         const int res_x, const int res_y, const int nb_iter) {
  Rcpp::NumericMatrix ret(res_x, res_y);
  double x_step = (x_max - x_min) / res_x;
  double y_step = (y_max - y_min) / res_y;
  int r,c;
#pragma omp parallel for default(shared) private(c) schedule(dynamic,1) collapse(2)
  for (r = 0; r < res_y; r++) {
    for (c = 0; c < res_x; c++) {
      double zx = 0.0, zy = 0.0, new_zx;
      double cx = x_min + c*x_step, cy = y_min + r*y_step;
      int n = 0;
      for (n=0;  (zx*zx + zy*zy < 4.0 ) && ( n < nb_iter ); n++ ) {
        new_zx = zx*zx - zy*zy + cx;
        zy = 2.0*zx*zy + cy;
        zx = new_zx;
      }
      ret(c,r) = n;
    }
  }
  return ret;
}

And then in R:

library(Rcpp)
sourceCpp("mandelRcpp.cpp")
xlims=c(-0.74877,-0.74872);
ylims=c(0.065053,0.065103);
x_res=y_res=1080L; nb_iter=10000L;
system.time(m <- mandelRcpp(xlims[[1]], xlims[[2]], ylims[[1]], ylims[[2]], x_res, y_res, nb_iter)) 
# 0.92s
rainbow=c(rgb(0.47,0.11,0.53),rgb(0.27,0.18,0.73),rgb(0.25,0.39,0.81),rgb(0.30,0.57,0.75),rgb(0.39,0.67,0.60),rgb(0.51,0.73,0.44),rgb(0.67,0.74,0.32),rgb(0.81,0.71,0.26),rgb(0.89,0.60,0.22),rgb(0.89,0.39,0.18),rgb(0.86,0.13,0.13))
    cols=c(colorRampPalette(rainbow)(100),rev(colorRampPalette(rainbow)(100)),"black") # palette
par(mar=c(0, 0, 0, 0))
system.time(image(m^(1/7), col=cols, asp=diff(ylims)/diff(xlims), axes=F, useRaster=T)) 
# 0.5s

enter image description here

I was unsure though if there is any other obvious speed improvements I could take advantage of aside from OpenMP multithreading, e.g. via simd vectorization? (using simd options in the openmp #pragma didn't seem to do anything)

PS at first my code was crashing but I later found this was solved by replacing ret[r,c] = n; with ret(r,c) = n; Using Armadillo classes as suggested in the answer below make things very slightly faster, though the timings are almost the same. Also flipped around x and y so it comes out in the right orientation when plotted with image(). Using 8 threads speed is ca. 350 times faster than the vectorized plain R Mandelbrot version here and also about 7.3 times faster than the (non-multithreaded) Python/Numba version here (similar to PyCUDA or PyOpenCL speeds), so quite happy with that... Rasterizing/display now seems the bottleneck in R....


Solution

  • Do not use OpenMP with Rcpp's *Vector or *Matrix objects as they mask SEXP functions / memory allocations that are single-threaded. OpenMP is a multi-threaded approach.

    This is why the code is crashing.

    One way to get around this limitation is to use a non-R data structure to store the results. One of the following will be sufficient: arma::mat or Eigen::MatrixXd or std::vector<T>... As I favor armadillo, I will change the res matrix to arma::mat from Rcpp::NumericMatrix. Thus, the following will execute your code in parallel:

    #include <RcppArmadillo.h> // Note the changed include and new attribute
    // [[Rcpp::depends(RcppArmadillo)]]
    
    // Avoid including header if openmp not on system
    #ifdef _OPENMP
    #include <omp.h>
    #endif
    // [[Rcpp::plugins(openmp)]]
    
    // Note the changed return type
    // [[Rcpp::export]]
    arma::mat mandelRcpp(const double x_min, const double x_max,
                         const double y_min, const double y_max,
                         const int res_x, const int res_y, const int nb_iter) {
      arma::mat ret(res_x, res_y); // note change
      double x_step = (x_max - x_min) / res_x;
      double y_step = (y_max - y_min) / res_y;
      unsigned r,c;
    
      #pragma omp parallel for shared(res)
      for (r = 0; r < res_y; r++) {
        for (c = 0; c < res_x; c++) {
          double zx = 0.0, zy = 0.0, new_zx;
          double cx = x_min + c*x_step, cy = y_min + r*y_step;
          unsigned n = 0;
          for (;  (zx*zx + zy*zy < 4.0 ) && ( n < nb_iter ); n++ ) {
            new_zx = zx*zx - zy*zy + cx;
            zy = 2.0*zx*zy + cy;
            zx = new_zx;
          }
    
          if(n == nb_iter) {
            n = 0;
          }
    
          ret(r, c) = n;
        }
      }
    
      return ret;
    }
    

    With the test code (note y and x were not defined, thus I assumed y = ylims and x = xlims) we have:

    xlims = ylims = c(-2.0, 2.0)
    
    x_res = y_res = 400L
    nb_iter = 256L
    
    system.time(m <-
                  mandelRcpp(xlims[[1]], xlims[[2]],
                             ylims[[1]], ylims[[2]], 
                             x_res, y_res, nb_iter))
    
    rainbow = c(
      rgb(0.47, 0.11, 0.53),
      rgb(0.27, 0.18, 0.73),
      rgb(0.25, 0.39, 0.81),
      rgb(0.30, 0.57, 0.75),
      rgb(0.39, 0.67, 0.60),
      rgb(0.51, 0.73, 0.44),
      rgb(0.67, 0.74, 0.32),
      rgb(0.81, 0.71, 0.26),
      rgb(0.89, 0.60, 0.22),
      rgb(0.89, 0.39, 0.18),
      rgb(0.86, 0.13, 0.13)
    )
    
    cols = c(colorRampPalette(rainbow)(100),
             rev(colorRampPalette(rainbow)(100)),
             "black") # palette
    par(mar = c(0, 0, 0, 0))
    
    image(m,
          col = cols,
          asp = diff(range(ylims)) / diff(range(xlims)),
          axes = F)
    

    For:

    enter image description here