cvectorizationhpcintel-mkllarge-data-volumes

Does Intel MKL or some similar library provide a vectorized way to count the number of elements in an array fulfilling some condition in C?


The problem

I'm working on implementing and refining an optimization algorithm with some fairly large arrays (from tens of millions of floats and up) and using mainly Intel MKL in C (not C++, at least not so far) to squeeze out every possible bit of performance. Now I've run into a silly problem - I have a parameter that sets maxima and minima for subsets of a set of (tens of millions) of coefficients. Actually applying these maxima and minima using MKL functions is easy - I can create equally-sized vectors with the limits for every element and use V?Fmax and V?Fmin to apply them. But I also need to account for this clipping in my error metric, which requires me to count the number of elements that fall outside these constraints.

However, I can't find an MKL function that allows me to do things like counting the number of elements that fulfill some condition, the way you can create and sum logical arrays with e.g. NumPy in Python or in MATLAB. Irritatingly, when I try to google this question, I only get answers relating to Python and R.

Obviously I can just write a loop that increments a counter for each element that fulfills one of the conditions, but if there is an already optimized implementation that allows me to achieve this, I would much prefer that just owing to the size of my arrays.

Does anyone know of a clever way to achieve this robustly and very efficiently using Intel MKL (maybe with the statistics toolbox or some creative use of elementary functions?), a similarly optimized library that does this, or a highly optimized way to hand-code this? I've been racking my brain trying to come up with some out-of-the box method, but I'm coming up empty.

Note that it's necessary for me to be able to do this in C, that it's not viable for me to shift this task to my Python frontend, and that it is indeed necessary for me to code this particular subprogram in C in the first place.

Thanks!


Solution

  • If you were using c++, count_if from the algorithms library with an execution policy of par_unseq may parallelize and vectorize the count. On Linux at least, it typically uses Intel TBB to do this.

    It's not likely to be as easy in c. Because c doesn't have concepts like templates, callables or lambdas, the only way to specialize a generic (library-provided) count()-function would be to pass a function pointer as a callback (like qsort() does). Unless the compiler manages to devirtualize and inline the callback, you can't vectorize at all, leaving you with (possibly thread parallelized) scalar code. OTOH, if you use for example gcc vector intrinsics (my favourite!), you get vectorization but not parallelization. You could try to combine the approaches, but I'd say get over yourself and use c++.

    However, if you only need vectorization, you can almost certainly just write sequential code and have the compiler autovectorize, unless the predicate for what should be counted is poorly written, or your compiler is braindamaged.

    For example. gcc vectorizes the code on x86 if at least sse4 instructions are available (-msse4). With AVX[2/512] (-mavx / -mavx2 / -mavx512f) you can get wider vectors to do more elements at once. In general, if you're compiling on the same hardware you will be running the program on, I'd recommend letting gcc autodetect the optimal instruction set extensions (-march=native).

    Note that in the provided code, the conditions should not use short-circuiting or (||), because then the read from the max-vector is semantically forbidden if the comparison with the min-vector was already true for the current element, severely hindering vectorization (though avx512 could potentially vectorize this with somewhat catastrophic slowdown).

    I'm pretty sure gcc is not nearly optimal in the code it generates for avx512, since it could do the k-reg (mask register) or in the mask registers with kor[b/w/d/q], but maybe somebody with more experience in avx512 (*cougth* Peter Cordes *cough*) could weigh in on that.