I am in need of performing a very common and simple matrix operation.
However I need it fast, really fast...
I am already considering a multi-threaded implementation, however for now I just want to see how fast I can get it on a single processor.
The matrix operation is as follows:
I am calculating the euclidean distance between a vector of points (A) and a reference point (B).
The points are in 3D space and each point has a set of X, Y and Z coordinates.
Consequently the vector of points is described by three floating-point arrays holding the X, Y, Z coordinates for each point.
The output is another vector of length N holding the distances between each point in the array and the reference point.
The three XYZ arrays are arranged as columns of a Nx3 matrix.
x[0] y[0] z[0]
x[1] y[1] z[1]
x[2] y[2] z[2]
x[3] y[3] z[3]
. . .
. . .
. . .
x[N-1] y[N-1] z[N-1]
In memory the matrix is arranged in row-major order as a 1D array containing the values of the X, Y and Z columns sequentially.
x[0], x[1], x[2], x[3] . . . x[N-1], y[0], y[1], y[2], y[3] . . . y[N-1], z[0], z[1], z[2], z[3] . . . z[N-1]
The whole thing is slightly complicated by the fact that we need to add a scalar to each member of the matrix before we take the square root.
The following is the routine in naive C code:
void calculateDistances3D(float *matrix, float Bx, float By, float Bz, float scalar, float *distances, int N)
{
float *Ax = matrix;
float *Ay = Ax + N;
float *Az = Ay + N;
int i;
for (i = 0; i < N; i++) {
float dx = Ax[i] - Bx;
float dy = Ay[i] - By;
float dz = Az[i] - Bz;
float dx2 = dx * dx;
float dy2 = dy * dy;
float dz2 = dz * dz;
float squaredDistance = dx2 + dy2 + dz2;
float squaredDistancePlusScalar = squaredDistance + scalar;
distances[i] = sqrt(squaredDistancePlusScalar);
}
}
…and here is the naive Accelerate implementation (using vDSP and VecLib):
(notice that all processing is performed in-place)
void calculateDistances3D_vDSP(float *matrix, float Bx, float By, float Bz, float scalar, float *distances, int N)
{
float *Ax = matrix;
float *Ay = Ax + N;
float *Az = Ay + N;
// for each point in the array take the difference with the reference point
Bx = -Bx;
By = -By;
Bz = -Bz;
vDSP_vsadd(Ax, 1, &Bx, Ax, 1, N);
vDSP_vsadd(Ay, 1, &By, Ay, 1, N);
vDSP_vsadd(Az, 1, &Bz, Az, 1, N);
// square each coordinate
vDSP_vsq(Ax, 1, Ax, 1, N);
vDSP_vsq(Ay, 1, Ay, 1, N);
vDSP_vsq(Az, 1, Az, 1, N);
// reduce XYZ columns to a single column in Ax (reduction by summation)
vDSP_vadd(Ax, 1, Ay, 1, Ax, 1, N);
vDSP_vadd(Ax, 1, Az, 1, Ax, 1, N);
// add scalar
vDSP_vsadd(Ax, 1, &scalar, Ax, 1, N);
// take sqrt
vvsqrtf(distances, Ax, &N);
}
In the vDSP library the only functions that could be used to calculate distances between vectors are:
vDSP_vdist()
vDSP_distancesq()
vDSP_vpythg()
Maybe I am missing something but as far as I can tell none of them support three input vectors as needed to calculate a distance in 3D.
A couple of things to notice:
(1) I am not comparing distances so I cannot live with the square distance. I need the real distance hence calculating the square root is absolutely necessary.
(2) Taking the reciprocal square root could be a possibility if you really think that doing so will make the code significantly faster.
I have the impression that I am not using the Accelerate framework to its full potential.
I am looking for something a little smarter and maybe concise, doing more work in less function calls. Rearranging the memory in other ways could also be ok, however I think that the memory layout is pretty good as it is.
I am also open to suggestions about other highly optimized/vectorized linear algebra libraries that work on Intel processors. I don’t care if they are commercial or open-source solutions as long as their performance is fast and robust.
The questions is: what is the best function or combination of functions in the Accelerate framework to achieve faster code than the above?
I am developing in Xcode 7 on a MacBook Pro (Retina, 15-inch, Mid 2014) running Mac OS X El Capitan.
Thanks.
Try this one.
N = 2^20
@ 1000 repeats on both my iMac and iPhonematrix
can be treated as strictly read-only, because only distances
is manipulated10^-6
In my opinion, vDSP
is too high level for further "targeted" optimization. Instead, you may look into Ray Wenderlich’s iOS Assembly Tutorial as a starting point for using NEON and write your own SIMD instructions for this specific problem.
Depending on the size N
of your problem, you could also gain a further speed-up by using the GPU, for example with Metal.
void calculateDistances3D_vDSP(float *matrix, float Bx, float By, float Bz, float scalar, float *distances, int N)
{
float *Ax = matrix;
float *Ay = Ax + N;
float *Az = Ay + N;
float constants = Bx*Bx + By*By + Bz*Bz + scalar;
Bx = -2.0f*Bx;
By = -2.0f*By;
Bz = -2.0f*Bz;
vDSP_vsq(Ax, 1, distances, 1, N); // Ax^2
vDSP_vma(Ay, 1, Ay, 1, distances, 1, distances, 1, N); // Ax^2 + Ay^2
vDSP_vma(Az, 1, Az, 1, distances, 1, distances, 1, N); // Ax^2 + Ay^2 + Az^2
vDSP_vsma(Ax, 1, &Bx, distances, 1, distances, 1, N); // Ax^2 + Ay^2 + Az^2 - 2*Bx
vDSP_vsma(Ay, 1, &By, distances, 1, distances, 1, N); // Ax^2 + Ay^2 + Az^2 - 2*Bx - 2*By
vDSP_vsma(Az, 1, &Bz, distances, 1, distances, 1, N); // Ax^2 + Ay^2 + Az^2 - 2*Bx - 2*By - 2*Bz
vDSP_vsadd(distances, 1, &constants, distances, 1, N); // ... + constants = (Ax-Bx)^2 + (Ay-By)^2 + (Az-Bz)^2 + scalar
/*
vDSP_vsq(Ax, 1, distances, 1, N); // Ax^2
vDSP_vsma(Ax, 1, &Bx, distances, 1, distances, 1, N); // Ax^2 - 2*Ax*Bx
vDSP_vma(Ay, 1, Ay, 1, distances, 1, distances, 1, N); // Ax^2 - 2*Ax*Bx + Ay^2
vDSP_vsma(Ay, 1, &By, distances, 1, distances, 1, N); // Ax^2 - 2*Ax*Bx + Ay^2 - 2*Ay*By
vDSP_vma(Az, 1, Az, 1, distances, 1, distances, 1, N); // Ax^2 - 2*Ax*Bx + Ay^2 - 2*Ay*By + Az^2
vDSP_vsma(Az, 1, &Bz, distances, 1, distances, 1, N); // Ax^2 - 2*Ax*Bx + Ay^2 - 2*Ay*By + Az^2 - 2*Az*Bz
vDSP_vsadd(distances, 1, &constants, distances, 1, N); // ... + constants = (Ax-Bx)^2 + (Ay-By)^2 + (Az-Bz)^2 + scalar
*/
// take sqrt
vvsqrtf(distances, distances, &N);
}