I have a disk (centered in 0, with radius 1) filled with triangles (not necessarily of same area/length). There could be a HUGE amount of triangle (let's say from 1k
to 300k
triangles)
My goal is to find as quick as possible in which triangle a point belongs.
The operation has to be repeated a large amount of time (around 10k
times).
For now the algorithm I'm using is: I'm computing the barycentric coordinates of the point in each triangle. If the first coefficient is between 0 and 1, I continue. If it's not, I stop. Then I compute the second coefficient with the same idea, and the third, and I do this for every triangle.
I can't think of a way to use the fact that I'm working on a disc (and the fact that I have an Euclidean distance to help me "target" the good triangles directly): If I try to compute the distance from my point to every "center" of triangles :
1) it's already more operations than what I'm doing when I brute force it with barycentric coordinates
2) I will have to order a vector containing the Euclidean distances of all triangles to my point.
3) I have absolutely no guarantee that the closest triangle to my point will be the good triangle.
I feel like I'm missing something, and that I could pre-compute, something to help me spot the good "area" before starting the brute force part.
The algorithm is already parallelised (using OpenMP): I'm calling the below function on a parallel for.
bool Triangle2D::is_in_triangle(Vector2d Point) {
double denominator = ((Tri2D(1, 1) - Tri2D(2, 1))*(Tri2D(0, 0) - Tri2D(2, 0)) + (Tri2D(2, 0) - Tri2D(1, 0))*(Tri2D(0, 1) - Tri2D(2, 1)));
// Computing the first coefficient
double a = ((Tri2D(1, 1) - Tri2D(2, 1))*(Point(0) - Tri2D(2, 0)) + (Tri2D(2, 0) - Tri2D(1, 0))*(Point(1) - Tri2D(2, 1))) / denominator;
if (a < 0 || a>1) {
return(false);
}
// Computing the second coefficient
double b = ((Tri2D(2, 1) - Tri2D(0, 1))*(Point(0) - Tri2D(2, 0)) + (Tri2D(0, 0) - Tri2D(2, 0))*(Point(1) - Tri2D(2, 1))) / denominator;
if (b < 0 || b>1) {
return(false);
}
// Computing the third coefficient
double c = 1 - a - b;
if (c < 0 || c>1) {
return(false);
}
return(true);
}
Next step is probably to look at GPU parallelisation, but I need to make sure that the idea behind the code is good enough.
For now it takes approximately 2min30
for 75k
triangles and 10k
points, but this isn't fast enough.
Edit:Triangle2D
uses Eigen matrix to store coordinates
All long-bearded HPC-professionals, kindly do permit a bit scholastically elaborated approach here, which may ( in my honest opinion ) become interesting, if not enjoyed by, for our Community Members, that feel themselves a bit more junior than you professionally feel yourselves and who may get interested in a bit deeper look into performance-motivated code-design, performance tweaking and other parallel-code risks and benefits, that you know on your own hard-core HPC-computing experience so well and so deep. Thank you.
a) ALGORITHM (as-is) can get ~2X speedup a low-hanging fruit+more surprises yet2come
b) OTHER ALGORITHM may get ~40~80X speedup boost due2geometry
c) TIPS FOR THE BEST PARALLEL CODE + ULTIMATE PERFORMANCE
GOAL : A target runtime for 10k
points in 300k
triangles would be 2-3min on a computer with an i5 8500, 3GHz, 6core, NVIDIA Quadro P400 (have to try GPU computing, not even sure if it's worth it)
While this may seem as a long journey, the problem is nice and deserves a bit closer look, so please, bear with me during the flow of utmost-performance motivated thinking.
a) ALGORITHM (as-is) ANALYSIS:
The as-is use of Barycentric coordinate system is a nice trick, the straight implementation of which costs a bit more than about (20 FLOPs + 16 MEM/REG-I/O-ops) in best case and slightly above (30 FLOPs + 30 MEM/REG-I/O-ops).
There are a few polishing touches, that may reduce these execution costs down right by avoiding some expensive and even not important operations from ever taking place:
--------------------------------------------------------------------------------------
double denominator = ( ( Tri2D( 1, 1 )
- Tri2D( 2, 1 ) // -------------------------- 2x MEM + OP-1.SUB
) * ( Tri2D( 0, 0 ) //--------------------- + OP-3.MUL
- Tri2D( 2, 0 ) //--------------------- 2x MEM + OP-2.SUB
) + ( Tri2D( 2, 0 ) //--------------- + OP-7.ADD
- Tri2D( 1, 0 ) //--------------- 2x MEM + OP-4.SUB
) * ( Tri2D( 0, 1 ) //--------- + OP-6.MUL
- Tri2D( 2, 1 ) //--------- 2x MEM + OP-5.SUB
)
);
// Computing the first coefficient ------------------------------------------------------------------------------------------------------
double a = ( ( Tri2D( 1, 1 )
- Tri2D( 2, 1 ) //-------------------------- 2x MEM + OP-8.SUB
) * ( Point(0) //------------------------ + OP-A.MUL
- Tri2D( 2, 0 ) //--------------------- 2x MEM + OP-9.SUB
) + ( Tri2D( 2, 0 ) //--------------- + OP-E.ADD
- Tri2D( 1, 0 ) //--------------- 2x MEM + OP-B.SUB
) * ( Point(1) //-------------- + OP-D.MUL
- Tri2D( 2, 1 ) //--------- 2x MEM + OP-C.MUL
)
) / denominator; //-------------------------- 1x REG + OP-F.DIV //----------- MAY DEFER THE MOST EXPENSIVE DIVISION UNTIL a third coeff is indeed first needed, if ever------------[3]
if (a < 0 || a>1) { // ----------------------------------------------------------------------------- a < 0 ~~ ( sign( a ) * sign( denominator ) ) < 0
return(false); // ------------------------------------------------------------------------------ a > 1 ~~ || a > denominator
}
// Computing the second coefficient
double b = ( ( Tri2D( 2, 1 ) - Tri2D( 0, 1 ) ) //--------- 2x MEM + OP-16.SUB
* ( Point(0) - Tri2D( 2, 0 ) ) //--------- 2x MEM + OP-17.SUB + OP-18.MUL
+ ( Tri2D( 0, 0 ) - Tri2D( 2, 0 ) ) //--------- 2x MEM + OP-19.SUB + OP-22.ADD
* ( Point(1) - Tri2D( 2, 1 ) ) //--------- 2x MEM + OP-20.SUB + OP-21.MUL
) / denominator; //-------------------------- 1x REG + OP-23.DIV //---------- MAY DEFER THE MOST EXPENSIVE DIVISION UNTIL a third coeff is indeed first needed, if ever -----------[3]
if (b < 0 || b>1) { // ----------------------------------------------------------------------------- b < 0 ~~ ( sign( b ) * sign( denominator ) ) < 0
return(false); // ------------------------------------------------------------------------------ b > 1 ~~ || b > denominator
}
// Computing the third coefficient
double c = 1 - a - b; // ------------------------------------------- 2x REG + OP-24.SUB + OP-25.SUB
// 1 -(a - b)/denominator; //--------------------------------------------------------------- MAY DEFER THE MOST EXPENSIVE DIVISION EXECUTED BUT HERE, IFF INDEED FIRST NEEDED <---HERE <----------[3]
repeated re-evaluations, that appear in the original may get explicitly crafted out by manual assign/re-use, yet, there is a chance a good optimising compiler may get these evicted within a use of -O3
enforced optimisation flag.
do not hesitate to profile even this lowest-hanging fruit, to polish the most expensive parts.
//------------------------------------------------------------------
double Tri2D_11_sub_21 = ( Tri2D( 1, 1 )
- Tri2D( 2, 1 )
), //====================================================== 2x MEM + OP-a.SUB (REG re-used 2x)
Tri2D_20_sub_10 = ( Tri2D( 2, 0 )
- Tri2D( 1, 0 )
), //====================================================== 2x MEM + OP-b.SUB (REG re-used 2x)
Tri2D_00_sub_20 = ( Tri2D( 0, 0 )
- Tri2D( 2, 0 )
); //====================================================== 2x MEM + OP-c.SUB (REG re-used 1~2x)
//-----------------------
double denominator = ( ( /*
Tri2D( 1, 1 )
- Tri2D( 2, 1 ) // -------------------------- 2x MEM + OP-1.SUB (avoided by re-use) */
Tri2D_11_sub_21 //=========================================== 1x REG + OP-d.MUL
) * ( /*
Tri2D( 0, 0 ) //--------------------- + OP-3.MUL
- Tri2D( 2, 0 ) //--------------------- 2x MEM + OP-2.SUB (avoided by re-use) */
Tri2D_00_sub_20 //===================================== 1x REG + OP-f.ADD
) + ( /*
Tri2D( 2, 0 ) //--------------- + OP-7.ADD
- Tri2D( 1, 0 ) //--------------- 2x MEM + OP-4.SUB (avoided by re-use) */
Tri2D_20_sub_10 //=============================== 1x REG + OP-e.MUL
) * ( Tri2D( 0, 1 ) //--------- + OP-6.MUL
- Tri2D( 2, 1 ) //--------- 2x MEM + OP-5.SUB
)
);
//\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/
//
// Computing the first coefficient ------------------------------------------------------------------------------------------------------
//
double enumer_of_a = ( ( /*
Tri2D( 1, 1 )
- Tri2D( 2, 1 ) //-------------------------- 2x MEM + OP-8.SUB (avoided by re-use) */
Tri2D_11_sub_21 //=========================================== 1x REG + OP-g.MUL
) * ( Point(0) //------------------------------------------ + OP-i.MUL
- Tri2D( 2, 0 ) //--------------------------------------- 2x MEM + OP-h.SUB
) + ( /*
Tri2D( 2, 0 ) //--------------- + OP-E.ADD
- Tri2D( 1, 0 ) //--------------- 2x MEM + OP-B.SUB (avoided by re-use) */
Tri2D_20_sub_10 //=============================== 1x REG + OP-l.ADD
) * ( Point(1) //-------------------------------- + OP-k.MUL
- Tri2D( 2, 1 ) //--------------------------- 2x MEM + OP-j.MUL
)
);/*denominator; *///------------------------ 1x REG + OP-F.DIV (avoided by DEFERRAL THE MOST EXPENSIVE DIVISION UNTIL a third coeff is indeed first needed, if ever-----------[3]
//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~J.I.T./non-MISRA-C-RET-->
// TEST CONDITIONS FOR A CHEAPEST EVER J.I.T./non-MISRA-C-RET-->
//
if ( enumer_of_a > denominator // in a > 1, THE SIZE DECIDES, the a / denominator > 1, iff enumer_of_a > denominator a rather expensive .FDIV is avoided at all
|| enumer_of_a * denominator < 0 ) return(false); // in a < 0, THE SIGN DECIDES, not the VALUE matters, so will use a cheaper .FMUL, instead of a rather expensive .FDIV ~~ ( sign( a ) * sign( denominator ) ) < 0
//\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/
//
// Computing the second coefficient
//
double enumer_of_b = ( ( Tri2D( 2, 1 ) - Tri2D( 0, 1 ) ) //---------------------------------------- 2x MEM + OP-m.SUB
* ( Point(0) - Tri2D( 2, 0 ) ) //---------------------------------------- 2x MEM + OP-n.SUB + OP-o.MUL
+ ( /*
Tri2D( 0, 0 ) - Tri2D( 2, 0 ) //--------- 2x MEM + OP-19.SUB + OP-22.ADD (avoided by re-use) */
Tri2D_00_sub_20 //======================================================== 1x REG + OP-p.ADD
)
* ( Point(1) - Tri2D( 2, 1 ) ) //---------------------------------------- 2x MEM + OP-r.SUB + OP-q.MUL
);/*denominator; *///------------------------ 1x REG + OP-23.DIV (avoided by DEFERRAL THE MOST EXPENSIVE DIVISION UNTIL a third coeff is indeed first needed, if ever-----------[3]
//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~J.I.T./non-MISRA-C-RET-->
// TEST CONDITIONS FOR A 2nd CHEAPEST J.I.T./non-MISRA-C-RET-->
//
if ( enumer_of_b > denominator // in b > 1, THE SIZE DECIDES, the a / denominator > 1, iff enumer_of_a > denominator a rather expensive .FDIV is avoided at all
|| enumer_of_b * denominator < 0 ) return(false); // in b < 0, THE SIGN DECIDES, not the VALUE matters, so will use a cheaper .FMUL, instead of a rather expensive .FDIV ~~ ( sign( a ) * sign( denominator ) ) < 0
//\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/
//
// Computing the third coefficient
//
double c = 1 - ( ( enumer_of_a
- enumer_of_b
)
/ denominator
); // --------------------------------------------- 3x REG + OP-s.SUB + OP-t.FDIC + OP-u.SUB <----THE MOST EXPENSIVE .FDIV BUT HERE, IFF INDEED FIRST NEEDED <---HERE <------------[3]
//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~J.I.T./non-MISRA-C-RET-->
// TEST CONDITIONS FOR PRE-FINAL RET J.I.T./non-MISRA-C-RET-->
//
if ( c < 0 || c > 1 ) return( false );
return( true ); //~~~~~~~~~~~~~~~~~~~ "the-last-resort" RET-->
b) OTHER APPROACH TO THE ALGORITHM:
Let's review another approach, that seems both faster and a way cheaper, due to less data and lest instructions and is also promising to have high density, once smart use of vectorised AVX-2 or better AVX-512 vector-instructions will get harnessed per each core : ANIMATED, fully-INTERACTIVE explanation is altogether with analytical problem re-formulation here.
A triple-test of a point-to-line distance (each line being ax + by + c = 0
) comes at cheap cost of ~ 2 FMA3
are enough + a sign test (and even better if vector-4-in-1-compact AVX-2 / 8-in-1 AVX-512 VFMADD
-s)
While there may be possible to "fast" decide on whether a point makes sense to be tested against a respective triangle, possibly with statically pre-"framing" each triangle in polar(R,Theta)
coordinate space by a static, precomputed tuple of ( R_min, R_max, Theta_min, Theta_max )
and "fast" discriminate each point, if it does not fit inside such a polar-segment. Yet the costs of doing this (data-access pattern costs + the costs of these "fast" instructions) will grow beyond any potentially "saved" instruction-paths, that need not take place (if a point is found outside a polar-segment). Having achieved a range of performance of 24 points-in-triangle tests per a cost of ~9 CPU-instructions per 6 CPU-cores @ 3.0+ GHz, the polar-segment "pre-testing" will suddenly become prohibitively expensive, not speaking about a second order negative effect ( introduced by a way worse cache-hit / cache-miss ratios, given more data is to be stored and read into a "fast"-pre-test ~ +16B per a triangle "framing" polar-segment tuple +8B per point ( with the worst impact on the cache hit/miss-ratio ).
That is clearly not good direction for any further move as the performance will get decreased, not increased, which is our global strategy here.
Intel i5 8500 CPU can use but AVX-2, so the most compact use of 8-triangles per CPU-clock tick per core is left for achieving even 2X higher performance, if needed.
TRIPLE-"point-above-line"-TEST per POINT per TRIANGLE:
---------------------------------------------------------------------------------
PRE-COMPUTE STATIC per TRIANGLE CONSTANTS:
LINE_1: C0__L1, C1__L1, C2__L1, bool_L1DistanceMustBePOSITIVE
LINE_2: C0__L2, C1__L2, C2__L2, bool_L2DistanceMustBePOSITIVE
LINE_3: C0__L3, C1__L3, C2__L3, bool_L3DistanceMustBePOSITIVE
TEST per TRIANGLE per POINT (Px,Py) - best executed in an AVX-vectorised fashion
LINE_1_______________________________________________________________
C0__L1 ( == pre-calc'd CONST = c1 / sqrt( a1^2 + b1^2 ) ) //
Px * C1__L1 ( == pre-calc'd CONST = a1 / sqrt( a1^2 + b1^2 ) ) // OP-1.FMA REG-Px,C1__L1,C0__L1
Py * C2__L1 ( == pre-calc'd CONST = b1 / sqrt( a1^2 + b1^2 ) ) // OP-2.FMA REG-Py,C2__L1, +
.GT./.LT. 0 // OP-3.SIG
LINE_2_______________________________________________________________
C0__L2 ( == pre-calc'd CONST = c2 / sqrt( a2^2 + b2^2 ) ) //
Px * C1__L2 ( == pre-calc'd CONST = a2 / sqrt( a2^2 + b2^2 ) ) // OP-4.FMA REG-Px,C1__L2,C0__L2
Py * C2__L2 ( == pre-calc'd CONST = b2 / sqrt( a2^2 + b2^2 ) ) // OP-5.FMA REG-Py,C2__L2, +
.GT./.LT. 0 // OP-6.SIG
LINE_3_______________________________________________________________
C0__L3 ( == pre-calc'd CONST = c3 / sqrt( a3^2 + b3^2 ) ) //
Px * C1__L3 ( == pre-calc'd CONST = a3 / sqrt( a3^2 + b3^2 ) ) // OP-7.FMA REG-Px,C1__L3,C0__L3
Py * C2__L3 ( == pre-calc'd CONST = b3 / sqrt( a3^2 + b3^2 ) ) // OP-8.FMA REG-Py,C2__L3, +
.GT./.LT. 0 // OP-9.SIG
( using AVX-2 intrinsics or inlined assembler will deliver highest performance due to COMPACT 4-in-1 VFMADDs )
____________________________________________
| __________________________________________triangle A: C1__L1
| | ________________________________________triangle B: C1__L1
| | | ______________________________________triangle C: C1__L1
| | | | ____________________________________triandle D: C1__L1
| | | | |
| | | | | ______________________________
| | | | | | ____________________________triangle A: Px
| | | | | | | __________________________triangle B: Px
| | | | | | | | ________________________triangle C: Px
| | | | | | | | | ______________________triandle D: Px
| | | | | | | | | |
|1|2|3|4| | | | | |
| | | | | |1|2|3|4| ________________
| | | | | | | | | | | ______________triangle A: C0__L1
| | | | | | | | | | | | ____________triangle B: C0__L1
| | | | | | | | | | | | | __________triangle C: C0__L1
| | | | | | | | | | | | | | ________triandle D: C0__L1
| | | | | | | | | | | | | | |
|1|2|3|4| | | | | | | | | | |
| | | | | |1|2|3|4| | | | | |
| | | | | | | | | | |1|2|3|4|
(__m256d) __builtin_ia32_vfmaddpd256 ( (__v4df )__A, (__v4df )__B, (__v4df )__C ) ~/ per CPU-core @ 3.0 GHz ( for actual uops durations check Agner or Intel CPU documentation )
can
perform 4-( point-in-triangle ) PARALLEL-test in just about ~ 9 ASSEMBLY INSTRUCTIONS / per CPU-core @ 3.0 GHz
24-( point-in-triangle ) PARALLEL-test in just about ~ 9 ASSEMBLY INSTRUCTIONS / per CPU
using AVX-512 empowered CPU, can use 8-in-1 VFMADDs
could
perform 8-( point-in-triangle ) PARALLEL-test in just about ~ 9 ASSEMBLY INSTRUCTIONS / per CPU-core @ 3.0 GHz
48-( point-in-triangle ) PARALLEL-test in just about ~ 9 ASSEMBLY INSTRUCTIONS / per CPU
c) TIPS FOR THE BEST PARALLEL CODE + ULTIMATE PERFORMANCE:
Step -1: GPU / CUDA costs v/s benefits
If your PhD-mentor, professor, boss or Project Manager indeed insists you to develop a GPU-computing c++/CUDA code solution for this very problem, the best next step is to ask for getting any better suited GPU-card for such a task, than the one you posted.
Your indicated card, the Q400 GPU has just 2 SMX (48KB L1-cache each) is not quite fit for a seriously meant parallel CUDA-computing, having about 30X-less processing devices to actually do any SIMD-thread-block computing, not mentioning its small memory and tiny L1 / L2 on-SMX-caches. So, after all CUDA-related design and optimisation efforts, there will be not more, but a single (!) pair of the GPU-SMX warp32
-wide thread-block executions in the SIMD-threads, so there is no big circus to be expected from this GP107-based device and there are 30+X-better equipped devices for some indeed highly performing parallel-processing available COTS )
Step 0: pre-compute and pre-arrange data ( MAXIMISE cache-line COHERENCY ):
Here, it makes sense to optimise the best macroscoping looping of the algorithm, so that you "benefit" most from cache hits ( i.e. best re-use of "fast" data, pre-fetched already )
So, may test, if it is cache-wise faster to rather distribute the work over N-concurrent-workers, who process disjunct working-blocks of triangles, where they each loop over the smallest memory-area --- all the points ( ~ 10k * 2 * 4B ~ 80 kB ), before moving into next triangle in the working-block. Ensuring the row-first array alignment into memory-area is vital ( FORTRAN guys can tell a lot about the costs/benefits of just this trick for an HPC-fast compact/vectorised vector-algebra and matrix-processing )
The benefit?
The cached coefficients will be re-used ~ 10k times (at a cost of ~ 0.5~1.0 [ns]
, instead of re-fetching costs of + 100 ~ 300 [ns]
if these would have to be re-read by a RAM memory access). Difference, about ~ 200X ~ 600X
is worth an effort to best align the workflow sub-ordinated to data-access-patterns and cache-line resources.
Results are atomic - any point will belong to one and only one (non-overlapping) triangles.
This simplifies the a priori non-colliding and relatively sparse writes into a resulting vector, into where any detected point-inside-triangle can freely report an index of such triangle found.
Using this vector of results for potentially avoiding any re-testing on points, where there has already been performed a match ( index is non-negative ) isnot very efficient, as the costs of re-reading such indication and re-arranging the compact alignment of 4-in-1 or 8-in-1 point-in-triangle tests, would become adversely expensive to potential "savings" of not re-testing an already mapped point.
So the 10k
points over a block of 300k
triangles may yield a workflow of:
a split of some 300k / 6
cores~ 50k
triangles / 1 core ~ 50k * 10 k
point-in-triangle tests per core~ 500M
point-in-triangle tests per core @ 3.0+ GHz
~ 125M AVX-2
vector-4-in-1-compact test-executions per-core~ 125M
tests x 10 uops
-instructions @ 3.0 GHz
...
which is 1.25G uops @ 3.0+ GHz
... second(s)? Yes, there is a pretty strong motivation to go down here, towards this ultimate performance and direct the further work this way.
SO, HERE WE ARE:
The principally achievable HPC-target is in a range of a few seconds for 300+k Triangles / 10+k Points on just a 6-core AVX-2 i5 8500 @ 3.0+ GHz
Worth an effort, isn't it?