I am trying to implement an image processing algorithm for a gamut mapping filter for Hardware using Vivado HLS. I have created a synthesizable version from a Halide code. But it is taking way too long for an image of (256x512) it was taking around 135 seconds which shouldn't be the case. I have used some optimizing techniques like pipelining the innermost loop, By pipelining, I have set the target(initiation interval) of II=1 for the innermost loop but the acheived II is 6. From the warnings thrown by the compiler, I have understood that it is because of accessing of the weights like ctrl_pts & weights, From the tutorials, I have seen, using array partitioning and array reshaping would help with the faster accessing of the weights. I have shared the code I have used to synthesize below:
//header
include "hls_stream.h"
#include <ap_fixed.h>
//#include <ap_int.h>
#include "ap_int.h"
typedef ap_ufixed<24,24> bit_24;
typedef ap_fixed<11,8> fix;
typedef unsigned char uc;
typedef ap_uint<24> stream_width;
//typedef hls::stream<uc> Stream_t;
typedef hls::stream<stream_width> Stream_t;
struct pixel_f
{
float r;
float g;
float b;
};
struct pixel_8
{
uc r;
uc g;
uc b;
};
void gamut_transform(int rows,int cols,Stream_t& in,Stream_t& out, float ctrl_pts[3702][3],float weights[3702][3],float coefs[4][3],float num_ctrl_pts);
//core
//include the header
#include "gamut_header.h"
#include "hls_math.h"
void gamut_transform(int rows,int cols, Stream_t& in,Stream_t& out, float ctrl_pts[3702][3],float weights[3702][3],float coefs[4][3],float num_ctrl_pts)
{
#pragma HLS INTERFACE axis port=in
#pragma HLS INTERFACE axis port=out
//#pragma HLS INTERFACE fifo port=out
#pragma HLS dataflow
pixel_8 input;
pixel_8 new_pix;
bit_24 temp_in,temp_out;
pixel_f buff_1,buff_2,buff_3,buff_4,buff_5;
float dist;
for (int i = 0; i < 256; i++)
{
for (int j = 0; i < 512; i++)
{
temp_in = in.read();
input.r = (temp_in & 0xFF0000)>>16;
input.g = (temp_in & 0x00FF00)>>8;
input.b = (temp_in & 0x0000FF);
buff_1.r = ((float)input.r)/256.0;
buff_1.g = ((float)input.g)/256.0;
buff_1.b = ((float)input.b)/256.0;
for(int idx =0; idx < 3702; idx++)
{
buff_2.r = buff_1.r - ctrl_pts[idx][0];
buff_2.g = buff_1.g - ctrl_pts[idx][1];
buff_2.b = buff_1.b - ctrl_pts[idx][2];
dist = sqrt((buff_2.r*buff_2.r)+(buff_2.g*buff_2.g)+(buff_2.b*buff_2.b));
buff_3.r = buff_2.r + (weights[idx][0] * dist);
buff_3.g = buff_2.g + (weights[idx][1] * dist);
buff_3.b = buff_2.b + (weights[idx][2] * dist);
}
buff_4.r = buff_3.r + coefs[0][0] + buff_1.r* coefs[1][0] + buff_1.g * coefs[2][0] + buff_1.b* coefs[3][0];
buff_4.g = buff_3.g + coefs[0][1] + buff_1.r* coefs[1][1] + buff_1.g * coefs[2][1] + buff_1.b* coefs[3][1];
buff_4.b = buff_3.b + coefs[0][2] + buff_1.r* coefs[1][2] + buff_1.g * coefs[2][2] + buff_1.b* coefs[3][2];
buff_5.r = fmin(fmax((float)buff_4.r, 0.0), 255.0);
buff_5.g = fmin(fmax((float)buff_4.g, 0.0), 255.0);
buff_5.b = fmin(fmax((float)buff_4.b, 0.0), 255.0);
new_pix.r = (uc)buff_4.r;
new_pix.g = (uc)buff_4.g;
new_pix.b = (uc)buff_4.b;
temp_out = ((uc)new_pix.r << 16 | (uc)new_pix.g << 8 | (uc)new_pix.b);
out<<temp_out;
}
}
}
Even with the achieved II=6, the time taken is around 6 seconds; The given target is to have the time taken in milliseconds. I tried to do pipelining for the second most inner loop, but I am running out of resources on my board when I do that as the third most inner loop is being unrolled. I am using zynq ultra-scale board which has a fair amount of resources. Any suggestions on optimizing the code will be highly appreciated.
Also, can anyone suggest what type of interface would be best suited for ctrl_pts,weights and coefs, For reading the image I understood that streaming interface helps, and for reading small values like the number of rows and columns, Axi lite is preferred? Is there a type of interface that I can use for the mentioned variables so that it can go hand in hand with array partitioning and array reshaping?
Any suggestions will be highly appreciated,
Thanks in advance
Edit: I understand that the fixed-point representation can bring down the latency further, But my first goal is to get the floating-point representation with the best result and then analyzing the performance with fixed point representation
There are some steps you can do to optimize your design, but bear in mind that if you really need a floating square root operation, that will most likely have a huge latency penalty (unless properly pipelined, of course).
Your code might have a typo in the second inner loop: the index should be j
right?
First off: ctrl_pts is read multiple time from the main memory (I assume). Since it's reused 256x512 times, it would be better to store it into a local buffer on the FPGA (like a BRAM, but it can be inferred), like so:
for(int i =0; i < 3702; i++) {
for (int j = 0; j < 3; ++j) {
#pragma HLS PIPELINE II=1
ctrl_pts_local[i][j] = ctrl_pts[i][j];
}
}
for (int i = 0; i < 256; i++) {
for (int j = 0; i < 512; i++) {
// ...
buff_2.r = buff_1.r - ctrl_pts_local[idx][0];
// ...
Same reasoning goes for coefs
and weights
, just store them in a local variable before running the rest of the code.
To access the arguments you can use a master AXI4 interface m_axi
and configure it accordingly. Once the algorithm is dealing with the local buffers, HLS should be able to automatically partition the buffers accordingly. If not, you can place the ARRAY_PARTITION complete dim=0
pragmas to force it.
Because of the way your algorithm works, another thing you could try is to break down the main loops (256x512) into three smaller processes running in dataflow, and so in parallel (+3 if you include the setup ones)
The whole code will look something like this (I hope it renders correctly):
[Compute buff_1]-->[FIFO1]-->[compute buff_3]-->[FIFO2a]-->[compute buff_4 and buff_5 + stream out]
L-------------------------------->[FIFO2b]----^
One tricky thing would be to stream buff_1 to both the next processes.
I won't try this code, so there might be compilations errors along the way, but the whole accelerator code would look something like this:
for(int i =0; i < 3702; i++) {
for (int j = 0; j < 3; ++j) {
#pragma HLS PIPELINE II=1
ctrl_pts_local[i][j] = ctrl_pts[i][j];
weights_local[i][j] = weights[i][j];
}
}
for(int i =0; i < 4; i++) {
for (int j = 0; j < 3; ++j) {
#pragma HLS PIPELINE II=1
coefs_local[i][j] = coefs[i][j];
}
}
Process_1:
for (int i = 0; i < 256; i++) {
for (int j = 0; i < 512; i++) {
#pragma HLS PIPELINE II=1
temp_in = in.read();
input.r = (temp_in & 0xFF0000)>>16;
input.g = (temp_in & 0x00FF00)>>8;
input.b = (temp_in & 0x0000FF);
buff_1.r = ((float)input.r)/256.0;
buff_1.g = ((float)input.g)/256.0;
buff_1.b = ((float)input.b)/256.0;
fifo_1.write(buff_1); // <--- WRITE TO FIFOs
fifo_2b.write(buff_1);
}
}
Process_2:
for (int i = 0; i < 256; i++) {
for (int j = 0; i < 512; i++) {
for(int idx =0; idx < 3702; idx++) {
#pragma HLS LOOP_FLATTEN // <-- It shouldn't be necessary, since the if statements already help
#pragma HLS PIPELINE II=1 // <-- The PIPELINE directive can go here
if (idx == 0) {
buff_1 = fifo_1.read(); // <--- READ FROM FIFO
}
buff_2.r = buff_1.r - ctrl_pts_local[idx][0];
buff_2.g = buff_1.g - ctrl_pts_local[idx][1];
buff_2.b = buff_1.b - ctrl_pts_local[idx][2];
dist = sqrt((buff_2.r*buff_2.r)+(buff_2.g*buff_2.g)+(buff_2.b*buff_2.b));
buff_3.r = buff_2.r + (weights_local[idx][0] * dist);
buff_3.g = buff_2.g + (weights_local[idx][1] * dist);
buff_3.b = buff_2.b + (weights_local[idx][2] * dist);
if (idx == 3702 - 1) {
fifo_2a.write(buff_3); // <-- WRITE TO FIFO
}
}
}
}
Process_3:
for (int i = 0; i < 256; i++) {
for (int j = 0; i < 512; i++) {
#pragma HLS PIPELINE II=1
buff_3 = fifo_2a.read(); // <--- READ FROM FIFO
buff_1 = fifo_2b.read(); // <--- READ FROM FIFO
buff_4.r = buff_3.r + coefs_local[0][0] + buff_1.r* coefs_local[1][0] + buff_1.g * coefs_local[2][0] + buff_1.b* coefs[3][0];
buff_4.g = buff_3.g + coefs_local[0][1] + buff_1.r* coefs_local[1][1] + buff_1.g * coefs_local[2][1] + buff_1.b* coefs_local[3][1];
buff_4.b = buff_3.b + coefs_local[0][2] + buff_1.r* coefs_local[1][2] + buff_1.g * coefs_local[2][2] + buff_1.b* coefs_local[3][2];
buff_5.r = fmin(fmax((float)buff_4.r, 0.0), 255.0);
buff_5.g = fmin(fmax((float)buff_4.g, 0.0), 255.0);
buff_5.b = fmin(fmax((float)buff_4.b, 0.0), 255.0);
new_pix.r = (uc)buff_4.r;
new_pix.g = (uc)buff_4.g;
new_pix.b = (uc)buff_4.b;
temp_out = ((uc)new_pix.r << 16 | (uc)new_pix.g << 8 | (uc)new_pix.b);
out<<temp_out;
}
}
Be extremely careful in sizing the depth of the FIFOs, since Process 2 (the one with the sqrt
operation) might have a slower data consumption and production rates! Also, FIFO 2b needs to account for that delay. If the rates are not matching, there will be deadlocks. Make sure to have a meaningful testbench and to cosimulate your design.
(The FIFO's depth can be changed with the pragma #pragma HLS STREAM variable=fifo_1 depth=N
).
There might be further smaller/detailed optimizations that can be performed along the way, but I would first start from the ones above, being the heaviest. Just bear in minds that floating point processing is not optimal on FPGAs (as you noted) and is usually avoided.
EDITs: I tried the code with the modifications above and I've achieved II=1 with decent resouce usage.
Since the II is now one, the ideal number of cycles the accelerator would take is 256x512 and I'm close to that: ideal 402,653,184 versus mine 485,228,587). One crazy idea I now have to propose to you is to split the Process_2 inner-most loop into two parallel branches (or even more than 2 actually), feeding their own FIFOs. Process_1 would supply the two branches while an additional process/loop will alternatively read from the two FIFOs the 256x512 elements and supply them in the correct order to Process_3. In this way, the total amount of cycles required should halve, since Process_2 is the slowest process in the dataflow (and so improving it will improve the whole design). One possible drawback of this approach will be a higher amount of area/resource required on the FPGA.
Good luck.