verilogroundingfixed-point

Rounding down the absolute value of signed fixed point numbers in Verilog


Context

Hello, I am working on building a R2MDC-FFT engine in Verilog.

Currently, the engine outputs are exhibiting rounding errors (it is failing some provided testcases by a small margin), and I suspect it is due to how I am rounding the multiplication results of the Butterfly Unit.

In more details...

Step 1 - Convert Q<16.16> to Q<24.8> (i.e. discarding extra fractional bits), which is where rounding comes into effect.

Referring to the following resources...

I note the following two points

  1. Performing arithmetic right shift (i.e. truncation) on positive signed fixed point numbers, will always result in rounding down of the result's absolute value.

    • Thus, I can just truncate positive signed fixed point numbers to achieve my desired rounding.
  2. Performing arithmetic right shift (i.e. truncation) on negative signed fixed point numbers, will always result in rounding up of the result's absolute value.

    • Thus, adding 1'b1 to the truncated negative signed fixed point number will 'reverse' this effect, as it causes the absolute value to decrease. This achieves my desired rounding.

Step 2 - Convert Q<24.8> to Q<8.8> (i.e. discarding extra integer bits)

We just need to take the lower 16bits of Q<24.8>, no rounding needed.

Here is the (relevant) implementation in the Butterfly Unit...

module bf (
    input signed [15:0] A,
    input signed [15:0] B,
    output reg signed [15:0] C
);

localparam NUM_FRACTIONAL_BITS = 8;


// 1. Perform sign extension of inputs
wire signed [31:0] A_extended;
wire signed [31:0] B_extended;

assign A_extended = {{16{A[15]}}, A};
assign B_extended = {{16{B[15]}}, B};

// 2. Perform multiplication, and then collect the results from the LSB
wire signed [63:0] mult_result_extended;
wire signed [31:0] mult_result;

assign mult_result_extended = A_extended * B_extended;
assign mult_result = mult_result_extended[31:0];

// 3. Convert Q<16.16> to Q<8.8>
reg signed [31:0] mult_result_shifted;

always @(*) begin
    mult_result_shifted = mult_result >>> NUM_FRACTIONAL_BITS;    // Q<16.16> to Q<24.8> truncation

    // Check signedness of result (i.e. Check MSB of 2s complement representation)
    if (mult_result[31]) begin
        // Result is negative, we add 1 to round down it's absolute value
        C = mult_result_shifted + 2'sb01;    // Round, then convert Q<24.8> to Q<8.8>
        // Edgecase of overflow is checked and handled (code not shown)
    end
    else begin
        // Result is positive, truncation has same effect as rounding down it's absolute value
        C = mult_result_shifted;    // Round, then convert Q<24.8> to Q<8.8>
    end
end

endmodule

Help required

The above implementation does not work, as I am still observing rounding errors.

I would like to verify if the rounding logic presented above is correct. If not, what could I do to rectify it?

Any help would be greatly appreciated, thank you.

What I have tried

I have looked at other posts that might be related, such as https://stackoverflow.com/questions/73630956/truncated-signed-fixed-point-conversion-from-q2-28-to-q2-14-in-verilog .

However, they all concern rounding towards the nearest integer, which is not what I need.

I have also developed a testbench just for the Multiplication unit and tried to verify it.

The outputted results are in accordance to my desired rounding behavior. I suspect it might be due too my test inputs being too simplistic, but I'm not sure what else to do besides randomly inputting test vectors and hoping for an error.

Reproducible example

// Multiplication Unit

module mult (
    input signed [15:0] A,
    input signed [15:0] B,
    output reg signed [15:0] C
);

localparam NUM_FRACTIONAL_BITS = 8;


// 1. Perform sign extension of inputs
wire signed [31:0] A_extended;
wire signed [31:0] B_extended;

assign A_extended = {{16{A[15]}}, A};
assign B_extended = {{16{B[15]}}, B};

// 2. Perform multiplication, and then collect the results from the LSB
wire signed [63:0] mult_result_extended;
wire signed [31:0] mult_result;

assign mult_result_extended = A_extended * B_extended;
assign mult_result = mult_result_extended[31:0];

// 3. Convert Q<16.16> to Q<8.8>
reg signed [31:0] mult_result_shifted;

always @(*) begin
    mult_result_shifted = mult_result >>> NUM_FRACTIONAL_BITS;    // Q<16.16> to Q<24.8> truncation

    // Check signedness of result (i.e. Check MSB of 2s complement representation)
    if (mult_result[31]) begin
        // Result is negative, we add 1 to round down it's absolute value
        C = mult_result_shifted + 2'sb01;    // Round, then convert Q<24.8> to Q<8.8>

        // Check for overflow
        if (mult_result_shifted + 2'sb01 == 32'b0)
            C = 16'hFFFF;
    end
    else begin
        // Result is positive, truncation has same effect as rounding down it's absolute value
        C = mult_result_shifted;    // Round, then convert Q<24.8> to Q<8.8>
    end
end

endmodule
// Testbench

module tb_mult(

    );

    reg [15:0] tb_A;
    reg [15:0] tb_B;
    wire [15:0] tb_C;

    mult DUT (
        .A(tb_A),
        .B(tb_B),
        .C(tb_C)
    );

    // All inputs/outputs are in Q<8.8> format, using 2s complement
    // Q<8.8> can represent -128 to +127.99609375, with resolution 0.00390625

    initial begin
        tb_A = 16'b1111_1010_0111_1111;    // Q<8.8> = -5.50390625, 2s complement decimal value = -1409
        tb_B = 16'b0000_1101_0001_1010;    // Q<8.8> = +13.1015625, 2s complement decimal value = 3354
        //tb_C expected to be 16'b1011_0111_1110_0100; Equivalently Q<16.0> = -18460 or Q<8.8> = -72.109375

        /* Explanation
            1. tb_A * tb_B = 32'b1111_1111_1011_0111_1110_0011_1110_0110; lets call this result D.
            2. Note that D is in Q<16.16> format with the fixed-point representation -72.10977172851562.
            3. Naively, we could convert it directly by taking the middle 16bits. i.e D[23:8] 16'b1011_0111_1110_0011, which has Q<8.8> = -72.11328125
                - However, we want the final result's absolute value to be rounded down. i.e We want 16'b1011_0111_1110_0100, which has Q<8.8> = -72.109375


            4. Suppose that D = 32'b1111_1111_1011_0111_1110_0011_0110_0110. (Notice that D[7] is different from the previous scenario)
                - This has fixed point representation -72.11172485351562.
            5. We still want the final result to be 16'b1011_0111_1110_0100 or -72.109375, as we want to final Q<8.8> result's absolute value to ALWAYS be rounded down.
        */
    end

endmodule

Solution

  • I did not try the code, but I suspect the issue is in making a "rounding" adjustment when it is not needed.

    Always adjusting after truncating a negative number does not make sense. If all of the truncated bits are zero, then the number is already "perfectly represented" after truncation and does not need adjustment. Only when the truncated bits are non-zero should the adjustment be done.