verilogsystem-veriloghdlregister-transfer-level

How can I schedule multiple inputs into an instantiated SystemVerilog module?


I am trying to build a module that takes a 32 bit input (parameterised) and outputs the cube of the input. The naive approach would be the following:

module cuber #(
    BW = 32
) (
    input logic [BW-1:0] in0,
    output logic [BW*2-1:0] cubed_op
);

logic [BW*2-1:0] inter_l;

fast_mul #(.BW(BW))
fm_inst_1 (
  .input1(in0),
  .input2(in0),
  .product(inter_l)
);

fast_mul #(.BW(BW))
fm_inst_2 (
  .input1(inter_l),
  .input2(in0),
  .product(cubed_op)
);

endmodule

But I want to know if I can reuse the fm_inst_1 multiplier to perform both multiplications.

I am trying to use a FIFO to schedule the inputs but I can't wrap my head around how the multiplier would perform these multiplications. Then I tried to write the first multiplication's output back to the intermediate register and hoping it will reuse it but I am sure there is a better way to do this.


Solution

  • I am sure there is a better way to do this.

    There is no need for multiple instances of a structural multiply in the context you have presented.
    Behavioral modeling works in simulation and synthesis workflows.

    A synthesis workflow will infer two FPGA DSP blocks approximately in cascade, to perform the multiplies.

    I am trying to use a FIFO to schedule the inputs

    A fifo has nothing to do with the multiply. Use a fifo when you need the buffering/delay/accumulation behavior of a fifo. Use a register to store intermediate values if needed.

    If you prefer structural design, create two instances of the multiplier and you have the basic structure for the multiplier. Most FPGAs have several DSP block available. Instantiating two modules will not re-use one physical resource.

    If you want to use re-use single multiplier, you will need a state machine to act as a controller and some registers and a multiplexer at the input which makes the selection to determine if the 2nd mul input comes from the DUT module input or the registered intermediate value. The design would require some sort of flag or strobe telling it when a new value arrives.

    You have at least one error in what you posted.
    Cubing N bits produces 3N bits, not 2N.

    If you prefer to perform the muls using structural modeling, the output of the first is 32 bits times 32 bits which is 64 bits. The output of the second is 32 bits times 64 bits which is 96 bits.

    If the problem definition needed a parameterized power, then a structural model might be better because you could use a generate loop to create 2**WHATEVER_PARAMETER is needed.

    If the design had a high speed clock, then a structural model might be better because the output of each stage could be registered using flop flops for timing closure/performance.

    The best model depends on the context.

    Here is a behavioral model of the unsigned cuber which I like better than what you did in the context you presented because its more concise.

    module cuber #(
        BW = 32
    ) (
      input  logic [BW       -  1:0] in0,
      output logic [(BW * 3) -  1:0] cubed_op
    );
    
      always_comb
        cubed_op = in0 * in0 * in0;
    
    endmodule
    

    A small sim of this produces

    time = 0, in0 =        2, cubed_op =               8, log2 cubed =   3
    time = 1, in0 =        4, cubed_op =              64, log2 cubed =   6
    time = 2, in0 =        8, cubed_op =             512, log2 cubed =   9
    time = 3, in0 =       16, cubed_op =            4096, log2 cubed =  12
    time = 4, in0 =      256, cubed_op =        16777216, log2 cubed =  24
    time = 5, in0 =     2048, cubed_op =      8589934592, log2 cubed =  33
    time = 6, in0 = 4294967295, cubed_op = 79228162458924105385300197375, log2 cubed =  96
    

    I printed the log base2 to display the number of bits used for the cube.

    The last vector at time 6, is the max value of the input (2**32 - 1) so that you can see it works for big numbers and takes 3N bits.


    Here is the state machine version which uses only a single multiply. The mul performs the square in the first clock, then the cube in the 2nd. The design accepts data at a 50% duty cycle.

    module cuber
       (input logic [7:0] data_in,
        input logic val_in,
        input logic clk,
        input logic rst,
        output logic val_out,
        output logic [15:0] mul_out);
      
      // locals
      typedef enum logic [1:0] {SQUARE=2'b00,CUBE=2'b01} T_SM_ENUM;
      //  
      logic [31:0] mul_in_32; 
      logic [63:0] mul_in_64;
      logic [95:0] mul_out_96;
      logic        val_del1,val_del2;
      logic        mux_sel_out64_nout96;
      T_SM_ENUM    current_state, next_state;
      
      // rename
      assign mul_in_32 = data_in;
      
      // mux
      always_comb
        if(mux_sel_out64_nout96)
          mul_in_64 = data_in;
        else
          mul_in_64 = mul_out_96[63:0];
      
      // SM Combinational proc
      always_comb begin :SM
          // outputs
          mux_sel_out64_nout96 = 0;
          // NS
          next_state = current_state;
          
          case(current_state)
            SQUARE: begin
              // outputs
              if(val_in)
                mux_sel_out64_nout96 = 1;
              // NS
              if(val_in)
                next_state = CUBE;
            end
            
            CUBE: begin
              // outputs
              if(val_in)
                mux_sel_out64_nout96 = 1;
              // NS
              if(val_in)        
                next_state = SQUARE;
            end
            
            default:
              next_state = SQUARE;
          endcase
        end :SM
      
      // sync proc general use
      always_ff @(posedge clk)
        if(rst) begin
           current_state = SQUARE;  
           val_del1 <= 0;
           val_del2 <= 0;
          end
        else  begin
           current_state <= next_state;
          val_del1 <= val_in;
          val_del2 <= val_del1;
        end
          
      // sync proc for single mul
      always_ff @(posedge clk)
        if(rst) 
           mul_out_96 <= '0;
         else       
           mul_out_96 <= mul_in_32 * mul_in_64;
      
      // rename
      assign mul_out = mul_out_96;
      assign val_out = val_del2;
      
    endmodule
    

    And a '$monitor' of the results:

    # time =   0, reset = 1,val_in = 0, data_in =  x, mul_out =     x, val_out = x
    # time =   5, reset = 0,val_in = 0, data_in =  x, mul_out =     0, val_out = 0
    # time =  15, reset = 0,val_in = 0, data_in =  x, mul_out =     x, val_out = 0
    # time =  35, reset = 0,val_in = 1, data_in =  4, mul_out =     x, val_out = 0
    # time =  45, reset = 0,val_in = 0, data_in =  4, mul_out =    16, val_out = 0
    # time =  55, reset = 0,val_in = 1, data_in = 16, mul_out =    64, val_out = 1
    # time =  65, reset = 0,val_in = 0, data_in = 16, mul_out =   256, val_out = 0
    # time =  75, reset = 0,val_in = 0, data_in = 16, mul_out =  4096, val_out = 1
    # time =  85, reset = 0,val_in = 0, data_in = 16, mul_out =     0, val_out = 0
    

    The test drives 4 as a vector, and the dut produces 64 two clocks later then drives 16 and produces 4096 two clocks later.