Timing closure problems in FIFO

I have a FIFO implementation in Verilog that is based on this article: CummingsSNUG2002SJ_FIFO

Whilst using this FIFO as a CDC FIFO, with the read side being clocked at 100Mhz and the write side at 8Mhz I ran into a timing issue. Please note that the issue is not reproducible in simulation whether it is behavioral, post-synthesis or post-implementation. It can only be see in the timing reports given by the Vivado tool. I also have a timing constraint for each clock in the synthesis flow.

Here are snippets of code that are of interest for the question:

module CDC_FIFO #(
    parameter DSIZE = 28,
    parameter ASIZE = 10
  )(
    output [DSIZE-1:0] rdata,
    output wfull,
    output rempty,
    output r_almost_empty,
    input [DSIZE-1:0] wdata,
    output [ASIZE:0] wr_diff,
    input winc,
    input wclk,
    input wrst_n,
    input rinc,
    input rclk,
    input rrst_n
  );

  wire [ASIZE-1:0] waddr, raddr;
  wire [ASIZE:0] wptr, rptr, wq2_rptr, rq2_wptr;

  localparam AWAIT_RINC = 0;
  localparam AWAIT_W_TO_R = 1;
  
  reg [ASIZE:0] rptr_save = 0;
  reg [1:0] state = AWAIT_RINC;
  reg r_to_w = 0;
  wire w_to_r;

  always @(posedge rclk) begin
    case (state)
        AWAIT_RINC: begin
            if (rinc && !w_to_r) begin
                state <= AWAIT_W_TO_R;
                r_to_w <= 1;
                rptr_save <= rptr;
            end
            else begin
                r_to_w <= 0;
            end
        end
        AWAIT_W_TO_R: begin
            if (w_to_r) begin
                state <= AWAIT_RINC;
                r_to_w <= 0;
            end
        end
        default: ;
    endcase
  end

  sync_r2w #(.ADDRSIZE(ASIZE)) sync_r2w
           (
            .r_to_w(r_to_w),
            .w_to_r(w_to_r),
             .wq2_rptr(wq2_rptr),
             .rptr(rptr_save),
             .wclk(wclk),
             .wrst_n(wrst_n)
           );
 // Other instantiations of other modules
endmodule;

module sync_r2w #(
    parameter ADDRSIZE = 4

  )(
    output reg w_to_r = 0,
    input r_to_w,
    output [ADDRSIZE:0] wq2_rptr,
    input [ADDRSIZE:0] rptr,
    input wclk, wrst_n
  );

  reg [ADDRSIZE:0] wq1_rptr = 0;
  reg [ADDRSIZE:0] wq2_rptr_reg = 0;

  assign wq2_rptr = wq2_rptr_reg;

  always @(posedge wclk)
  begin
    if (!wrst_n)
    begin
      { wq2_rptr_reg,wq1_rptr} <= 0;
    end
    else
    begin
      if (w_to_r || r_to_w) begin
        { wq2_rptr_reg,wq1_rptr} <= {wq1_rptr,rptr};        
      end
    end
  end

  always @(posedge wclk) begin
    if (!wrst_n) begin
      w_to_r <= 0;
    end
    else begin
      w_to_r <= r_to_w;
    end
  end
endmodule

The issue arises when the read pointer has to be passed to the write domain through the sync_r2w module. So I have added a state machine to CDC_FIFO, to initiate a sort of handshake between the two, so to be sure that the rptr_save holds its value long enough for the write side to process it. However it seems that this has not fixed my issue.

Here is a photo of the critical path that is currently failing (one of the bits of rptr_save)

What exactly is the problem in my logic that still does not fix the timing? It seems to me that the current FSM basically waits for the write side to assert that it has indeed taken the rptr_save value and than "switches off" the handshake condition to prevent further update.

Thanks

Solution

Timing constraints, are used by the Timing Analyzer tool, not simulation. A back-annotated simulation works like hardware using setup and hold times to create the waves/simulator output. RTL simulation uses the Verilog event scheduler simulation model. You will not see any affect of timing constraints in simulation.

The Vivado timing analyzer does not know about the behavior of your logic. It knows delays thru Xilinx primitives like slices, LUTs, wires, buffers. clk2out delay of registers & RAM outputs (not a comprehensive list, you get the idea). Delays from one domain to another don't make a lot of sense from an analysis pov. The timing analyzer does not know you put a nice CDC handling FIFO in that path with two clocks and two constraints.

By default, the timing analyzer times between every path to every other path in all clock domains that have a constraint. If you have paths that cross domains the tool will show that path as failing. The point of this is to show the user 'Hey you have crossed clock domains' If you have taken care to design logic that handles CDC (such as a CDC handling FIYO from Cummings) then the reported path can be ignored because you have handled the CDC. The native purpose of static timing analysis makes sense within the same clock domain only.

The normal use of the timing analyzer is to add up delays in all the paths from register to register in a clock domain (done for all clock domains). If a path's delay is shorter that the clock period, then the path passes otherwise it fails.

What to do A few options

Right click on the failing path(s) in the Timing Analyzer tool and do 'Set false path', save the false path to the same constraints file that the clock constraints are in. Run the tools again and the path will not be flagged as an error. More paths may pop up, do the same with those if you are confident that the corresponding RTL logic handles the CDC.
Create a max delay constraint. This maintains some relationship between the 2 domains, so that the logic for both sides of the fifo ends up near to the other in the design. Recommendations here async-fifos-how-can-we-effectively-make-and-constrain-them and here What constraints are needed
A more general approach is to set the clock groups as asynchronous How when why to set clock groups asynchronous. If this link is not enough, then search on 'set clock groups asynchronous' for more.

This Xilinx page recommends setting all clock groups asynchronous set clock groups async.
Here is the section in context

set_clock_groups -asynchronous -group {xilinx_jtag_tck}
set_clock_groups -asynchronous -group {clkA PLL1_c0 PLL1_c1 } 
set_clock_groups -asynchronous -group {clkB PLL2_c0 PLL2_c1 } 
set_clock_groups -asynchronous -group {dsp_clk PLL3_c0 PLL3_c1 PLL3_c2}

    In general, it is recommended to add all clocks to the group.
 Exceptions are:

1) DDR and GXB designs, where there are a large number of clocks added to the design that the user does not know about. 

95% of these clocks have paths only to domains that are relevant or are cut in the .xdc files that get created, so they do not need to be taken into consideration.

2) Virtual clocks are recommended for all I/O interfaces (except for source-synch outputs). 

Virtual clocks are not added here because the only paths they connect to are I/O's that are explicitly designated and they tend to only have real paths. 

If they are added, it will not cause any issues.

Refer to (UG903) Vivado Design Suite User Guide: Using Constraints for more information.

Xilinx refers to the process of setting up constraints for a new design as 'baselining' There is explanation here Baselining-the-Design

Alternate solution Use the Xilinx IP catalog to generate a CDC FIFO, the posted issue will not occur because Xilinx's IP contains false paths or max_delay paths where needed. IP Catalog also generates a simulation model to use for RTL simulations. Not necessarily recommending this; its something else you could do. Designing your own vs using Xilinx IP has its own trade offs.