Running test on Rocket core CPU - global variable initialized to 0 is unsuccessful, output wrong value instead

While I am benchmarking my Rocketcore CPU, I encountered failed Coremark benchmarking. After some debug, I reduce the issue scope to unsuccessful global initialization of 0 value. In Coremark, it will initialize some volatile variables to be 0x0, but instead, the variables are assigned wrong values.

Environment Information for issue production:

Compiling env: Ubuntu 22.04, gcc 11.4.0
Compiler: riscv32-unknown-elf-gcc (g2ee5e430018) 12.2.0 (remark: from rocket-tools repo 04a559f)
Compiler option: see Makefile
test program: test.c
environment linker files: see common/
UART printing programs: see uart/
Rocketchip repo: 0586532
Rocketchip config: RV32imac, 1 big core, without debug, with 2 GPIOs, with 1 UART, simulated SRAM 256MB
Simulation env: CentOS 7.9, Synopsys VCS R-2020.12-SP1

My simulation flow:

compile the program into .hex file.

$ make clean
$ make bmarks=test
$ riscv32-unknown-elf-objcopy -O binary test.riscv test.bin
$ hexdump -v -e/'4 "%08x\n" ' test.bin > test.hex
$ chmod 755 test.hex

insert .hex program into sram in testbench.v
compile executable for simulation.

Expected Behaviour:
By setting PERFORMANCE_RUN=1, all globally defined seed values should be initialized with predefined values, and locally defined num values should be initialized with predefined values:

seed1_volatile=0x0
seed2_volatile=0x0
seed3_volatile=0x66
seed4_volatile=0xa
seed5_volatile=0
num1=0x0
num2=0x0
num3=0x0
num4=0x0
num5=0x0

Actual behaviour:
Debug output showing the seed values globally initialized as 0 becomes random numbers, while others values are as expected:

seed1_volatile=0xdd232eba
seed2_volatile=0xedd684db
seed3_volatile=0x66
seed4_volatile=0xa
seed5_volatile=0xf870bef0
num1=0x0
num2=0x0
num3=0x0
num4=0x0
num5=0x0

My debug attempts:

UART printing may be faulty: verified with printing multiple data format, including %d, %lu, %x, %s, %f, output as expected.
pre-defined params setting may not be linked: print out the PERFORMANCE_RUN flag value, matches the setting.
Rocketcore may not be running: tested with Dhrystone, output as expected.
program is placed into wrong memory locations: the assembly code shows all instruction and data locations are as expected (from 0x80000000 - 0x90000000)

What I found out:

In assembly code from line 1886 onwards: calling the global variables is done by loading data from memory with lw a1,offset(a1) or lw a1,offset(gp); while the local variables seems to be down by loading the value as immediate with li a1,0.
In .hex file from line 4195 onwards: The initialization of non-zero values will store the values into sram, but zero values are not stored.

At this point, I do not know how to globally initialize a variable as 0. If anyone can help pointing out possible reason or direction for further debug info, it will be much appreciated.Thank you in advance for giving me any hint possible!

Solution

[self posting answer]
After two weeks of debug, I finally figure out where is the issue (due to my lack of knowledge in compilation and assembly) - the original crt.S and test.ld provided by the rocket-tools repo common folder do NOT contain the clearing of .bss data section, as the developers expect the bootloader to do the initialization. Other users have actually posted about this issue before - post 1, post 2.

So my solution is to add the clearing code into the crt.S before the trap vector initialization:

  # init bss section
  la a0, __sbss  #load the starting address of bss to a0
  la a1, __ebss  #load the ending address of bss to a1
  bgeu a0, a1, done_bss  #do not clear if a0>=a1, i.e. bss section is empty

clear_bss:
  sw x0, (a0)            #store 0 to the address in a0 (bss)
  addi a0, a0, 4         #increment by 4
  bltu a0, a1, clear_bss #if bss end not reached, continue to store 0

done_bss:
  # <original crt.S continues>

Plus, include the variables __sbss and __ebss in the test.ld by modifying the bss section:

  .sbss : {
    __sbss = .;  /*starting address of bss*/
    *(.sbss .sbss.* .gnu.linkonce.sb.*)
    *(.scommon)
  }
  .bss : {
    *(.bss)
    __ebss = .; /*ending address of bss*/
  }

Now the simulation output of my test program is correct:

seed1_volatile=0x0
seed2_volatile=0x0
seed3_volatile=0x66
seed4_volatile=0xa
seed5_volatile=0
num1=0x0
num2=0x0
num3=0x0
num4=0x0
num5=0x0