I am trying to use PAPI library to count cache misses. cache hit performance counter is not available on my hardware, that's why I am trying to determine cache hits with no cache misses. I am trying few things. First version of my code is this:
int numEvents = 2;
long long values[2];
int events[2] = {PAPI_L1_DCM, PAPI_L2_TCM};
if (PAPI_start_counters(events, numEvents) != PAPI_OK ) // !=PAPI_OK
printf("PAPI error: %d\n", 1);
for(int i=0; i < arr_size; i++)
{
array[i].value = 1;
}
_mm_mfence();
if ((ret1 = PAPI_read_counters(values, numEvents)) != PAPI_OK) {
fprintf(stderr, "PAPI failed to read counters: %s\n", PAPI_strerror(ret1));
exit(1);
}
miss1 = values[0];
_mm_mfence();
for(int i=0; i < arr_size; i++){
array[i].value = array[i].value + 9; // (int) sum
}
_mm_mfence();
if ((ret2 = PAPI_read_counters(values, numEvents)) != PAPI_OK) {
fprintf(stderr, "PAPI failed to read counters: %s\n", PAPI_strerror(ret2));
exit(1);
}
miss2 = values[0];
printf("before flush miss_1 %lli, miss_2 %lli \n", miss1, miss2);
the problem is that this piece of code is supposed to give me cache hits, so L1 cache miss should be extremely low. but I get unexpectedly high results for miss_2. With array size of 200, miss_2 is nearly 100. it doesn't give any valid result to judge that it really was hit, because of high number of cache misses.
I also tried to rewrite it like this:
if (PAPI_start_counters(events, numEvents) != PAPI_OK ) // !=PAPI_OK
printf("PAPI error: %d\n", 1);
for(int i=0; i < arr_size; i++){
array[i].value = array[i].value + 9; // (int) sum
}
if ( PAPI_stop_counters(values, numEvents) != PAPI_OK)
printf("PAPI error: 2\n");
printf("before flush miss %lli\n", values[0]);
but this gives even worse result, miss_2 is more than 200. Is there anything I am not doing right? It was supposed to give more precise result, but it's doing terrible now. Or I am missing something.
I have tried without fences, I am sure that at least they don't do any harm. I would really appreciate any suggestion.
The disadvantage of PAPI_read_counters is it's overhead, and not great performance, but now I don't care abut performance, I want to correctly determine cache hits.
Though I was also thinking to use RDMPC but I have not found an example to use it without _asm function overwriting. Is this really the only way to use rdpmc? there does not exist already defined function which I would not have to overwrite?
EDIT: adding compiler code for PAPI_read
./prog6: file format elf64-x86-64
Disassembly of section .init:
00000000000009c0 <_init>:
9c0: 48 83 ec 08 sub $0x8,%rsp
9c4: 48 8b 05 1d 16 20 00 mov 0x20161d(%rip),%rax # 201fe8 <__gmon_start__>
9cb: 48 85 c0 test %rax,%rax
9ce: 74 02 je 9d2 <_init+0x12>
9d0: ff d0 callq *%rax
9d2: 48 83 c4 08 add $0x8,%rsp
9d6: c3 retq
Disassembly of section .plt:
00000000000009e0 <.plt>:
9e0: ff 35 6a 15 20 00 pushq 0x20156a(%rip) # 201f50 <_GLOBAL_OFFSET_TABLE_+0x8>
9e6: ff 25 6c 15 20 00 jmpq *0x20156c(%rip) # 201f58 <_GLOBAL_OFFSET_TABLE_+0x10>
9ec: 0f 1f 40 00 nopl 0x0(%rax)
00000000000009f0 <puts@plt>:
9f0: ff 25 6a 15 20 00 jmpq *0x20156a(%rip) # 201f60 <puts@GLIBC_2.2.5>
9f6: 68 00 00 00 00 pushq $0x0
9fb: e9 e0 ff ff ff jmpq 9e0 <.plt>
0000000000000a00 <clock_gettime@plt>:
a00: ff 25 62 15 20 00 jmpq *0x201562(%rip) # 201f68 <clock_gettime@GLIBC_2.17>
a06: 68 01 00 00 00 pushq $0x1
a0b: e9 d0 ff ff ff jmpq 9e0 <.plt>
0000000000000a10 <getpid@plt>:
a10: ff 25 5a 15 20 00 jmpq *0x20155a(%rip) # 201f70 <getpid@GLIBC_2.2.5>
a16: 68 02 00 00 00 pushq $0x2
a1b: e9 c0 ff ff ff jmpq 9e0 <.plt>
0000000000000a20 <__stack_chk_fail@plt>:
a20: ff 25 52 15 20 00 jmpq *0x201552(%rip) # 201f78 <__stack_chk_fail@GLIBC_2.4>
a26: 68 03 00 00 00 pushq $0x3
a2b: e9 b0 ff ff ff jmpq 9e0 <.plt>
0000000000000a30 <PAPI_read_counters@plt>:
a30: ff 25 4a 15 20 00 jmpq *0x20154a(%rip) # 201f80 <PAPI_read_counters>
a36: 68 04 00 00 00 pushq $0x4
a3b: e9 a0 ff ff ff jmpq 9e0 <.plt>
0000000000000a40 <sched_setaffinity@plt>:
a40: ff 25 42 15 20 00 jmpq *0x201542(%rip) # 201f88 <sched_setaffinity@GLIBC_2.3.4>
a46: 68 05 00 00 00 pushq $0x5
a4b: e9 90 ff ff ff jmpq 9e0 <.plt>
0000000000000a50 <PAPI_start_counters@plt>:
a50: ff 25 3a 15 20 00 jmpq *0x20153a(%rip) # 201f90 <PAPI_start_counters>
a56: 68 06 00 00 00 pushq $0x6
a5b: e9 80 ff ff ff jmpq 9e0 <.plt>
0000000000000a60 <PAPI_stop_counters@plt>:
a60: ff 25 32 15 20 00 jmpq *0x201532(%rip) # 201f98 <PAPI_stop_counters>
a66: 68 07 00 00 00 pushq $0x7
a6b: e9 70 ff ff ff jmpq 9e0 <.plt>
0000000000000a70 <malloc@plt>:
a70: ff 25 2a 15 20 00 jmpq *0x20152a(%rip) # 201fa0 <malloc@GLIBC_2.2.5>
a76: 68 08 00 00 00 pushq $0x8
a7b: e9 60 ff ff ff jmpq 9e0 <.plt>
0000000000000a80 <PAPI_strerror@plt>:
a80: ff 25 22 15 20 00 jmpq *0x201522(%rip) # 201fa8 <PAPI_strerror>
a86: 68 09 00 00 00 pushq $0x9
a8b: e9 50 ff ff ff jmpq 9e0 <.plt>
0000000000000a90 <__printf_chk@plt>:
a90: ff 25 1a 15 20 00 jmpq *0x20151a(%rip) # 201fb0 <__printf_chk@GLIBC_2.3.4>
a96: 68 0a 00 00 00 pushq $0xa
a9b: e9 40 ff ff ff jmpq 9e0 <.plt>
0000000000000aa0 <getrusage@plt>:
aa0: ff 25 12 15 20 00 jmpq *0x201512(%rip) # 201fb8 <getrusage@GLIBC_2.2.5>
aa6: 68 0b 00 00 00 pushq $0xb
aab: e9 30 ff ff ff jmpq 9e0 <.plt>
0000000000000ab0 <exit@plt>:
ab0: ff 25 0a 15 20 00 jmpq *0x20150a(%rip) # 201fc0 <exit@GLIBC_2.2.5>
ab6: 68 0c 00 00 00 pushq $0xc
abb: e9 20 ff ff ff jmpq 9e0 <.plt>
0000000000000ac0 <fwrite@plt>:
ac0: ff 25 02 15 20 00 jmpq *0x201502(%rip) # 201fc8 <fwrite@GLIBC_2.2.5>
ac6: 68 0d 00 00 00 pushq $0xd
acb: e9 10 ff ff ff jmpq 9e0 <.plt>
0000000000000ad0 <__fprintf_chk@plt>:
ad0: ff 25 fa 14 20 00 jmpq *0x2014fa(%rip) # 201fd0 <__fprintf_chk@GLIBC_2.3.4>
ad6: 68 0e 00 00 00 pushq $0xe
adb: e9 00 ff ff ff jmpq 9e0 <.plt>
Disassembly of section .plt.got:
0000000000000ae0 <__cxa_finalize@plt>:
ae0: ff 25 12 15 20 00 jmpq *0x201512(%rip) # 201ff8 <__cxa_finalize@GLIBC_2.2.5>
ae6: 66 90 xchg %ax,%ax
Disassembly of section .text:
0000000000000af0 <main>:
af0: 41 57 push %r15
af2: b9 0f 00 00 00 mov $0xf,%ecx
af7: 41 56 push %r14
af9: 41 55 push %r13
afb: 41 54 push %r12
afd: 55 push %rbp
afe: 53 push %rbx
aff: 48 81 ec 78 01 00 00 sub $0x178,%rsp
b06: 64 48 8b 04 25 28 00 mov %fs:0x28,%rax
b0d: 00 00
b0f: 48 89 84 24 68 01 00 mov %rax,0x168(%rsp)
b16: 00
b17: 31 c0 xor %eax,%eax
b19: 48 8d 9c 24 e0 00 00 lea 0xe0(%rsp),%rbx
b20: 00
b21: 48 b8 00 00 00 80 07 movabs $0x8000000780000000,%rax
b28: 00 00 80
b2b: 48 c7 84 24 e0 00 00 movq $0x1,0xe0(%rsp)
b32: 00 01 00 00 00
b37: 48 8d 53 08 lea 0x8(%rbx),%rdx
b3b: 48 89 84 24 c8 00 00 mov %rax,0xc8(%rsp)
b42: 00
b43: 31 c0 xor %eax,%eax
b45: 48 89 d7 mov %rdx,%rdi
b48: f3 48 ab rep stos %rax,%es:(%rdi)
b4b: e8 c0 fe ff ff callq a10 <getpid@plt>
b50: 48 89 da mov %rbx,%rdx
b53: be 80 00 00 00 mov $0x80,%esi
b58: 89 c7 mov %eax,%edi
b5a: e8 e1 fe ff ff callq a40 <sched_setaffinity@plt>
b5f: 85 c0 test %eax,%eax
b61: 0f 85 17 03 00 00 jne e7e <main+0x38e>
b67: 0f ae f0 mfence
b6a: 48 8d 74 24 10 lea 0x10(%rsp),%rsi
b6f: bf 02 00 00 00 mov $0x2,%edi
b74: 0f ae f0 mfence
b77: e8 84 fe ff ff callq a00 <clock_gettime@plt>
b7c: 0f 31 rdtsc
b7e: bf 00 fa 00 00 mov $0xfa00,%edi
b83: 0f ae f0 mfence
b86: 48 c1 e2 20 shl $0x20,%rdx
b8a: 49 89 c6 mov %rax,%r14
b8d: 49 09 d6 or %rdx,%r14
b90: e8 db fe ff ff callq a70 <malloc@plt>
b95: 48 8d bc 24 c8 00 00 lea 0xc8(%rsp),%rdi
b9c: 00
b9d: be 02 00 00 00 mov $0x2,%esi
ba2: 49 89 c4 mov %rax,%r12
ba5: e8 a6 fe ff ff callq a50 <PAPI_start_counters@plt>
baa: 85 c0 test %eax,%eax
bac: 0f 85 88 02 00 00 jne e3a <main+0x34a>
bb2: 4d 89 e7 mov %r12,%r15
bb5: 49 8d 84 24 00 fa 00 lea 0xfa00(%r12),%rax
bbc: 00
bbd: 4c 89 e5 mov %r12,%rbp
bc0: c7 45 00 01 00 00 00 movl $0x1,0x0(%rbp)
bc7: 48 83 c5 40 add $0x40,%rbp
bcb: 48 39 e8 cmp %rbp,%rax
bce: 75 f0 jne bc0 <main+0xd0>
bd0: 4c 8d ac 24 d0 00 00 lea 0xd0(%rsp),%r13
bd7: 00
bd8: be 02 00 00 00 mov $0x2,%esi
bdd: 4c 89 ef mov %r13,%rdi
be0: e8 4b fe ff ff callq a30 <PAPI_read_counters@plt>
be5: 85 c0 test %eax,%eax
be7: 0f 85 b8 02 00 00 jne ea5 <main+0x3b5>
bed: 48 8b 84 24 d0 00 00 mov 0xd0(%rsp),%rax
bf4: 00
bf5: 4c 89 e3 mov %r12,%rbx
bf8: 48 89 44 24 08 mov %rax,0x8(%rsp)
bfd: 0f 1f 00 nopl (%rax)
c00: 83 03 09 addl $0x9,(%rbx)
c03: 48 83 c3 40 add $0x40,%rbx
c07: 48 39 dd cmp %rbx,%rbp
c0a: 75 f4 jne c00 <main+0x110>
c0c: 31 d2 xor %edx,%edx
c0e: 48 8d 35 88 04 00 00 lea 0x488(%rip),%rsi # 109d <_IO_stdin_used+0x2d>
c15: bf 01 00 00 00 mov $0x1,%edi
c1a: 31 c0 xor %eax,%eax
c1c: e8 6f fe ff ff callq a90 <__printf_chk@plt>
c21: be 02 00 00 00 mov $0x2,%esi
c26: 4c 89 ef mov %r13,%rdi
c29: e8 02 fe ff ff callq a30 <PAPI_read_counters@plt>
c2e: 85 c0 test %eax,%eax
c30: 0f 85 6f 02 00 00 jne ea5 <main+0x3b5>
c36: 48 8b 8c 24 d0 00 00 mov 0xd0(%rsp),%rcx
c3d: 00
c3e: 48 8b 54 24 08 mov 0x8(%rsp),%rdx
c43: 48 8d 35 e6 04 00 00 lea 0x4e6(%rip),%rsi # 1130 <_IO_stdin_used+0xc0>
c4a: 31 c0 xor %eax,%eax
c4c: bf 01 00 00 00 mov $0x1,%edi
c51: e8 3a fe ff ff callq a90 <__printf_chk@plt>
c56: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
c5d: 00 00 00
c60: 41 0f ae 3c 24 clflush (%r12)
c65: 49 83 c4 40 add $0x40,%r12
c69: 49 39 dc cmp %rbx,%r12
c6c: 75 f2 jne c60 <main+0x170>
c6e: be 02 00 00 00 mov $0x2,%esi
c73: 4c 89 ef mov %r13,%rdi
c76: e8 b5 fd ff ff callq a30 <PAPI_read_counters@plt>
c7b: 85 c0 test %eax,%eax
c7d: 0f 85 22 02 00 00 jne ea5 <main+0x3b5>
c83: 48 8b ac 24 d0 00 00 mov 0xd0(%rsp),%rbp
c8a: 00
c8b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
c90: 41 83 07 09 addl $0x9,(%r15)
c94: 49 83 c7 40 add $0x40,%r15
c98: 49 39 df cmp %rbx,%r15
c9b: 75 f3 jne c90 <main+0x1a0>
c9d: be 02 00 00 00 mov $0x2,%esi
ca2: 4c 89 ef mov %r13,%rdi
ca5: e8 86 fd ff ff callq a30 <PAPI_read_counters@plt>
caa: 85 c0 test %eax,%eax
cac: 0f 85 f3 01 00 00 jne ea5 <main+0x3b5>
cb2: 48 8b 8c 24 d0 00 00 mov 0xd0(%rsp),%rcx
cb9: 00
cba: 48 8d 35 97 04 00 00 lea 0x497(%rip),%rsi # 1158 <_IO_stdin_used+0xe8>
cc1: bf 01 00 00 00 mov $0x1,%edi
cc6: 31 c0 xor %eax,%eax
cc8: 48 89 ea mov %rbp,%rdx
ccb: e8 c0 fd ff ff callq a90 <__printf_chk@plt>
cd0: be 02 00 00 00 mov $0x2,%esi
cd5: 4c 89 ef mov %r13,%rdi
cd8: e8 83 fd ff ff callq a60 <PAPI_stop_counters@plt>
cdd: 85 c0 test %eax,%eax
cdf: 0f 85 72 01 00 00 jne e57 <main+0x367>
ce5: 0f ae f0 mfence
ce8: 0f 31 rdtsc
cea: bf 02 00 00 00 mov $0x2,%edi
cef: 48 c1 e2 20 shl $0x20,%rdx
cf3: 48 89 c3 mov %rax,%rbx
cf6: 48 8d 74 24 20 lea 0x20(%rsp),%rsi
cfb: 48 09 d3 or %rdx,%rbx
cfe: e8 fd fc ff ff callq a00 <clock_gettime@plt>
d03: bf 01 00 00 00 mov $0x1,%edi
d08: 48 be db 34 b6 d7 82 movabs $0x431bde82d7b634db,%rsi
d0f: de 1b 43
d12: 0f ae f0 mfence
d15: 48 8b 4c 24 20 mov 0x20(%rsp),%rcx
d1a: 48 2b 4c 24 10 sub 0x10(%rsp),%rcx
d1f: 48 69 c9 00 ca 9a 3b imul $0x3b9aca00,%rcx,%rcx
d26: 48 03 4c 24 28 add 0x28(%rsp),%rcx
d2b: 48 2b 4c 24 18 sub 0x18(%rsp),%rcx
d30: 48 89 c8 mov %rcx,%rax
d33: 48 c1 f9 3f sar $0x3f,%rcx
d37: 48 f7 ee imul %rsi
d3a: 48 8d 35 3f 04 00 00 lea 0x43f(%rip),%rsi # 1180 <_IO_stdin_used+0x110>
d41: 31 c0 xor %eax,%eax
d43: 48 c1 fa 12 sar $0x12,%rdx
d47: 48 29 ca sub %rcx,%rdx
d4a: e8 41 fd ff ff callq a90 <__printf_chk@plt>
d4f: 48 89 da mov %rbx,%rdx
d52: bf 01 00 00 00 mov $0x1,%edi
d57: 31 c0 xor %eax,%eax
d59: 4c 29 f2 sub %r14,%rdx
d5c: 48 8d 35 53 03 00 00 lea 0x353(%rip),%rsi # 10b6 <_IO_stdin_used+0x46>
d63: e8 28 fd ff ff callq a90 <__printf_chk@plt>
d68: 31 d2 xor %edx,%edx
d6a: 48 8d 35 56 03 00 00 lea 0x356(%rip),%rsi # 10c7 <_IO_stdin_used+0x57>
d71: 31 c0 xor %eax,%eax
d73: bf 01 00 00 00 mov $0x1,%edi
d78: e8 13 fd ff ff callq a90 <__printf_chk@plt>
d7d: 31 ff xor %edi,%edi
d7f: 48 8d 74 24 30 lea 0x30(%rsp),%rsi
d84: e8 17 fd ff ff callq aa0 <getrusage@plt>
d89: 83 f8 ff cmp $0xffffffff,%eax
d8c: 0f 84 d6 00 00 00 je e68 <main+0x378>
d92: 48 8b 8c 24 b8 00 00 mov 0xb8(%rsp),%rcx
d99: 00
d9a: 48 8b 94 24 b0 00 00 mov 0xb0(%rsp),%rdx
da1: 00
da2: 48 8d 35 3e 03 00 00 lea 0x33e(%rip),%rsi # 10e7 <_IO_stdin_used+0x77>
da9: 31 c0 xor %eax,%eax
dab: bf 01 00 00 00 mov $0x1,%edi
db0: e8 db fc ff ff callq a90 <__printf_chk@plt>
db5: c5 f9 57 c0 vxorpd %xmm0,%xmm0,%xmm0
db9: bf 01 00 00 00 mov $0x1,%edi
dbe: c5 fb 10 0d 12 04 00 vmovsd 0x412(%rip),%xmm1 # 11d8 <_IO_stdin_used+0x168>
dc5: 00
dc6: 48 69 44 24 30 40 42 imul $0xf4240,0x30(%rsp),%rax
dcd: 0f 00
dcf: 48 03 44 24 38 add 0x38(%rsp),%rax
dd4: 48 8d 35 d5 03 00 00 lea 0x3d5(%rip),%rsi # 11b0 <_IO_stdin_used+0x140>
ddb: c4 e1 fb 2a c0 vcvtsi2sd %rax,%xmm0,%xmm0
de0: 48 69 54 24 40 40 42 imul $0xf4240,0x40(%rsp),%rdx
de7: 0f 00
de9: 48 03 54 24 48 add 0x48(%rsp),%rdx
dee: c5 fb 59 c1 vmulsd %xmm1,%xmm0,%xmm0
df2: c4 e1 fb 2c c0 vcvttsd2si %xmm0,%rax
df7: c5 f9 57 c0 vxorpd %xmm0,%xmm0,%xmm0
dfb: c4 e1 fb 2a c2 vcvtsi2sd %rdx,%xmm0,%xmm0
e00: c5 fb 59 c1 vmulsd %xmm1,%xmm0,%xmm0
e04: c4 e1 fb 2c d0 vcvttsd2si %xmm0,%rdx
e09: 48 01 c2 add %rax,%rdx
e0c: 31 c0 xor %eax,%eax
e0e: e8 7d fc ff ff callq a90 <__printf_chk@plt>
e13: 31 c0 xor %eax,%eax
e15: 48 8b 8c 24 68 01 00 mov 0x168(%rsp),%rcx
e1c: 00
e1d: 64 48 33 0c 25 28 00 xor %fs:0x28,%rcx
e24: 00 00
e26: 75 51 jne e79 <main+0x389>
e28: 48 81 c4 78 01 00 00 add $0x178,%rsp
e2f: 5b pop %rbx
e30: 5d pop %rbp
e31: 41 5c pop %r12
e33: 41 5d pop %r13
e35: 41 5e pop %r14
e37: 41 5f pop %r15
e39: c3 retq
e3a: ba 01 00 00 00 mov $0x1,%edx
e3f: 48 8d 35 47 02 00 00 lea 0x247(%rip),%rsi # 108d <_IO_stdin_used+0x1d>
e46: bf 01 00 00 00 mov $0x1,%edi
e4b: 31 c0 xor %eax,%eax
e4d: e8 3e fc ff ff callq a90 <__printf_chk@plt>
e52: e9 5b fd ff ff jmpq bb2 <main+0xc2>
e57: 48 8d 3d 4a 02 00 00 lea 0x24a(%rip),%rdi # 10a8 <_IO_stdin_used+0x38>
e5e: e8 8d fb ff ff callq 9f0 <puts@plt>
e63: e9 7d fe ff ff jmpq ce5 <main+0x1f5>
e68: 48 8d 3d 62 02 00 00 lea 0x262(%rip),%rdi # 10d1 <_IO_stdin_used+0x61>
e6f: e8 7c fb ff ff callq 9f0 <puts@plt>
e74: e9 19 ff ff ff jmpq d92 <main+0x2a2>
e79: e8 a2 fb ff ff callq a20 <__stack_chk_fail@plt>
e7e: 48 8b 0d 9b 11 20 00 mov 0x20119b(%rip),%rcx # 202020 <stderr@@GLIBC_2.2.5>
e85: ba 18 00 00 00 mov $0x18,%edx
e8a: be 01 00 00 00 mov $0x1,%esi
e8f: 48 8d 3d de 01 00 00 lea 0x1de(%rip),%rdi # 1074 <_IO_stdin_used+0x4>
e96: e8 25 fc ff ff callq ac0 <fwrite@plt>
e9b: bf 01 00 00 00 mov $0x1,%edi
ea0: e8 0b fc ff ff callq ab0 <exit@plt>
ea5: 89 c7 mov %eax,%edi
ea7: e8 d4 fb ff ff callq a80 <PAPI_strerror@plt>
eac: 48 8b 3d 6d 11 20 00 mov 0x20116d(%rip),%rdi # 202020 <stderr@@GLIBC_2.2.5>
eb3: be 01 00 00 00 mov $0x1,%esi
eb8: 48 8d 15 49 02 00 00 lea 0x249(%rip),%rdx # 1108 <_IO_stdin_used+0x98>
ebf: 48 89 c1 mov %rax,%rcx
ec2: 31 c0 xor %eax,%eax
ec4: e8 07 fc ff ff callq ad0 <__fprintf_chk@plt>
ec9: bf 01 00 00 00 mov $0x1,%edi
ece: e8 dd fb ff ff callq ab0 <exit@plt>
ed3: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
eda: 00 00 00
edd: 0f 1f 00 nopl (%rax)
0000000000000ee0 <_start>:
ee0: 31 ed xor %ebp,%ebp
ee2: 49 89 d1 mov %rdx,%r9
ee5: 5e pop %rsi
ee6: 48 89 e2 mov %rsp,%rdx
ee9: 48 83 e4 f0 and $0xfffffffffffffff0,%rsp
eed: 50 push %rax
eee: 54 push %rsp
eef: 4c 8d 05 6a 01 00 00 lea 0x16a(%rip),%r8 # 1060 <__libc_csu_fini>
ef6: 48 8d 0d f3 00 00 00 lea 0xf3(%rip),%rcx # ff0 <__libc_csu_init>
efd: 48 8d 3d ec fb ff ff lea -0x414(%rip),%rdi # af0 <main>
f04: ff 15 d6 10 20 00 callq *0x2010d6(%rip) # 201fe0 <__libc_start_main@GLIBC_2.2.5>
f0a: f4 hlt
f0b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
0000000000000f10 <deregister_tm_clones>:
f10: 48 8d 3d f9 10 20 00 lea 0x2010f9(%rip),%rdi # 202010 <__TMC_END__>
f17: 55 push %rbp
f18: 48 8d 05 f1 10 20 00 lea 0x2010f1(%rip),%rax # 202010 <__TMC_END__>
f1f: 48 39 f8 cmp %rdi,%rax
f22: 48 89 e5 mov %rsp,%rbp
f25: 74 19 je f40 <deregister_tm_clones+0x30>
f27: 48 8b 05 aa 10 20 00 mov 0x2010aa(%rip),%rax # 201fd8 <_ITM_deregisterTMCloneTable>
f2e: 48 85 c0 test %rax,%rax
f31: 74 0d je f40 <deregister_tm_clones+0x30>
f33: 5d pop %rbp
f34: ff e0 jmpq *%rax
f36: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
f3d: 00 00 00
f40: 5d pop %rbp
f41: c3 retq
f42: 0f 1f 40 00 nopl 0x0(%rax)
f46: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
f4d: 00 00 00
0000000000000f50 <register_tm_clones>:
f50: 48 8d 3d b9 10 20 00 lea 0x2010b9(%rip),%rdi # 202010 <__TMC_END__>
f57: 48 8d 35 b2 10 20 00 lea 0x2010b2(%rip),%rsi # 202010 <__TMC_END__>
f5e: 55 push %rbp
f5f: 48 29 fe sub %rdi,%rsi
f62: 48 89 e5 mov %rsp,%rbp
f65: 48 c1 fe 03 sar $0x3,%rsi
f69: 48 89 f0 mov %rsi,%rax
f6c: 48 c1 e8 3f shr $0x3f,%rax
f70: 48 01 c6 add %rax,%rsi
f73: 48 d1 fe sar %rsi
f76: 74 18 je f90 <register_tm_clones+0x40>
f78: 48 8b 05 71 10 20 00 mov 0x201071(%rip),%rax # 201ff0 <_ITM_registerTMCloneTable>
f7f: 48 85 c0 test %rax,%rax
f82: 74 0c je f90 <register_tm_clones+0x40>
f84: 5d pop %rbp
f85: ff e0 jmpq *%rax
f87: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1)
f8e: 00 00
f90: 5d pop %rbp
f91: c3 retq
f92: 0f 1f 40 00 nopl 0x0(%rax)
f96: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
f9d: 00 00 00
0000000000000fa0 <__do_global_dtors_aux>:
fa0: 80 3d 81 10 20 00 00 cmpb $0x0,0x201081(%rip) # 202028 <completed.7696>
fa7: 75 2f jne fd8 <__do_global_dtors_aux+0x38>
fa9: 48 83 3d 47 10 20 00 cmpq $0x0,0x201047(%rip) # 201ff8 <__cxa_finalize@GLIBC_2.2.5>
fb0: 00
fb1: 55 push %rbp
fb2: 48 89 e5 mov %rsp,%rbp
fb5: 74 0c je fc3 <__do_global_dtors_aux+0x23>
fb7: 48 8b 3d 4a 10 20 00 mov 0x20104a(%rip),%rdi # 202008 <__dso_handle>
fbe: e8 1d fb ff ff callq ae0 <__cxa_finalize@plt>
fc3: e8 48 ff ff ff callq f10 <deregister_tm_clones>
fc8: c6 05 59 10 20 00 01 movb $0x1,0x201059(%rip) # 202028 <completed.7696>
fcf: 5d pop %rbp
fd0: c3 retq
fd1: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
fd8: f3 c3 repz retq
fda: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
0000000000000fe0 <frame_dummy>:
fe0: 55 push %rbp
fe1: 48 89 e5 mov %rsp,%rbp
fe4: 5d pop %rbp
fe5: e9 66 ff ff ff jmpq f50 <register_tm_clones>
fea: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
0000000000000ff0 <__libc_csu_init>:
ff0: 41 57 push %r15
ff2: 41 56 push %r14
ff4: 49 89 d7 mov %rdx,%r15
ff7: 41 55 push %r13
ff9: 41 54 push %r12
ffb: 4c 8d 25 36 0d 20 00 lea 0x200d36(%rip),%r12 # 201d38 <__frame_dummy_init_array_entry>
1002: 55 push %rbp
1003: 48 8d 2d 36 0d 20 00 lea 0x200d36(%rip),%rbp # 201d40 <__init_array_end>
100a: 53 push %rbx
100b: 41 89 fd mov %edi,%r13d
100e: 49 89 f6 mov %rsi,%r14
1011: 4c 29 e5 sub %r12,%rbp
1014: 48 83 ec 08 sub $0x8,%rsp
1018: 48 c1 fd 03 sar $0x3,%rbp
101c: e8 9f f9 ff ff callq 9c0 <_init>
1021: 48 85 ed test %rbp,%rbp
1024: 74 20 je 1046 <__libc_csu_init+0x56>
1026: 31 db xor %ebx,%ebx
1028: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1)
102f: 00
1030: 4c 89 fa mov %r15,%rdx
1033: 4c 89 f6 mov %r14,%rsi
1036: 44 89 ef mov %r13d,%edi
1039: 41 ff 14 dc callq *(%r12,%rbx,8)
103d: 48 83 c3 01 add $0x1,%rbx
1041: 48 39 dd cmp %rbx,%rbp
1044: 75 ea jne 1030 <__libc_csu_init+0x40>
1046: 48 83 c4 08 add $0x8,%rsp
104a: 5b pop %rbx
104b: 5d pop %rbp
104c: 41 5c pop %r12
104e: 41 5d pop %r13
1050: 41 5e pop %r14
1052: 41 5f pop %r15
1054: c3 retq
1055: 90 nop
1056: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
105d: 00 00 00
0000000000001060 <__libc_csu_fini>:
1060: f3 c3 repz retq
Disassembly of section .fini:
0000000000001064 <_fini>:
1064: 48 83 ec 08 sub $0x8,%rsp
1068: 48 83 c4 08 add $0x8,%rsp
106c: c3 retq
I have an object size of 64, and I added initialization as well:
typedef struct _object{
int value;
int pad_0;
int * pad_2;
int * pad_3;
int * pad_4;
int * pad_5;
int * pad_6;
int * pad_7;
int * pad_8;
} object;
object * array;
int arr_size = 1000;
array = (object *) malloc(arr_size * sizeof(object));
for(int i=0; i < arr_size; i++){
array[i].value = 1;
}
I've done some experiments using LIKWID, which is similar to PAPI, on Haswell. I found out that the calls to the functions that initialize and read the performance counters can cause more than 600 replacements in the L1 cache. Since the L1 cache has only 512 lines, this means that these functions may evict many of the lines that you would otherwise expect to be in the L1. By looking at the relatively large source code of PAPI_start_counters and _internal_hl_read_cnts, it seems to me that these functions may evict many lines from the L1, so the array elements don't survive in the L1 across these calls. I've verified this by using loads instead of stores and counting hits and misses using MEM_LOAD_RETIRED.*
. I think the solution would be to use the RDPMC
instruction. I have not used this instruction directly before. The code snippets here look useful.
Alternatively, you can put two copies of the loop after PAPI_start_counters
/PAPI_read_counters
and then subtract from the results the counts for one copy of the loop. This method works well.
By the way, the L1D.REPLACEMENT
counter seems to be fairly accurate on Haswell when the number of cache lines accessed is about larger than 10. Perhaps the count would be exact by using RDPMC
.
From your previous question, it seems that you're on Skylake. According to the PAPI event mapping, PAPI_L1_DCM
and PAPI_L2_TCM
are mapped to L1D.REPLACEMENT
and LONGEST_LAT_CACHE.REFERENCE
performance monitoring events on Intel processors. These are defined in the Intel manual as follows:
L1D.REPLACEMENT: Counts L1D data line replacements including opportunistic replacements, and replacements that require stall-for-replace or block-for-replace.
LONGEST_LAT_CACHE.REFERENCE: This event counts core-originated cacheable demand requests that refer to the last level cache (LLC). Demand requests include loads, RFOs, and hardware prefetches from L1D, and instruction fetches from IFU.
Without getting into the details of when these events exactly occur, there are three important points here that are relevant to your question:
miss2
.On Skylake, there are other native events that you can use to count L1D misses and hits per load instruction. You can use MEM_LOAD_RETIRED.L1_HIT
to count the number of retired load instructions that hit in the L1D. You can use MEM_INST_RETIRED.ALL_LOADS
-MEM_LOAD_RETIRED.L1_HIT
to count the number of retired load instructions that miss in the L1D. There doesn't seem to be PAPI events for them. According to the documentation, you can pass native event codes to PAPIF_start_counters
.
Another issue is that it's not clear to me whether PAPIF_start_counters
by default will count only user events of both kernel and user events. It seems that you can use PAPI_create_eventset
to control the counting domain.
The calls to PAPI APIs can also impact the event counts. You can try to measure this using an empty block as follows:
if ((ret1 = PAPI_read_counters(values, numEvents)) != PAPI_OK) {
fprintf(stderr, "PAPI failed to read counters: %s\n", PAPI_strerror(ret1));
exit(1);
}
// Nothing.
if ((ret2 = PAPI_read_counters(values, numEvents)) != PAPI_OK) {
fprintf(stderr, "PAPI failed to read counters: %s\n", PAPI_strerror(ret2));
exit(1);
}
This measurement will give you an estimate of the error that may occur due to PAPI itself.
Also, I don't think you need to use _mm_mfence
.