I have global const char definition in two different files:
f1:
const char foo1[] = "SAME_VALUE";
f2:
const char foo2[] = "SAME_VALUE";
Wanted to understand if in the final binary this will be optimized to take common space in memory. This is in context of GCC
This kind of optimization is called string interning.
GCC sets the -fmerge-constants
flag by default:
Attempt to merge identical constants (string constants and floating-point constants) across compilation units.
This option is the default for optimized compilation if the assembler and linker support it. Use -fno-merge-constants to inhibit this behavior.
Enabled at levels -O, -O2, -O3, -Os.
Let's make an executable with a 3rd file named f.c to reference the strings:
#include <stdio.h>
// For proposition#1
extern const char foo1[], foo2[];
// For proposition#2
//extern const char *foo1, *foo2;
int main(void) {
printf("%s\n", foo1);
printf("%s\n", foo2);
return 0;
}
When you define the following respectively in f1.c and f2.c (proposition#1):
const char foo1[] = "SAME_VALUE";
const char foo2[] = "SAME_VALUE";
This results in 2 different memory spaces into which the string "SAME_VALUE" is stored. So, the string is duplicated:
$ gcc f.c f1.c f2.c
$ objdump -D a.out
[...]
0000000000001060 <main>:
1060: f3 0f 1e fa endbr64
1064: 48 83 ec 08 sub $0x8,%rsp
1068: 48 8d 3d 99 0f 00 00 lea 0xf99(%rip),%rdi <-- foo1@2008
106f: e8 dc ff ff ff callq 1050 <puts@plt>
1074: 48 8d 3d 9d 0f 00 00 lea 0xf9d(%rip),%rdi <-- foo2@2018
107b: e8 d0 ff ff ff callq 1050 <puts@plt>
[...]
0000000000002008 <foo1>:
2008: 53 'S' <-- 1 string @ 2008
2009: 41 'A'
200a: 4d 'M'
200b: 45 5f 'E' '_'
200d: 56 'V'
200e: 41 'A'
200f: 4c 55 'L' 'U'
2011: 45 'E'
...
0000000000002018 <foo2>:
2018: 53 'S' <-- Another string @ 2018
2019: 41 'A'
201a: 4d 'M'
201b: 45 5f 'E' '_'
201d: 56 'V'
201e: 41 'A'
201f: 4c 55 'L' 'U'
2021: 45 'E'
But if you define the following respectively in f1.c and f2.c (proposition#2):
const char *foo1 = "SAME_VALUE";
const char *foo2 = "SAME_VALUE";
You define two pointers which will point on the same string. In this case, "SAME_VALUE" will likely not be duplicated. In the following raw disassembly, the string is at address 2004 and both foo1 and foo2 point to it:
$ gcc f.c f1.c f2.c
$ objdump -D a.out
[...]
2004: 53 'S' <-- 1 string @ 2004
2005: 41 'A'
2006: 4d 'M'
2007: 45 5f 'E' '_'
2009: 56 'V'
200a: 41 'A'
200b: 4c 55 'L' 'U'
200d: 45 'E'
[...]
0000000000001060 <main>:
1060: f3 0f 1e fa endbr64
1064: 48 83 ec 08 sub $0x8,%rsp
1068: 48 8b 3d a1 2f 00 00 mov 0x2fa1(%rip),%rdi <-- 106f+2fa1=foo1@4010
106f: e8 dc ff ff ff callq 1050 <puts@plt>
1074: 48 8b 3d 9d 2f 00 00 mov 0x2f9d(%rip),%rdi <-- 107b+2f9d=foo2@4018
[...]
0000000000004010 <foo1>:
4010: 04 20 <-- foo1 = @2004
[...]
0000000000004018 <foo2>:
4018: 04 20 <-- foo2 = @2004
To avoid the duplication with proposition#1, GCC provides -fmerge-all-constants
:
Attempt to merge identical constants and identical variables.
This option implies -fmerge-constants. In addition to -fmerge-constants this considers e.g. even constant initialized arrays or initialized constant variables with integral or floating-point types. Languages like C or C++ require each variable, including multiple instances of the same variable in recursive calls, to have distinct locations, so using this option results in non-conforming behavior.
Let's rebuild proposition#1 with this flag. We can see that foo2 is optimized out, only foo1 is kept and referenced:
$ gcc -fmerge-all-constants f.c f1.c f2.c
$ objdump -D a.out
[...]
0000000000001149 <main>:
1149: f3 0f 1e fa endbr64
114d: 55 push %rbp
114e: 48 89 e5 mov %rsp,%rbp
1151: 48 8d 3d b0 0e 00 00 lea 0xeb0(%rip),%rdi <-- 1158(RIP) + eb0 = 2008 <foo1>
1158: e8 f3 fe ff ff callq 1050 <puts@plt>
115d: 48 8d 3d a4 0e 00 00 lea 0xea4(%rip),%rdi <-- 1164(RIP) + ea4 = 2008 <foo1>
1164: e8 e7 fe ff ff callq 1050 <puts@plt>
1169: b8 00 00 00 00 mov $0x0,%eax
[...]
0000000000002008 <foo1>:
2008: 53 'S' <--- foo2 optimized out, only foo1 defined
2009: 41 'A'
200a: 4d 'M'
200b: 45 5f 'E' '_'
200d: 56 'V'
200e: 41 'A'
200f: 4c 55 'L' 'U'
2011: 45 'E'