cgcclinkerelfloader

Is the statement “Uninitialized global variables get weak symbols” in CSAPP accurate?


In Computer Systems: A Programmer’s Perspective (CSAPP), the book mentions that “Uninitialized global variables get weak symbols.” After some experimentation, I found this description seems inconsistent with what actually happens.

Here’s what I did:

create a C file main.c:

__attribute__((weak)) int var1;
int var2;
int var3 = 1;

int main()
{
    return 0;
}

compile it with the GCC -fcommon option:

gcc -fcommon -c main.c -o main.o

use readelf -s to check the symbol table:

symbol content outputed by readelf -s

We can see that:

The interesting part is : var2 is GLOBAL (contrary to CSAPP’s statement that uninitialized globals get weak symbols), and it is placed in COMMON block.

So my question is:

Thank you for your attention, and I would appreciate any additional insights on this topic.


Solution

  • Is CSAPP wrong?

    In short, it is!

    The authors apply the term weak symbol incorrectly. They define the term correctly but they mis-apply it to symbols that are not weak per that definition, but are common symbols in linkage terms and are tentatively defined in the sense of the C Standard.

    The semantic distinction between weak and common symbol is made in the Elf Specification, 1995, predating even the 1st edition (2002) of CSAPP. The meaning of a tentative definition has not changed in ISO C standards since the first (C90) (except for a trivial correction).

    A charitable interpretation would be that the authors wished to simpify their presentation by conflating weak symbols with common symbols, because at the time they were writing the way in which the GNU/Linux ELF linker handled the resolution of tentative definitions by default was in principle similar to the way it handled the resolution of weak symbols. But reading what they have to say about weak symbols togther with the illustrative workings, it seems more likely that confusion of weak and common symbols just slipped through uncorrected.

    You didn't mis-read the text

    All the illustrative workings in the book are said to have been done on Linux.

    The context of the statement you picked up on is:

    7.6.1 How Linkers Resolve Multiply Defined Global Symbols At compile time, the compiler exports each global symbol to the assembler as either strong or weak, and the assembler encodes this information implicitly in the symbol table of the relocatable object file. Functions and initialized global variables get strong symbols. Uninitialized global variables get weak symbols. For the example program in Figure 7.1, buf, bufp0, main, and swap are strong symbols; bufp1 is a weak symbol. Given this notion of strong and weak symbols, Unix linkers use the following rules for dealing with multiply defined symbols:

    • Rule 1: Multiple strong symbols are not allowed.
    • Rule 2: Given a strong symbol and multiple weak symbols, choose the strong
    • symbol.
    • Rule 3: Given multiple weak symbols, choose any of the weak symbols

    The example program of Figure 7.1 that they refer to is:

    /* swap.c */
    extern int buf[];
    
    int *bufp0 = &buf[0];
    int *bufp1;
    
    void swap()
    {
        int temp;
        
        bufp1 = &buf[1];
        temp = *bufp0;
        *bufp0 = *bufp1;
        *bufp1 = temp;
    }
    
    /* main.c */
    void swap();
    
    int buf[2] = {1, 2};
    
    int main()
    {
        swap();
        return 0;
    }
    

    In the 2nd ed. (2010) of the book they show us the readelf symbol table (in an antiquated format) of the resulting object file swap.o:

    ...Similarly, here are the symbol table entries for swap.o:

    Num:    Value   Size    Type    Bind    Ot  Ndx Name
      8:        0      4    OBJECT  GLOBAL   0    3 bufp0
      9:        0      0    NOTYPE  GLOBAL   0  UND buf
     10:        0     39    FUNC    GLOBAL   0    1 swap
     11:        4      4    OBJECT  GLOBAL   0  COM bufp1
    

    This symbol table does not appear in the 3rd ed. (2016).

    You can see that symbol bufp1, about which they said:

    `bufp1` is a weak symbol.
    

    has not been made a weak symbol by either of the methods that were known to the GCC C compiler back then or now: __attribute__((weak)) or #pragma weak. It does not have type WEAK but type GLOBAL and is distinguished from the regularly defined symbols bufp0 and swap by having no numeric section index (Ndx) denoting the section in which it is defined, but instead has the annotation COM, meaning that it has a tentative definition held in a COMMON block.

    This is the same result for bufp1 that I will obtain by compiling:

    $ gcc -fcommon -c swap.c
    
    $ readelf --syms --wide swap.o
    
    Symbol table '.symtab' contains 7 entries:
       Num:    Value          Size Type    Bind   Vis      Ndx Name
         0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND 
         1: 0000000000000000     0 FILE    LOCAL  DEFAULT  ABS swap.c
         2: 0000000000000000     0 SECTION LOCAL  DEFAULT    1 .text
         3: 0000000000000000     8 OBJECT  GLOBAL DEFAULT    5 bufp0
         4: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND buf
         5: 0000000000000008     8 OBJECT  GLOBAL DEFAULT  COM bufp1
         6: 0000000000000000    67 FUNC    GLOBAL DEFAULT    1 swap
         
    

    I need to specify -fcommon to get that, and CSAPP does not, because GCC defaulted to -fcommon prior to GCC 10.0 and subsequently defaults to -fno-common.

    As you'd predict, a WEAK symbol results from:

    $ cat swap_weak.c
    /* swap_weak.c */
    extern int buf[];
    
    int *bufp0 = &buf[0];
    int *bufp1 __attribute__((weak));
    
    void swap()
    {
        int temp;
        
        bufp1 = &buf[1];
        temp = *bufp0;
        *bufp0 = *bufp1;
        *bufp1 = temp;
    }
    
    $ gcc -c swap_weak.c
    $ readelf --syms --wide swap_weak.o
    
    Symbol table '.symtab' contains 7 entries:
       Num:    Value          Size Type    Bind   Vis      Ndx Name
         0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND 
         1: 0000000000000000     0 FILE    LOCAL  DEFAULT  ABS swap_weak.c
         2: 0000000000000000     0 SECTION LOCAL  DEFAULT    1 .text
         3: 0000000000000000     8 OBJECT  GLOBAL DEFAULT    5 bufp0
         4: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND buf
         5: 0000000000000000     8 OBJECT  WEAK   DEFAULT    4 bufp1
         6: 0000000000000000    67 FUNC    GLOBAL DEFAULT    1 swap
         
    

    And it makes no difference to compile like gcc -fcommon -c swap_weak.c.

    What are the difference, relationship, precedence between WEAK symbol and COMMON symbol? What if there are two symbols that are defined as WEAK and COMMON in two files and have the same name?

    Weak symbols

    __attribute__((weak)) is one of many non-standard GNU extensions to C that are implemented using GCC's __attribute__((...)) declarations modifier.

    C Standards have always stipulated that if there is more than one external definition of a symbol in a program then the behaviour is undefined. Weak symbols stake out some definition in the undefined behaviour by letting the same symbol be defined multiple times in the actual linkage of the program but having only one definition survive into it.

    A weak symbol is handled by the linker as described by CSAPP.

    Here's an illustration:

    $ tail -n +1 prog.c strong.c weak*.c
    ==> prog.c <==
    #include <stdio.h>
    extern char const * file;
    int main(void) { 
        puts(file);
        return 0;
    } 
    
    ==> strong.c <==
    char const * file = __FILE__;
    
    ==> weak_a.c <==
    __attribute__((weak))  char const * file = __FILE__;
    
    ==> weak_b.c <==
    __attribute__((weak))  char const * file = __FILE__;
    
    $ gcc prog.c strong.c weak_a.c weak_b.c
    $ ./a.out
    strong.c
    $ gcc prog.c  weak_a.c weak_b.c strong.c
    $ ./a.out
    strong.c
    

    That shows that given any weak definitions of symbol file and one strong definition, the linker takes the strong one and discards the weak ones.

    $ gcc prog.c  weak_a.c weak_b.c
    $ ./a.out
    weak_a.c
    $ gcc prog.c weak_b.c weak_a.c
    $ ./a.out
    weak_b.c
    

    That shows that given multiple weak definitions of file and no strong ones, the linker, in principle, picks one definition arbitrarily and discards the rest. In practice it just picks the first one you input.

    Common symbols

    Per the C Standard (any C Standard will do, but for reference say Working Draft N2347 C17..C2x), the compiler classifies any uninitialized filescope variable declaration that is not qualified as extern or is qualified static as a tentative definition. E.g

    [static] int foo;
    

    At the end of a translation unit, if the compiler has seen a matching declaration that initialises the variable - and so defines it - e.g.

    [static] int foo = 42;
    

    then it considers all tentative definitions to be declarations of the symbol so defined. If it sees no such definition then it deems there to be an implicit 0-initialised definition in the translation unit - which is an external definition unless the symbol was declared static.

    Should there be more than one such implicit definition of a symbol in a program's linkage the C Standard makes it UB. But the Standard also expressly recognises many widely supported common extensions to the Standard, among which is the behaviour traditionally provided by -fcommon. From the reference draft:

    J.5.11 Multiple external definitions

    1 There may be more than one external definition for the identifier of an object, with or without the explicit use of the keyword extern; if the definitions disagree, or more than one is initialized, the behavior is undefined

    This extension empowers a compiler to steer the linker's handling of the implicit 0-initialised definitions that accrue from tentative definitions by means the -fcommon and -fno-common options:

    For -fno-common, if there are multiple definitions of an implicit 0-initialized symbol in the program then the linker will give a multiple definition error, just as it would for an explicitly initialised regular symbol.

    For -fcommon, the linker does something anologous to its handling of weakly v. strongly defined symbols. It merges the input common blocks and appraises all the common definitions versus the regularly defined symbols. If there are any common definitions of a symbol in the program and one regular definition, then the regular definition wins. If there are multiple common definitions and no regular definition then any one of the common definitions is picked and the rest are discarded.

    Here's an illustration:

    $ tail -n +1 prog1.c defined.c tent_*.c
    ==> prog1.c <==
    #include <stdio.h>
    extern char const * file;
    int main(void) {
        if (file) { 
            printf("%s\n",file);
        } else {
            printf("%p\n",file);
        }
        return 0;
    } 
    
    ==> defined.c <==
    char const * file = __FILE__;
    
    ==> tent_a.c <==
    char const * file;
    
    
    ==> tent_b.c <==
    char const * file;
    

    Without -fcommon:

    $ gcc prog1.c defined.c tent_a.c tent_b.c
    /usr/bin/ld: /tmp/cc1sTYzG.o:(.bss+0x0): multiple definition of `file'; /tmp/ccocLj6m.o:(.data.rel.local+0x0): first defined here
    /usr/bin/ld: /tmp/ccIEz1I4.o:(.bss+0x0): multiple definition of `file'; /tmp/ccocLj6m.o:(.data.rel.local+0x0): first defined here
    collect2: error: ld returned 1 exit status
    
    $ gcc prog1.c tent_a.c tent_b.c
    /usr/bin/ld: /tmp/ccKi5GC1.o:(.bss+0x0): multiple definition of `file'; /tmp/ccNJ2NWl.o:(.bss+0x0): first defined here
    collect2: error: ld returned 1 exit status
    

    With -fcommon:

    $ gcc -fcommon prog1.c defined.c tent_a.c tent_b.c
    $ ./a.out
    defined.c
    
    $ gcc -fcommon prog1.c tent_a.c tent_b.c
    $ ./a.out
    (nil)
    

    Linkage priority of common v. weak symbols

    The linker accepts both kinds of symbols in a program and gives priority to the winning common definition over the winning weak definition of the same symbol. See Elf Specification: Symbol Table:-

    ..if a common symbol exists (that is, a symbol whose st_shndx field holds SHN_COMMON), the appearance of a weak symbol with the same name will not cause an error. The link editor honors the common definition and ignores the weak ones.

    Here's an illustration of that

    $ gcc prog1.c weak_a.c
    $ ./a.out
    weak_a.c
    
    $ gcc -fcommon prog1.c weak_a.c tent_a.c
    $ ./a.out
    (nil)
    

    Isn't all this a bit of a mess?

    Yes. GCC and Standard C have lived a very long time: they have cruft and legacy burdens. The idea of common variables hailed from Fortran - much older still - and infiltrated C from the get-go as the natural way to support the linkage of tentative definitions. It wasn't intently conceived as a means of supporting the linkage of multiple definitions of the same symbol; but it went some way there, just for the case of multiple tentative definitions of a variable, and at most one actual definition of the same. Weak symbols work for both variables and functions, but they have to be defined (not "tentatively"). So if you're willing to write C that eschews tentative definitions, or you're writing in a language that eschews them, like C++, then common variables are redundant and weak symbols are a well-considered solution for linking multiple definitions of any kind. Common variables in C are a legacy style nowadays and that's why GCC recently defaulted to -fno-common.