c++gccclangavrmachine-code

C++ optimization comparing inline classes and functions doesn't seem good enough


I'm studying C++ focusing on Arduino Nano (very layman's level) which has a huge limitation regarding memory usage (2KB) and compilation size (32KB), but is possibly within my expectations .

However, whenever I use a lib, this space is consumed quickly. And when I superficially analyze the code of these libs, I realize that they don't care that much about optimization. This made me curious about how C++ deals with optimizations.

For all code here, I'm using -Os optimization (optimization for size, which is standard for Platform IO).

The following happens: as I already have knowledge of OOP, I decided to create projects for Arduino already in classes. For maintenance and organization, this is really great, but for the size of the application once compiled, this seems a bit confusing.

I've always known that C++ has one of the best optimizers in the compilation process, and that seems to be true to some extent.

Let's imagine I have a program like this:

extern void use(float value);
extern float read(int pin);

class Temperature {
   private:
     int pin;

   public:
     Temperature(int pin): pin(pin) {}

     float get() { return read(pin); }
};

Temperature temp(7);

void loop() {
   use(temp.get());
}

Note: the use() and read() functions I defined this way just as an example, since I am using Compiler Explorer (https://godbolt.org/z/78ajPnPar) to get the machine code. This way the code can be compiled successfully and simulate functions over which I have no control.

As we can see in the code above, the code is very simple: it is a Temperature class that receives the pin in its constructor and stores it to later be read by get(). And perhaps the most important part: in my program, this class is initialized once with the parameter pin = 7, with no exceptions at the moment.

This should generate machine code with the following content (in GCC x64, but AVR or Clang generate something similar):

loop():
         push rax
         mov edi, DWORD PTR temp[rip]
         call read(int)
         pop rdx
         jmp use(float)
_GLOBAL__sub_I_temp:
         mov DWORD PTR temp[rip], 7
         ret
temp:
         .zero 4

This code shows that temp allocates 4 bytes due to the use of int, later stores the literal value 7, read()s and use()s the result.

However, if instead of declaring the Temperature class with a pin argument which will essentially be the same for my application, I create a template <int PIN>, things change a little:

//...

template <int PIN>
class Temperature {
   public:
     float get() { return read(PIN); }
};

Temperature<7> temp;

void loop() {
   use(temp.get());
}

The behavior will be the same, but the machine code will change a little:

loop():
         push rax
         mov edi, 7
         call read(int)
         pop rdx
         jmp use(float)
temp:
         .zero 1

You can see that the code is smaller using fewer instructions. This makes sense because the template is optimized during compile time. Note that 1 byte is allocated, although it appears to be unused (this may be a bug). But the idea itself is to show that in some cases, there will be differences, often with the following case:

// ...

void loop() {
   use(read(7));
}

What makes:

loop():
         push rax
         mov edi, 7
         call read(int)
         pop rdx
         jmp use(float)

Visibly, the machine code became much smaller. Although it just doesn't have the allocation of the previous version, but it is due to the simplification I made in the code, overall, we would have a bigger difference here.

So, although one thing or another makes sense, it seems to me that GCC, AVR or Clang are not always able to optimize things that, apparently, would be simple.

Let's see: the Temperature class is used once, it's quite simple, and the value 7 is defined as literal, which would easily allow the first version to become the "function only" version, leaving everything inline.

My question, finally, is: despite trying to keep the code organized with classes, I will really have to pay attention to replacing them with functions, losing some scalability (assuming that in a future version I could use more than one instance of Temperature, for example)? Even worse: the libs have the same problem, so they have a much higher cost than they could have due to lack of optimizations even in simple situations? Or at the end of it all, am I just going crazy? Remembering that the size of the application here is critical due to Nano's limitations.

In my real case, I reached almost the 32KB limit, I had to replace libs with lighter ones to save a few KB, but the application is not ready, and it seems to me that I will have to worry about my own code, replacing classes with functions until that classes actually make more sense than functions (generating later refactoring that I still don't know if will be necessary). Would this be the way?


Solution

  • Let's see: the Temperature class is used once, it's quite simple, and the value 7 is defined as literal, which would easily allow the first version to become the "function only" version, leaving everything inline.

    It has nothing to do with classes. Except for the restrictions placed by class member layout and virtual functions/bases, a class is effectively just a collection of variables (per instance) with a set of functions anyway. Whether you use a member function or a free function doesn't matter.

    The problem is that you are not comparing code with equal semantics. In the first code, temp is defined in a way that it can in any translation unit change the value 7 to something else at any time without it being possible to notice that in the current translation unit. So the compiler can't use a constant 7 in the code, even if it sees the initialization. It is exactly equivalent to making 7 a global int variable and using that in the function version.

    Similarly, although temp isn't really required in the second example, there could be a second translation unit which uses the address of temp and so the compiler must reserve at least one byte for it so that the linker can use it as the address of that variable in all translation units without it accidentally matching any other object's address. Again, you would have the exact same problem if you declared 7 as a const int variable in C.

    The solution is to actually make the semantics equal either:

    The last case is the easiest. In C++ (but not C) all you need to do is declare the variable as const:

    const Temperature<7> temp;
    

    or

    const Temperature temp(7);
    

    const non-extern non-inline non-template variables have internal linkage by default in C++ and so the compiler will now know, in both cases, that you are always using 7 at the call site and that there is no other use of the variable in the program.


    In general, classes in C++ are not any worse for performance than alternative approaches. If you see a difference, then it is likely caused by circumstantial issues such as the one above. To get optimal results, whether you use classes or any other language feature, you need to have a very good understanding of the language specification, the general compilation process (especially the separate compilation of translation units) and then also compiler-specific optimization processes and deep knowledge about the hardware's behavior.

    When you use a library, you must trust the library writer to be aware of all of that and to put sufficient effort into optimizing. For many library writers reducing memory footprint as much as possible may not be their priority. You must look for a library that specifically targets your requirements or if there is none, write it yourself.

    However, OOP generally is not a good idea to reduce memory footprint and runtime performance. The principles of OOP lead to much higher memory fragmentation and usually need indirect calls (i.e. virtual functions in C++) to be implemented, harming the optimizer's ability to see through function calls.

    But classes in C++ do not equate to OOP. That's only one paradigm that they can be used for and currently there is a more of a shift to functional programming paradigms.

    The uniquely important feature of classes in C++ is that they permit RAII and value semantics for any kind of object.