I have two functions, both are similar to this:
void Bit_Delay()
{
//this is a tuned tight loop for 8 MHz to generate timings for 9600 baud
volatile char z = 12;
while(z)
{
z++;
z++;
z++;
z++;
z -= 5;
}
}
(The second function is analogous instead it uses 18 instead of 12 for the counter).
The code works flawlessly as it is (with z appearing locally to each function internally), but I'm trying to cram a little more functionality into my executable before I hit the (horribly) limited FLASH memory available.
My thought was to promote the z
variable to be a global one (a volatile static). Because these two functions are effectively atomic operations (it's a single-threaded CPU and there are no interrupts at play to interfere), I figured that these two functions could share the single variable, thus saving a tiny bit of stack manipulation.
This didn't work. It is clear that the compiler is optimising-out much of the code related to z
completely! The code then fails to function properly (running far too fast), and the size of the compiled binary drops to about 50% or so.
I realised that I needed the z variable to be marked volatile to prevent the compiler from removing code it knows is counting a fixed (and thus reducible to a constant) number each time.
Question:
Can I optimise this any further, and trick the compiler into keeping both functions intact? I'm compiling with "-Os" (optimise for small binary).
Here's the entire program verbatim for those playing along at home...
#include <avr/io.h>
#define RX_PIN (1 << PORTB0) //physical pin 3
#define TX_PIN (1 << PORTB1) //physical pin 1
void Bit_Delay()
{
//this is a tuned tight loop for 8 MHz to generate timings for 9600 baud
volatile char z = 12;
while(z)
{
z++;
z++;
z++;
z++;
z -= 5;
}
}
void Serial_TX_Char(char c)
{
char i;
//start bit
PORTB &= ~TX_PIN;
Bit_Delay();
for(i = 0 ; i < 8 ; i++)
{
//output the data bits, LSB first
if(c & 0x01)
PORTB |= TX_PIN;
else
PORTB &= ~TX_PIN;
c >>= 1;
Bit_Delay();
}
//stop bit
PORTB |= TX_PIN;
Bit_Delay();
}
char Serial_RX_Char()
{
char retval = 0;
volatile char z = 18; //1.5 bits delay
//wait for idle high
while((PINB & RX_PIN) == 0)
{}
//wait for start bit falling-edge
while((PINB & RX_PIN) != 0)
{}
//1.5 bits delay
while(z)
{
z++;
z++;
z++;
z++;
z -= 5;
}
for(z = 0 ; z < 8 ; z++)
{
retval >>= 1; //make space for the new bit
retval |= (PINB & RX_PIN) << (8 - RX_PIN); //get the bit and store it
Bit_Delay();
}
return retval;
}
int main(void)
{
CCP = 0xd8; //protection signature for clock registers (see datasheet)
CLKPSR = 0x00; //set the clock prescaler to "div by 1"
DDRB |= TX_PIN;
PORTB |= TX_PIN; //idle high
while (1)
Serial_TX_Char(Serial_RX_Char() ^ 0x20);
}
The target CPU is an Atmel ATTiny5
microcontroller, the code above uses up 94.1% of the FLASH memory! If you connect to the chip using a serial port at 9600 Baud, 8N1, you can type characters in and it returns them with bit 0x20 flipped (uppercase to lowercase and vice-versa).
This is not a serious project of course, I'm just experimenting to see how much functionality I could cram into this chip. I'm not going to bother with rewriting this in assembly, I seriously doubt I could do a better job than GCC's optimiser!
@Frank asked about the IDE / compiler I'm using...
Microchip Studio (7.0.2542)
The "All Options" string that is passed to the compiler avr-gcc
...
-x c -funsigned-char -funsigned-bitfields -DDEBUG -I"C:\Program Files (x86)\Atmel\Studio\7.0\Packs\atmel\ATtiny_DFP\1.8.332\include" -Os -ffunction-sections -fdata-sections -fpack-struct -fshort-enums -g2 -Wall -mmcu=attiny5 -B "C:\Program Files (x86)\Atmel\Studio\7.0\Packs\atmel\ATtiny_DFP\1.8.332\gcc\dev\attiny5" -c -std=gnu99 -MD -MP -MF "$(@:%.o=%.d)" -MT"$(@:%.o=%.d)" -MT"$(@:%.o=%.o)"
I question the following assumption:
This didn't work. It is clear that the compiler is optimising-out much of the code related to z completely! The code then fails to function properly (running far too fast), and the size of the compiled binary drops to about 50% or so.
Looking at https://gcc.godbolt.org/z/sKdz3h8oP, it seems like the loops are actually being performed, however, for whatever reason each z++
, when using a global volatile z
goes from:
subi r28,lo8(-(1))
sbci r29,hi8(-(1))
ld r20,Y
subi r28,lo8((1))
sbci r29,hi8((1))
subi r20,lo8(-(1))
subi r28,lo8(-(1))
sbci r29,hi8(-(1))
st Y,r20
subi r28,lo8((1))
sbci r29,hi8((1))
to:
lds r20,z
subi r20,lo8(-(1))
sts z,r20
You will need to recalibrate your 12, 18, and 5 constants to get your baud rate correct (since fewer instructions are executed in each loop), but the logic is there in the compiled version.
To be clear: This looks really weird to me, the local volatile version is clearly not being compiled correctly. I did find an old gcc bug along these lines: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=33970, but it seems to not cover the local variable case.