I recently spent far too long fighting a pernicious, and randomly occurring, but in the scheduler/context switcher code for my AVR operating system, AVRoxide. The bug was caused by two things - a good old fashioned mistake, and also my not properly understanding the difference in the way the ATmega4809 processor handles interrupts compared to its predecessors.

When I was trying to work out what was going wrong, Google found remarkably little helpful information - so I’ll write up the explanation here in the hope it may help someone else when they Google “why are interrupts weird on the ATmega” :-).

A stupid bug…

So, essentially the bug in my code was in the context-restoring assembler code. For the uninitiated, a context save is where we save the entire state of the processor - all the registers etc. - some place safe, and a context restore is where we load them back again, restoring the processor state to exactly what it was when it was saved. A context switch, where we switch from one thread to another, is essentially just “context save the current thread, then context restore the one we are switching to”.

The bug in my code was that my context restore function was obediently restoring the SREG status register (as it should.) Unfortunately, I was overlooking the fact that on the AVR, the global “interrupts enabled” flag is part of SREG, meaning that as soon as I restored it I was re-enabling interrupts - in the middle of the context restore routine, which is something you most definitely do not want interrupted.

Dumb, of course. But you’d think I would have found it pretty quickly… Except I didn’t, because it actually worked fine on the ATmega4809. In most of my operating system, context switches occur inside interrupt service routines anyway - and I never noticed this bug. It also worked fine on the ATmega328P. Only when I introduced “context switching in userland” (i.e. threads yielding, outside of an interrupt), did weird things start to happen.

The Critical Difference

This code should have broken from day one. But it didn’t… Why not?

Essentially, what’s different is this: The ATmega4809 (and I guess other “zero series” AVRs) actually doesn’t pay any attention to the global interrupt enable bit while it’s inside an Interrupt Service Routine:

On the older AVRs, when you enter an ISR, the chip clears the global interrupt enable flag in SREG. This is what stops interrupts interrupting themselves. Then, when you exit the ISR, the reti instruction sets the interrupt enable flag again.

This is why the bad code worked fine on the ‘328P. OK, it was actually restoring SREG badly (including the interrupt enable flag), but since interrupts were always disabled when the context was saved (because they were saved from within an interrupt), it never re-enabled by accident when the context was loaded.

But on the newer “zero series” AVRs:

The ATmega4809 does not clear the interrupt enable bit when it enters an interrupt service routine!

OK, but in that case, how does the processor know not to interrupt an interrupt? There is a separate flag, CPUINT.STATUS, that indicates whether or not the processor is in an ISR, and this is what blocks interrupts from interrupting themselves.

So, on the ‘4809 this code worked for a different reason: In fact, I was incorrectly restoring the interrupt-enabled flag, but it didn’t matter because as long as I was inside an interrupt the processor was ignoring it anyway.

There is a corollary note to this:

On the ATmega4809, the rti instruction also does not set the interrupt enable bit. In fact, it just clears the relevant CPUINT.STATUS bit.

This is a pretty subtle but important change in the way interrupts are handled. We are used to assuming that reti can be effectively used as an atomic “enable interrupts and return” instruction - but on the zero-series devices, this is no longer true.

Further Reading

This stuff is documented in the datasheets. But it’s not exactly “called out” in them. It’s a very subtle change in behaviour, and it may be that you never notice… But if you do, you could be banging your head against a wall for a couple of days.

In fact, it means that the ATmega4809 - and I guess other zero-series chips - behaviour explicitly contradicts the AVR Instruction Set Manual, which states that:

RETI - Return from Interrupt

Returns from the interrupt. The return address is loaded from the STACK, and the Global Interrupt Enable bit is set.

For clarity - I say again… This is not true on the ATmega4809. reti has no effect on the Global Interrupt Enable bit on these chips, and actually operates on a different register, CPUINT.STATUS.

This is the sort of thing which Microchip could really do a better job of calling out in the datasheets - but anyway, consider it a lesson learned. Now I know, so do you.

Links for anyone wanting to dig further: