volatile, memory barriers and reordering traps
Here’s a bug that has cost more embedded engineers a weekend than almost any other:
the code works at -O0, and the moment you turn the optimiser on it hangs. Or it
works until you add an interrupt, or DMA, or a second core. The variable you’re
watching clearly changes — you can see it in the debugger — but your loop never
notices. The culprit is almost always the same: a mismatch between the order you
wrote memory accesses and the order the compiler and CPU actually perform them.
volatile and memory barriers are the tools that close that gap, and they’re widely
misunderstood.
Why volatile exists
The C compiler assumes memory only changes when your code changes it. That assumption is what lets it keep a value in a register instead of re-reading RAM — a huge optimisation, and completely wrong for anything that changes behind the compiler’s back: a hardware register, a variable written by an ISR, a buffer filled by DMA.
Figure 1 — Without
volatile the compiler caches the read in a register and loops forever. volatile forces the access to happen on every iteration, so the loop sees the hardware change.
volatile tells the compiler: this thing can change at any time; never cache it,
never assume, perform every read and write exactly as written. That’s it. Three jobs:
- No caching — every read goes to memory, every write reaches memory.
- No elision — the compiler can’t delete a read whose result “isn’t used” or fold two writes into one. With memory-mapped I/O, the act of reading or writing has side effects (clearing a flag, popping a FIFO), so each one must survive.
- Program order among volatile accesses — two
volatileaccesses keep their relative order.
That covers the three classic cases: memory-mapped peripheral registers, flags shared with an ISR on a single core, and any DMA target the CPU reads while hardware writes it.
What volatile does not do
This is where the weekend goes. volatile is not a concurrency primitive:
- It doesn’t order volatile against non-volatile. The compiler may freely move an
ordinary write across a
volatileone. - It doesn’t stop the CPU from reordering. On a weakly-ordered core (most Cortex-A,
multi-core, anything with a store buffer) the hardware can complete your writes in a
different order than the program issued them —
volatilesays nothing about that. - It isn’t atomic.
volatile uint32_t x; x++;is still load-modify-store; an interrupt between the load and the store corrupts it. On a 32-bit MCU even a 64-bitvolatileread can tear into two halves.
So volatile solves “the compiler optimised away my access,” not “two contexts touch
the same data safely.” For that you need ordering, and sometimes atomicity.
The reordering trap
Compilers and CPUs reorder independent memory accesses for speed. That’s invisible in single-threaded code — but the instant a second observer (an ISR, DMA, another core) can see your memory mid-sequence, order matters:
Figure 2 — The producer/consumer handoff. Without a barrier, the “data ready” flag can become visible before the data itself; a barrier between the two writes forbids the reorder.
This producer/consumer pattern is everywhere: fill a buffer, then set a “ready” flag;
the reader checks the flag, then uses the buffer. If the flag write floats before the
data write, the reader sees ready == 1 and reads stale garbage. Making both variables
volatile doesn’t save you — it keeps each access, but doesn’t forbid the reorder
across them, and does nothing about the CPU. You need a barrier.
Barriers: compiler vs CPU
There are two independent kinds of reordering, so there are two kinds of barrier:
- Compiler barrier.
asm volatile("" ::: "memory")(often wrapped asbarrier()). It tells the compiler: don’t move any memory access across this point, and assume all memory may have changed. Zero instructions are emitted — it’s purely a constraint on code generation. Enough on a single core where the only reorderer is the compiler. - Hardware memory barrier.
DMB(data memory barrier) /DSB(data synchronisation barrier) on ARM. These emit a real instruction that stops the CPU from completing memory accesses out of order across the barrier. You need them on weakly-ordered or multi-core systems, around DMA handoffs, and sometimes between a peripheral config write and the action that depends on it. ADMBis also a compiler barrier, so it covers both.
The decision in one line: single-core, compiler-only reordering → compiler barrier. Anything where hardware can reorder (other cores, DMA, store buffers) → hardware barrier. And when you also need the read-modify-write to be indivisible → an atomic.
Figure 3 — Pick by what you actually need.
volatile keeps accesses; a compiler barrier orders them for the compiler; a hardware barrier orders them for the CPU; an atomic adds indivisibility.
Atomics: when you need order and indivisibility
C11 <stdatomic.h> (_Atomic, atomic_load/atomic_store with a memory order) is the
right modern tool when data is shared across cores or you need a true read-modify-write.
An atomic_store(&ready, 1, memory_order_release) paired with an
atomic_load(&ready, memory_order_acquire) gives you the producer/consumer ordering of
Figure 2 and the barriers and indivisibility, in one portable construct — the
compiler emits the right DMB for your target. On bare-metal single-core MCUs you often
don’t need full atomics, but the moment you have an RTOS with preemption or an SMP
part, reach for them instead of hand-rolling volatile + barriers.
The practical rules
- MMIO and ISR-shared flags →
volatile. Always. A non-volatile peripheral access is a latent bug waiting for-O2. volatileis necessary but not sufficient for sharing. It stops caching; it does not order against ordinary memory or stop the CPU. Don’t reach for it to fix a race.- Producer/consumer handoff → a barrier (or release/acquire atomics). Data write, then barrier, then flag write. Flag read, then barrier, then data read.
- DMA buffers →
volatileand a hardware barrier (plus cache maintenance on M7-class parts — see the DMA gotchas). - Need an indivisible update across contexts → atomic, not
volatile++. - Reproduce these bugs with the optimiser on. A race that “works” at
-O0is not fixed; it’s hidden.
Field notes
volatileon a whole struct or a function pointer is usually a smell — it’s pointing at a design that should use a barrier or atomic. Apply it to the specific accesses that touch hardware or shared state.- The compiler is allowed to do everything the standard permits, not what you hoped. “It worked before” means the optimiser hadn’t found that freedom yet.
- Disassemble the hot loop once. Seeing the single cached
ldroutside the loop (Figure 1) versus theldrinside it teachesvolatilebetter than any article. - On weakly-ordered cores, absence of a crash is not absence of a bug — reordering faults are timing-dependent and ship to the field looking fine on your bench.