volatile, memory barriers and reordering traps

Here’s a bug that has cost more embedded engineers a weekend than almost any other: the code works at -O0, and the moment you turn the optimiser on it hangs. Or it works until you add an interrupt, or DMA, or a second core. The variable you’re watching clearly changes — you can see it in the debugger — but your loop never notices. The culprit is almost always the same: a mismatch between the order you wrote memory accesses and the order the compiler and CPU actually perform them. volatile and memory barriers are the tools that close that gap, and they’re widely misunderstood.

Why `volatile` exists

The C compiler assumes memory only changes when your code changes it. That assumption is what lets it keep a value in a register instead of re-reading RAM — a huge optimisation, and completely wrong for anything that changes behind the compiler’s back: a hardware register, a variable written by an ISR, a buffer filled by DMA.

Figure 1 — Without volatile the compiler caches the read in a register and loops forever. volatile forces the access to happen on every iteration, so the loop sees the hardware change.

volatile tells the compiler: this thing can change at any time; never cache it, never assume, perform every read and write exactly as written. That’s it. Three jobs:

No caching — every read goes to memory, every write reaches memory.
No elision — the compiler can’t delete a read whose result “isn’t used” or fold two writes into one. With memory-mapped I/O, the act of reading or writing has side effects (clearing a flag, popping a FIFO), so each one must survive.
Program order among volatile accesses — two volatile accesses keep their relative order.

That covers the three classic cases: memory-mapped peripheral registers, flags shared with an ISR on a single core, and any DMA target the CPU reads while hardware writes it.

What `volatile` does not do

This is where the weekend goes. volatile is not a concurrency primitive:

It doesn’t order volatile against non-volatile. The compiler may freely move an ordinary write across a volatile one.
It doesn’t stop the CPU from reordering. On a weakly-ordered core (most Cortex-A, multi-core, anything with a store buffer) the hardware can complete your writes in a different order than the program issued them — volatile says nothing about that.
It isn’t atomic. volatile uint32_t x; x++; is still load-modify-store; an interrupt between the load and the store corrupts it. On a 32-bit MCU even a 64-bit volatile read can tear into two halves.

So volatile solves “the compiler optimised away my access,” not “two contexts touch the same data safely.” For that you need ordering, and sometimes atomicity.

The reordering trap

Compilers and CPUs reorder independent memory accesses for speed. That’s invisible in single-threaded code — but the instant a second observer (an ISR, DMA, another core) can see your memory mid-sequence, order matters:

Figure 2 — The producer/consumer handoff. Without a barrier, the “data ready” flag can become visible before the data itself; a barrier between the two writes forbids the reorder.

This producer/consumer pattern is everywhere: fill a buffer, then set a “ready” flag; the reader checks the flag, then uses the buffer. If the flag write floats before the data write, the reader sees ready == 1 and reads stale garbage. Making both variables volatile doesn’t save you — it keeps each access, but doesn’t forbid the reorder across them, and does nothing about the CPU. You need a barrier.

Barriers: compiler vs CPU

There are two independent kinds of reordering, so there are two kinds of barrier:

Compiler barrier. asm volatile("" ::: "memory") (often wrapped as barrier()). It tells the compiler: don’t move any memory access across this point, and assume all memory may have changed. Zero instructions are emitted — it’s purely a constraint on code generation. Enough on a single core where the only reorderer is the compiler.
Hardware memory barrier. DMB (data memory barrier) / DSB (data synchronisation barrier) on ARM. These emit a real instruction that stops the CPU from completing memory accesses out of order across the barrier. You need them on weakly-ordered or multi-core systems, around DMA handoffs, and sometimes between a peripheral config write and the action that depends on it. A DMB is also a compiler barrier, so it covers both.

The decision in one line: single-core, compiler-only reordering → compiler barrier. Anything where hardware can reorder (other cores, DMA, store buffers) → hardware barrier. And when you also need the read-modify-write to be indivisible → an atomic.

Figure 3 — Pick by what you actually need. volatile keeps accesses; a compiler barrier orders them for the compiler; a hardware barrier orders them for the CPU; an atomic adds indivisibility.

Atomics: when you need order and indivisibility

C11 <stdatomic.h> (_Atomic, atomic_load/atomic_store with a memory order) is the right modern tool when data is shared across cores or you need a true read-modify-write. An atomic_store(&ready, 1, memory_order_release) paired with an atomic_load(&ready, memory_order_acquire) gives you the producer/consumer ordering of Figure 2 and the barriers and indivisibility, in one portable construct — the compiler emits the right DMB for your target. On bare-metal single-core MCUs you often don’t need full atomics, but the moment you have an RTOS with preemption or an SMP part, reach for them instead of hand-rolling volatile + barriers.

The practical rules

MMIO and ISR-shared flags → volatile. Always. A non-volatile peripheral access is a latent bug waiting for -O2.
volatile is necessary but not sufficient for sharing. It stops caching; it does not order against ordinary memory or stop the CPU. Don’t reach for it to fix a race.
Producer/consumer handoff → a barrier (or release/acquire atomics). Data write, then barrier, then flag write. Flag read, then barrier, then data read.
DMA buffers → volatile and a hardware barrier (plus cache maintenance on M7-class parts — see the DMA gotchas).
Need an indivisible update across contexts → atomic, not volatile++.
Reproduce these bugs with the optimiser on. A race that “works” at -O0 is not fixed; it’s hidden.

Field notes

volatile on a whole struct or a function pointer is usually a smell — it’s pointing at a design that should use a barrier or atomic. Apply it to the specific accesses that touch hardware or shared state.
The compiler is allowed to do everything the standard permits, not what you hoped. “It worked before” means the optimiser hadn’t found that freedom yet.
Disassemble the hot loop once. Seeing the single cached ldr outside the loop (Figure 1) versus the ldr inside it teaches volatile better than any article.
On weakly-ordered cores, absence of a crash is not absence of a bug — reordering faults are timing-dependent and ship to the field looking fine on your bench.