Metastability and clock domain crossing, explained

This is the bug class that humbles experienced FPGA engineers: a design that simulates perfectly, passes timing, works on the bench for hours — and then glitches once a day in the field, or only on one board, or only when it gets warm. Almost always it’s a clock domain crossing done wrong. Whenever a signal generated by one clock is sampled by an unrelated clock, you’re exposed to metastability, and you can’t simulate your way out of it because it’s a physical, probabilistic effect. You can only contain it — and the techniques are simple once you see why they’re needed.

What metastability actually is

A flip-flop only captures a clean value if its input is stable for a setup time before the clock edge and a hold time after. Meet that and the output is a solid 0 or 1. Violate it — by changing the input right at the edge — and the output can enter a metastable state: hovering between 0 and 1, at an illegal voltage, for an unpredictable time before it randomly resolves one way or the other.

Figure 1 — A setup/hold violation. When the async input D changes inside the capture window, the flop output Q hovers between logic levels before resolving to a random value.

The crucial point: when two clocks are unrelated (different sources, or the same frequency with drifting phase), the relative timing of a signal and the capturing edge is effectively random. So given enough time, the signal will eventually change right at an edge, and the flop will go metastable. It’s not a question of if, but how often — and whether your circuit has settled before it uses the value.

The two-flop synchronizer

You can’t stop metastability, but you can give it time to decay. The standard single-bit fix is two flip-flops in series, both clocked by the destination clock:

Figure 2 — The double-flop synchronizer. FF1 may go metastable, but it has a whole clock period to resolve before FF2 samples it — so the downstream logic sees a clean value.

FF1 samples the async signal and may go metastable. But its output isn’t used directly; it’s given a full destination-clock period to settle, and only then does FF2 sample it. By the time FF2’s edge arrives, FF1’s output has (almost certainly) resolved to a valid level, so FF2’s output — the synchronized signal — is clean. “Almost certainly” is quantified by MTBF (mean time between failures), which rises exponentially with the settling time you allow: a faster clock gives less margin, so very high-speed domains sometimes use three flops. This is the single most important idiom in CDC, and it costs two flip-flops.

The trap: never synchronize a bus bit-by-bit

Here’s where good engineers still get bitten. The two-flop synchronizer works for one bit. Apply it independently to each bit of a multi-bit bus and you create a new bug: each bit resolves its metastability independently, at a slightly different moment, so for a cycle the destination can read a word that never existed.

Figure 3 — Why a bus needs more than synchronizers. Synchronizing each bit separately lets them resolve at different times, producing transient garbage words. Cross multi-bit data through an async FIFO with Gray-coded pointers, or a handshake.

Take a counter going from 0111 to 1000: all four bits change at once. Two-flop each bit and the reader might briefly see 0000, 1111, or anything in between. The fixes:

Async FIFO. The data sits in a dual-port RAM written by clock A and read by clock B; only the read/write pointers cross domains. Those pointers are Gray-coded — exactly one bit changes per increment — so each can be safely two-flop synchronized without the multi-bit problem. Full/empty flags come from comparing the synchronized pointers. This is the workhorse for streaming data between clocks.
Handshake. For occasional single transfers, a request/acknowledge protocol: the sender raises req (a single bit, synchronized), the receiver latches the now-stable data and raises ack (synchronized back). Slow, but trivially correct.
Gray code for any counter that crosses a domain, so the multi-bit value only ever changes one bit at a time.

The rule compresses to: single control bit → synchronizer; multi-bit data → FIFO or handshake; counters → Gray code. Never run a raw bus straight through per-bit flops.

Why it survives simulation and fails in the field

CDC bugs are insidious because your tools are blind to them by default. A plain RTL simulation uses ideal clocks and zero-delay logic, so metastability simply doesn’t exist there — the design “works.” It’s silicon, temperature and clock drift that expose it, often rarely enough to pass every test and still fail in production. That’s why the discipline is structural, enforced by review and tooling, not by testing:

Use a CDC linter (the vendor tools and dedicated checkers) to find every crossing and verify each has a proper synchronizer, FIFO or handshake.
Mark synchronizer flops with the right constraints/attributes so the tools don’t optimise them away or try to “fix” their timing.
Treat every clock boundary as a design decision with a named technique — not something to wire up ad hoc.

Field notes

Two flops for every single-bit crossing — always. It’s the cheapest insurance in digital design, and the most common omission.
Never two-FF a bus. If more than one bit crosses together, it’s a FIFO, a handshake, or Gray code — no exceptions.
Constrain and name your synchronizers so synthesis doesn’t merge or retime them, and so the CDC linter can recognise them.
Run a CDC check as part of the build, not once at the end. A new crossing added late is exactly the one that ships broken.
“It works on the bench” proves nothing about CDC. The failure rate is a probability per clock; bench time just hasn’t rolled the dice enough yet. This is the hardware cousin of the software reordering bugs that only appear under optimisation — invisible until the conditions line up.