The three rules of Q arithmetic

A Qn number stores $x \cdot 2^{n}$ in an integer, covering $[- 1, 1 - 2^{- n}]$ with resolution $2^{- n}$ . The rules that prevent 90% of fixed-point bugs:

1. Addition — same format in, same format out, but the result can overflow: adding two Q15 values needs either saturation or one guard bit. 2. Multiplication — formats add: $Q 15 \times Q 15 = Q 30$ , so the int32 product must be shifted right by 15 (and rounded: add $2^{14}$ before the shift) to come back to Q15. 3. The asymmetry — −1.0 is representable but +1.0 is not, and $(- 1) \times (- 1)$ overflows: the only multiply that saturates in hardware MAC units. That's why CMSIS-DSP's __SSAT exists.

Q-Format Converter

The three rules of Q arithmetic

Related tools