Main Path 2 of 7

Foldability: When a Symmetry Becomes a Free Engineering Move

This page is about the theorem that turns exact symmetry into an implementation trick: if a rotation commutes with RoPE, you can fold it through a full single-head Q/K/V attention block, then lift the same rewrite headwise through concatenation, a shared output projection, and the residual add.

Think of this as the “physics to engineering” bridge. The previous page said which rotations are allowed. This page says why allowed rotations can be compiled away as an exact checkpoint rewrite, not just a score preserving curiosity.

Theorem Reference

Lean anchors. attentionBlockOutput_invariant_of_shared_rotation, folded_qkvo_attention_block_invariant_of_commutes, ropeAttentionHeadContext, concatProjectedMultiHeadRopeAttentionBlockOutput_invariant_of_commutes, residualConcatProjectedMultiHeadRopeAttentionBlockOutput_invariant_of_commutes

Math statement.

\Theta R = R\Theta,\qquad R^\top R = I \;\Longrightarrow\; \operatorname{Attn}_{\Theta,\;RW_q,\;RW_k,\;RW_v,\;W_oR^\top}(x,T) = \operatorname{Attn}_{\Theta,\;W_q,\;W_k,\;W_v,\;W_o}(x,T).

B(R)=\operatorname{blockdiag}(R_1,\dots,R_h), \] \[ x + W_o B(R)^\top \operatorname{Concat}\!\big(\operatorname{Ctx}^{(h)}_{\Theta_h,\;R_hW_q^{(h)},\;R_hW_k^{(h)},\;R_hW_v^{(h)}}(x,T)\big) = x + W_o \operatorname{Concat}\!\big(\operatorname{Ctx}^{(h)}_{\Theta_h,\;W_q^{(h)},\;W_k^{(h)},\;W_v^{(h)}}(x,T)\big).

In English. If the same orthogonal rewrite \(R\) commutes with the RoPE operator \(\Theta\), then you can fold it into \(W_q\), \(W_k\), and \(W_v\), absorb \(R^\top\) into the output projection \(W_o\), and leave the full single-head attention block output unchanged. The newer wrapper theorem says each head may carry its own commuting orthogonal rewrite \(R_h\); after concatenation, one shared output projection absorbs the inverse block-diagonal bundle rotation \(B(R)^\top\), and the whole residual block still stays exactly fixed.

Physical intuition. A commuting orthogonal rotation is not a new behavior. It is a coordinate rewrite that the network cannot detect even at the level of the final wrapped attention block: each head changes frame internally, then the shared output layer removes the whole bundle rotation at once.

physics: same state in a new frame engineering: offline Q/K/V/O rewrite systems: zero runtime cost if folded

Before vs After

\(\text{wrapped block before folding}\)

\(x\)

→

head 1: \(W_q,W_k,W_v\)

→

\(\Theta_1,\mathrm{Ctx}^{(1)}\)

→

\(\mathrm{ctx}^{(1)}\)

\(x,T\)

→

head 2,...,h

→

\(\mathrm{Ctx}^{(2)},...,\mathrm{Ctx}^{(h)}\)

→

\(\operatorname{Concat}\to W_o\to +x\)

\(\text{wrapped block after folding}\)

\(x\)

→

head 1: \(R_1W_q,R_1W_k,R_1W_v\)

→

\(\Theta_1,\mathrm{Ctx}^{(1)}\)

→

\(R_1\mathrm{ctx}^{(1)}\)

\(x,T\)

→

head 2,...,h

→

\(R_2\mathrm{Ctx}^{(2)},...,R_h\mathrm{Ctx}^{(h)}\)

→

\(\operatorname{Concat}\to W_oB(R)^\top\to +x\)

\mathrm{weights}(R_hq_h,R_hK_h)=\mathrm{weights}(q_h,K_h),\qquad \mathrm{Ctx}_h \mapsto R_h\,\mathrm{Ctx}_h,\qquad W_o \mapsto W_o B(R)^\top.

The visual point is simple: you are not adding a new runtime operation. You are rewriting the same wrapped block in a symmetry-compatible head bundle frame.

The picture is not “the model changed and somehow performance stayed similar.” The picture is “each head changed frame internally, then the shared output map removed that whole frame change before the residual stream saw anything.”

Why This Matters

Quantization

You can choose a better coordinate frame for the whole Q/K/V block, then absorb it back into weights and output projection.

Compilation

A symmetry-compatible rewrite can be baked into the checkpoint instead of carried at inference.

Residual interface

The new exact wrapper says this remains true after concatenation, a shared output projection, and the residual add.

Feynman version: if two descriptions give the same measurable score, then the extra internal rotation was never a physical degree of freedom in the first place. It was a coordinate choice.

DIY Wrapped Foldability Checker

Toggle the theorem assumptions and what parts of the attention block you rotate. This separates “score-level invariance,” “single-head block invariance,” and the newer theorem that survives concatenation, a shared output projection, and the residual add.

Engineering Heuristic

Find a rotation that makes your representation easier to compress or analyze.
Check that it commutes with the relevant positional dynamics.
Fold it into Q/K/V offline and compensate with the matching output projection rewrite instead of paying for it online.

That is the pattern the exact theorem justifies. It is not mere empirical convenience; it is algebra with systems consequences. The current wrapper theorem says the same reasoning survives one more architectural layer: headwise context computation, concatenation, a shared output projection, and the residual add.