This page is about the theorem that turns exact symmetry into an implementation trick: if a rotation commutes with RoPE, you can fold it through a full single-head Q/K/V attention block, then lift the same rewrite headwise through concatenation, a shared output projection, and the residual add.
Think of this as the “physics to engineering” bridge. The previous page said which rotations are allowed. This page says why allowed rotations can be compiled away as an exact checkpoint rewrite, not just a score preserving curiosity.
Lean anchors.
attentionBlockOutput_invariant_of_shared_rotation,
folded_qkvo_attention_block_invariant_of_commutes,
ropeAttentionHeadContext,
concatProjectedMultiHeadRopeAttentionBlockOutput_invariant_of_commutes,
residualConcatProjectedMultiHeadRopeAttentionBlockOutput_invariant_of_commutes
Math statement.
In English. If the same orthogonal rewrite \(R\) commutes with the RoPE operator \(\Theta\), then you can fold it into \(W_q\), \(W_k\), and \(W_v\), absorb \(R^\top\) into the output projection \(W_o\), and leave the full single-head attention block output unchanged. The newer wrapper theorem says each head may carry its own commuting orthogonal rewrite \(R_h\); after concatenation, one shared output projection absorbs the inverse block-diagonal bundle rotation \(B(R)^\top\), and the whole residual block still stays exactly fixed.
Physical intuition. A commuting orthogonal rotation is not a new behavior. It is a coordinate rewrite that the network cannot detect even at the level of the final wrapped attention block: each head changes frame internally, then the shared output layer removes the whole bundle rotation at once.
The visual point is simple: you are not adding a new runtime operation. You are rewriting the same wrapped block in a symmetry-compatible head bundle frame.
The picture is not “the model changed and somehow performance stayed similar.” The picture is “each head changed frame internally, then the shared output map removed that whole frame change before the residual stream saw anything.”
You can choose a better coordinate frame for the whole Q/K/V block, then absorb it back into weights and output projection.
A symmetry-compatible rewrite can be baked into the checkpoint instead of carried at inference.
The new exact wrapper says this remains true after concatenation, a shared output projection, and the residual add.
Feynman version: if two descriptions give the same measurable score, then the extra internal rotation was never a physical degree of freedom in the first place. It was a coordinate choice.
Toggle the theorem assumptions and what parts of the attention block you rotate. This separates “score-level invariance,” “single-head block invariance,” and the newer theorem that survives concatenation, a shared output projection, and the residual add.
That is the pattern the exact theorem justifies. It is not mere empirical convenience; it is algebra with systems consequences. The current wrapper theorem says the same reasoning survives one more architectural layer: headwise context computation, concatenation, a shared output projection, and the residual add.