Rotational Attention Layers: Geometric-Algebraic Transformations within Transformer Architectures

Author: Richard Vermillion
Date: 2025-10-22

Abstract

We propose a modification to standard Transformer blocks wherein the value vector output of the attention mechanism is interpreted not as an additive residual update, but as a geometric rotation applied to the input representation. By framing attention as a rotor in geometric algebra (GA), we leverage the “sandwich” form \( R\,x\,\tilde R \) (or equivalently \( a\,b\,x\,b\,a \)) to perform rotation in high-dimensional embedding space without computing full \(N\times N\) rotation matrices. In practice we fix (or optionally learn) one reference vector \(a\) and let the attention output serve as \(b\), normalize appropriately, reflect/rotate the input \(x\), then pass through the subsequent MLP + residual path. We hypothesize that many semantic and numeric transformations—especially analogical and arithmetic relations—are naturally expressed as rotations rather than translations, and that providing an explicit rotational capability can reduce depth/parameter needs, improve interpretability, and better support tasks such as analogies and arithmetic reasoning (e.g., where numbers are embedded on helices). We detail the block design, initialization strategy (so as not to degrade pretrained models), and propose a set of ablation and evaluation experiments on analogy benchmarks and arithmetic tasks. We believe this rotational-layer hybrid (additive + rotational) is a promising direction for richer inductive biases in large language models (LLMs).

1. Introduction

The standard Transformer architecture (Vaswani et al., 2017) uses self-attention and additive residual updates via value vectors \(V\) combined with queries/keys. The residual update is of the form:

\[ x_{\rm out} = x_{\rm in} + \mathrm{MLP}(\mathrm{LayerNorm}(x_{\rm in} + \mathrm{SelfAttn}(x_{\rm in}))) \]

Although extremely powerful, this additive update paradigm treats attention output purely as a vector shift. We argue there is value in decoupling directional re-orientation from magnitude shifts. In particular:

Many semantic relations (e.g., king → queen, man → woman) may behave more like rotations in embedding space: preserving norm, re-orienting direction rather than purely translating.
Recent work on numeric reasoning in LLMs shows numbers are represented as helical embeddings and arithmetic corresponds to phase shifts (i.e., rotations) on that helix (Kantamneni & Tegmark, 2025).
By giving the model explicit rotational capacity, we might enable more efficient representation of structured transformations, reduce reliance on additive layers alone, and improve interpretability of learned latent geometry.

We propose inserting rotational transformer blocks periodically within a standard architecture, so that the model has both rotational and additive operations available. Our implementation uses geometric algebra: we define a reference direction \(a\) (e.g., a fixed basis vector) and attention output \(b\) to form a rotor that rotates input \(x\). This avoids computing a full \(N\times N\) rotation matrix, maintaining computational efficiency.

2. Rotational Block Design

2.1 Geometric Algebra Primer (very brief)

In geometric algebra (GA), a rotor \(R\) can be used to rotate a vector \(x\) by embedding \(x\) into a sandwich product:

\[ x \mapsto R\,x\,\widetilde{R} \]

where \(\widetilde{R}\) is the reverse (conjugate) of \(R\). For a simple rotation in a plane spanned by orthonormal unit vectors \(a,b\), one can write:

\[ R = a\,b \Rightarrow R\,x\,\widetilde{R} = a\,b\,x\,b\,a. \]

This performs a reflection of \(x\) in the plane perpendicular to \(b\), then a second reflection in the plane perpendicular to \(a\), which altogether yields a rotation in the plane spanned by \(a,b\). Importantly, this uses only vector-bivector algebra rather than constructing a full rotation matrix of size \(N\times N\).

2.2 Block Implementation

Let \(x \in \mathbb{R}^{d}\) (embedding dimension).

Compute standard self-attention output \(b = \mathrm{SelfAttn}(\mathrm{LayerNorm}(x))\).
Add a fixed reference direction: \(b_{\rm ref} = b + a\), where \(a\) is a fixed (or optionally learned) unit vector (e.g., basis vector \(e_0\)).
Normalize \(b_{\rm ref}\): \(b \leftarrow b_{\rm ref} / |b_{\rm ref}|\).
Reflect input \(x\) in direction \(b\): \( h = x - 2 (b \cdot x) b \).
Flip sign of the reference component \(h[a\text{-dim}] \leftarrow -h[a\text{-dim}]\).
Pass through MLP and residual: \( h_{\rm norm} = \mathrm{LayerNorm}(h) \), \( r = \mathrm{MLP}(h_{\rm norm}) \), \( \text{output} = h + r \).

2.3 Initialization for Compatibility with Pretrained Models

To avoid disrupting pretrained models:

Initialize attention projection weights small so \(b \approx 0\).
Then \(b + a \approx a\), and after normalization, \(b \approx a\), which makes the rotor a near-identity transformation.
Thus, the block starts close to \(x \mapsto x\) and can be fine-tuned without destabilizing the model.

3. Rationale & Hypotheses

3.1 Semantic Analogue Transformations

Analogies often involve reorienting a subspace while preserving other properties. Rotational blocks may represent such analogies (e.g. gender, plurality) more naturally than additive residuals.

3.2 Numerical Reasoning and Helices

Work like Kantamneni & Tegmark (2025) shows LLMs encode numbers as helices and perform addition as rotations. Our layers directly support such transformations.

3.3 Additional Degree of Freedom

In additive residuals, the same update is applied to all inputs. Rotational transformations depend on \(b \cdot x\), introducing a conditional transformation capacity.

3.4 Integration into Standard Architectures

Our blocks can be inserted periodically (e.g. every 7 layers) into standard transformer architectures, maintaining compatibility with existing models and training regimes.

4. Experimental Plan

4.1 Model Setup

Baseline: transformer without rotational blocks.
Experimental: insert 4 rotational layers into a 28-layer model (total 32). Only these layers are trained initially.

4.2 Tasks & Datasets

Analogy Tasks:

Google Analogy Dataset
BATS (Bigger Analogy Test Set)
SAT Analogy Questions

Arithmetic Tasks:

Multi-digit addition/subtraction (2, 4, 8 digits)
Extrapolation to 10+ and 20+ digit arithmetic (self-generated)
RL fine-tuning (e.g., DeepSeek R1-style curriculum/self-improvement)

4.3 Ablations

Fixed vs learnable reference vector \(a\)
With vs without additive residuals
Vary frequency/number of rotational blocks

4.4 Metrics

Accuracy on analogy and arithmetic datasets
Extrapolation generalization gap
Training efficiency and convergence speed
Activation norm stability
Interpretability of rotation planes

4.5 Hypotheses

Rotational layers improve analogy and arithmetic accuracy
Generalization improves on longer-digit inputs
Learnable \(a\) improves over fixed \(a\)
Additive path is still useful (removal hurts)
Rotational blocks yield more stable activation norms

5. Related Work

Vaswani et al. (2017) - Transformers
Kantamneni & Tegmark (2025) - Helical number representations
Kobayashi et al. (2020) - Norms in attention
Assaad et al. (2022) - Rotation-equivariant attention (VN-Transformers)
Bounsi et al. (2024) - Neural algorithmic reasoning in transformers
Lu & Guo (2023) - Helix encodings in transformers

6. Risks & Limitations

Limited rotation planes if \(a\) is fixed
Instability if \(b\) is poorly conditioned (mitigated via normalization)
Interpretation in high dimensions is difficult
Additional compute cost (though minor)

7. Timeline

Implementation and verification (Weeks 0–4)
Compatibility tests with frozen base (Weeks 4–6)
Analogy task tuning and ablations (Weeks 6–12)
Arithmetic extrapolation experiments (Weeks 12–20)
Probing and interpretability (Weeks 20–24)
Paper drafting and submission (Weeks 24–28)

8. Conclusion

We believe that enriching transformer architectures with explicit rotational capacity—via geometric-algebraic rotors built from attention output and a reference direction—offers a promising inductive bias for structured transformations, analogical reasoning, and numeric reasoning. The proposed block is efficient, integrates with existing models, and is easy to initialize near identity to avoid degradation. Through ablations and benchmark experiments we aim to test whether this approach yields tangible improvements.

References

Vaswani, A., et al. (2017). Attention is All You Need.
Kobayashi, G., et al. (2020). Attention is Not Only a Weight: Analyzing Transformers with Vector Norms. EMNLP.
Kantamneni, S., Tegmark, M. (2025). Language Models Use Trigonometry to Do Addition. arXiv preprint.
Assaad, S., et al. (2022). VN‑Transformer: Rotation‑Equivariant Attention for Vector Neurons. NeurIPS Workshop.
Bounsi, W., et al. (2024). Transformers meet Neural Algorithmic Reasoners. arXiv.
Lu, JHJ., Guo, Q. (2023). The Double Helix inside the NLP Transformer.