Canon layers

A canon layer is quite simple
Its goal is to be a local sequence mixer on a short window
It can be summarized as
- $canon (h_{t}) = w_{1} ⊙ h_{t} + w_{2} ⊙ h_{t - 1} + w_{3} ⊙ h_{t - 2} + w_{4} ⊙ h_{t - 3}$
  - with $w_{i} \in R^{d}$
- zeros are usually used for padding at the boundaries
Flexible Integration. Canon layers integrate at multiple points within each Transformer block:
- Canon-A: Before the attention block (m = d if hidden size is d).
- Canon-B : Inside the attention block, applied to Q/K/V projections (m = 3d).
- Canon-C : Before the MLP block (m = d).
- Canon-D: Within the MLP block (m = 4d for standard MLP, m = $\frac{16}{3} d$ for gated MLP).
In the paper, they implement Canon layers as 1-d causal convolution with kernel size 4, for its efficient CUDA kernels and good capacity.
- there is an explicit residual connection
- $canon (h_{t}) = h_{t} + conv1d ([h_{t}, h_{t - 1}, h_{t - 2}, h_{t - 3}])$

Synthetic Tasks for Decomposing Intelligence

Interpreting task failures. If a specific architecture (of a given size) fails at a certain difficulty level (e.g., large N or k), it does not imply the model cannot learn the skill given infinite training.. All comparisons use a fixed limited training budget; thus, results should be seen as differences in the speed of skill acquisition, not absolute capability.

Design principles

We want to design synthetic tasks to systematically evaluate specific capabilities of language model architectures under controlled conditions, minimizing confounds and enabling clean comparisons.
Criterion 1: Tasks must not be shallow.
- Shallow tasks—like associative recall or copying—are easily solvable by small and shallow models e.g. 2-layer attention models
Criterion 2: Emphasis on mental thinking.
- Tasks should assess a model’s ability to reason internally without Chain-of-Thought (CoT).
Criterion 3: Avoid emphasis on length generalization.
- Length generalization is often unstable—sensitive to random seeds and training order.
Criterion 4: Relevance to real-world skills.
- Tasks should prioritize broadly applicable skills while avoiding capabilities better suited to external tools.
- For example, large-number arithmetic (e.g., adding 10-digit numbers) is theoretically interesting but can be delegated to Python interpreters; failures in this area typically reflect limited data exposure rather than architectural weaknesses (e.g., Llama3 70B miscalculates 452352 + 547647)

Task DEPO: Mental reasoning depth

Reasoning depth represents a fundamental capability for LLMs, requiring models to retrieve information through multi-step computation.
Task Depo evaluates reasoning depth as k-hop traversal over directed permutations, where models compute the k-th successor for each query q entirely internally, without intermediate steps like Chain-of-Thought (CoT).
The dataset is defined by two parameters: the maximum permutation size N and the reasoning depth K.
Each problem instance is generated as follows:
- First, a permutation length n is sampled uniformly from {3, 4, … , N }. A directed permutation of n nodes is then created, representing a cycle where each node points to its successor: x1 → x2 → · · · → xn → x1.
- The permutation is presented as edges in the form of ordered pairs (xi, xi+1), but these edges are shuffled randomly into a sequence of 2n tokens.
Then, a query is defined as: $< query k > q_{1}$ which means what is the k-th successor of $q_{1}$ in the cycle i.e. after $k$ hops.

Task Brevo: Mental reasoning breadth

evaluates a model’s ability to process multiple dependencies simultaneously, as required in tasks involving tree-like traversal or dependency graphs.
Task defines m edges xi → yi, representing dependencies where yi depends on xi. This creates a DAG.
Upon receiving a query vertex q, the model outputs all vertices recursively reachable from q, sorted in topological order starting from the leaves (e.g., u → v → q yields output u followed by v).
One key discovery from previous Physics of LLM revealed that, due to the non-uniqueness of valid outputs, language models must preprocess the entire topological order of the DAG mentally before generating the first token.

Task Mano - Knowledge manipulation

Task Mano evaluates a distinct form of reasoning: the ability to manipulate stored knowledge internally.
- contrasting with in-context reasoning tasks
Mano requires models to retrieve factual knowledge embedded in their parameters and perform hierarchical computation entirely mentally
- For instance, questions like “Was $[nam e]$ born in an even or odd month?” or derived 2-hop queries such as “What is $[nam e]$ ’s sister’s birthdate?” demand reasoning layers over stored knowledge.
- These skills cannot reliably emerge through supervised fine-tuning alone (as shown in Physics of LLM 3.2) and require development during pretraining or continued pretraining.
The factual base consists of three 23 × 23 arithmetic tables (addition, subtraction, and multiplication), which models learn implicitly during pretraining.
- It’s small and manageable.
Queries are defined as an arbitrary $l$ -depth manipulation of arithmetics modulo 23.
- For example, a length- $l$ = 3 instance is $< b os >< l e n 3 > + * ab - c d < an s > an s$
- This corresponds to the expression ((a × b) + (c − d)) mod 23, where operands a, b, c, and d are integers sampled uniformly from $[0, 22]$ .

Summary

Results 2–5: When Transformer Meets Canon .
- Boost performance.
- In the playground, Canon layers improve reasoning depth (200–400%), reasoning breadth (30%), knowledge manipulation length (30%), and more.
- Reviving NoPE.
  - Integrating Canon layers transforms NoPE models into strong performers, often matching or surpassing RoPE(+Canon).
  - Canon layers outperform positional fixes like ALiBi or H-Alibi, and reducing/removing RoPE usage improves length generalization.
- Ablation study.
  - Canon layers contribute cumulatively across sublayer positions (Canon-A/B/C/D), independently of attention or MLP components.
  - Residual links improve training efficiency; minimal parameter tuning is required without compromising stability.
- MLP and MoE.
  - Canon layers can recover some knowledge capacity lost in gated MLP or mixture-of-expert (MoE) architectures, via improved training efficiency and stability.
Results 6–7: When Linear Attention Meets Canon
- Boost performance.
  - Canon layers elevate Gated Linear Attention from 1-hop to 4-hop reasoning depth, double its reasoning breadth and knowledge manipulation length, making it comparable to Mamba2 and even surpassing it on tasks like Brevo (reasoning breadth).
- Ablation study.
  - Residual links and full Canon (A/B/C/D) are essential for maximizing effectiveness for linear-attention models, partial implementations may underperform.
Results 8–9: When Mamba Meets Canon
- Secret of success.
  - Mamba2’s performance is driven by its built-in conv1d mechanism, which acts as a non-linear Canon-B layer applied to selective coordinates.
  - Removing conv1d drops performance to match GLA, while replacing it with full Canon layers further boosts results, highlighting the importance of horizontal information flow over SSM design.
- Ablation study.
  - Canon choices—such as integration points and residual links—can influence Mamba2’s performance.
  - Mimetic initialization, while optimized for length generalization, harms shorter-context tasks, underscoring the need for diverse pretraining environments.
Results 10–11: Comparing Architectures
- Controlled comparisons.
  - Applying full Canon layers consistently across RoPE, NoPE, Mamba2, and GLA allows controlled comparisons, revealing that full transformers outperform linear models in hierarchical reasoning tasks, achieving twice the reasoning depth.
- Reasoning depth challenges.
  - In GLA and Mamba2, limited reasoning depth stems from accumulated compression and retrieval errors—not memory capacity—pinpointing a key focus for future research on linear models.
    - This is the retrieval error issue discussed in DeltaNet, approaches like TTT and DeltaNet can be good solutions.
  - Until this is resolved, hybrid designs (e.g., sliding-window Transformers with linear backbones) remain the most scalable path to deeper reasoning

🤖 Harold's Notes

Explorer

Physics of LLM - Canon layers