Canon layers
-
A canon layer is quite simple
-
Its goal is to be a local sequence mixer on a short window
-
It can be summarized as
-
- with
- zeros are usually used for padding at the boundaries
-
-
Flexible Integration. Canon layers integrate at multiple points within each Transformer block:
- Canon-A: Before the attention block (m = d if hidden size is d).
- Canon-B : Inside the attention block, applied to Q/K/V projections (m = 3d).
- Canon-C : Before the MLP block (m = d).
- Canon-D: Within the MLP block (m = 4d for standard MLP, m = for gated MLP).
-
In the paper, they implement Canon layers as 1-d causal convolution with kernel size 4, for its efficient CUDA kernels and good capacity.
- there is an explicit residual connection
Synthetic Tasks for Decomposing Intelligence
Interpreting task failures. If a specific architecture (of a given size) fails at a certain difficulty level (e.g., large N or k), it does not imply the model cannot learn the skill given infinite training.. All comparisons use a fixed limited training budget; thus, results should be seen as differences in the speed of skill acquisition, not absolute capability.
Design principles
-
We want to design synthetic tasks to systematically evaluate specific capabilities of language model architectures under controlled conditions, minimizing confounds and enabling clean comparisons.
-
Criterion 1: Tasks must not be shallow.
- Shallow tasks—like associative recall or copying—are easily solvable by small and shallow models e.g. 2-layer attention models
-
Criterion 2: Emphasis on mental thinking.
- Tasks should assess a model’s ability to reason internally without Chain-of-Thought (CoT).
-
Criterion 3: Avoid emphasis on length generalization.
- Length generalization is often unstable—sensitive to random seeds and training order.
-
Criterion 4: Relevance to real-world skills.
- Tasks should prioritize broadly applicable skills while avoiding capabilities better suited to external tools.
- For example, large-number arithmetic (e.g., adding 10-digit numbers) is theoretically interesting but can be delegated to Python interpreters; failures in this area typically reflect limited data exposure rather than architectural weaknesses (e.g., Llama3 70B miscalculates 452352 + 547647)
Task DEPO: Mental reasoning depth
- Reasoning depth represents a fundamental capability for LLMs, requiring models to retrieve information through multi-step computation.
- Task Depo evaluates reasoning depth as k-hop traversal over directed permutations, where models compute the k-th successor for each query q entirely internally, without intermediate steps like Chain-of-Thought (CoT).
- The dataset is defined by two parameters: the maximum permutation size N and the reasoning depth K.
- Each problem instance is generated as follows:
- First, a permutation length n is sampled uniformly from {3, 4, … , N }. A directed permutation of n nodes is then created, representing a cycle where each node points to its successor: x1 → x2 → · · · → xn → x1.
- The permutation is presented as edges in the form of ordered pairs (xi, xi+1), but these edges are shuffled randomly into a sequence of 2n tokens.
- Then, a query is defined as: which means what is the k-th successor of in the cycle i.e. after hops.
Task Brevo: Mental reasoning breadth
- evaluates a model’s ability to process multiple dependencies simultaneously, as required in tasks involving tree-like traversal or dependency graphs.
- Task defines m edges xi → yi, representing dependencies where yi depends on xi. This creates a DAG.
- Upon receiving a query vertex q, the model outputs all vertices recursively reachable from q, sorted in topological order starting from the leaves (e.g., u → v → q yields output u followed by v).
- One key discovery from previous Physics of LLM revealed that, due to the non-uniqueness of valid outputs, language models must preprocess the entire topological order of the DAG mentally before generating the first token.
Task Mano - Knowledge manipulation
- Task Mano evaluates a distinct form of reasoning: the ability to manipulate stored knowledge internally.
- contrasting with in-context reasoning tasks
- Mano requires models to retrieve factual knowledge embedded in their parameters and perform hierarchical computation entirely mentally
- For instance, questions like “Was born in an even or odd month?” or derived 2-hop queries such as “What is ’s sister’s birthdate?” demand reasoning layers over stored knowledge.
- These skills cannot reliably emerge through supervised fine-tuning alone (as shown in Physics of LLM 3.2) and require development during pretraining or continued pretraining.
- The factual base consists of three 23 × 23 arithmetic tables (addition, subtraction, and multiplication), which models learn implicitly during pretraining.
- It’s small and manageable.
- Queries are defined as an arbitrary -depth manipulation of arithmetics modulo 23.
- For example, a length- = 3 instance is
- This corresponds to the expression ((a × b) + (c − d)) mod 23, where operands a, b, c, and d are integers sampled uniformly from .
Summary
-
Results 2–5: When Transformer Meets Canon .
- Boost performance.
- In the playground, Canon layers improve reasoning depth (200–400%), reasoning breadth (30%), knowledge manipulation length (30%), and more.
- Reviving NoPE.
- Integrating Canon layers transforms NoPE models into strong performers, often matching or surpassing RoPE(+Canon).
- Canon layers outperform positional fixes like ALiBi or H-Alibi, and reducing/removing RoPE usage improves length generalization.
- Ablation study.
- Canon layers contribute cumulatively across sublayer positions (Canon-A/B/C/D), independently of attention or MLP components.
- Residual links improve training efficiency; minimal parameter tuning is required without compromising stability.
- MLP and MoE.
- Canon layers can recover some knowledge capacity lost in gated MLP or mixture-of-expert (MoE) architectures, via improved training efficiency and stability.
-
Results 6–7: When Linear Attention Meets Canon
- Boost performance.
- Canon layers elevate Gated Linear Attention from 1-hop to 4-hop reasoning depth, double its reasoning breadth and knowledge manipulation length, making it comparable to Mamba2 and even surpassing it on tasks like Brevo (reasoning breadth).
- Ablation study.
- Residual links and full Canon (A/B/C/D) are essential for maximizing effectiveness for linear-attention models, partial implementations may underperform.
- Boost performance.
-
Results 8–9: When Mamba Meets Canon
- Secret of success.
- Mamba2’s performance is driven by its built-in conv1d mechanism, which acts as a non-linear Canon-B layer applied to selective coordinates.
- Removing conv1d drops performance to match GLA, while replacing it with full Canon layers further boosts results, highlighting the importance of horizontal information flow over SSM design.
- Ablation study.
- Canon choices—such as integration points and residual links—can influence Mamba2’s performance.
- Mimetic initialization, while optimized for length generalization, harms shorter-context tasks, underscoring the need for diverse pretraining environments.
- Secret of success.
-
Results 10–11: Comparing Architectures
- Controlled comparisons.
- Applying full Canon layers consistently across RoPE, NoPE, Mamba2, and GLA allows controlled comparisons, revealing that full transformers outperform linear models in hierarchical reasoning tasks, achieving twice the reasoning depth.
- Reasoning depth challenges.
- In GLA and Mamba2, limited reasoning depth stems from accumulated compression and retrieval errors—not memory capacity—pinpointing a key focus for future research on linear models.
- This is the retrieval error issue discussed in DeltaNet, approaches like TTT and DeltaNet can be good solutions.
- Until this is resolved, hybrid designs (e.g., sliding-window Transformers with linear backbones) remain the most scalable path to deeper reasoning
- In GLA and Mamba2, limited reasoning depth stems from accumulated compression and retrieval errors—not memory capacity—pinpointing a key focus for future research on linear models.
- Controlled comparisons.