Canon layers

  • A canon layer is quite simple

  • Its goal is to be a local sequence mixer on a short window

  • It can be summarized as

      • with
    • zeros are usually used for padding at the boundaries
  • Flexible Integration. Canon layers integrate at multiple points within each Transformer block:

    • Canon-A: Before the attention block (m = d if hidden size is d).
    • ˆCanon-B : Inside the attention block, applied to Q/K/V projections (m = 3d).
    • Canon-C : Before the MLP block (m = d).
    • Canon-D: Within the MLP block (m = 4d for standard MLP, m = for gated MLP).
  • In the paper, they implement Canon layers as 1-d causal convolution with kernel size 4, for its efficient CUDA kernels and good capacity.

    • there is an explicit residual connection

Synthetic Tasks for Decomposing Intelligence

Interpreting task failures. If a specific architecture (of a given size) fails at a certain difficulty level (e.g., large N or k), it does not imply the model cannot learn the skill given infinite training.. All comparisons use a fixed limited training budget; thus, results should be seen as differences in the speed of skill acquisition, not absolute capability.

Design principles

  • We want to design synthetic tasks to systematically evaluate specific capabilities of language model architectures under controlled conditions, minimizing confounds and enabling clean comparisons.

  • Criterion 1: Tasks must not be shallow.

    • Shallow tasks—like associative recall or copying—are easily solvable by small and shallow models e.g. 2-layer attention models
  • Criterion 2: Emphasis on mental thinking.

    • Tasks should assess a model’s ability to reason internally without Chain-of-Thought (CoT).
  • Criterion 3: Avoid emphasis on length generalization.

    • Length generalization is often unstable—sensitive to random seeds and training order.
  • Criterion 4: Relevance to real-world skills.

    • Tasks should prioritize broadly applicable skills while avoiding capabilities better suited to external tools.
    • For example, large-number arithmetic (e.g., adding 10-digit numbers) is theoretically interesting but can be delegated to Python interpreters; failures in this area typically reflect limited data exposure rather than architectural weaknesses (e.g., Llama3 70B miscalculates 452352 + 547647)

Task DEPO: Mental reasoning depth

  • Reasoning depth represents a fundamental capability for LLMs, requiring models to retrieve information through multi-step computation.
  • Task Depo evaluates reasoning depth as k-hop traversal over directed permutations, where models compute the k-th successor for each query q entirely internally, without intermediate steps like Chain-of-Thought (CoT).
  • The dataset is defined by two parameters: the maximum permutation size N and the reasoning depth K.
  • Each problem instance is generated as follows:
    • First, a permutation length n is sampled uniformly from {3, 4, … , N }. A directed permutation of n nodes is then created, representing a cycle where each node points to its successor: x1 → x2 → · · · → xn → x1.
    • The permutation is presented as edges in the form of ordered pairs (xi, xi+1), but these edges are shuffled randomly into a sequence of 2n tokens.
  • Then, a query is defined as: which means what is the k-th successor of in the cycle i.e. after hops.

Task Brevo: Mental reasoning breadth

  • evaluates a model’s ability to process multiple dependencies simultaneously, as required in tasks involving tree-like traversal or dependency graphs.
  • Task defines m edges xi → yi, representing dependencies where yi depends on xi. This creates a DAG.
  • Upon receiving a query vertex q, the model outputs all vertices recursively reachable from q, sorted in topological order starting from the leaves (e.g., u → v → q yields output u followed by v).
  • One key discovery from previous Physics of LLM revealed that, due to the non-uniqueness of valid outputs, language models must preprocess the entire topological order of the DAG mentally before generating the first token.

Task Mano - Knowledge manipulation

  • Task Mano evaluates a distinct form of reasoning: the ability to manipulate stored knowledge internally.
    • contrasting with in-context reasoning tasks
  • Mano requires models to retrieve factual knowledge embedded in their parameters and perform hierarchical computation entirely mentally
    • For instance, questions like “Was born in an even or odd month?” or derived 2-hop queries such as “What is ’s sister’s birthdate?” demand reasoning layers over stored knowledge.
    • These skills cannot reliably emerge through supervised fine-tuning alone (as shown in Physics of LLM 3.2) and require development during pretraining or continued pretraining.
  • The factual base consists of three 23 × 23 arithmetic tables (addition, subtraction, and multiplication), which models learn implicitly during pretraining.
    • It’s small and manageable.
  • Queries are defined as an arbitrary -depth manipulation of arithmetics modulo 23.
    • For example, a length- = 3 instance is
    • This corresponds to the expression ((a × b) + (c − d)) mod 23, where operands a, b, c, and d are integers sampled uniformly from .

Summary

  • Results 2–5: When Transformer Meets Canon .

    • ˆBoost performance.
    • In the playground, Canon layers improve reasoning depth (200–400%), reasoning breadth (30%), knowledge manipulation length (30%), and more.
    • Reviving NoPE.
      • Integrating Canon layers transforms NoPE models into strong performers, often matching or surpassing RoPE(+Canon).
      • Canon layers outperform positional fixes like ALiBi or H-Alibi, and reducing/removing RoPE usage improves length generalization.
    • ˆ Ablation study.
      • Canon layers contribute cumulatively across sublayer positions (Canon-A/B/C/D), independently of attention or MLP components.
      • Residual links improve training efficiency; minimal parameter tuning is required without compromising stability.
    • MLP and MoE.
      • Canon layers can recover some knowledge capacity lost in gated MLP or mixture-of-expert (MoE) architectures, via improved training efficiency and stability.
  • Results 6–7: When Linear Attention Meets Canon

    • Boost performance.
      • Canon layers elevate Gated Linear Attention from 1-hop to 4-hop reasoning depth, double its reasoning breadth and knowledge manipulation length, making it comparable to Mamba2 and even surpassing it on tasks like Brevo (reasoning breadth).
    • ˆ Ablation study.
      • Residual links and full Canon (A/B/C/D) are essential for maximizing effectiveness for linear-attention models, partial implementations may underperform.
  • Results 8–9: When Mamba Meets Canon

    • ˆ Secret of success.
      • Mamba2’s performance is driven by its built-in conv1d mechanism, which acts as a non-linear Canon-B layer applied to selective coordinates.
      • Removing conv1d drops performance to match GLA, while replacing it with full Canon layers further boosts results, highlighting the importance of horizontal information flow over SSM design.
    • Ablation study.
      • Canon choices—such as integration points and residual links—can influence Mamba2’s performance.
      • Mimetic initialization, while optimized for length generalization, harms shorter-context tasks, underscoring the need for diverse pretraining environments.
  • Results 10–11: Comparing Architectures

    • ˆ Controlled comparisons.
      • Applying full Canon layers consistently across RoPE, NoPE, Mamba2, and GLA allows controlled comparisons, revealing that full transformers outperform linear models in hierarchical reasoning tasks, achieving twice the reasoning depth.
    • Reasoning depth challenges.
      • In GLA and Mamba2, limited reasoning depth stems from accumulated compression and retrieval errors—not memory capacity—pinpointing a key focus for future research on linear models.
        • This is the retrieval error issue discussed in DeltaNet, approaches like TTT and DeltaNet can be good solutions.
      • Until this is resolved, hybrid designs (e.g., sliding-window Transformers with linear backbones) remain the most scalable path to deeper reasoning