Unified representation of sequence models

Quantifying Memory Utilization with Effective State-Size

Introduction

The majority of sequence models of practical interest can formally be expressed as either linear systems ( $y = T u$ ) or systems with input-varying linear operators $(y = f_{T} (u) u$ ), the latter of which they abbreviate to LIV.
The input-varying linear operator framework decouples the input-varying featurization $u \to T := f_{T} (u)$ and the linear mapping $y = T u$ required to construct and apply the operator respectively.
This decomposition enables a wide array of deep learning primitives to be uniformly formulated as linear systems, including models like convolutions , linear state-transition recurrences and attention variants.

Background (recurrences)

Consider a general linear recurrence formulated as follows:

s_{i + 1} = A_{i} s_{i} + B_{i} u_{i}, y_{i} = C_{i} s_{i} + D_{i} u_{i},

where $(A_{i} \in R^{n_{i + 1} \times n_{i}}, B_{i} \in R^{n_{i + 1} \times d}, C_{i} \in R^{d \times n_{i}}, D_{i} \in R^{d \times d})_{i \in [ℓ]}$ ;
$s_{i}$ and $n_{i}$ are the state and state-size at sequence index $i$ respectively.

Creating the operator T

In this section, we begin by showing that most modern sequence models can effectively materialize a linear operator $T$ .
Using the flattened notation, we let $T \in R^{d ℓ \times d ℓ}$ , $u, y \in R^{d ℓ}$ denote the operator, inputs, and outputs respectively, $ℓ$ denote the sequence length and $d$ denote the channel dimension. Here, we index sequence indices with subscripts, i.e. $T_{ij} \in R^{d \times d}$ , $u_{i} \in R^{d}$ and channels (and other non-temporal dimensions) with superscripts, i.e. $T^{α β} \in R^{ℓ \times ℓ}$ , $u^{α} \in R^{ℓ}$ . For additional details on notation, refer to Section D.1.

A unified representation of sequence models

While typically nonlinear, most sequence models of interest can effectively materialize a linear operator $T$ , where the equation $y = T u$ faithfully expresses the computation performed by the model (see Section E.2 for further elaboration):

T_{ij} T_{ij} T_{ij} T_{ij} T_{ij} = C_{i} B_{j} = C_{i} A_{i - 1} \dots A_{j + 1} B_{j} = K_{i - j} = C_{i} K_{i - j} B_{j} = σ (C_{i} B_{j}) linear attention, recurrence, convolution, gated convolution, attention.

They make a distinction between linear systems (such as convolutions) and LIVs (such as attention and gated convolutions).
In the former, the operator $T$ is input-invariant whereas the latter are constructed via causal featurizers that map past inputs into features, i.e. $f_{B} : u_{: i} \mapsto B_{i}$ , which are then used to construct the elements of $T$ as outlined above.

🤖 Harold's Notes

Explorer

Unified representation of sequence models

Introduction

Background (recurrences)

Creating the operator T

A unified representation of sequence models

Graph View

Table of Contents

Backlinks