Introduction
-
The majority of sequence models of practical interest can formally be expressed as either linear systems () or systems with input-varying linear operators ), the latter of which they abbreviate to LIV.
-
The input-varying linear operator framework decouples the input-varying featurization and the linear mapping required to construct and apply the operator respectively.
-
This decomposition enables a wide array of deep learning primitives to be uniformly formulated as linear systems, including models like convolutions , linear state-transition recurrences and attention variants.
Background (recurrences)
Consider a general linear recurrence formulated as follows:
where ;
and are the state and state-size at sequence index respectively.
Creating the operator T
-
In this section, we begin by showing that most modern sequence models can effectively materialize a linear operator .
-
Using the flattened notation, we let , denote the operator, inputs, and outputs respectively, denote the sequence length and denote the channel dimension. Here, we index sequence indices with subscripts, i.e. , and channels (and other non-temporal dimensions) with superscripts, i.e. , . For additional details on notation, refer to Section D.1.
A unified representation of sequence models
While typically nonlinear, most sequence models of interest can effectively materialize a linear operator , where the equation faithfully expresses the computation performed by the model (see Section E.2 for further elaboration):
- They make a distinction between linear systems (such as convolutions) and LIVs (such as attention and gated convolutions).
- In the former, the operator is input-invariant whereas the latter are constructed via causal featurizers that map past inputs into features, i.e. , which are then used to construct the elements of as outlined above.