Like standard RNNs, conventional discrete SSMs are inherently weak at modeling long-distance dependencies. In other words, they aren’t good at understanding the relationship between steps in a sequence that are far apart, such as words at the beginning and end of a paragraph—which makes them weak at modeling long sequences (such as text data) altogether.
To solve for this, Gu et al proposed the use of a technique called HiPPO (short for High-order Polynomial Projection Operators) to define the way the A and B matrices behave.
Polynomial functions combine one or more terms. Each term comprises a coefficient and a basis function of some variable. For instance, 3x2 is a term whose coefficient is 3 and whose basis is x2. A polynomial’s “order” is determined by the highest power of any basis it contains: 3x2 + 5x is a “second order polynomial.” The higher a polynomial’s order, the more intricate detail can be captured in its curves.
Orthogonal polynomial functions are special “families” of polynomials, spanning multiple orders, in which each polynomial is mathematically independent from the others, ensuring there’s no redundant overlap or informational dependencies between them. They’re also very robust to minor rounding errors, making them useful for approximating more complex functions. Families of orthogonal polynomials are themselves generated by a rule called a three-term recurrence formula. The HiPPO method uses such recurrence formulae to construct the A and B matrices.
In essence, each time the state ht is updated by the state equation , the elements of the state vector ht act as the coefficients of polynomial expressions that approximate the original input. Older inputs are approximated through lower order polynomials that capture broad, low frequency (long-term) details and more recent inputs are approximated through higher order polynomials capture fine-grained, high-frequency (short term) details. Since the chosen polynomials are orthogonal, no information is repeated. In essence, this structure forces the state space to “memorize” the entire input history by efficiently “compressing” it into a fixed-size vector of coefficients.
The S4 paper notes that “simply modifying an SSM from a random matrix A to [the HiPPO Matrix] improved its performance on the sequential MNIST benchmark from 60% to 98%,” effectively solving SSMs’ long-term memory problem. Later variations of structured SSMs, such as DSS, S5 and Mamba, use different (often simpler) initialization schemes for A and B, but retain the core HiPPO principals.