Compressed Convolutional Attention (CCA)

Only works at small scale?

One note for concern, however, is that compared to MLA/GQA, which like MHA, only do linear projections of q,k,v, CCA does more involved operations on the compressed latents.
This induces a measure of inductive bias into CCA which is less present in MLA and GQA and may mean that CCA outperforms at small scales while the benefits lessen at larger scales. Testing how CCA performs and compares with alternatives on larger scales would be important to verify or falsify this hypothesis.