-
Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving
-
Nice blog on llm inference optimizations: https://vgel.me/posts/faster-inference/
-
Tim Dettmers on quantization https://timdettmers.com/2022/08/17/llm-int8-and-emergent-features/
-
You can build a custom torch dynamo backend for super efficient inference
-
From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models
-
Seems potentially useful for compressing the kv cache or developing alternative methods: https://arxiv.org/abs/2404.15574