🤖 Harold's Notes

Search

❯

❯

Resources to read

❯

Inference

Dec 06, 20251 min read

Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving
Nice blog on llm inference optimizations: https://vgel.me/posts/faster-inference/
Large Transformer Model Inference Optimization
Tim Dettmers on quantization https://timdettmers.com/2022/08/17/llm-int8-and-emergent-features/
You can build a custom torch dynamo backend for super efficient inference
From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models
Seems potentially useful for compressing the kv cache or developing alternative methods: https://arxiv.org/abs/2404.15574

Graph View

Backlinks

No backlinks found

Created with Quartz v4.2.3 © 2025