DeepSeek-v3

  • Different from Gloeckle et al. (2024), which parallelly predicts 𝐷 additional tokens using independent output heads, they sequentially predict additional tokens and keep the complete causal chain at each prediction depth.