Introducing Kimi K2 Thinking: A Deep-Reasoning Agent with Native INT4 Quantization
Kimi K2 Thinking has been introduced as the latest and most capable version in the series of open-source thinking models. Designed from the ground up as a thinking agent, it specializes in performing step-by-step reasoning while dynamically invoking tools to complete complex tasks. The model sets a new state-of-the-art on benchmarks like Humanity’s Last Exam (HLE) and BrowseComp by significantly extending its multi-step reasoning depth and maintaining stable performance across 200–300 consecutive tool calls.
A key technical highlight of K2 Thinking is its implementation as a native INT4 quantization model, which, combined with a 256k context window, allows for lossless reductions in inference latency and GPU memory usage.
Key Features
- Deep Thinking & Tool Orchestration: The model is end-to-end trained to interleave chain-of-thought reasoning with function calls. This enables it to handle autonomous research, coding, and writing workflows that can last for hundreds of steps without drifting from the original goal.
- Stable Long-Horizon Agency: K2 Thinking demonstrates coherent, goal-directed behavior for up to 200–300 consecutive tool invocations, a significant improvement over previous models that often saw performance degrade after 30–50 steps.
- Native INT4 Quantization: By employing Quantization-Aware Training (QAT) in its post-training stage, the model achieves a nearly 2x speed-up in low-latency mode without sacrificing performance.
Model Architecture
Kimi K2 Thinking is built on a Mixture-of-Experts (MoE) architecture. Its key specifications are as follows:
| Specification | Value |
|---|---|
| Architecture | Mixture-of-Experts (MoE) |
| Total Parameters | 1T |
| Activated Parameters | 32B |
| Context Length | 256K |
| Vocabulary Size | 160K |
| Number of Layers | 61 (including 1 dense layer) |
| Number of Experts | 384 (8 selected per token, 1 shared) |
| Attention Mechanism | MLA |
| Activation Function | SwiGLU |
Performance and Evaluation
Evaluation results show that Kimi K2 Thinking achieves state-of-the-art or highly competitive performance across a range of tasks. In reasoning tasks with tools, it scores 44.9 on HLE and 60.2 on BrowseComp, outperforming other leading models. It also demonstrates strong capabilities in coding, achieving a score of 71.3 on SWE-bench Verified and showing particular strength in multilingual coding benchmarks. All reported benchmark results were achieved using INT4 precision, underscoring the model’s efficiency.
Deployment and Usage
Developers can access Kimi K2 Thinking via an OpenAI/Anthropic-compatible API available at platform.moonshot.ai. For local deployment, the model is optimized to run on inference engines such as vLLM, SGLang, and KTransformers.
The model supports standard chat completion and advanced tool-calling functionalities. Users can define a list of available tools, and the model will autonomously decide when and how to use them to fulfill a request. The recommended temperature setting for general use is 1.0.
License
Both the model weights and the associated code repository are released under the Modified MIT License.