kimi-k2-thinking

Model Description

Introducing Kimi K2 Thinking: A Deep-Reasoning Agent with Native INT4 Quantization

Kimi K2 Thinking has been introduced as the latest and most capable version in the series of open-source thinking models. Designed from the ground up as a thinking agent, it specializes in performing step-by-step reasoning while dynamically invoking tools to complete complex tasks. The model sets a new state-of-the-art on benchmarks like Humanity’s Last Exam (HLE) and BrowseComp by significantly extending its multi-step reasoning depth and maintaining stable performance across 200–300 consecutive tool calls.

A key technical highlight of K2 Thinking is its implementation as a native INT4 quantization model, which, combined with a 256k context window, allows for lossless reductions in inference latency and GPU memory usage.

Key Features

  • Deep Thinking & Tool Orchestration: The model is end-to-end trained to interleave chain-of-thought reasoning with function calls. This enables it to handle autonomous research, coding, and writing workflows that can last for hundreds of steps without drifting from the original goal.
  • Stable Long-Horizon Agency: K2 Thinking demonstrates coherent, goal-directed behavior for up to 200–300 consecutive tool invocations, a significant improvement over previous models that often saw performance degrade after 30–50 steps.
  • Native INT4 Quantization: By employing Quantization-Aware Training (QAT) in its post-training stage, the model achieves a nearly 2x speed-up in low-latency mode without sacrificing performance.

Model Architecture

Kimi K2 Thinking is built on a Mixture-of-Experts (MoE) architecture. Its key specifications are as follows:

Specification Value
Architecture Mixture-of-Experts (MoE)
Total Parameters 1T
Activated Parameters 32B
Context Length 256K
Vocabulary Size 160K
Number of Layers 61 (including 1 dense layer)
Number of Experts 384 (8 selected per token, 1 shared)
Attention Mechanism MLA
Activation Function SwiGLU

Performance and Evaluation

Evaluation results show that Kimi K2 Thinking achieves state-of-the-art or highly competitive performance across a range of tasks. In reasoning tasks with tools, it scores 44.9 on HLE and 60.2 on BrowseComp, outperforming other leading models. It also demonstrates strong capabilities in coding, achieving a score of 71.3 on SWE-bench Verified and showing particular strength in multilingual coding benchmarks. All reported benchmark results were achieved using INT4 precision, underscoring the model’s efficiency.

Deployment and Usage

Developers can access Kimi K2 Thinking via an OpenAI/Anthropic-compatible API available at platform.moonshot.ai. For local deployment, the model is optimized to run on inference engines such as vLLM, SGLang, and KTransformers.

The model supports standard chat completion and advanced tool-calling functionalities. Users can define a list of available tools, and the model will autonomously decide when and how to use them to fulfill a request. The recommended temperature setting for general use is 1.0.

License

Both the model weights and the associated code repository are released under the Modified MIT License.

🔔How to Use

graph LR A("Purchase Now") --> B["Start Chat on Homepage"] A --> D["Read API Documentation"] B --> C["Register / Login"] C --> E["Enter Key"] D --> F["Enter Endpoint & Key"] E --> G("Start Using") F --> G style A fill:#f9f9f9,stroke:#333,stroke-width:1px style B fill:#f9f9f9,stroke:#333,stroke-width:1px style C fill:#f9f9f9,stroke:#333,stroke-width:1px style D fill:#f9f9f9,stroke:#333,stroke-width:1px style E fill:#f9f9f9,stroke:#333,stroke-width:1px style F fill:#f9f9f9,stroke:#333,stroke-width:1px style G fill:#f9f9f9,stroke:#333,stroke-width:1px

Purchase Now

Start Chat on Homepage

Register / Login

Enter Key

Read API Documentation

Enter Endpoint & Key

Start Using

Description Ends

Recommend Models

kimi-k2.5

Kimi K2.5 is a native multimodal model that significantly advances visual understanding and coding capabilities while introducing a revolutionary multi-agent swarm system for tackling complex, large-scale tasks.

o3-pro

The o-series of models are trained with reinforcement learning to think before they answer and perform complex reasoning. The o3-pro model uses more compute to think harder and provide consistently better answers. o3-pro is available in the Responses API only to enable support for multi-turn model interactions before responding to API requests, and other advanced API features in the future. Since o3-pro is designed to tackle tough problems, some requests may take several minutes to finish. To avoid timeouts, try using background mode.

gpt-4o-mini-rev

Using reverse engineering to call the model within the official application and convert it into an API.