kimi-k2-thinking

Model Description

Introducing Kimi K2 Thinking: A Deep-Reasoning Agent with Native INT4 Quantization

Kimi K2 Thinking has been introduced as the latest and most capable version in the series of open-source thinking models. Designed from the ground up as a thinking agent, it specializes in performing step-by-step reasoning while dynamically invoking tools to complete complex tasks. The model sets a new state-of-the-art on benchmarks like Humanity’s Last Exam (HLE) and BrowseComp by significantly extending its multi-step reasoning depth and maintaining stable performance across 200–300 consecutive tool calls.

A key technical highlight of K2 Thinking is its implementation as a native INT4 quantization model, which, combined with a 256k context window, allows for lossless reductions in inference latency and GPU memory usage.

Key Features

  • Deep Thinking & Tool Orchestration: The model is end-to-end trained to interleave chain-of-thought reasoning with function calls. This enables it to handle autonomous research, coding, and writing workflows that can last for hundreds of steps without drifting from the original goal.
  • Stable Long-Horizon Agency: K2 Thinking demonstrates coherent, goal-directed behavior for up to 200–300 consecutive tool invocations, a significant improvement over previous models that often saw performance degrade after 30–50 steps.
  • Native INT4 Quantization: By employing Quantization-Aware Training (QAT) in its post-training stage, the model achieves a nearly 2x speed-up in low-latency mode without sacrificing performance.

Model Architecture

Kimi K2 Thinking is built on a Mixture-of-Experts (MoE) architecture. Its key specifications are as follows:

Specification Value
Architecture Mixture-of-Experts (MoE)
Total Parameters 1T
Activated Parameters 32B
Context Length 256K
Vocabulary Size 160K
Number of Layers 61 (including 1 dense layer)
Number of Experts 384 (8 selected per token, 1 shared)
Attention Mechanism MLA
Activation Function SwiGLU

Performance and Evaluation

Evaluation results show that Kimi K2 Thinking achieves state-of-the-art or highly competitive performance across a range of tasks. In reasoning tasks with tools, it scores 44.9 on HLE and 60.2 on BrowseComp, outperforming other leading models. It also demonstrates strong capabilities in coding, achieving a score of 71.3 on SWE-bench Verified and showing particular strength in multilingual coding benchmarks. All reported benchmark results were achieved using INT4 precision, underscoring the model’s efficiency.

Deployment and Usage

Developers can access Kimi K2 Thinking via an OpenAI/Anthropic-compatible API available at platform.moonshot.ai. For local deployment, the model is optimized to run on inference engines such as vLLM, SGLang, and KTransformers.

The model supports standard chat completion and advanced tool-calling functionalities. Users can define a list of available tools, and the model will autonomously decide when and how to use them to fulfill a request. The recommended temperature setting for general use is 1.0.

License

Both the model weights and the associated code repository are released under the Modified MIT License.

🔔How to Use

graph LR A("Purchase Now") --> B["Start Chat on Homepage"] A --> D["Read API Documentation"] B --> C["Register / Login"] C --> E["Enter Key"] D --> F["Enter Endpoint & Key"] E --> G("Start Using") F --> G style A fill:#f9f9f9,stroke:#333,stroke-width:1px style B fill:#f9f9f9,stroke:#333,stroke-width:1px style C fill:#f9f9f9,stroke:#333,stroke-width:1px style D fill:#f9f9f9,stroke:#333,stroke-width:1px style E fill:#f9f9f9,stroke:#333,stroke-width:1px style F fill:#f9f9f9,stroke:#333,stroke-width:1px style G fill:#f9f9f9,stroke:#333,stroke-width:1px

Purchase Now

Start Chat on Homepage

Register / Login

Enter Key

Read API Documentation

Enter Endpoint & Key

Start Using

Description Ends

Recommend Models

DeepGemini-2.5-pro

DeepSeek-R1 + gemini-2.5-pro-preview-03-25,The Deep series is composed of the DeepSeek-R1 (671b) model combined with the chain-of-thought reasoning of other models, fully utilizing the powerful capabilities of the DeepSeek chain-of-thought. It employs a strategy of leveraging other more powerful models for supplementation, thereby enhancing the overall model's capabilities.

gpt-4.1-nano-2025-04-14

GPT-4.1 nano is the fastest, most cost-effective GPT-4.1 model.

gemini-2.5-pro

Gemini 2.5 Pro is Google's most advanced AI model designed for coding and complex tasks, featuring enhanced reasoning capabilities, native multimodal support, and a 1-million token context window.