Kimi K1.5: Scaling Reinforcement Learning for State-of-the-Art LLMs

Podcast

Summary Points for Kimi K1.5 Innovations in AI:

The field of artificial intelligence (AI) continues to evolve rapidly, with large language models (LLMs) at the forefront of innovation. Kimi K1.5, the latest multimodal LLM developed by the Kimi team, represents a significant leap forward in scaling reinforcement learning (RL) for LLMs. This article explores the technical advancements, methodologies, and results achieved by Kimi K1.5, which has set new benchmarks in reasoning, multimodal understanding, and token efficiency.

Introduction: The Need for Scaling Reinforcement Learning

Traditional LLMs rely heavily on next-token prediction during pretraining, which is constrained by the availability of high-quality training data. Reinforcement learning offers a promising alternative by enabling models to explore and learn from rewards, effectively bypassing the limitations of static datasets. However, prior attempts to integrate RL with LLMs have struggled to produce competitive results. Kimi K1.5 addresses these challenges with a novel approach that combines long-context scaling, improved policy optimization, and multimodal training.

Key Innovations in Kimi K1.5

Kimi K1.5 introduces several groundbreaking techniques that redefine the capabilities of LLMs:

1. Long-Context Scaling

Extended Context Windows: Kimi K1.5 scales the context window to 128k tokens, enabling the model to process and reason over significantly longer sequences. This is achieved through partial rollouts, which reuse large chunks of previous trajectories to improve training efficiency.
Impact on Reasoning: The extended context length enhances the model's ability to plan, reflect, and correct its reasoning, resulting in state-of-the-art performance across multiple benchmarks.

The changes on the training accuracy and length as train iterations grow

2. Improved Policy Optimization

Simplistic RL Framework: Kimi K1.5 employs a variant of online mirror descent for robust policy optimization, avoiding complex techniques like Monte Carlo tree search or value functions.
Length Penalty: To address the issue of overthinking, a length-based reward system is introduced, promoting concise yet accurate responses.

Comparison with using ReST for policy optimization.

3. Multimodal Training

Text and Vision Integration: The model is trained on both text and vision data, enabling it to reason across modalities. This includes tasks like image-grounded conversations, chart interpretation, and visual coding.
Synthetic and Real-World Data: The training corpus includes a mix of real-world datasets, synthetic visual reasoning data, and text-rendered images, ensuring comprehensive multimodal capabilities.

4. Long2Short Techniques

Context Compression: Kimi K1.5 employs methods like model merging, shortest rejection sampling, and long2short RL to transfer the reasoning capabilities of long-context models to short-context models. This improves token efficiency without compromising performance.

Long2ShortPerformance.Allthek1.5seriesdemonstratebettertokenefficiencycomparedtoothermodels

Training Methodology

The development of Kimi K1.5 involved a multi-stage training process:

1. Pretraining

Diverse Data Sources: The model was pretrained on a multimodal corpus covering English, Chinese, code, mathematics, and general knowledge. Rigorous quality control ensured the relevance and diversity of the data.
Vision-Language Integration: The model was gradually introduced to interleaved vision-language data, establishing robust multimodal capabilities.

2. Reinforcement Learning

Prompt Set Curation: High-quality prompts were curated to ensure diverse coverage, balanced difficulty, and accurate evaluability. This included STEM problems, coding tasks, and general reasoning challenges.
Curriculum Sampling: The training process began with easier tasks and progressively moved to more challenging ones, enhancing the model's adaptability.

3. Long-Context Activation

Extended Sequence Lengths: The model was trained with upsampled long-context data, gradually increasing the maximum sequence length from 4k to 128k tokens.
Partial Rollouts: This technique allowed the model to handle long trajectories efficiently by segmenting responses across iterations.

Performance Benchmarks

Kimi K1.5 has achieved state-of-the-art results across a wide range of benchmarks:

Performance of Kimi k1.5 long-CoT and flagship open-source and proprietary models.

Performance of Kimi k1.5 short-CoT and flagship open-source and proprietary models.

Reasoning Benchmarks

MATH-500: Achieved 96.2% exact match accuracy, outperforming leading models like OpenAI's GPT-4.
AIME 2024: Scored 77.5% on advanced math problems, demonstrating superior logical reasoning.
Codeforces: Ranked in the 94th percentile, showcasing exceptional coding capabilities.

Multimodal Benchmarks

MathVista: Scored 74.9% on visual mathematical reasoning tasks, highlighting its ability to integrate text and vision.
MMMU: Achieved 70% accuracy on multimodal university-level questions, spanning diverse academic disciplines.

Token Efficiency

The long2short RL algorithm significantly improved token efficiency, with the short-context model achieving competitive performance using fewer tokens.

Ablation Studies and Insights

1. Scaling Context Length vs. Model Size

Smaller models with extended context lengths achieved comparable performance to larger models, demonstrating the effectiveness of long-context scaling.

2. Negative Gradients in Policy Optimization

Incorporating negative gradients markedly enhanced training efficiency, outperforming methods like Reinforced Self-Training (ReST).

3. Curriculum Sampling

Gradually increasing task difficulty during training led to better performance compared to uniform sampling strategies.

Infrastructure Optimizations

Kimi K1.5's training system is designed for scalability and efficiency:

Hybrid Deployment: A Kubernetes-based framework enables seamless transitions between training and inference phases, minimizing idle GPU resources.
Partial Rollouts: This technique optimizes the handling of long trajectories, reducing computational overhead and improving scalability.
Code Sandbox: A secure environment for executing user-submitted code ensures reliable evaluation and feedback during RL training.

Future Directions

While Kimi K1.5 represents a significant advancement, several areas remain open for exploration:

Improved Credit Assignment: Enhancing the model's ability to assign credit to intermediate reasoning steps could further improve performance.
Reducing Overthinking: Developing methods to balance exploration and efficiency without compromising reasoning quality.
Iterative Long2Short Training: Combining long2short methods with long-context RL in an iterative manner could yield even greater token efficiency.

Conclusion

Kimi K1.5 sets a new standard for LLMs by demonstrating the potential of reinforcement learning to scale reasoning capabilities. Its innovative approach to long-context scaling, multimodal training, and policy optimization has resulted in state-of-the-art performance across diverse benchmarks. As the field of AI continues to evolve, Kimi K1.5 serves as a testament to the transformative power of combining RL with LLMs, paving the way for more advanced and efficient models in the future.