Tencent's Hunyuan-Large-Vision sets a new benchmark as China's leading multimodal model

2 months ago 11

ARTICLE AD BOX

Tencent's new Hunyuan-Large-Vision model now leads all Chinese entries on the LMArena Vision Leaderboard, ranking just behind GPT-5 and Gemini 2.5 Pro.

Built on a mixture-of-experts architecture with 389 billion parameters (52 billion active), it delivers performance comparable to Claude Sonnet 3.5.

Table showing ranking, model name, score ± uncertainty, number of votes, and provider of multimodal AI systems.

Among Chinese entries, Hunyuan-Large-Vision leads the pack, overtaking the previously top-rated Qwen2.5-VL in its largest version. Tencent says the model scored an average of 79.5 on the OpenCompass Academic Benchmark and stands out on multilingual tasks.

Performance of Hunyuan-Large-Vision, GPT-4o, Claude-3.7, and Qwen-2.5-VL72B in multimodal benchmarks.

Tencent demonstrated the model's capabilities with a range of tasks: identifying Iris lactea, composing a poem from a photo of the Seine, offering strategic advice in Go, and translating questions into Spanish. Compared to Tencent's earlier vision models, Hunyuan-Large-Vision also handles less common languages more effectively.

THE DECODER Newsletter

The most important AI news straight to your inbox.

✓ Weekly

✓ Free

✓ Cancel at any time

Iris lactea recognition, Seine's quatrain, Go move recommendation, tattoo statistics translation

Hunyuan-Large-Vision is built around three main modules: a custom vision transformer with one billion parameters for image processing, a connector module to bridge vision and language, and a language model using the mixture-of-experts technique.

Tencent says the vision transformer was first trained to link images and text, then further refined with over a trillion multimodal text samples. In benchmarks, it outperforms other popular models on complex multimodal tasks.

Architecture diagram of Hunyuan-Large LLM (389 billion parameters) with mixture of experts, native ViT, and adaptive MLP for multimodal inputs.

New training pipeline for multimodal data

Tencent built a pipeline that transforms noisy raw data into high-quality instruction data using pre-trained AI and specialized tools—resulting in over 400 billion multimodal text samples across visual recognition, math, science, and OCR.

The model was fine-tuned with Rejection Sampling, which generates multiple responses and keeps only the best, while automated tools filter out errors and redundancies. More efficient reasoning was achieved by distilling complex answers into concise ones.

Training used Tencent's Angel-PTM framework and a multi-level load balancing strategy, which cut GPU bottlenecks by 18.8 percent and sped up the process.

Recommendation

Hunyuan-Large-Vision is available exclusively via API on Tencent Cloud. Unlike some previous Tencent models, this one is not open source. With its 389 billion parameters, it would not be practical to run on consumer hardware anyway.

Read Entire Article