After MiniMax becomes the first to call tokens, the next key battleground for model vendors emerges

CryptocurrencySniper · 2026-02-27T02:46:33+00:00

Since late January 2026, the token usage of major models has significantly increased, with domestic large models rising and surpassing the United States for the first time. Driven by the Spring Festival, token consumption of intelligent agents has surged, and new models like M2.5 are highly popular due to their efficient performance. Overall, the demand for AI reasoning is accelerating, and the need for computing power will also increase substantially.

CryptocurrencySniper

2026-02-27 02:46:33

Abstract generation in progress

The token usage of major models shows a significant surge starting in late January 2026; 2. Domestic large models take center stage; 3. When agents perform tasks, overall token consumption can increase by more than ten times, while the corresponding computing power demand grows by over a hundred times.

Recently, the explosion in token calls for large models has become a focus. According to the latest data from OpenRouter, the token usage of major models has shown a clear jump since late January 2026.

Meanwhile, domestic large models have taken the spotlight. During the week of February 9-15, Chinese models with 4.12 trillion token calls surpassed the US models’ 2.94 trillion tokens for the first time. The following week, February 16-22, Chinese models’ weekly calls further surged to 5.16 trillion tokens, a 127% increase over three weeks, while US models’ calls dropped to 2.7 trillion tokens. Among the top five models by platform call volume, four are from Chinese companies: MiniMax’s M2.5, Kimi K2.5 from Moonshot AI, GLM-5 from Zhipu, and V3.2 from DeepSeek. These four models account for 85.7% of the total calls in the Top 5.

Notably, M2.5 made a remarkable debut, topping the OpenRouter hot list within 12 hours of release and leading the weekly call volume chart, with weekly calls soaring to 3.07 trillion tokens, surpassing the combined total of Kimi K2.5, GLM-5, and DeepSeek V3.2.

OpenRouter is the world’s largest large model API aggregation platform, providing developers with a unified API interface to access hundreds of large language models globally. Its core features include multi-model calls, intelligent routing optimization, and transparent performance rankings, aiming to solve the complexity of multi-model integration and vendor lock-in issues.

As focus shifts to AI application proliferation and the rise of domestic models, we should not overlook the structural changes behind the data.

Why has model call volume exploded recently? Why are emerging models like M2.5 leading the rankings?

Many institutions believe that, on one hand, the Chinese New Year holiday has increased AI application penetration, leading to an overall rise in token consumption; on the other hand, AI model agents are widely deployed in scenarios, significantly increasing token consumption per task.

Along with this growth since late January, analyzing new trends in the AI industry can provide insights into these questions.

First, OpenClaw has become extremely popular. It is an open-source agent framework that grants large models local operating system permissions, allowing AI to execute shell commands, manipulate the file system, and achieve so-called “local proxy sovereignty.” On February 15, local time, OpenClaw’s creator Peter Steinberger officially joined OpenAI to promote the development of the “next-generation personal agents.”

Subsequently, several large models targeting agent scenarios have been released, generating strong reactions—

Xiyu Technology (MiniMax) released M2.5 on February 13, claiming it to be the world’s first production-level flagship model designed natively for agent scenarios. Within seven days of release, its calls exceeded 3.07 trillion tokens. Thanks to its excellent performance in programming and agent workflows and very low costs, it has become the developers’ first choice.

Moonshot AI released KimiK2.5 on January 27. This model adopts a native multimodal architecture, capable of scheduling up to 100 “agent clones” to work in parallel, boosting complex task efficiency by 3 to 10 times. It ranks first in multiple subcategories (such as programming and tool invocation), with call volume far exceeding Gemini 3 and Claude models.

Zhipu released GLM-5 on February 12, with further expanded parameters and a sparse attention mechanism, designed specifically for complex system engineering and long-range agent tasks. With advantages like free access and a 200K context window, user growth has accelerated after its release, prompting Zhipu to implement sales restrictions and price increases on Coding Plan.

These models focus on enhancing programming capabilities and automating agent tasks, which shift AI applications from private entertainment to production environments. Professional developers’ token consumption is much denser than that of casual conversations. Once these needs are activated, the call volume increases significantly.

OpenRouter’s official data confirms this: over 70% of token consumption comes from routine calls in production environments by internet giants, large and medium-sized enterprises, and professional developers. These scenarios involve single calls with token counts far exceeding those of individual users or small test projects. The platform has seen a significant rise in long-text generation requests in recent weeks, with token counts between 100K and 1 million, where MiniMax M2.5 leads. This range is typical for agent workflow consumption.

It can be said that this surge in token consumption directly reflects new trends in large model development. AI is shifting from “fast thinking” to “slow thinking” and from “tools” to “labor force.” The intelligent agent functions developed by leading AI companies fall into the “slow thinking” category.

When models face complex tasks (such as “write code for an e-commerce website”), they no longer produce direct answers. Instead, they “talk to themselves”: breaking down requirements, designing architecture, writing functions, debugging, and optimizing performance—because AI begins to “repeatedly reason in its mind.” Each reasoning step and logical chain consumes tokens. This increase in “reasoning density” causes token consumption to grow much faster than the number of users or questions.

CITIC Securities states that AI application scenarios are evolving from simple conversations to multimodal (text/image/audio/video) and AI agent upgrades, with exponential growth in token consumption per task. The explosion in token usage fundamentally reflects an exponential expansion in AI reasoning demands.

Huatai Securities previously predicted that, with changes in reasoning paradigms and accelerated deployment of agents, there are two multiplicative factors in future computing power needs:

The relationship between reasoning and token calls is not linear, as multi-agent collaboration and multi-tool invocation accelerate token consumption;
The relationship between computing power demand and token growth is not linear, because more complex reasoning processes will increase calculation time under the same hardware, demanding higher speed and interactivity.

They believe that compared to chatbots, agents will decompose and encode tasks during execution, leading to increased interaction frequency, task complexity, and usage. Overall token consumption could increase by more than ten times, while the demand for computing power could grow by over a hundred times, supporting a long-term outlook of rapid growth in computing needs.

(Article source: Cailian Press)

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.