StepFuns Physical AI Gambit Highlights Divergent Paths Amid Sector Financial Strains
China's AI Giants Forge Divergent Paths: StepFun's Physical World Ambition Contrasts with Mounting Financial Pressures
As the third anniversary of ChatGPT's public release approaches in early 2026, the Chinese artificial intelligence landscape presents a tableau of stark contrasts. On one side, a new record-breaking private financing round signals a bold strategic pivot. On the other, the recently public financials of two industry leaders reveal the brutal economic realities of the foundational model race. This dichotomy underscores a critical inflection point for the sector: the transition from a competition purely on paper benchmarks to a grueling contest defined by commercial viability and sustainable unit economics.
The latest tremor shaking the capital markets is the completion of a B+ funding round exceeding 5 billion yuan (approximately $700 million USD) by StepFun (阶跃星辰), a Beijing-based AI startup. The round, participated in by a consortium of state-backed and private investors including Shanghai Guotou Xiandao Fund, China Life Private Equity, and Pudong Venture Capital, with continued support from existing backers Tencent, Qiming Venture Partners, and 5Y Capital, sets a new 12-month record for single-funding in China's large language model (LLM) sector. Notably, this sum surpasses the recent $500 million Series C raised by Moonshot AI and even eclipses the IPO proceeds of both Zhipu AI and Minimax.
This massive capital injection arrives amidst a flurry of activity. The acquisition of Manus by Meta, Moonshot's founder Yang Zhilin's public assertion of ample funding, and the recent high-profile Hong Kong listings of Zhipu and Minimax have collectively heightened market fervor. However, StepFun's intended use of funds delineates a distinct strategic fork in the road. While its peers race towards the perceived safety of public markets, StepFun is charting a capital-intensive, long-term course into what it terms the "physical world."
Industry observers widely concur that the initial "regular season" of LLM competition, characterized by competitions on parameter count and token volume, has concluded. The "playoffs," they argue, have begun, centered on a new core imperative: moving artificial intelligence out of data centers and into real-world applications and physical devices.
Leadership and Strategy: Integrating the "Physical World" Veteran
Beyond the financing headline, a significant leadership evolution at StepFun may hold greater strategic import. Yin Qi, Chairman of Qianli Technology (千里科技), has formally assumed the role of Chairman of StepFun, taking full responsibility for corporate strategy and industrial deployment.
Yin Qi is a pivotal figure from China's previous AI cycle, co-founding the computer vision giant Megvii (旷视科技) in 2011. His profile stands in contrast to the purely academic founders of some peers or those with quantitative finance backgrounds. Yin's defining hallmark is a proven track record in large-scale, real-world AI implementation. Megvii's IoT business serves over 100 Chinese cities and has expanded globally, managing an AIoT platform connecting hundreds of millions of terminals.
His appointment addresses a perceived gap in StepFun's otherwise technically elite team. CEO Jiang Daxin, former Corporate Vice President at Microsoft and Chief Scientist of Microsoft Asia Internet Engineering Academy, brings extensive experience in productizing AI at global scale. Chief Scientist Zhang Xiangyu is one of the four co-authors of the foundational ResNet paper. CTO Zhu Yibo has hands-on experience building 10,000-GPU clusters from scratch at ByteDance and previously supported Anthropic at Google Cloud.
"Technological leadership does not equate to commercial success," notes an industry analyst familiar with the company. "As the competition intensifies, model performance alone is no longer the sole market differentiator. The new watershed is the ability to integrate models into genuine application scenarios and generate tangible business value." Yin Qi, with his deep experience in deploying AI at the billion-device scale, is seen as the crucial piece to complete this puzzle.
His strategic vision aligns with StepFun's parallel development track. While continuing to push the boundaries of its foundational "Step" series models, the company has aggressively pursued embedding its technology into consumer terminals. StepFun claims deep collaboration with 60% of China's top smartphone brands, including OPPO, Honor, and ZTE, with its models deployed on over 42 million devices, serving 20 million daily user queries. In automotive, partnerships with Qianli Technology and Geely have integrated its end-to-end voice model into the AgentOS smart cockpit. The Geely Galaxy M9, featuring this integration, sold nearly 40,000 units in its first three months, with StepFun targeting million-unit deployments this year.
Yin Qi's formal leadership role accelerates this "AI + Terminal" strategy from planning to execution, creating a combined entity that some observers liken, in structure if not in scale, to Elon Musk's xAI-Tesla-Optimus ecosystem—one providing the "soul," the other the "body."
The Multimodal Imperative: A Ticket to the Physical World
StepFun's terminal-centric strategy is fundamentally predicated on advancing multimodal AI. The company argues that multimodal capabilities serve as the essential "sensory system" for LLMs to interact with the physical environment. Text provides symbolic logic, but multimodal processing furnishes eyes, ears, and a voice.
The company has committed to a "native multimodal" approach from its inception in 2023, conducting end-to-end training directly on interleaved image-text data rather than adopting the more common "plug-in" method where visual encoders feed information separately to a language model. This approach, StepFun contends, allows for more native understanding and generation, enabling AI to comprehend the causal logic of the physical world in a more human-like fashion.
This principle extends to its audio models. The latest Step-Audio-R1.1 model employs Modal-Grounded Reasoning Distillation (MGRD) to generate reasoning chains based purely on acoustic features, aiming to solve the problem where over-reasoning degrades performance in audio AI. StepFun claims this model achieved top-ranking on the Artificial Analysis benchmark.
The next evolutionary step, from understanding to interaction, is being pursued through Vision-Language-Action (VLA) architecture. StepFun's open-source Step-GUI model series, particularly its 4-billion parameter edge-side version Step-GUI-Edge, is designed to enable AI agents to understand screen content and perform operations. The company states this compact model can outperform some models eight times its size in certain benchmarks, making sophisticated on-device agent capabilities feasible for consumer hardware.
The Other Side of the Coin: Soaring Valuations Meet Soaring Costs
The optimism fueling StepFun's massive raise exists alongside a sobering reality freshly detailed in the public filings of its now-listed rivals, Minimax and Zhipu AI. Both companies achieved valuations of approximately $6 billion in their recent Hong Kong IPOs, which were met with enthusiastic investor reception. However, a dissection of their financials reveals the profound economic challenges at the heart of the foundational model business.
Both companies exhibit a "short, lean, and fast" profile—sub-1,000 employees, rapid product iteration, and revenue growing from zero to approaching an annualized $100 million run rate within two to three years. Zhipu, with its primarily B2B model, maintains gross margins around 50%, while Minimax's B2C-focused毛利率 has recently turned positive.
Yet, these promising top-line figures are utterly dwarfed by operational expenditures. In 2024, the combined costs and operating expenses for both companies were roughly ten times their revenue. By the first nine months of 2025, as Minimax scaled revenue, its expenses remained over five times revenue, while Zhipu's ratio appeared to worsen, suggesting a troubling lack of scale economies.
The core question laid bare by these financials is whether the LLM business follows traditional internet scale economics, where margins improve with growth, or an inverse scale model where expansion necessitates ever-greater losses.
Deconstructing the Cost Black Hole: It's the Compute, Stupid
The analysis points to compute, not human capital, as the primary "black hole." While both companies employ small, highly compensated teams—Minimax's R&D personnel reportedly average monthly costs of 160,000 RMB ($22,500)—salary expenses, though significant, are ultimately dilutable as revenue grows. Minimax's total annual compensation, around $100 million or 90% of its revenue, is deemed substantial but not prohibitive given the global AI talent war.
The overwhelming cost driver is compute expenditure, which is bifurcated into training and inference. Training costs, the "sunk" investment required to develop a model before it can generate revenue, are capitalized as R&D. Inference costs, the compute consumed when a model is used by customers, are recorded as cost of revenue.
For both Minimax and Zhipu, training compute alone constitutes over 50% of total expenses, accounting for more than half of their staggering loss ratios. This highlights a fundamental tension: the relentless, non-linear scaling of training costs against the linear growth of revenue. Furthermore, both companies' minimal fixed asset ownership indicates a heavy reliance on third-party cloud services for this compute, a potentially more expensive long-term model compared to owning data center infrastructure like OpenAI.
"The model economics are brutal," summarizes a financial analyst covering the sector. "You have a cost curve for training that seems to steepen with each generation, racing against a revenue curve that, while growing fast, is battling commoditization pressure in the inference market. The path to profitability is a question of endurance and finding monopolistic advantages beyond pure model performance."
Convergence on a New Battlefield
The narratives of StepFun's strategic pivot and the financial disclosures of Minimax and Zhipu, while distinct, are converging on the same existential challenge: the urgent need for viable, scalable business models beyond API calls and pure software services.
StepFun's answer, backed by its new war chest and leadership, is a deep integration into the hardware ecosystem—smartphones, cars, and eventually other IoT devices. This path seeks to create value through tightly coupled, on-device experiences and leverage hardware sales as a distribution and monetization channel. It is a "heavy and slow" path, but one potentially insulated from the margin-crushing competition of pure cloud-based inference.
For the listed companies, the pressure is now acutely public. Their financials serve as a canonical case study for investors, validating both the enormous market opportunity and the frightening burn rates required to compete. Their future hinges on accelerating the commercialization of their technology, improving inference efficiency, and discovering high-margin, defensible applications before their capital reserves deplete.
The "playoffs" of China's AI industry have thus commenced on two interconnected fronts: one focused on technological integration into the fabric of daily life, and the other on solving the fundamental equation of model economics. The success or failure of these divergent yet complementary strategies will define the next chapter for China's AI ambitions, determining which companies transition from cutting-edge research labs into enduring, profitable enterprises.
Comments
Post a Comment