Alibaba's proprietary Qwen3.7-Max can run for 35 hours autonomously and supports external harnesses like Anthropic's Claude Code

Published: 2 weeks ago (May 21, 2026 at 07:53 PM EDT)

8 min read

Source: VentureBeat

The AI Industry’s “Agent Era”

The AI industry has fully entered the “agent era,” a paradigm where AI models do far more than generate text — they now actively plan, execute, and course‑correct complex tasks over days rather than seconds.

Thus, it’s perhaps unsurprising to see Chinese e‑commerce giant Alibaba’s famed Qwen Team of AI researchers release a model capable of performing autonomous, agentic AI work over multiple days: that model has arrived in the form of Qwen‑3.7‑Max, which the company reports in a blog post achieved ≈ 35 hours of continuous autonomous execution — albeit, in a proprietary, not open‑source format, as prior Qwen Team releases were.

This is also to be expected — it’s what many analysts and industry experts feared in the wake of the departure of several key Qwen Team leaders earlier this year.

But it makes sense for Alibaba financially, at least in the short term: training AI models—especially ones as powerful as Qwen‑3.7‑Max—is expensive, and giving them away essentially for free, as open‑source models are, does not immediately help recoup any costs.

In that sense, Alibaba is simply aligning its efforts with American AI giants like OpenAI and Google by offering the latest and greatest models only through paid APIs and subscription or paid web‑plan bundles, and slightly less‑performant ones through open source.

Still, the arrival of Qwen‑3.7‑Max offers further optionality to enterprises and individual users, and more competition for American AI labs — rarely a bad thing for consumers at all budget levels. Yet, the fact that the model is only accessible from Chinese‑based endpoints means it may be limited in its appeal to American and European enterprises seeking to maximize compliance and security posturing when fulfilling government contracts, or even just attempting to comply with all relevant state, local, and national data‑sovereignty regulations.

The Marathon AI Era

To understand why Qwen‑3.7‑Max is a departure from previous models, one must look at how it was trained and how it operates in practice.

Language models typically degrade when forced to maintain a single train of thought over thousands of conversational turns; they forget instructions, hallucinate variables, or simply get stuck in logical loops. Qwen‑3.7‑Max was specifically designed as a “versatile agent foundation” capable of “long‑horizon reasoning” to overcome this exact bottleneck.

Demonstration: Autonomous Engineering Task

The starkest demonstration of this capability is an autonomous engineering task detailed by the Qwen team. The model was given access to an isolated server equipped with a T‑Head ZW‑M890 PPU—a hardware architecture the model had never encountered during its training. Its task was to optimize an attention kernel.

Over the course of 35 straight hours, Qwen‑3.7‑Max operated entirely autonomously.
It executed 1,158 distinct tool calls, performed 432 kernel evaluations, diagnosed compilation failures, and iteratively improved the code to achieve a 10.0× geometric‑mean speedup.

By comparison, Chinese competitor models like z.ai’s GLM‑5.1 and Moonshot’s Kimi K2.6 capped out at 7.3× and 5.0× speedups respectively, often voluntarily terminating their sessions when they failed to make progress. However, both are available open source.

Environment Scaling

This endurance is achieved through what Alibaba calls “environment scaling.” Just as early LLMs grew smarter by ingesting more diverse text, Qwen‑3.7‑Max was trained across a vast, scaled array of dynamic agentic environments.

It can simulate a one‑year lifecycle of a startup in the “YC‑Bench” evaluation, navigating hundreds of decision‑making rounds encompassing personnel management and contract screening. In this simulation, the model generated $2.08 million in virtual revenue, nearly doubling the performance of the prior generation, Qwen‑3.6‑Plus.
The model has built‑in reward‑hacking self‑monitoring, autonomously detecting when it attempts to cheat a training environment and adding heuristic rules to correct its own behavior.

A Brain for Any Scaffold

From a product perspective, Qwen‑3.7‑Max is designed to be the cognitive engine for modern software development and enterprise automation.

Context window: 1 million tokens
Maximum output limit: 64 K tokens

These specifications provide immense overhead for processing sprawling codebases or lengthy technical documents.

Cross‑Harness Generalization

One of its most compelling features is “cross‑harness generalization.” Rather than being hard‑coded to work best within a specific proprietary interface, Qwen‑3.7‑Max is built to act as a drop‑in intelligence layer for diverse agent frameworks. It supports the Anthropic API protocol natively, allowing developers to plug it directly into existing tools like Claude Code or OpenClaw.

Benchmark Performance

The benchmark data provided by Alibaba indicates that this generalized approach has paid massive dividends.

Benchmark	Qwen‑3.7‑Max	Claude Opus‑4.6 Max	DeepSeek V4‑Pro Max
Apex Math Reasoning	44.5	34.5	38.3
Humanity’s Last Exam (HLE)	41.4	—	—
Realistic Coding Agent (MCP‑Atlas)	76.4	—	—

These scores translate into tangible utility for end‑users. Through open‑source Model Context Protocol (MCP) integrations, the model can operate as an autonomous office assistant, capable of reading university formatting specs and automatically reformatting a messy Word document via command‑line tools without human intervention.

Pricing

Running this level of intelligence comes at a distinct cost. Developers accessing the API via Alibaba Cloud Model Studio will pay:

Resource	Price (USD)
Input tokens (per 1 M)	$2.50
Output tokens (per 1 M)	$7.50
Integrated web‑search calls (per 1 000)	$10.00
Code‑interpreter tools	Free (limited time)

Qwen‑3.7‑Max occupies a strategic middle ground in the current API economy. While it demands a notable premium over aggressively priced domestic rivals—costing nearly doub

Frontier AI Model Pricing Snapshot

For context, running heavy agentic workflows through OpenAI’s GPT‑5.4 or Anthropic’s Claude Opus 4.7 will cost developers $17.50 and $30.00 per million tokens, respectively.

Model	Input (¢/1 K tok)	Output (¢/1 K tok)	Total Cost (¢/1 K tok)	Source
MiMo‑V2.5 Flash	$0.10	$0.30	$0.40	Xiaomi MiMo
MiniMax M2.7	$0.30	$1.20	$1.50	MiniMax
Gemini 3.1 Flash‑Lite	$0.25	$1.50	$1.75	Google
MiMo‑V2.5	$0.40	$2.00	$2.40	Xiaomi MiMo
Kimi‑K2.6	$0.95	$4.00	$4.95	Moonshot/Kimi
GLM‑5	$1.00	$3.20	$4.20	Z.ai
Grok 4.3 (low context)	$1.25	$2.50	$3.75	xAI
DeepSeek V4 Pro	$1.74	$3.48	$5.22	DeepSeek
GLM‑5.1	$1.40	$4.40	$5.80	Z.ai
Claude Haiku 4.5	$1.00	$5.00	$6.00	Anthropic
Grok 4.3 (high context)	$2.50	$5.00	$7.50	xAI
Qwen 3.7‑Max	$2.50	$7.50	$10.00	Alibaba Cloud
Gemini 3.5 Flash	$1.50	$9.00	$10.50	Google
Gemini 3.1 Pro Preview (≤200K)	$2.00	$12.00	$14.00	Google
GPT‑5.4	$2.50	$15.00	$17.50	OpenAI
Gemini 3.1 Pro Preview (>200K)	$4.00	$18.00	$22.00	Google
Claude Opus 4.7	$5.00	$25.00	$30.00	Anthropic
GPT‑5.5	$5.00	$30.00	$35.00	OpenAI

Interpretation: By positioning Qwen 3.7‑Max just below Google’s Gemini 3.5 Flash ($10.50) yet well above budget‑tier models, Alibaba signals that this is a flagship reasoning engine aimed at luring enterprise workloads away from the most expensive Silicon‑Valley offerings.

Licensing Remains Proprietary (for now)

For all its technical brilliance, the most controversial aspect of Qwen 3.7‑Max is how it is distributed.

Proprietary model – released API‑only.
Historically, Alibaba’s Qwen series (e.g., Qwen 2.5, Qwen 3.6) were open‑weight, allowing developers to:
- Download the model,
- Run it on‑premise,
- Fine‑tune for data‑sensitive or highly specific use cases without sending proprietary information to a third‑party server.

Impact of the shift:

Audience	Consequence
Enterprise users	Must trust Alibaba Cloud with their data streams and rely on constant internet connectivity for agentic workflows.
Open‑source community	Loses access to one of the most capable models on the planet, limiting local‑hardware experimentation and private‑cluster deployments.

Community Reactions: Awe Meets Disappointment

The developer community reacted swiftly, mixing profound respect for the engineering achievement with frustration over the licensing model.

Sudo su (@sudoingX) on X (formerly Twitter):
“qwen is unreal, they just dropped 3.7 max and it is beating opus 4.6 max on most of the benchmarks they ran.”

Highlights from the discussion

Performance metrics:
- “the apex math number, 44.5 against opus 34.5, that is not a small gap,” noted Sudo su.
- “35 hours straight on a kernel‑optimization task with 1000+ tool calls” – a vivid illustration of the agent era in action.
Speed of iteration:
- Qwen 3.6 released just last month; the leap to 3.7‑Max showcases a relentless development cadence.
- “nobody else is moving like this,” Sudo su added.
Open‑source concerns:
- “one thing though, please open source this one too,” the commentator pleaded.
- “3.6 dense made the entire local LLM ecosystem better. the max tier going API‑only would close a door we have been keeping open. give us the weights eventually.”

Bottom line

Qwen 3.7‑Max proves that the autonomous‑agent era is no longer a theoretical projection—it’s a present reality capable of executing complex engineering feats while humans sleep. The lingering question is whether this new frontier of AI will become a democratized resource you can download to your laptop, or remain an intelligence utility rented strictly from the cloud. For now, it is undeniably the latter.