Qwen3.5 122B and 35B models offer Sonnet 4.5 performance on local computers

Published: (February 28, 2026 at 03:20 PM EST)
5 min read

Source: Hacker News

Alibaba’s Qwen 3.5 Medium Model Series

A little more than a day ago the Qwen AI team released the Qwen 3.5 Medium Model series, a family of four new large‑language models (LLMs) that support agentic tool calling. Three of the models are available for commercial use under the Apache 2.0 license:

  • Qwen 3.5‑35B‑A3B
  • Qwen 3.5‑122B‑A10B
  • Qwen 3.5‑27B

The models can be downloaded from Hugging Face and ModelScope.

A fourth model, Qwen 3.5‑Flash, is proprietary and only accessible via the Alibaba Cloud Model Studio API, but it offers a strong cost advantage compared with Western alternatives (see the pricing table below).

Why the Open‑Source Models Matter

  • Benchmark performance – On third‑party tests the open‑source Qwen 3.5 models match or beat similarly‑sized proprietary models from OpenAI and Anthropic, surpassing OpenAI’s GPT‑5‑mini and Anthropic’s Claude Sonnet 4.5 (released only five months ago).
  • Quantization‑friendly – The team reports that the models stay highly accurate even when quantized, i.e., when the numeric precision of weights and KV‑cache values is reduced.
  • Frontier‑level context windows on the desktop – The flagship Qwen 3.5‑35B‑A3B can exceed a 1 million‑token context length on consumer‑grade GPUs with 32 GB VRAM, far less compute than many competing solutions.
  • Near‑lossless 4‑bit quantization – Enables massive datasets to be processed on modest hardware.

Technology: Delta Force

Qwen 3.5’s performance stems from a hybrid architecture that blends Gated Delta Networks with a sparse Mixture‑of‑Experts (MoE) system. Highlights from the Qwen 3.5‑35B‑A3B specifications:

FeatureDetail
Parameter Efficiency35 B total parameters, but only 3 B are active for any given token.
Expert DiversityMoE layer contains 256 experts; 8 are routed per token plus 1 shared expert, reducing inference latency.
Near‑Lossless QuantizationMaintains high accuracy with 4‑bit weights, shrinking the memory footprint for local deployment.
Base Model ReleaseAlibaba open‑sourced the Qwen 3.5‑35B‑A3B‑Base model alongside the instruction‑tuned variants.

Product: Intelligence That “Thinks” First

Qwen 3.5 introduces a native “Thinking Mode.” Before emitting a final answer, the model generates an internal reasoning chain wrapped in “ tags, allowing it to work through complex logic.

ModelTarget HardwareContext LengthNotable Traits
Qwen 3.5‑27BHigh‑efficiency GPUs> 800 K tokensOptimized for low‑resource environments.
Qwen 3.5‑FlashHosted on Alibaba Cloud1 M + tokens (default)Production‑grade, includes official tools.
Qwen 3.5‑122B‑A10BServer‑grade GPUs (80 GB VRAM)1 M + tokensBridges the gap to the world’s largest frontier models.

Benchmark results show the 35B‑A3B model surpasses larger predecessors (e.g., Qwen‑3‑235B) and the proprietary GPT‑5‑mini and Claude Sonnet 4.5 in knowledge (MMMLU) and visual reasoning (MMMU‑Pro).

Alibaba Qwen 3.5 Medium models benchmark comparison chart. Credit: Alibaba

Pricing & API Integration

For users who prefer not to host the weights themselves, Alibaba Cloud Model Studio offers an API for Qwen 3.5‑Flash with the following rates:

OperationPrice (per 1 M tokens)
Input$0.10
Output$0.40
Cache Creation$0.125
Cache Read$0.01
Tool Calling – Web Search$10 per 1 000 calls
Tool Calling – Code InterpreterFree (limited‑time offer)

Cost Comparison with Other Major LLM APIs

ModelInputOutputTotal Cost*Source
Qwen 3 Turbo$0.05$0.20$0.25Alibaba Cloud
Qwen 3.5‑Flash$0.10$0.40$0.50Alibaba Cloud
DeepSeek‑Chat (v3.2‑Exp)$0.28$0.42$0.70DeepSeek
DeepSeek‑Reasoner (v3.2‑Exp)$0.28$0.42$0.70DeepSeek
Grok 4.1 Fast (reasoning)$0.20$0.50$0.70xAI
Grok 4.1 Fast (non‑reasoning)$0.20$0.50$0.70xAI

*Total cost = Input + Output (per 1 M tokens).

Qwen 3.5‑Flash is therefore among the most affordable LLM APIs worldwide.

All information is current as of 28 Feb 2026.

Model Pricing Overview

ModelInput $ / 1K tokensOutput $ / 1K tokensTotal $ / 1K tokens*Provider
MiniMax M2.50.151.201.35MiniMax
MiniMax M2.5‑Lightning0.302.402.70MiniMax
Gemini 3 Flash Preview0.503.003.50Google
Kimi‑k2.50.603.003.60Moonshot
GLM‑51.003.204.20Z.ai
ERNIE 5.00.853.404.25Baidu
Claude Haiku 4.51.005.006.00Anthropic
Qwen3‑Max (2026‑01‑23)1.206.007.20Alibaba Cloud
Gemini 3 Pro (≤200K)2.0012.0014.00Google
GPT‑5.21.7514.0015.75OpenAI
Claude Sonnet 4.53.0015.0018.00Anthropic
Gemini 3 Pro (>200K)4.0018.0022.00Google
Claude Opus 4.65.0025.0030.00Anthropic
GPT‑5.2 Pro21.00168.00189.00OpenAI

*Total = Input + Output cost per 1 K tokens (rounded to two decimals).

What It Means for Enterprise Technical Leaders and Decision‑Makers

With the launch of the Qwen 3.5 Medium Models, rapid iteration and fine‑tuning—once the exclusive domain of well‑funded labs—are now accessible for on‑premise development at many non‑technical firms. This effectively decouples sophisticated AI from massive capital expenditure.

Across the organization, this architecture transforms how data is handled and secured. The ability to ingest massive document repositories or hour‑scale videos locally enables deep institutional analysis without the privacy risks of third‑party APIs.

By running these specialized Mixture‑of‑Experts models within a private firewall, organizations can maintain sovereign control over their data while leveraging native “thinking” modes and official tool‑calling capabilities to build more reliable, autonomous agents.

Early adopters on Hugging Face have specifically lauded the model’s ability to “narrow the gap” in agentic scenarios where previously only the largest closed models could compete.

This shift toward architectural efficiency over raw scale ensures that AI integration remains cost‑conscious, secure, and agile enough to keep pace with evolving operational needs.

0 views
Back to Blog

Related posts

Read more »