Google releases Gemma 4 under Apache 2.0 — and that license change may matter more than benchmarks

Published: (April 2, 2026 at 01:49 PM EDT)
8 min read
Source: VentureBeat

Source: VentureBeat

Gemma 4: Google’s Open‑Weight Model Family Goes Fully Apache 2.0

For the past two years, enterprises evaluating open‑weight models have faced an awkward trade‑off. Google’s Gemma line consistently delivered strong performance, but its custom license — with usage restrictions and terms Google could update at will — pushed many teams toward Mistral or Alibaba’s Qwen instead. Legal review added friction, compliance teams flagged edge cases, and “open” with asterisks isn’t the same as truly open.

Gemma 4 eliminates that friction entirely. Google DeepMind’s newest open‑model family ships under a standard Apache 2.0 license — the same permissive terms used by Qwen, Mistral, Arcee, and most of the open‑weight ecosystem.

  • No custom clauses.
  • No “Harmful Use” carve‑outs that require legal interpretation.
  • No restrictions on redistribution or commercial deployment.

For enterprise teams that have been waiting for Google to play on the same licensing terms as the rest of the field, the wait is over.

Timing note – While some Chinese AI labs (most notably Alibaba’s latest Qwen models, Qwen 3.5 Omni and Qwen 3.6 Plus) have begun pulling back from fully open releases, Google is moving in the opposite direction — opening up its most capable Gemma release yet while explicitly stating the architecture draws from its commercial Gemini 3 research.


Four Models, Two Tiers: Edge to Workstation in a Single Family

Gemma 4 arrives as four distinct models organized into two deployment tiers.

TierModelParametersMoE?ModalitiesContext Window
WorkstationGemma‑4‑31B (dense)31 BNoText + Image256 K tokens
Gemma‑4‑A4B (Mixture‑of‑Experts)26 B (effective)YesText + Image256 K tokens
EdgeGemma‑4‑E2B2.3 B effective (5.1 B total)NoText + Image + Audio128 K tokens
Gemma‑4‑E4B4 B effective (≈ 9 B total)NoText + Image + Audio128 K tokens

Naming conventions

  • E – “effective parameters.”
    E2B has 2.3 B effective parameters but 5.1 B total because each decoder layer carries its own small embedding table via Per‑Layer Embeddings (PLE). The tables are large on disk but cheap to compute, so the model runs like a 2 B model while technically weighing more.

  • A – “active parameters.”
    In the 26 B A4B MoE model, only 3.8 B of the total 25.2 B parameters activate during inference, delivering roughly 26 B‑class intelligence with compute costs comparable to a 4 B model.

Deployment implications

  • The MoE model can run on consumer‑grade GPUs and should appear quickly in tools like Ollama and LM Studio.
  • The 31 B dense model needs more headroom — think an NVIDIA H100 or RTX 6000 Pro for unquantized inference. Google also ships Quantization‑Aware Training (QAT) checkpoints to maintain quality at lower precision.
  • On Google Cloud, both workstation models can run in a fully serverless configuration via Cloud Run with NVIDIA RTX Pro 6000 GPUs, spinning down to zero when idle.

The MoE Bet: 128 Small Experts to Save on Inference Costs

The architectural choices inside the 26 B A4B model deserve particular attention from teams evaluating inference economics.

  • Instead of a handful of large experts, Google uses 128 small experts, activating 8 per token plus one shared always‑on expert.
  • This yields a model that benchmarks competitively with dense models in the 27 B–31 B range while running at roughly the speed of a 4 B model during inference.

Why it matters

  • Cost efficiency – Delivering 27 B‑class reasoning at 4 B‑class throughput means fewer GPUs, lower latency, and cheaper per‑token inference in production.
  • Practicality – For organizations running coding assistants, document‑processing pipelines, or multi‑turn agentic workflows, the MoE variant may be the most practical choice in the family.

Both workstation models use a hybrid attention mechanism that interleaves local sliding‑window attention with full global attention (the final layer is always global). This design enables the 256 K context window while keeping memory consumption manageable — an important consideration for teams processing long documents, codebases, or multi‑turn agent conversations.


Native Multimodality: Vision, Audio, and Function Calling Baked In from Scratch

Previous generations of open models typically treated multimodality as an add‑on (vision encoders bolted onto text backbones, audio via external ASR pipelines, function calling via prompt engineering). Gemma 4 integrates all of these capabilities at the architecture level.

Vision

  • All four models handle variable‑aspect‑ratio image input with configurable visual‑token budgets.
  • Budgets range from 70 to 1 120 tokens per image, letting developers trade off detail against compute.
    • Low budgets – classification, captioning.
    • High budgets – OCR, document parsing, fine‑grained visual analysis.
  • Multi‑image and video input (processed as frame sequences) are supported natively, enabling visual reasoning across multiple documents or screenshots.

Audio (Edge models only)

  • Native automatic speech recognition (ASR) and speech‑to‑translated‑text run on‑device.
  • The audio encoder has been compressed to 305 M parameters (down from 681 M in Gemma 3n), and frame duration dropped from 160 ms to 40 ms for more responsive transcription.
  • Ideal for voice‑first applications that need to keep data local (e.g., healthcare, field service, multilingual customer interaction).

Function calling

  • Function‑calling support is baked into the model, eliminating the need for fragile prompt‑engineering tricks.

Bottom line for IT leaders

  • Licensing – Fully Apache 2.0, no hidden clauses.
  • Flexibility – Four models covering edge to workstation workloads, with hybrid dense/MoE options.
  • Cost – MoE design delivers high‑capacity reasoning at low inference cost.
  • Multimodality – Vision, audio, and function calling are first‑class citizens, ready for on‑device or server‑side deployment.

Gemma 4 positions Google as a serious open‑weight contender, offering enterprises a permissively licensed, high‑performance, and truly multimodal model family.

Function Calling – Native Across All Four Models

Function calling is also native across all four models, drawing on research from Google’s FunctionGemma release late last year.

  • Unlike previous approaches that relied on instruction‑following to coax models into structured tool use, Gemma 4’s function calling was trained into the model from the ground up.
  • It is optimized for multi‑turn agentic flows with multiple tools.
  • This shows up in agentic benchmarks and, more importantly, reduces the prompt‑engineering overhead that enterprise teams typically invest when building tool‑using agents.

Benchmarks in Context: Where Gemma 4 Lands in a Crowded Field

The benchmark numbers tell a clear story of generational improvement.

ModelSizeAIME 2026LiveCodeBench v6Codeforces ELOMMMU‑Pro (Vision)MATH‑VisionGPQA Diamond
Gemma 4 (dense)31 B89.2 %80.0 %2,15076.9 %85.6 %
Gemma 4 (MoE)88.3 %77.1 %82.3 %
Gemma 3 (dense)27 B20.8 %29.1 %
Edge‑E4B42.5 %52.0 %
Edge‑E2B37.5 %44.0 %

Key take‑aways

  • The 31 B dense model’s scores would have been frontier‑class for proprietary models not long ago.
  • The MoE variant closes the gap with only a modest performance trade‑off while offering a significant inference‑cost advantage.
  • Edge models (E4B & E2B) punch above their weight class, delivering strong results on a T4 GPU despite being a fraction of the size of Gemma 3 27 B.

Competitive Landscape

  • Qwen 3.5, GLM‑5, and Kimi K2.5 all compete aggressively in this parameter range.
  • What distinguishes Gemma 4 is not a single benchmark, but the combination of:
    • Strong reasoning performance
    • Native multimodality (text, vision, audio)
    • Function calling
    • 256 K context window
    • A genuinely permissive Apache 2.0 license
    • Deployment flexibility from edge devices to cloud serverless

What Enterprise Teams Should Watch Next

  1. Model Availability

    • Google is releasing both pre‑trained base models and instruction‑tuned variants.
    • The base models have historically been strong foundations for custom training.
    • The Apache 2.0 license removes ambiguity about commercial deployment of fine‑tuned derivatives.
  2. Serverless Deployment via Cloud Run

    • GPU‑enabled Cloud Run offers pay‑as‑you‑go inference that scales to zero.
    • This can significantly improve economics for internal tools and lower‑traffic applications compared to always‑on GPU instances.
  3. Future Model Sizes

    • Google has hinted that the current lineup is not the complete Gemma 4 family; additional sizes are likely to follow.
  4. Overall Value Proposition

    • The current offering—workstation‑class reasoning models and edge‑class multimodal models, all under Apache 2.0 and built on Gemini 3 research—represents the most complete open‑model release Google has shipped to date.
    • For enterprises waiting for open models that compete on licensing terms as well as performance, the evaluation can finally begin without a call to legal first.
0 views
Back to Blog

Related posts

Read more »