Comparing the performance of Qwen 3.5 and Gemma 4

Google DeepMind has just launched Gemma 4, its latest open weight family with sizes E2B, E4B, 26B, A4B, and 31B. Alibabas Qwen 3.5 is already one of the most robust open families on the market, including sizes 2B, 4B, 9B, 27B, 35B-A3B, 122B-A10B, and 397B-A17B.

Google DeepMind has just launched Gemma 4 , its latest open weight family with sizes E2B, E4B, 26B, A4B, and 31B. Alibaba's Qwen 3.5 is already one of the most robust open families on the market, including sizes 2B, 4B, 9B, 27B, 35B-A3B, 122B-A10B, and 397B-A17B.

If you're choosing an open model for local agents, laptop-based inference, or self-hosted production, a useful question isn't "which family has the most outstanding model?" but rather which family wins in terms of the size you can actually run.

Currently, there are no third-party Vals-style comparison charts that include all the sizes of Gemma 4 and Qwen 3.5. Therefore, the most reliable comparison at present must distinguish between two different types of evidence:

Evidence of third-party chat preference, in which Google cites Arena AI's open-source text ranking.
Formal overlap between models-tags, where we match models by implementation class and only compare benchmark rows that both sides actually publish.

On the currently published official overlap benchmark, the Qwen 3.5 wins more rows in the 2B, 4B, and mid-size MoE classes, while the Gemma 4 is most competitive in the dense ~30B class and performs better for edge audio, multilingual, and some multimodal workloads. However, on Arena AI, both the Gemma 4 31B and Gemma 4 26B A4B rank higher than comparable large Qwen 3.5 models in terms of chat priority.

Quick comparison

class size	Gemma 4 model	Qwen 3.5 model	The best reading material available today.
Edge / mobile	Gemma 4 E2B	Qwen3.5-2B	Currently, there are no reliable third-party rankings; official results suggest Qwen has a slight edge.
Class 4B	Gemma 4 E4B	Qwen3.5-4B	Currently, there are no reliable third-party rankings; official results suggest Qwen has a slight edge.
Large Dense Models (LDM)	Gemma 4 31B	Qwen3.5-27B	The official overlap is evenly split; Arena AI chat priority leans towards Gemma 4 31B .
Efficient MoE	Gemma 4 26B A4B	Qwen3.5-35B-A3B	Official overlap benefits Qwen ; the Arena AI chat option benefits Gemma 4 26B A4B

Qwen also has a higher-end segment that Gemma 4 can't match: 9B, 122B-A10B, and 397B-A17B.

Why might Google's claim still be true?

If you only read the model-tag tables below, you might easily conclude that Qwen 3.5 is far superior to Gemma 4. That's too subjective. Google's launch post is presenting a different kind of evidence.

On Arena AI's open-source ranking page, March 31, 2026:

The Gemma 4 31B ranked third among open-ended models with a score of 1452 ± 9.
Qwen3.5-397B-A17B ranked 4th with a score of 1449 ± 6
Gemma 4 26B A4B ranked 6th with a score of 1441 ± 9
Qwen3.5-122B-A10B reached 1416 ± 6
Qwen3.5-27B reached 1404 ± 6
Qwen3.5-35B-A3B reached 1400 ± 6

This is third-party evidence supporting Gemma, particularly regarding Google's "byte for byte" and "intelligence-per-parameter" approach. This doesn't mean Gemma 4 beats Qwen 3.5 on every performance test. It does mean that on large-scale chat priority rankings, the two larger Gemma 4 models are currently ranking higher than the main open Qwen 3.5 models.

An objective assessment is:

Arena AI currently favors Gemma 4 in terms of assistant quality on a large scale.
The official model-card overlap shows that Qwen 3.5 leans towards Qwen 3.5 in many aspects such as static inference, programming, and the number of agent rows.
Third-party evidence for small models remains limited, so conclusions about 2B and 4B are still more provisional than those for large models.

The bigger trend: Open weighting is improving almost as fast as closed weighting.

A helpful way to view the Gemma 4 versus Qwen 3.5 issue is to chart the weighted best-performing open patterns and best-performing close patterns at each point in time over the past year based on a single third-party index.

In that respect, Arena AI isn't perfect, but it's useful. It measures conversation quality based on human preferences, not the accuracy of static benchmarks. That means it shouldn't replace the underlying model-tag tables. But it puts open and closed models on the same scale, which makes it a good indicator of the growth rate of advanced models.

On this index, the growth rate of the open-weighted model increased from 1398 to 1456 over the past year (+58), while the growth rate of the closed-weighted model increased from 1448 to 1504 (+56). Thus, the growth rate of the open-weighted model improved almost as fast as the closed-weighted model on this ranking, but the absolute gap remained almost unchanged: 50 points at the top of the window compared to 48 points at the bottom.

Key milestones behind the chart

Track	Become a pioneer	Model	Arena AI Score
open weight	Starting on Windows (April 2, 2025)	DeepSeek-R1	1398 ± 5
open weight	August 21, 2025	DeepSeek-V3.1	1418 ± 6
open weight	December 22, 2025	GLM-4.7	1443 ± 6
open weight	January 27, 2026	Kimi K2.5 Thinking	1453 ± 5
open weight	February 11, 2026	GLM-5	1456 ± 6
Closed weight	Starting on Windows (April 2, 2025)	Gemini 2.5 Pro Experimental	1448 ± 3
Closed weight	August 5, 2025	Claude Opus 4.1 Thinking	1449 ± 3
Closed weight	November 12, 2025	GPT-5.1 High	1455 ± 4
Closed weight	November 24, 2025	Claude Opus 4.5 Thinking	1474 ± 4
Closed weight	February 5, 2026	Claude Opus 4.6 Thinking	1504 ± 6

This chart clarifies three things:

The rate of increase in open-weighted models is no longer so slow. On this third-party chat index, the slope is now very similar.
The gap remains. Open-weighted models have gained nearly the same number of points, but they haven't significantly erased the lead held by closed-weighted models over the past year.
Much of the catch-up in open-weighted models comes from a few major releases, rather than steady monthly growth. The same is true for closed-weighted models, particularly around Claude Opus 4.5 and Claude Opus 4.6 .

Methodology Note : The chart uses Arena AI's overall text ranking as of March 31, 2026, as the sole general scale, then plots a bounding envelope by the model's release date. This is helpful for comparing speeds, but it remains an indicator based on chat preferences, not a lab-based benchmark. Small changes in scores within the stated uncertainty range should be viewed with caution. GLM-5 uses the first public launch date on Hugging Face (February 11, 2026) as the timeline because details about Z.ai's public launch are not accessible in the machine-readable post when this article is updated.

Interactions on benchmarks

The static chart above is the simplest example because it keeps the open and closed states on the same Arena AI chat priority scale. But if you want to check the durability of the results on task benchmarks, the explorer below recalculates the open and closed weights over the same 1-year period using rows taken from Vals AI, whose benchmark pages also keep rows from the older model visible.

Note : This chart uses current Vals AI benchmark pages instead of model-tag snapshots from the release date. This makes it useful for comparing the shape of progress across benchmarks, but it should not be read as a historical record from the release date. If a model is missing a row taken from a given benchmark, that milestone will not appear for that metric. Historical milestones may still include outdated models, even if they were actually the best benchmark models at the time.

How to match the models

There are two important points to note before reviewing the tables:

Gemma's small models use efficient parameters. Gemma 4 E2B has an efficient parameter of 2.3B / parameter loaded with embeddings is 5.1B, and Gemma 4 E4B has an efficient parameter of 4.5B / parameter loaded is 8B. These parameters are understood to be the Qwen 2B and 4B corresponding parameters at the implementation level, rather than their exact raw weighted corresponding parameters.
Qwen publishes different benchmark modes by size. Qwen3.5-4B, Qwen3.5-27B, and Qwen3.5-35B-A3B run in thinking mode by default on their model cards. Qwen3.5-2B publishes separate scores for thinking and non-thinking mode; this post uses the thinking mode score whenever the card displays thinking/non-thinking results.

Remove any benchmarks that are too methodologically sensitive to be considered clear, direct comparisons here, such as:

AIME 2026 lacks tools, as Gemma announced, but the corresponding size of Qwen is not.
SWE-bench verified, as Qwen announced this for larger models but not for Gemma 4.
CodeForces, as Qwen notes, measures its CodeForces results on its own query set, making it a less effective direct comparison to the Codeforces ELO scores published by Google.

This means the tables below should be read as reliable but limited: They are good for comparing published duplication, but not the whole story about the overall quality of the assistant.

In the tables below, read the following rows:

MMLU-Pro: General Knowledge and Reasoning
GPQA Diamond: Expert Scientific Reasoning
LiveCodeBench v6: Programming
Tau2 / TAU2-Bench: Tool/Agent Usage Behavior
MMMLU: Multilingual Inference
MMMU-Pro: Multimodal Inference

Match the size

Deployment layer	Gemma 4	Qwen 3.5	Why is this a good combination?
Edge / mobile	E2B	2B	Smallest local models
Class 4B	E4B	4B	Small laptop / edge-plus level
Large Dense Models (LDM)	31B dense	27B dense	The largest non-inference model in each family
Efficient MoE	26B A4B	35B-A3B	The nearest medium-sized MoE class, with approximately 4 billion versus approximately 3 billion operational parameters.

Should I choose Qwen 3.5 or Gemma 4?

For 2B edge implementations: Qwen3.5-2B if you care about text quality, tool usage, and long context; Gemma 4 E2B if audio and Google-edge integration are more important.

For 4B: Qwen3.5-4B. This is the easiest option in this article.

For large, non-inferential models: Qwen3.5-27B for text-prioritizing assistants and agents, Gemma 4 31B for more balanced multilingual and multimodal workloads.

For medium-sized MoEs: Qwen 3.5-35B-A3B unless your core KPI is heavy programming-intensive multilingual work and you want to specifically test Gemma 4 26B A4B.

The important thing is that the priorities at the team level are now clear enough to guide the shortlisting. But these priorities aren't one-sided: If you're interested in static task benchmarks, published overlaps often point to Qwen 3.5. If you're interested in assistant-style chat quality, the strongest signals from third parties currently indicate that Gemma 4 is in the lead.

However, the final quality still depends on your questions, context length, tool calls, and latency tolerances. That's why real-world implementation teams should treat public performance tests as a filter, not the final answer.

Comparing Google's Gemma 4 and OpenAI's GPT-5.3 Chat
this article will compare google's gemma 4 26b a4b and openai's gpt-5.3 chat based on key metrics including cost, context length, and other model features.
How to use Gemma 4 with the Gemini API and Google AI Studio
gemma 4 models handle function calls, structured json output, and system commands at the model level, rather than through prompt generation techniques.
Gemma 4 vs. Gemini: Which Google AI suite is right for your workflow?
if you need local deployment, infrastructure control, offline use, customization freedom, or edge device scenarios, gemma 4 is well worth considering.
What is Google Gemma 4? Everything you need to know about Google's new open-source AI.
google has launched gemma 4, a powerful open-source ai model that supports agentic workflows, multimodal applications, and is free to use commercially.

Comparing the performance of Qwen 3.5 and Gemma 4

Google DeepMind has just launched Gemma 4, its latest open weight family with sizes E2B, E4B, 26B, A4B, and 31B. Alibabas Qwen 3.5 is already one of the most robust open families on the market, including sizes 2B, 4B, 9B, 27B, 35B-A3B, 122B-A10B, and 397B-A17B.

Quick comparison

Why might Google's claim still be true?

The bigger trend: Open weighting is improving almost as fast as closed weighting.

Interactions on benchmarks

How to match the models

Match the size

Should I choose Qwen 3.5 or Gemma 4?

Comparing Google's Gemma 4 and OpenAI's GPT-5.3 Chat

How to use Gemma 4 with the Gemini API and Google AI Studio

Gemma 4 vs. Gemini: Which Google AI suite is right for your workflow?

What is Google Gemma 4? Everything you need to know about Google's new open-source AI.

The easiest way to use DeepSeek

These major updates enable users to utilize Copilot for research and analysis.

What is YOLO Mode in Cursor? How to activate it for 100% AI-automated programming.

Create AI-integrated lesson plans on NotebookLM

Detailed instructions for using the ChatGPT Images 2.0 image creation tool.

How to use Gemma 4 with the Gemini API and Google AI Studio