Caught red-handed? — The launch of Meta’s new Llama 4 AI model family this past weekend made quite a splash in tech circles.
Touted as heavyweights in artificial intelligence, the Scout and Maverick models from the Llama 4 family were announced as the first to feature the Mixture of Experts (MoE) architecture — a design technique that boosts model power while reducing the resources needed per query.
But beyond the technical promises, controversy quickly followed: Meta is said to have submitted a non-public version of Llama 4 to the LMArena benchmarking platform in order to boost its ranking and score.
An experimental model, tuned to charm
LMArena is a community-driven site where language models face off in head-to-head matchups. Visitors submit prompts, compare the two model responses, and vote for the best one. A scoring system then ranks the models based on human preferences.
Among the contenders, Meta’s Llama-4-Maverick-03-26-Experimental model quickly climbed to second place, just behind Google’s Gemini 2.5 Pro.
The only issue: this experimental version wasn’t publicly available and appeared to be specifically designed to perform well in this type of human-vote benchmark — especially by adjusting its tone and response style to win over voters.
According to LMArena, the results showed that this version produced longer, more engaging answers, sometimes peppered with emojis — in contrast to the public version, which was far more concise and formal. This stylistic tuning may have given Meta an unfair edge over competitors, who submitted open models that anyone could use.
“Meta should have made it clearer that “Llama-4-Maverick-03-26-Experimental” was a customized model to optimize for human preference”, LMArena said in a post yesterday:
We've seen questions from the community about the latest release of Llama-4 on Arena. To ensure full transparency, we're releasing 2,000+ head-to-head battle results for public review. This includes user prompts, model responses, and user preferences. (link in next tweet)
Early…
— lmarena.ai (formerly lmsys.org) (@lmarena_ai) April 8, 2025
In response to the controversy, LMArena published over 2,000 head-to-head matchups including the prompts, responses, and user votes.
The platform also stated that it updated its rules to better regulate future submissions and prevent similar confusion: “Meta’s interpretation of our policy did not match what we expect from model providers”, they clarified.
Uploading the public — i.e. “non-experimental” — version of Llama 4 Maverick from Hugging Face to LMArena is also planned, to provide a more transparent comparison baseline.
Meta owns it, but doesn’t apologize
Meta, for its part, doesn’t deny anything. The company admits to submitting an experimental version optimized for conversation, but insists it was part of a broader “exploratory approach”.
A spokesperson confirmed: “Llama-4-Maverick-03-26-Experimental is a chat-optimized version we experimented with that performed well on LMArena.”
In the official blog post announcing the launch of Llama 4, Meta does mention this version and its LMArena score of 1417 — but without making clear it was different from the one released to the public.
Still, many observers, including AI researchers, felt that distinction was too vague, creating a gap between the model’s public performance and the one implied by Meta’s benchmark claims.
All’s fair in love and benchmarks
This episode is all the more sensitive because Llama 4 Maverick is being positioned as a serious challenger to the closed models from OpenAI, Anthropic, and Google.
Meta claims that Maverick outperforms GPT-4o and Gemini 2.0 Flash in many benchmark tests. But that announcement was quickly overshadowed by these accusations of biased optimization.
Ahmad Al-Dahle, head of Meta GenAI, attempted to justify the performance gap by citing variability depending on the platforms or services used to run the model, which are still being stabilized. He also denied any accusations of cheating on benchmark test sets.
This controversy comes as Meta also makes headlines for addressing political bias in its models.
The company now claims that Llama 4 is less biased, more open to a variety of viewpoints, and less likely to refuse to answer sensitive prompts. A deliberate shift, backed by new safety testing efforts (including a program called GOAT, for “Generative Offensive Agent Testing”).
The Llama 4 models are available on Hugging Face as open source — although this label has been disputed by the Open Source Initiative, which points out that EU users are restricted from certain rights granted elsewhere.
Also on The Coding Love:
- 🤷♂️ “I just wanted to fix my problem”: 20 years ago, Linus Torvalds created Git
- ⚔️ This site pits AIs against each other to build Minecraft models
- 🦖 Elon Musk wants to rewrite all the COBOL code behind U.S. Social Security—in just a few months
- 🖨️ To celebrate Microsoft’s 50th anniversary, Bill Gates shares the historic source code that launched it all