Engineering

Claude Sonnet 5: how close to Opus, and at what cost?

Alex Berriman

1 July 2026

Every model launch arrives with a benchmark deck. Anthropic's pitch for Claude Sonnet 5 is unusually specific, and it is a good one: near-Opus quality, at Sonnet prices. For a shop like ours that is not a curiosity. We run these models in production every day, so the question is simple and expensive. Is Sonnet 5 good enough to be the default, and what does it actually cost to use?

We do not take launch numbers on faith. Vendor benchmarks run under vendor conditions, and independent harnesses routinely land 17 to 21 points lower on the same task. So we did what we always do when a model matters to the business. We ran it through our own lab.

Sonnet 5

Claude

Field

What Anthropic actually shipped

The facts first. Sonnet 5 (claude-sonnet-5) launched on 30 June 2026. It carries a 1M token context window and a 128K output ceiling, with knowledge to January 2026. It is the first Sonnet to get the higher xhigh effort setting, and adaptive thinking is on by default. It is now the default model inside the Claude apps, on the API, in Claude Code, on AWS and Azure, and generally available in GitHub Copilot.

The headline that everyone repeats is the price: $3 per million input tokens and $15 per million output. That is the same sticker as Sonnet 4.6, and roughly 40% under Opus 4.8 at $5 and $25. There is introductory pricing of $2 and $10 running until 31 August 2026. Hold that detail. It matters more than it looks.

On the benchmarks Anthropic actually led with, the "narrows the gap" framing holds up where it has been verified:

On SWE-bench Pro, the coding metric Anthropic headlined, Sonnet 5 scores 63.2% against Opus 4.8's 69.2% and Sonnet 4.6's 58.1%. That is about 91% of Opus, and a clear jump over the previous Sonnet.
On Humanity's Last Exam with tools, Sonnet 5 lands 57.4% against Opus 4.8's 57.9%. Half a point. That is the single cleanest piece of evidence for the near-Opus claim.
On Terminal-Bench it climbs to around 80% from Sonnet 4.6's 67%, and on OSWorld computer-use tasks it sits at 81.2%.

One caveat is worth stating plainly, because it shaped everything we did next. The louder numbers going around are not real. The "82% on SWE-bench Verified, first model past 80" line has no Anthropic source and is contradicted by models that were already above 80% months ago. The "96% GPQA" figures trace back to an April Fools post. If you are going to pick a model on benchmarks, pick carefully which benchmarks.

So we ran it through our own lab

We keep an internal model-evaluation suite for exactly this moment. It is 87 tasks across the work we actually sell: production coding, UI components, SVG and diagrams, full web pages, and Australian accounting and legal reasoning. Every task is graded three ways. Deterministic checks where there is a right answer, a model judge for quality, and human review for taste. Every model runs the identical suite, single-shot, at temperature zero. No vendor conditions. The same harness for Claude, GPT, Gemini, DeepSeek and the rest.

Here is where Sonnet 5 landed, on composite score across all 87 tasks:

Claude Sonnet 5: 0.773
Claude Opus 4.8: 0.778
Claude Sonnet 4.6: 0.734
Claude Haiku 4.5: 0.604

Read the top pair again. Five thousandths of a point separate Sonnet 5 from Opus 4.8, and their 95% confidence intervals sit almost exactly on top of each other. On our suite, single-shot, this is a tie. Against its own predecessor it is a real and clear step up.

It is not uniformly behind Opus, either. Sonnet 5 cleared every programming task we set (1.00, against Opus at 0.83) and did better on full web builds (0.54 to 0.42). Opus held its lead on Australian accounting (0.87 to 0.83), on UI components, and on the open-ended creative tasks our judge scores hardest, where Sonnet 5 was the weaker of the two. The picture is two strong models trading domains, not one beating the other. The chart above is that run. Every bar is a model that sat the same exam.

If you want the receipts, the whole thing is public. Every model, every task, every score, with confidence intervals and cost, lives at banja.au/lab/benchmarks.

The cost twist

Now the part the launch deck does not show you.

In that single-shot run, Sonnet 5 produced its 0.773 using 275,000 tokens, at about $3.48. Opus 4.8 produced its near-identical 0.778 using 199,000 tokens, at about $3.87. So Sonnet 5 was cheaper here. But look at the token counts. It spent 38% more tokens than Opus to land in the same place. It is cheaper per token and hungrier per task, and those two facts pull against each other.

The hunger is structural, not a quirk of one run. Sonnet 5 ships a new tokeniser that emits roughly 30% more tokens for the same text, and adaptive thinking is on by default. The introductory price exists, by Anthropic's own account, to keep the move "roughly cost-neutral" against that inflation. Which is another way of saying the real per-task cost steps up on 1 September, when standard pricing resumes.

And the moment you reach for the top of the new effort dial, the value can invert. This is not just our reading. The same complaint turns up wherever practitioners are testing it. From the Hacker News thread on launch day:

Cost per task is shockingly high. More expensive than Opus 4.8.

Incredibly inefficient at max reasoning, and even at high and xhigh it uses far more tokens than other models.

Opus always performs better for a given cost. If Sonnet 5 medium is not good enough, switch models, not effort levels.

CodeRabbit ran it on bug detection and found the same shape from another angle: max effort roughly doubled the cost without catching more bugs. Our own agentic runs, where each model iterates with full tool access at high effort, told the story from yet another direction. Everything bunches up near the top. Sonnet 5 at 0.83, Sonnet 4.6 at 0.84, Opus at 0.84, all inside each other's noise. Give a capable model a good scaffold and a generous effort budget and the benchmark ceiling compresses. The differences you are paying for stop showing up in the score.

What the benchmarks still will not tell you

A few honest limits, including on our own numbers.

Our suite is 87 tasks. The confidence intervals are wide and they overlap. A 0.773 and a 0.778 is a tie, not a ranking, and you should not read the order of any two models whose intervals cross. It is one run of one suite, built around the work we do, which is not the work you do.

The wider field matters too, and it is not a Claude clean sweep. On the same single-shot run, GPT-5.5 came out a touch ahead at 0.791, and Gemini 3.1 Pro effectively tied Sonnet 5. On shell-native and terminal work, GPT-5.5 is reported to lead. The fair read is that Sonnet 5 is a price-for-intelligence leader in its tier and genuinely competitive on agentic coding. It is not a categorical champion, and anyone selling it as one is quoting vendor numbers.

It is also slower than Sonnet 4.6, and it can be over-eager on small jobs. Ask for a one-line change and it may hand you back a helper and a test file you never asked for.

How we are using it

For us the verdict is practical, and it is mostly good news.

Sonnet 5 is our new default for agentic and coding work at low and medium effort. It is near-Opus on the things we measured, clearly better than Sonnet 4.6, and at the bottom of the effort dial it is genuinely cheap for what it returns. That is a strong default to have.

We keep Opus 4.8 for the hardest jobs. The long, ambiguous, high-stakes ones where the extra reasoning earns its premium. And we treat the effort dial as a real cost decision, not a free quality knob. Past medium, the right move is usually to change models, not to push Sonnet 5 to max and pay for tokens that do not move the answer.

The launch line was "near-Opus quality at a lower price." On our bench the first half is true, and the second half is true at low effort and gets complicated above it. That is a more useful answer than any deck can give you, and it is exactly why we keep our own lab.

Banja - we build products, and we measure them.

Built by banja.au

Need this level of engineering for your product? We design and build digital products for founders who move fast.

Get in Touch

Back