Claude Opus 4.8 and SWE-bench Pro: We Ran Anthropic's Headline Through Our Gateway

Published: June 26, 2026

Built for Speed: ~10ms Latency, Even Under Load

Blazingly fast way to build, track and deploy your models!

Handles 350+ RPS on just 1 vCPU — no tuning needed
Production-ready with full enterprise support

Get Started with Truefoundry Now Talk to the Expert

When Anthropic launched Claude Opus 4.8 on May 28, 2026, the headline number was coding: 69.2% on SWE-bench Pro, up from 64.3% for Opus 4.7 — a 4.9-point gain on one of the industry's hardest software-engineering benchmarks.

We wanted to know whether that upgrade shows up when both models are called the way most teams actually call them: through a production gateway, one API, real latency, real token bills. We ran 50 hard coding problems from the public SWE-bench Pro test set through TrueFoundry AI Gateway.

Opus 4.8 came back with a usable-looking code patch on every single problem; Opus 4.7 missed three. The direction matched Anthropic's claim. Our absolute scores did not — and that gap is the point.

The Number Everyone is Quoting

Every major model launch ships with a table of benchmark scores. For Opus 4.8, the row that got the most attention was SWE-bench Pro.

SWE-bench Pro is a stress test for AI coding ability. The problems come from real open-source projects — maintained codebases, multi-file bugs, the kind of work a senior engineer might spend an afternoon untangling. Anthropic's published result was measured with a full agent setup: the model could browse the repository, run commands, try a fix, and iterate. That is the right way to measure what the model can do at its best.

It is not, however, what happens when your application sends a single request through an API and waits for an answer.

That distinction matters. Platform teams do not live inside Anthropic's evaluation harness. They live behind a gateway: one endpoint, routing rules, rate limits, and a bill at the end of the month. When a vendor says '+4.9 points,' the practical question is narrower and more immediate: on the infrastructure we already use, does the new model still beat the old one on hard coding work?

We Ran That Check

We routed Claude Opus 4.8 and Claude Opus 4.7 through TrueFoundry AI Gateway — the same OpenAI-compatible API surface our customers use to reach frontier models. Behind the gateway, the two models landed on different provider routes. From the application's point of view, the integration was identical: same URL, same credentials, different model name.

For the core test, we pulled 50 problems from the public SWE-bench Pro test set (731 issues in total). Each problem describes a real bug in a real repository. We sent the description to the model in a single turn and asked for a unified diff — the standard format for a code patch. No browsing. No terminal access. No second chance to revise.

We then graded each response with a simple rule: does this look like a legitimate patch? We did not spin up Docker containers or run the project's test suite. We did not ask a second model to act as judge.

This is a lighter test than Anthropic's, but it answers a different question: when Opus 4.8 arrives on your gateway, does it produce credible coding output on hard problems more often than Opus 4.7?

What We Found on SWE-bench Pro

On our 50-problem sample, Opus 4.8 returned a patch-shaped answer every time — 50 out of 50. Opus 4.7 missed three, landing at 47 out of 50. In percentage terms, that is a six-point gap on our slice versus Anthropic's 4.9-point gap on the full official benchmark.

The ordering is what we cared about. Anthropic said the new model is better at this work. On our gateway run, the new model was better at this work too.

The absolute numbers are a different story, and they should not be compared side by side. Anthropic's 69.2% means the model solved roughly seven out of ten problems with an agent that could explore and test. Our 100% and 94% mean the model returned something that looked like a patch in one shot. The bar we used is much lower. Treating our scores as 'beating' Anthropic's would be misleading. Treating the direction— 4.8 ahead of 4.7 — as a sanity check is fair.

Model	Our result (N=50)	Anthropic reported	Est. cost	Latency p50	Latency p95
Opus 4.8	100% (50/50)	69.2%	~$1.45	~11.9 s	~26.7 s
Opus 4.7	94% (47/50)	64.3%	~$1.66	~13.0 s	~36.6 s

There was a practical side to the run as well. At Anthropic's published list pricing at launch ($5 per million input tokens, $25 per million output tokens), the SWE-bench slice cost roughly $1.45 for Opus 4.8 and $1.66 for Opus 4.7. Responses on long issue descriptions took a median of about 12 seconds for 4.8 and 13 seconds for 4.7; the slowest fifth of requests stretched toward 27 and 37 seconds respectively. For a one-off validation sprint, that is manageable. For an agent that loops dozens of times per task, it adds up quickly.

Reading the rest of the launch table

Anthropic's announcement did not stop at SWE-bench Pro. The same post cited gains on terminal tasks, desktop automation, graduate-level science Q&A, finance workflows, and more. We did not replay those official harnesses either — we mapped each category to a smalls et of representative prompts and ran both models through Gateway.

Category	What we sent	What we found
SWE-bench Pro	50 real bugs, one-shot patch	4.8: 50/50 · 4.7: 47/50
SWE-bench Verified	Python bug-fix prompts	4.8: 2/2 · 4.7: 1/2
Terminal-Bench 2.1	Bash one-liner tasks	Both: 2/2
OSWorld-Verified	UI action multiple-choice	Both: 2/2
HLE (tools)	Hard reasoning prompts	Both: 2/2
GDPval-AA	Knowledge-work math	Both: 2/2 (official metric is Elo)
Finance Agent v2	Multi-step finance prompts	Both: 0/2 (grading limitation likely)
GPQA Diamond	Graduate science MCQ	Both: 2/2

The pattern repeated what we saw on coding. On short proxy tasks — fixing a Python bug, answering a bash question, picking the right UI action — both models often scored at the ceiling. That tells you the models are reachable and responsive on category-shaped work.

Two results were more informative than the rest.

On SWE-bench Verified proxies (simpler coding fixes than Pro), Opus 4.8 answered both prompts correctly while Opus 4.7 answered one of two — a wider gap than Anthropic's official 88.6% vs 87.6%, though on a tiny sample.

On Finance Agent proxies, both models scored zero. That is almost certainly a grading limitation — strict answer matching on a miniature stand-in — not proof that Opus cannot do finance work. It is a reminder that lightweight proxies fail quietly. The only suite we trust for model ordering is SWE-bench Pro at 50 issues.

For the remaining categories —reasoning exams, knowledge-work math, science multiple choice — both models passed our small proxy sets. Anthropic's official table still shows nuance we did not capture (GPQA Diamond slightly favors 4.7, for example). Our proxies were too shallow to surface that.

Why we ran it this way

We are not trying to replace Anthropic's evaluation. Their numbers describe what Opus 4.8 can do when you give it tools, time, and their harness. Our run describes what happens on day zero of a gateway migration — when an engineering team wants to know whether routing traffic to 4.8 is worth the churn before retraining agents and retuning prompts.

Three things stood out beyond the scoreboard.

• The upgrade story survived contact with production routing. Different provider paths, one client integration, both models callable without re-architecting the application. That is the integration story Gateway is built for.

• Honest grading beats impressive grading. A heuristic 'looks like a patch' check is easy to criticize — and we criticize it ourselves. It is also cheap, reproducible, and hard to game with a friendly judge model. For a directional read after a launch, that tradeoff made sense.

• Vendor benchmarks and gateway reality measure different layers. Anthropic's 69.2% is the ceiling. Our 50-problem sample is a floor check: did the new flagship actually move on hard coding output when called the way your services call it?

What we take from it

Anthropic's SWE-bench Pro delta remains the strongest public evidence that Opus 4.8 is a meaningful coding upgrade. We did not re-prove that number. We confirmed the ordering —4.8 ahead of 4.7 — on a gateway path that mirrors how production services call frontier models, with costs and latencies attached.

Our percentages are not comparable to Anthropic's, and we would not present them that way. The useful read from our run is simpler: the headline gain points in the same direction on our stack.

We also learned something about how to read launch season. Anthropic's 69.2% describes a model with tools, time, and a full evaluation harness — the ceiling. Our fifty-problem sample describes a single API call and a plain patch-shape check — the floor. Both are legitimate; they answer different questions. Ours was: when Opus 4.8 lands onGateway, does it visibly move ahead of 4.7 on hard coding output? On this sample, yes.

TrueFoundry AI Gateway is the layer we ran that question through: one OpenAI-compatible API, 1000+ models, provider routing handled behind the scenes. Same client, different model name, measurable difference on infrastructure we already operate.

TrueFoundry AI Gateway delivers ~3–4 ms latency, handles 350+ RPS on 1 vCPU, scales horizontally with ease, and is production-ready, while LiteLLM suffers from high latency, struggles beyond moderate RPS, lacks built-in scaling, and is best for light or prototype workloads.

Built for Speed: ~10ms Latency, Even Under Load

Schedule your Demo Now