Four Days In, Here Is What We Are Actually Seeing
Anthropic released Claude Opus 4 and Sonnet 4 on May 22, 2025. We had access within hours. Since then, we have been running both models through real client work -- not benchmarks, not toy problems, not "write me a poem about coding." Actual business tasks that pay our bills and determine whether our clients trust our recommendations.
This is not a review. It is a field report. Four days is not enough to make sweeping claims, but it is enough to notice patterns. Some of those patterns are genuinely impressive. Others are not. Both are worth sharing.
We are going to be specific because vague praise is useless. If you are trying to decide whether to upgrade your workflows, you need concrete examples, not adjectives.
Coding Got Meaningfully Better
This is the area where the jump is most obvious, and it is the one that matters most for our day-to-day work. We build client applications with Claude Code, and Opus 4 is noticeably sharper at tasks that used to require multiple rounds of correction.
A specific example. We had a client project that needed a multi-step data pipeline: pull records from a Supabase database, cross-reference with a third-party API, apply business logic that involved seven conditional branches, and write the results to a formatted report. With Sonnet 3.5, this took about four iterations. The model would get the broad strokes right but miss edge cases in the conditional logic -- things like handling null values from the API response or properly sequencing async operations when one depended on the result of another.
Opus 4 got this right on the second attempt. The first attempt had a minor issue with how it handled rate limiting on the third-party API, but the conditional logic, the async sequencing, and the null handling were all correct out of the gate. That is not a miracle. It is a meaningful reduction in iteration cycles that compounds across dozens of tasks per week.
The difference is not that Opus 4 gets everything right the first time. It is that the things it gets wrong are smaller and less fundamental. You spend time polishing instead of restructuring.
Another example. We were refactoring an authentication flow across 12 files for a client application. This involved migrating from a custom JWT implementation to Supabase Auth, which meant updating middleware, route handlers, client-side hooks, and database triggers. With previous models, this kind of cross-file refactor was where things fell apart -- the model would update the middleware correctly but forget to propagate the change to a utility function three files away.
Opus 4 tracked the dependencies across all 12 files without losing the thread. It even caught a stale import in a test file that we had not asked it to update. That kind of ambient awareness of the codebase is new. Not perfect -- it still missed a type definition in one file -- but the ratio of "things it caught on its own" to "things we had to point out" shifted dramatically.
Reasoning on Business Documents Is Sharper
We do a lot of document analysis for clients -- contracts, proposals, financial summaries, competitive research. This is where Opus has always been strong relative to Sonnet, and Opus 4 extends that lead.
The improvement is most visible in multi-document analysis where the model needs to hold context from Document A while analyzing Document B and then synthesize findings across both. We ran a contract comparison for a real estate client -- two competing vendor agreements, each around 30 pages, with a brief to identify material differences in liability caps, termination clauses, and indemnification language.
Opus 4 produced a comparison that our client's attorney described as "thorough enough to use as a starting point for their own review." That is high praise from a lawyer. The model correctly identified a subtle difference in how the two contracts defined "material breach" that had genuine financial implications. Previous versions of Opus would sometimes catch this, sometimes not. Opus 4 caught it consistently across three separate runs.
We also tested it on a competitive analysis project -- pulling insights from 15 different company reports and synthesizing a market landscape. The model maintained consistent analytical criteria across all 15 sources, which was a problem area for Opus 3.5 when documents exceeded a certain cumulative length. Opus 4 did not drift.
Where It Still Falls Short
Credibility means being honest about limitations, so here they are.
Speed
Opus 4 is slow. Not unusably slow, but noticeably slower than Sonnet 4, and for interactive workflows where you are iterating quickly -- debugging, for instance, or back-and-forth document editing -- the latency adds up. We timed it informally: Opus 4 responses for complex coding tasks averaged around 45-60 seconds. Sonnet 4 was typically under 20 seconds for comparable tasks.
This is not a dealbreaker, but it changes how you work. We find ourselves batching requests with Opus 4 rather than having the rapid back-and-forth that Sonnet enables. You front-load your instructions, give more context upfront, and try to get it right in fewer exchanges. That is a different workflow, and it is not always the better one.
Cost
Opus 4 is significantly more expensive than Sonnet 4 on the API. For clients who are paying per token, this matters. A task that costs $0.30 with Sonnet 4 might cost $2.50 with Opus 4. The quality improvement does not always justify a 8x cost increase, especially for tasks where Sonnet 4 is already good enough.
We have already started building routing logic into our client deployments -- use Opus 4 for complex reasoning tasks and document analysis, route everything else to Sonnet 4 or Haiku. The model that is "best" is not always the model you should use. Cost-per-quality is the metric that matters in production.
Overconfidence on Ambiguous Tasks
This surprised us. Opus 4 is occasionally more confident than it should be on tasks where the right answer is "I need more information." We noticed this during a financial analysis task -- the model produced a definitive recommendation based on incomplete data where the correct response was to flag the gaps. Opus 3.5 was more likely to hedge in these situations. We are still gathering data on this, but it is something we are watching closely.
The model that is "best" is not always the model you should use. Cost-per-quality is the metric that actually matters in production.
When to Use Opus 4 vs Sonnet 4 vs Haiku
After four days of testing, here is our working framework. This will evolve, but it is where we are starting.
Use Opus 4 for: Complex multi-document analysis. Contract review and legal document work. Multi-file code refactors that require tracking dependencies across a large codebase. Strategic analysis where nuance and depth matter. Anything where getting it wrong has a real cost.
Use Sonnet 4 for: Day-to-day coding tasks. Single-document analysis. Content generation and editing. Customer-facing agents where response speed matters. Most interactive workflows where you are iterating back and forth. This is our default model for Claude Code now -- the speed improvement over Opus 4 means faster iteration cycles, and the quality is high enough for 80% of coding tasks.
Use Haiku for: High-volume, low-complexity tasks. Classification and routing. Quick data extraction from structured documents. Anything where you need thousands of API calls and cost is a primary concern. Haiku is absurdly fast and absurdly cheap, and it handles straightforward tasks without breaking a sweat.
What This Means for Client Work
The practical impact is that we can take on more complex projects with higher confidence. The contract review work that used to require significant human oversight can now run with lighter supervision. The coding projects that used to take three sprint cycles can often compress to two. Not because the AI replaces human judgment, but because it produces higher-quality first drafts that require less correction.
We are also seeing new use cases open up. One client asked us to build an agent that analyzes sales call transcripts, extracts commitments made by both parties, and flags discrepancies with the subsequent proposal. With Opus 3.5, the extraction was good but the discrepancy detection was unreliable. Opus 4 handles this end-to-end with enough accuracy that the client's sales ops team trusts the output.
That last point matters. Trust is the bottleneck for AI adoption, not capability. When the output is reliable enough that people stop double-checking everything, the time savings become real. Opus 4 moves the trust needle forward.
The Bigger Picture
Model releases used to feel like marketing events. This one feels like an actual capability jump. Not a revolution -- we are not going to claim that Opus 4 "changes everything" because that kind of language is lazy and usually wrong. But it is a solid, measurable improvement in the areas where we need improvement most: complex reasoning, cross-file code generation, and multi-document analysis.
If you are already building on Claude, Opus 4 is a straightforward upgrade for your highest-stakes tasks. If you are not building on Claude yet, Opus 4 and Sonnet 4 together represent the strongest model lineup available for business applications right now.
We will keep testing. We will keep sharing what we find. Four days is a start, not a conclusion. But the start is promising, and the specific improvements in coding and document analysis are already saving us real time on real client work.
That is the test that matters. Not benchmarks. Not demos. Does it make the work better? After four days, the answer is a qualified yes -- with the caveat that you need to be thoughtful about when to use Opus 4 versus Sonnet 4, and that the cost and speed trade-offs are real.
Trust is the bottleneck for AI adoption, not capability. When the output is reliable enough that people stop double-checking everything, the time savings become real. Opus 4 moves the trust needle forward.