The Experiment
I took 10 real coding tasks — the kind developers actually do every day — and submitted the exact same prompt to both GPT-5.4 and Claude Opus 4.6. Same system prompt, same context, same evaluation criteria.
No synthetic benchmarks. No cherry-picked examples. Just real tasks scored on three dimensions:
- Correctness (does it work without modifications?)
- Code quality (readability, types, error handling, edge cases)
- Efficiency (token usage, response time, number of follow-up prompts needed)
Each dimension is scored 1-10. Maximum possible score per task: 30.
The models were accessed via their respective APIs at standard pricing: GPT-5.4 at $2.50/$15 per million tokens and Claude Opus 4.6 at $15/$75 per million tokens.
Here are the 10 tasks and exactly what happened.
Task 1: Build a REST API Endpoint
Prompt: "Create a POST /api/users endpoint in Express.js with TypeScript. Validate email format and password strength (min 8 chars, 1 uppercase, 1 number). Hash the password with bcrypt. Store in PostgreSQL via Prisma. Return the user without the password field. Handle duplicate emails with a 409 status."
GPT-5.4 Result
Clean, production-ready code. The Zod validation schema was precise. The bcrypt hashing used a proper salt round constant. The Prisma query used select to exclude the password field at the database level rather than deleting it from the response object — a subtle but important security practice. TypeScript types were tight.
Claude Opus 4.6 Result
Also clean and correct. Used a similar Zod validation approach but added rate limiting middleware for the endpoint and included a comment explaining why. The password exclusion used Prisma's omit feature. Added a try/catch with specific error types for Prisma unique constraint violations.
Scores
| Dimension | GPT-5.4 | Opus 4.6 |
|---|---|---|
| Correctness | 10 | 10 |
| Code quality | 9 | 9 |
| Efficiency | 9 | 8 |
| Total | 28 | 27 |
Winner: GPT-5.4 (marginally, on speed and conciseness)
Both outputs were excellent. GPT-5.4 was faster and used fewer tokens. Opus added the rate limiting middleware unprompted — useful but not requested. For well-defined API tasks, the models are essentially interchangeable.
Task 2: Build a React Component
Prompt: "Create a React component called DataTable that accepts generic typed data, supports sortable columns, pagination (client-side), a search filter, and row selection with checkboxes. Use TypeScript generics. No UI library — just HTML/CSS with CSS modules. Include proper ARIA attributes."
GPT-5.4 Result
Delivered a well-structured generic component. TypeScript generics were used correctly for the column definition and data types. Sorting logic was clean with a custom useSortable hook extracted. Pagination used useMemo for performance. ARIA attributes were correct — role="grid", aria-sort on sortable headers, aria-selected on checkboxes.
Claude Opus 4.6 Result
Similar structure but with a few differences. Opus created a useDataTable hook that encapsulated sorting, pagination, and filtering logic — cleaner separation but more abstraction. TypeScript generics were equally correct. Missing aria-sort on the header cells. The CSS module included a responsive layout that switched to card view on mobile, which was not requested but was a thoughtful addition.
Scores
| Dimension | GPT-5.4 | Opus 4.6 |
|---|---|---|
| Correctness | 10 | 9 |
| Code quality | 9 | 9 |
| Efficiency | 9 | 8 |
| Total | 28 | 26 |
Winner: GPT-5.4
GPT-5.4's ARIA implementation was more complete, which matters for a component that will be used across an application. As noted by MindStudio's comparison, GPT-5.4 excels at boilerplate generation including React components and TypeScript interfaces.
Task 3: Write a Complex SQL Query
Prompt: "Write a PostgreSQL query that returns the top 10 customers by lifetime value (total order amount) who have placed at least 3 orders in the last 12 months, including their most recent order date, average order value, and the percentage change in their spending compared to the previous 12-month period. Use CTEs for readability."
GPT-5.4 Result
Three CTEs: one for current period aggregation, one for previous period aggregation, one for the percentage calculation. Clean, correct, well-formatted. Used COALESCE for handling customers with no previous period data. Added an index hint comment.
Claude Opus 4.6 Result
Four CTEs with a slightly different structure: separated the "last order date" calculation into its own CTE to avoid a correlated subquery. Added a NULLIF to prevent division by zero in the percentage calculation — a real edge case GPT-5.4 missed. Included a window function alternative in a comment block.
Scores
| Dimension | GPT-5.4 | Opus 4.6 |
|---|---|---|
| Correctness | 9 | 10 |
| Code quality | 8 | 9 |
| Efficiency | 9 | 8 |
| Total | 26 | 27 |
Winner: Claude Opus 4.6
The division-by-zero edge case was the differentiator. In production SQL, that kind of bug causes silent data corruption. Opus consistently surfaces edge cases that matter in real-world data pipelines.
Task 4: Debug a Race Condition
Prompt: I provided 3 files (~200 lines total) from a Node.js application with an intermittent test failure. The bug was a race condition in a caching layer where concurrent cache misses could trigger duplicate database queries and inconsistent state. "Find the bug, explain why it only manifests intermittently, and provide a fix."
GPT-5.4 Result
Identified the correct cache miss code path. Suggested adding a mutex lock using async-mutex. The fix was correct but treated the symptom rather than the root cause — it serialized all cache accesses, which would hurt performance under load.
Claude Opus 4.6 Result
Identified the same code path but also traced the state inconsistency to a second issue: the cache update was not atomic — there was a window between the read check and the write where another request could interleave. Opus suggested a "single-flight" pattern (coalescing concurrent identical requests) rather than a global mutex. The fix was more surgical and preserved concurrency for non-conflicting cache keys.
Scores
| Dimension | GPT-5.4 | Opus 4.6 |
|---|---|---|
| Correctness | 7 | 10 |
| Code quality | 7 | 9 |
| Efficiency | 8 | 8 |
| Total | 22 | 27 |
Winner: Claude Opus 4.6
A clear gap. Opus understood the concurrency model deeply enough to suggest a targeted fix. This aligns with Claude Opus 4.6's 80.8% score on SWE-bench Verified, which tests exactly this kind of real-world bug resolution.
Task 5: Code Review
Prompt: I provided a 350-line pull request adding a new payment processing module. "Review this PR for bugs, security issues, performance problems, and code quality. Prioritize findings by severity."
GPT-5.4 Result
Found 5 issues: a missing null check on the payment response, an unhandled promise rejection, a hardcoded timeout that should be configurable, a missing idempotency key, and a suggestion to extract magic numbers into constants. Organized by severity. Clear and actionable.
Claude Opus 4.6 Result
Found 8 issues: the same 5 GPT-5.4 found plus three more — a TOCTOU (time-of-check-time-of-use) vulnerability in the amount validation, a potential information leak in the error response that exposed internal stack traces, and a subtle issue where retry logic could cause double-charging if the first request succeeded but the response was lost. Each finding included the specific line number and a suggested fix.
Scores
| Dimension | GPT-5.4 | Opus 4.6 |
|---|---|---|
| Correctness | 8 | 10 |
| Code quality | 8 | 10 |
| Efficiency | 9 | 8 |
| Total | 25 | 28 |
Winner: Claude Opus 4.6
The three additional findings were all security-critical. The double-charging bug alone could cost a company significant money and reputation. Opus's 76% on MRCR v2 (multi-file reasoning) translates directly to better code review on complex modules.
Task 6: Write a Test Suite
Prompt: "Write comprehensive tests for this authentication middleware using Vitest. Cover: valid tokens, expired tokens, malformed tokens, missing authorization header, revoked tokens, rate limiting, and concurrent authentication requests." I provided the middleware source file (~120 lines).
GPT-5.4 Result
Generated 18 test cases organized in clean describe blocks. Every scenario from the prompt was covered. Added three extra edge cases: empty string token, token with wrong algorithm, and whitespace-only authorization header. Mocks were well-structured using vi.mock. Test descriptions were clear and followed the "should X when Y" pattern.
Claude Opus 4.6 Result
Generated 15 test cases. All prompted scenarios covered. The test structure used a helper factory for creating tokens with different properties — clever but added complexity. Missing the "concurrent authentication requests" test that was explicitly requested. The mocks were cleaner but the test count was lower.
Scores
| Dimension | GPT-5.4 | Opus 4.6 |
|---|---|---|
| Correctness | 10 | 8 |
| Code quality | 9 | 9 |
| Efficiency | 9 | 8 |
| Total | 28 | 25 |
Winner: GPT-5.4
GPT-5.4 followed the prompt more faithfully and added meaningful edge cases. As multiple comparisons note, GPT-5.4's test generation is among the best, writing comprehensive suites with strong edge case coverage.
Task 7: Refactor a Monolithic Module
Prompt: I provided a 500-line Python module that handled user management — registration, authentication, profile updates, password resets, and email notifications all in one file. "Refactor this into a clean module structure following SOLID principles. Maintain backward compatibility with the existing public API."
GPT-5.4 Result
Split into 5 modules: auth.py, registration.py, profile.py, password.py, notifications.py. Added an __init__.py that re-exported the original public functions for backward compatibility. Clean separation. Each module was self-contained.
However, it missed updating the circular dependency between registration.py and notifications.py — registration sends a welcome email, and the notification module needed a reference back to user data. The code would crash on import.
Claude Opus 4.6 Result
Split into 6 modules with the same breakdown plus a types.py for shared data classes. Crucially, it identified the circular dependency issue and resolved it by introducing an event-based pattern — registration emits a "user_created" event, and the notification module subscribes to it. The backward-compatible __init__.py was identical in approach.
Opus also added a brief comment at the top of each module explaining what belongs there and what does not — acting as a guide for future developers.
Scores
| Dimension | GPT-5.4 | Opus 4.6 |
|---|---|---|
| Correctness | 6 | 10 |
| Code quality | 8 | 10 |
| Efficiency | 8 | 7 |
| Total | 22 | 27 |
Winner: Claude Opus 4.6
The circular dependency bug would have caused a production failure. This is the type of multi-file reasoning where Opus excels — it understands cross-file dependencies and architectural implications before generating code.
Task 8: Write Technical Documentation
Prompt: "Write API documentation for this payment processing SDK. Include: overview, authentication, rate limits, error codes, 5 endpoint descriptions with request/response examples, a webhook section, and a migration guide from v1 to v2." I provided the SDK source code.
GPT-5.4 Result
Comprehensive documentation covering all requested sections. The endpoint descriptions were detailed with curl examples and response schemas. The error codes section was well-organized as a table. The migration guide was clear with before/after code examples. Clean markdown formatting.
Claude Opus 4.6 Result
Also comprehensive, with a slightly different structure — it led with a "Quick Start" section before the detailed docs, which is a good pattern for developer documentation. The webhook section was more detailed, including retry behavior, signature verification code, and testing guidance. The migration guide included a deprecation timeline that was not in the source code — it inferred this from versioning patterns.
Scores
| Dimension | GPT-5.4 | Opus 4.6 |
|---|---|---|
| Correctness | 9 | 9 |
| Code quality | 9 | 9 |
| Efficiency | 9 | 8 |
| Total | 27 | 26 |
Winner: Tie (GPT-5.4 by one point on efficiency)
Both produced excellent documentation. The quality difference is negligible. GPT-5.4 was slightly faster. For documentation tasks, either model works well — this aligns with developer reports that documentation quality is comparable across frontier models.
Task 9: Design a System Architecture
Prompt: "Design the architecture for a real-time collaborative document editor supporting 10,000 concurrent users. Cover: data model, conflict resolution strategy (CRDTs vs OT), WebSocket infrastructure, storage layer, presence system, and deployment topology. Provide a diagram in Mermaid syntax."
GPT-5.4 Result
Chose OT (Operational Transformation) with a central server. Reasonable architecture with Redis for presence, PostgreSQL for document storage, and a WebSocket gateway behind a load balancer. The Mermaid diagram was clean. The analysis was competent but followed a standard playbook — it did not deeply analyze the tradeoffs between CRDTs and OT for this specific scale.
Claude Opus 4.6 Result
Started by asking a clarifying question about the document model (rich text vs. plain text vs. structured data), which I answered as "rich text." Then recommended CRDTs (specifically Yjs) over OT, with a detailed explanation of why CRDTs are superior at this scale — eventual consistency without a central sequencer eliminates the single point of failure.
The architecture included a novel detail: a "document gateway" layer that handles CRDT merge operations and acts as both a WebSocket terminator and a state persistence layer. The Mermaid diagram included data flow arrows with protocol annotations. The deployment section recommended a specific partitioning strategy (shard by document ID) with reasoning about hot partitions.
Scores
| Dimension | GPT-5.4 | Opus 4.6 |
|---|---|---|
| Correctness | 8 | 10 |
| Code quality | 7 | 10 |
| Efficiency | 8 | 7 |
| Total | 23 | 27 |
Winner: Claude Opus 4.6
Architecture is where the reasoning depth gap between these models is most visible. Opus reasons more explicitly about the problem before generating output, working through edge cases and asking clarifying questions when requirements are genuinely ambiguous.
Task 10: Write a DevOps Deployment Script
Prompt: "Write a GitHub Actions workflow that: builds a Docker image, runs tests, pushes to ECR, deploys to ECS Fargate with blue-green deployment, runs a smoke test against the new deployment, and rolls back automatically if the smoke test fails. Use OIDC for AWS authentication — no hardcoded credentials."
GPT-5.4 Result
A complete workflow file with all requested steps. OIDC configuration was correct using aws-actions/configure-aws-credentials with the role ARN. Blue-green deployment used ECS service update with CODE_DEPLOY deployment controller. The smoke test was a curl-based health check. Rollback was triggered by the smoke test exit code. Well-commented, production-ready.
Claude Opus 4.6 Result
Also complete and correct. Used the same OIDC approach. The key difference was in the smoke test — Opus created a more thorough test that checked not just the health endpoint but also verified the deployment was serving the correct version by checking a /version endpoint. The rollback included a Slack notification step. However, the workflow was notably more verbose — 40% more lines for similar functionality.
Scores
| Dimension | GPT-5.4 | Opus 4.6 |
|---|---|---|
| Correctness | 10 | 10 |
| Code quality | 9 | 9 |
| Efficiency | 9 | 7 |
| Total | 28 | 26 |
Winner: GPT-5.4
For DevOps scripting, GPT-5.4's conciseness is an advantage. The workflow is easier to maintain and modify. Opus's additions (Slack notification, version verification) are nice but were not requested and added complexity. GPT-5.4 leads on Terminal-bench (75.1% vs 65.4%), and this advantage shows in terminal-oriented tasks.
The Final Scoreboard
| Task | GPT-5.4 | Opus 4.6 | Winner |
|---|---|---|---|
| 1. REST API endpoint | 28 | 27 | GPT-5.4 |
| 2. React component | 28 | 26 | GPT-5.4 |
| 3. SQL query | 26 | 27 | Opus 4.6 |
| 4. Debug race condition | 22 | 27 | Opus 4.6 |
| 5. Code review | 25 | 28 | Opus 4.6 |
| 6. Test suite | 28 | 25 | GPT-5.4 |
| 7. Refactor module | 22 | 27 | Opus 4.6 |
| 8. Documentation | 27 | 26 | Tie |
| 9. Architecture design | 23 | 27 | Opus 4.6 |
| 10. DevOps script | 28 | 26 | GPT-5.4 |
| Total | 257 | 266 | Opus 4.6 |
Final score: Claude Opus 4.6 wins 266 to 257.
But the aggregate score hides the real story.
The Pattern That Matters More Than the Score
Look at where each model wins:
GPT-5.4 wins on:
- API endpoints (well-defined, scoped tasks)
- React components (boilerplate with clear specs)
- Test writing (comprehensive coverage from a spec)
- DevOps scripts (terminal-oriented, concise output)
Claude Opus 4.6 wins on:
- SQL edge cases (catching subtle data bugs)
- Debugging (understanding root causes in complex systems)
- Code review (finding security and correctness issues)
- Refactoring (handling cross-file dependencies)
- Architecture (deep reasoning about tradeoffs)
The pattern is clear: GPT-5.4 is the faster, cheaper, better model for well-defined coding tasks. Claude Opus 4.6 is the deeper, more careful model for tasks requiring reasoning across complexity.
This matches what DataCamp's analysis found: GPT-5.4 is the best all-around model while Opus 4.6 excels specifically at agentic and deep-coding tasks.
The Cost Factor
The score gap (9 points) is relatively small. The cost gap is not.
| Metric | GPT-5.4 | Claude Opus 4.6 |
|---|---|---|
| Input pricing | $2.50/MTok | $15/MTok |
| Output pricing | $15/MTok | $75/MTok |
| Speed | 73.4 tok/s | 40.5 tok/s |
| Context window | 1M (surcharge >272K) | 1M (flat pricing) |
| Tool search savings | ~47% token reduction | N/A |
For this 10-task test, the total API cost was approximately $4.20 for GPT-5.4 and $31.50 for Opus 4.6. That is a 7.5x cost difference for a 3.5% quality gap.
For a team running hundreds of AI-assisted coding tasks per day, the math strongly favors GPT-5.4 for the majority of work, with Opus reserved for the high-stakes 10-20% where its reasoning depth makes a material difference.
The Smart Strategy: Use Both
Most working developers in 2026 are not choosing one model — they are choosing when to use each. The pattern that emerged from this test matches what we use at ZBuild:
Daily driver: GPT-5.4 (via Codex CLI or API)
- Writing new endpoints, components, and scripts
- Generating tests from specs
- Quick debugging on isolated issues
- DevOps and CI/CD automation
Heavy lifter: Claude Opus 4.6 (via Claude Code or API)
- Cross-file refactoring with complex dependencies
- Reviewing security-critical code
- Architectural design sessions
- Debugging non-obvious issues in large codebases
This two-model approach captures 95% of both models' strengths while keeping costs manageable. The Portkey guide to choosing between these models recommends the same hybrid approach.
What the Benchmarks Say (for Context)
The task-by-task results above align with the formal benchmarks:
| Benchmark | GPT-5.4 | Opus 4.6 | What It Measures |
|---|---|---|---|
| SWE-bench Verified | ~80% | 80.8% | Real GitHub issue resolution |
| SWE-bench Pro | 57.7% | ~46% | Harder, stricter coding tasks |
| Terminal-bench 2.0 | 75.1% | 65.4% | Terminal and system tasks |
| HumanEval | 93.1% | 90.4% | Function-level code generation |
| GPQA Diamond | 92.0-92.8% | 87.4-91.3% | Expert-level reasoning |
| ARC-AGI-2 | 73.3% | 68.8-69.2% | Novel reasoning |
Sources: MindStudio benchmarks, Evolink analysis, Anthropic
GPT-5.4 leads on most benchmarks. Opus 4.6 leads on SWE-bench Verified — the benchmark most closely tied to real-world bug fixing — which explains its advantage on debugging and refactoring in my tests.
The Verdict
If you can only choose one model: GPT-5.4. It handles 80% of coding tasks at equal or better quality, costs 6-7x less, and is 80% faster. The 20% of tasks where Opus is better (debugging, refactoring, architecture) can often be handled with more detailed prompting on GPT-5.4.
If you can use both: Do it. GPT-5.4 for daily coding, Opus 4.6 for complex work. This is not a compromise — it is the optimal strategy.
If cost does not matter and you want maximum quality on every task: Claude Opus 4.6. It won the overall score and its wins were on the tasks where quality matters most (bugs cost more than boilerplate).
The results were not what I expected because I assumed the more expensive model would dominate. It did not. The two models have genuinely different strengths, and the best strategy is knowing which strength you need for the task in front of you.
Sources
- OpenAI — Introducing GPT-5.4
- OpenAI — API Pricing
- Anthropic — Introducing Claude Opus 4.6
- Anthropic — Claude Pricing
- MindStudio — GPT-5.4 vs Claude Opus 4.6 vs Gemini 3.1 Pro Benchmarks
- MindStudio — Which AI Model Is Right for Your Workflow
- Portkey — GPT-5.4 vs Claude Opus 4.6 Guide
- DataCamp — GPT-5.4 vs Claude Opus 4.6 for Agentic Tasks
- Artificial Analysis — GPT-5.4 vs Claude Opus 4.6
- Bind AI — GPT-5.4 vs Claude Opus 4.6 for Coding
- Evolink — SWE-bench Verified 2026: Claude vs GPT
- DEV Community — ChatGPT vs Claude for Coding 2026
- Claude 5 — Opus 4.6 Benchmark Analysis