← Back to news
ZBuild News

I Gave the Same 10 Coding Tasks to GPT-5.4 and Claude Opus 4.6 — The Results Were Not What I Expected

A hands-on comparison where GPT-5.4 and Claude Opus 4.6 receive the same 10 real-world coding tasks — from API endpoints to architecture design. Each task is scored on correctness, code quality, and efficiency. The overall winner is revealed at the end.

Published
2026-03-27
Author
ZBuild Team
Reading Time
15 min read
gpt 5.4 vs claude opus 4.6gpt 5.4 codingclaude opus 4.6 codingbest ai for coding 2026gpt 5.4 benchmarksclaude opus 4.6 benchmarks
I Gave the Same 10 Coding Tasks to GPT-5.4 and Claude Opus 4.6 — The Results Were Not What I Expected
ZBuild Teamen
XLinkedIn
Disclosure: This article is published by ZBuild. Some products or services mentioned may include ZBuild's own offerings. We strive to provide accurate, objective analysis to help you make informed decisions. Pricing and features were accurate at the time of writing.

The Experiment

I took 10 real coding tasks — the kind developers actually do every day — and submitted the exact same prompt to both GPT-5.4 and Claude Opus 4.6. Same system prompt, same context, same evaluation criteria.

No synthetic benchmarks. No cherry-picked examples. Just real tasks scored on three dimensions:

  • Correctness (does it work without modifications?)
  • Code quality (readability, types, error handling, edge cases)
  • Efficiency (token usage, response time, number of follow-up prompts needed)

Each dimension is scored 1-10. Maximum possible score per task: 30.

The models were accessed via their respective APIs at standard pricing: GPT-5.4 at $2.50/$15 per million tokens and Claude Opus 4.6 at $15/$75 per million tokens.

Here are the 10 tasks and exactly what happened.


Task 1: Build a REST API Endpoint

Prompt: "Create a POST /api/users endpoint in Express.js with TypeScript. Validate email format and password strength (min 8 chars, 1 uppercase, 1 number). Hash the password with bcrypt. Store in PostgreSQL via Prisma. Return the user without the password field. Handle duplicate emails with a 409 status."

GPT-5.4 Result

Clean, production-ready code. The Zod validation schema was precise. The bcrypt hashing used a proper salt round constant. The Prisma query used select to exclude the password field at the database level rather than deleting it from the response object — a subtle but important security practice. TypeScript types were tight.

Claude Opus 4.6 Result

Also clean and correct. Used a similar Zod validation approach but added rate limiting middleware for the endpoint and included a comment explaining why. The password exclusion used Prisma's omit feature. Added a try/catch with specific error types for Prisma unique constraint violations.

Scores

DimensionGPT-5.4Opus 4.6
Correctness1010
Code quality99
Efficiency98
Total2827

Winner: GPT-5.4 (marginally, on speed and conciseness)

Both outputs were excellent. GPT-5.4 was faster and used fewer tokens. Opus added the rate limiting middleware unprompted — useful but not requested. For well-defined API tasks, the models are essentially interchangeable.


Task 2: Build a React Component

Prompt: "Create a React component called DataTable that accepts generic typed data, supports sortable columns, pagination (client-side), a search filter, and row selection with checkboxes. Use TypeScript generics. No UI library — just HTML/CSS with CSS modules. Include proper ARIA attributes."

GPT-5.4 Result

Delivered a well-structured generic component. TypeScript generics were used correctly for the column definition and data types. Sorting logic was clean with a custom useSortable hook extracted. Pagination used useMemo for performance. ARIA attributes were correct — role="grid", aria-sort on sortable headers, aria-selected on checkboxes.

Claude Opus 4.6 Result

Similar structure but with a few differences. Opus created a useDataTable hook that encapsulated sorting, pagination, and filtering logic — cleaner separation but more abstraction. TypeScript generics were equally correct. Missing aria-sort on the header cells. The CSS module included a responsive layout that switched to card view on mobile, which was not requested but was a thoughtful addition.

Scores

DimensionGPT-5.4Opus 4.6
Correctness109
Code quality99
Efficiency98
Total2826

Winner: GPT-5.4

GPT-5.4's ARIA implementation was more complete, which matters for a component that will be used across an application. As noted by MindStudio's comparison, GPT-5.4 excels at boilerplate generation including React components and TypeScript interfaces.


Task 3: Write a Complex SQL Query

Prompt: "Write a PostgreSQL query that returns the top 10 customers by lifetime value (total order amount) who have placed at least 3 orders in the last 12 months, including their most recent order date, average order value, and the percentage change in their spending compared to the previous 12-month period. Use CTEs for readability."

GPT-5.4 Result

Three CTEs: one for current period aggregation, one for previous period aggregation, one for the percentage calculation. Clean, correct, well-formatted. Used COALESCE for handling customers with no previous period data. Added an index hint comment.

Claude Opus 4.6 Result

Four CTEs with a slightly different structure: separated the "last order date" calculation into its own CTE to avoid a correlated subquery. Added a NULLIF to prevent division by zero in the percentage calculation — a real edge case GPT-5.4 missed. Included a window function alternative in a comment block.

Scores

DimensionGPT-5.4Opus 4.6
Correctness910
Code quality89
Efficiency98
Total2627

Winner: Claude Opus 4.6

The division-by-zero edge case was the differentiator. In production SQL, that kind of bug causes silent data corruption. Opus consistently surfaces edge cases that matter in real-world data pipelines.


Task 4: Debug a Race Condition

Prompt: I provided 3 files (~200 lines total) from a Node.js application with an intermittent test failure. The bug was a race condition in a caching layer where concurrent cache misses could trigger duplicate database queries and inconsistent state. "Find the bug, explain why it only manifests intermittently, and provide a fix."

GPT-5.4 Result

Identified the correct cache miss code path. Suggested adding a mutex lock using async-mutex. The fix was correct but treated the symptom rather than the root cause — it serialized all cache accesses, which would hurt performance under load.

Claude Opus 4.6 Result

Identified the same code path but also traced the state inconsistency to a second issue: the cache update was not atomic — there was a window between the read check and the write where another request could interleave. Opus suggested a "single-flight" pattern (coalescing concurrent identical requests) rather than a global mutex. The fix was more surgical and preserved concurrency for non-conflicting cache keys.

Scores

DimensionGPT-5.4Opus 4.6
Correctness710
Code quality79
Efficiency88
Total2227

Winner: Claude Opus 4.6

A clear gap. Opus understood the concurrency model deeply enough to suggest a targeted fix. This aligns with Claude Opus 4.6's 80.8% score on SWE-bench Verified, which tests exactly this kind of real-world bug resolution.


Task 5: Code Review

Prompt: I provided a 350-line pull request adding a new payment processing module. "Review this PR for bugs, security issues, performance problems, and code quality. Prioritize findings by severity."

GPT-5.4 Result

Found 5 issues: a missing null check on the payment response, an unhandled promise rejection, a hardcoded timeout that should be configurable, a missing idempotency key, and a suggestion to extract magic numbers into constants. Organized by severity. Clear and actionable.

Claude Opus 4.6 Result

Found 8 issues: the same 5 GPT-5.4 found plus three more — a TOCTOU (time-of-check-time-of-use) vulnerability in the amount validation, a potential information leak in the error response that exposed internal stack traces, and a subtle issue where retry logic could cause double-charging if the first request succeeded but the response was lost. Each finding included the specific line number and a suggested fix.

Scores

DimensionGPT-5.4Opus 4.6
Correctness810
Code quality810
Efficiency98
Total2528

Winner: Claude Opus 4.6

The three additional findings were all security-critical. The double-charging bug alone could cost a company significant money and reputation. Opus's 76% on MRCR v2 (multi-file reasoning) translates directly to better code review on complex modules.


Task 6: Write a Test Suite

Prompt: "Write comprehensive tests for this authentication middleware using Vitest. Cover: valid tokens, expired tokens, malformed tokens, missing authorization header, revoked tokens, rate limiting, and concurrent authentication requests." I provided the middleware source file (~120 lines).

GPT-5.4 Result

Generated 18 test cases organized in clean describe blocks. Every scenario from the prompt was covered. Added three extra edge cases: empty string token, token with wrong algorithm, and whitespace-only authorization header. Mocks were well-structured using vi.mock. Test descriptions were clear and followed the "should X when Y" pattern.

Claude Opus 4.6 Result

Generated 15 test cases. All prompted scenarios covered. The test structure used a helper factory for creating tokens with different properties — clever but added complexity. Missing the "concurrent authentication requests" test that was explicitly requested. The mocks were cleaner but the test count was lower.

Scores

DimensionGPT-5.4Opus 4.6
Correctness108
Code quality99
Efficiency98
Total2825

Winner: GPT-5.4

GPT-5.4 followed the prompt more faithfully and added meaningful edge cases. As multiple comparisons note, GPT-5.4's test generation is among the best, writing comprehensive suites with strong edge case coverage.


Task 7: Refactor a Monolithic Module

Prompt: I provided a 500-line Python module that handled user management — registration, authentication, profile updates, password resets, and email notifications all in one file. "Refactor this into a clean module structure following SOLID principles. Maintain backward compatibility with the existing public API."

GPT-5.4 Result

Split into 5 modules: auth.py, registration.py, profile.py, password.py, notifications.py. Added an __init__.py that re-exported the original public functions for backward compatibility. Clean separation. Each module was self-contained.

However, it missed updating the circular dependency between registration.py and notifications.py — registration sends a welcome email, and the notification module needed a reference back to user data. The code would crash on import.

Claude Opus 4.6 Result

Split into 6 modules with the same breakdown plus a types.py for shared data classes. Crucially, it identified the circular dependency issue and resolved it by introducing an event-based pattern — registration emits a "user_created" event, and the notification module subscribes to it. The backward-compatible __init__.py was identical in approach.

Opus also added a brief comment at the top of each module explaining what belongs there and what does not — acting as a guide for future developers.

Scores

DimensionGPT-5.4Opus 4.6
Correctness610
Code quality810
Efficiency87
Total2227

Winner: Claude Opus 4.6

The circular dependency bug would have caused a production failure. This is the type of multi-file reasoning where Opus excels — it understands cross-file dependencies and architectural implications before generating code.


Task 8: Write Technical Documentation

Prompt: "Write API documentation for this payment processing SDK. Include: overview, authentication, rate limits, error codes, 5 endpoint descriptions with request/response examples, a webhook section, and a migration guide from v1 to v2." I provided the SDK source code.

GPT-5.4 Result

Comprehensive documentation covering all requested sections. The endpoint descriptions were detailed with curl examples and response schemas. The error codes section was well-organized as a table. The migration guide was clear with before/after code examples. Clean markdown formatting.

Claude Opus 4.6 Result

Also comprehensive, with a slightly different structure — it led with a "Quick Start" section before the detailed docs, which is a good pattern for developer documentation. The webhook section was more detailed, including retry behavior, signature verification code, and testing guidance. The migration guide included a deprecation timeline that was not in the source code — it inferred this from versioning patterns.

Scores

DimensionGPT-5.4Opus 4.6
Correctness99
Code quality99
Efficiency98
Total2726

Winner: Tie (GPT-5.4 by one point on efficiency)

Both produced excellent documentation. The quality difference is negligible. GPT-5.4 was slightly faster. For documentation tasks, either model works well — this aligns with developer reports that documentation quality is comparable across frontier models.


Task 9: Design a System Architecture

Prompt: "Design the architecture for a real-time collaborative document editor supporting 10,000 concurrent users. Cover: data model, conflict resolution strategy (CRDTs vs OT), WebSocket infrastructure, storage layer, presence system, and deployment topology. Provide a diagram in Mermaid syntax."

GPT-5.4 Result

Chose OT (Operational Transformation) with a central server. Reasonable architecture with Redis for presence, PostgreSQL for document storage, and a WebSocket gateway behind a load balancer. The Mermaid diagram was clean. The analysis was competent but followed a standard playbook — it did not deeply analyze the tradeoffs between CRDTs and OT for this specific scale.

Claude Opus 4.6 Result

Started by asking a clarifying question about the document model (rich text vs. plain text vs. structured data), which I answered as "rich text." Then recommended CRDTs (specifically Yjs) over OT, with a detailed explanation of why CRDTs are superior at this scale — eventual consistency without a central sequencer eliminates the single point of failure.

The architecture included a novel detail: a "document gateway" layer that handles CRDT merge operations and acts as both a WebSocket terminator and a state persistence layer. The Mermaid diagram included data flow arrows with protocol annotations. The deployment section recommended a specific partitioning strategy (shard by document ID) with reasoning about hot partitions.

Scores

DimensionGPT-5.4Opus 4.6
Correctness810
Code quality710
Efficiency87
Total2327

Winner: Claude Opus 4.6

Architecture is where the reasoning depth gap between these models is most visible. Opus reasons more explicitly about the problem before generating output, working through edge cases and asking clarifying questions when requirements are genuinely ambiguous.


Task 10: Write a DevOps Deployment Script

Prompt: "Write a GitHub Actions workflow that: builds a Docker image, runs tests, pushes to ECR, deploys to ECS Fargate with blue-green deployment, runs a smoke test against the new deployment, and rolls back automatically if the smoke test fails. Use OIDC for AWS authentication — no hardcoded credentials."

GPT-5.4 Result

A complete workflow file with all requested steps. OIDC configuration was correct using aws-actions/configure-aws-credentials with the role ARN. Blue-green deployment used ECS service update with CODE_DEPLOY deployment controller. The smoke test was a curl-based health check. Rollback was triggered by the smoke test exit code. Well-commented, production-ready.

Claude Opus 4.6 Result

Also complete and correct. Used the same OIDC approach. The key difference was in the smoke test — Opus created a more thorough test that checked not just the health endpoint but also verified the deployment was serving the correct version by checking a /version endpoint. The rollback included a Slack notification step. However, the workflow was notably more verbose — 40% more lines for similar functionality.

Scores

DimensionGPT-5.4Opus 4.6
Correctness1010
Code quality99
Efficiency97
Total2826

Winner: GPT-5.4

For DevOps scripting, GPT-5.4's conciseness is an advantage. The workflow is easier to maintain and modify. Opus's additions (Slack notification, version verification) are nice but were not requested and added complexity. GPT-5.4 leads on Terminal-bench (75.1% vs 65.4%), and this advantage shows in terminal-oriented tasks.


The Final Scoreboard

TaskGPT-5.4Opus 4.6Winner
1. REST API endpoint2827GPT-5.4
2. React component2826GPT-5.4
3. SQL query2627Opus 4.6
4. Debug race condition2227Opus 4.6
5. Code review2528Opus 4.6
6. Test suite2825GPT-5.4
7. Refactor module2227Opus 4.6
8. Documentation2726Tie
9. Architecture design2327Opus 4.6
10. DevOps script2826GPT-5.4
Total257266Opus 4.6

Final score: Claude Opus 4.6 wins 266 to 257.

But the aggregate score hides the real story.


The Pattern That Matters More Than the Score

Look at where each model wins:

GPT-5.4 wins on:

  • API endpoints (well-defined, scoped tasks)
  • React components (boilerplate with clear specs)
  • Test writing (comprehensive coverage from a spec)
  • DevOps scripts (terminal-oriented, concise output)

Claude Opus 4.6 wins on:

  • SQL edge cases (catching subtle data bugs)
  • Debugging (understanding root causes in complex systems)
  • Code review (finding security and correctness issues)
  • Refactoring (handling cross-file dependencies)
  • Architecture (deep reasoning about tradeoffs)

The pattern is clear: GPT-5.4 is the faster, cheaper, better model for well-defined coding tasks. Claude Opus 4.6 is the deeper, more careful model for tasks requiring reasoning across complexity.

This matches what DataCamp's analysis found: GPT-5.4 is the best all-around model while Opus 4.6 excels specifically at agentic and deep-coding tasks.


The Cost Factor

The score gap (9 points) is relatively small. The cost gap is not.

MetricGPT-5.4Claude Opus 4.6
Input pricing$2.50/MTok$15/MTok
Output pricing$15/MTok$75/MTok
Speed73.4 tok/s40.5 tok/s
Context window1M (surcharge >272K)1M (flat pricing)
Tool search savings~47% token reductionN/A

For this 10-task test, the total API cost was approximately $4.20 for GPT-5.4 and $31.50 for Opus 4.6. That is a 7.5x cost difference for a 3.5% quality gap.

For a team running hundreds of AI-assisted coding tasks per day, the math strongly favors GPT-5.4 for the majority of work, with Opus reserved for the high-stakes 10-20% where its reasoning depth makes a material difference.


The Smart Strategy: Use Both

Most working developers in 2026 are not choosing one model — they are choosing when to use each. The pattern that emerged from this test matches what we use at ZBuild:

Daily driver: GPT-5.4 (via Codex CLI or API)

  • Writing new endpoints, components, and scripts
  • Generating tests from specs
  • Quick debugging on isolated issues
  • DevOps and CI/CD automation

Heavy lifter: Claude Opus 4.6 (via Claude Code or API)

  • Cross-file refactoring with complex dependencies
  • Reviewing security-critical code
  • Architectural design sessions
  • Debugging non-obvious issues in large codebases

This two-model approach captures 95% of both models' strengths while keeping costs manageable. The Portkey guide to choosing between these models recommends the same hybrid approach.


What the Benchmarks Say (for Context)

The task-by-task results above align with the formal benchmarks:

BenchmarkGPT-5.4Opus 4.6What It Measures
SWE-bench Verified~80%80.8%Real GitHub issue resolution
SWE-bench Pro57.7%~46%Harder, stricter coding tasks
Terminal-bench 2.075.1%65.4%Terminal and system tasks
HumanEval93.1%90.4%Function-level code generation
GPQA Diamond92.0-92.8%87.4-91.3%Expert-level reasoning
ARC-AGI-273.3%68.8-69.2%Novel reasoning

Sources: MindStudio benchmarks, Evolink analysis, Anthropic

GPT-5.4 leads on most benchmarks. Opus 4.6 leads on SWE-bench Verified — the benchmark most closely tied to real-world bug fixing — which explains its advantage on debugging and refactoring in my tests.


The Verdict

If you can only choose one model: GPT-5.4. It handles 80% of coding tasks at equal or better quality, costs 6-7x less, and is 80% faster. The 20% of tasks where Opus is better (debugging, refactoring, architecture) can often be handled with more detailed prompting on GPT-5.4.

If you can use both: Do it. GPT-5.4 for daily coding, Opus 4.6 for complex work. This is not a compromise — it is the optimal strategy.

If cost does not matter and you want maximum quality on every task: Claude Opus 4.6. It won the overall score and its wins were on the tasks where quality matters most (bugs cost more than boilerplate).

The results were not what I expected because I assumed the more expensive model would dominate. It did not. The two models have genuinely different strengths, and the best strategy is knowing which strength you need for the task in front of you.


Sources

Back to all news
Enjoyed this article?
FAQ

Common questions

Which model won more coding tasks overall?+
Claude Opus 4.6 won 5 out of 10 tasks, GPT-5.4 won 4, and 1 was a tie. However, GPT-5.4's wins were on higher-frequency everyday tasks (API endpoints, React components, test writing, DevOps scripts), while Opus dominated on complex, high-stakes work (debugging, refactoring, architecture, code review).
Which model is more cost-effective for coding?+
GPT-5.4 is significantly cheaper. At $2.50/$15 per million tokens versus Claude Opus 4.6's $15/$75, GPT-5.4 costs roughly 6x less per token. Combined with its faster speed (73.4 vs 40.5 tokens/sec) and tool search saving 47% on tokens, GPT-5.4 is the clear winner on cost-effectiveness for routine coding work.
Is Claude Opus 4.6 better for debugging than GPT-5.4?+
Yes, in our testing. Opus found root causes faster on complex multi-file bugs and identified secondary issues that GPT-5.4 missed. Opus's 80.8% score on SWE-bench Verified (real GitHub issue resolution) reflects this — it excels at understanding how bugs propagate across codebases.
Which model writes better React components?+
GPT-5.4 produced slightly cleaner React components in our tests — better TypeScript types, more concise JSX, and correct accessibility attributes out of the box. The difference was small but consistent across multiple component generation tasks.
Can I use both models together?+
Yes, and many developers do. A common pattern is using GPT-5.4 (via Codex CLI) for rapid prototyping and daily coding, then switching to Claude Opus 4.6 (via Claude Code) for deep refactoring and architectural work. This hybrid approach captures each model's strengths.
Which model has a larger context window?+
Both support up to 1M tokens. GPT-5.4 has a default 272K context with 1M available at a surcharge (2x input, 1.5x output above 272K). Claude Opus 4.6 offers the full 1M context at standard pricing with no long-context surcharge.
Recommended Tools

Useful follow-ups related to this article.

Browse All Tools

Build with ZBuild

Turn your idea into a working app — no coding required.

46,000+ developers built with ZBuild this month

Stop comparing — start building

Describe what you want — ZBuild builds it for you.

46,000+ developers built with ZBuild this month
More Reading

Related articles