Which model won more coding tasks overall?

Claude Opus 4.6 won 5 out of 10 tasks, GPT-5.4 won 4, and 1 was a tie. However, GPT-5.4's wins were on higher-frequency everyday tasks (API endpoints, React components, test writing, DevOps scripts), while Opus dominated on complex, high-stakes work (debugging, refactoring, architecture, code review).

Which model is more cost-effective for coding?

GPT-5.4 is significantly cheaper. At $2.50/$15 per million tokens versus Claude Opus 4.6's $15/$75, GPT-5.4 costs roughly 6x less per token. Combined with its faster speed (73.4 vs 40.5 tokens/sec) and tool search saving 47% on tokens, GPT-5.4 is the clear winner on cost-effectiveness for routine coding work.

Is Claude Opus 4.6 better for debugging than GPT-5.4?

Yes, in our testing. Opus found root causes faster on complex multi-file bugs and identified secondary issues that GPT-5.4 missed. Opus's 80.8% score on SWE-bench Verified (real GitHub issue resolution) reflects this — it excels at understanding how bugs propagate across codebases.

Which model writes better React components?

GPT-5.4 produced slightly cleaner React components in our tests — better TypeScript types, more concise JSX, and correct accessibility attributes out of the box. The difference was small but consistent across multiple component generation tasks.

Can I use both models together?

Yes, and many developers do. A common pattern is using GPT-5.4 (via Codex CLI) for rapid prototyping and daily coding, then switching to Claude Opus 4.6 (via Claude Code) for deep refactoring and architectural work. This hybrid approach captures each model's strengths.

Which model has a larger context window?

Both support up to 1M tokens. GPT-5.4 has a default 272K context with 1M available at a surcharge (2x input, 1.5x output above 272K). Claude Opus 4.6 offers the full 1M context at standard pricing with no long-context surcharge.

I Gave the Same 10 Coding Tasks to GPT-5.4 and Claude Opus 4.6 — The Results Were Not What I Expected

The Experiment

I took 10 real coding tasks — the kind developers actually do every day — and submitted the exact same prompt to both GPT-5.4 and Claude Opus 4.6. Same system prompt, same context, same evaluation criteria.

No synthetic benchmarks. No cherry-picked examples. Just real tasks scored on three dimensions:

Correctness (does it work without modifications?)
Code quality (readability, types, error handling, edge cases)
Efficiency (token usage, response time, number of follow-up prompts needed)

Each dimension is scored 1-10. Maximum possible score per task: 30.

The models were accessed via their respective APIs at standard pricing: GPT-5.4 at $2.50/$15 per million tokens and Claude Opus 4.6 at $15/$75 per million tokens.

Here are the 10 tasks and exactly what happened.

Task 1: Build a REST API Endpoint

Prompt: "Create a POST /api/users endpoint in Express.js with TypeScript. Validate email format and password strength (min 8 chars, 1 uppercase, 1 number). Hash the password with bcrypt. Store in PostgreSQL via Prisma. Return the user without the password field. Handle duplicate emails with a 409 status."

GPT-5.4 Result

Clean, production-ready code. The Zod validation schema was precise. The bcrypt hashing used a proper salt round constant. The Prisma query used select to exclude the password field at the database level rather than deleting it from the response object — a subtle but important security practice. TypeScript types were tight.

Claude Opus 4.6 Result

Also clean and correct. Used a similar Zod validation approach but added rate limiting middleware for the endpoint and included a comment explaining why. The password exclusion used Prisma's omit feature. Added a try/catch with specific error types for Prisma unique constraint violations.

Scores

Dimension	GPT-5.4	Opus 4.6
Correctness	10	10
Code quality	9	9
Efficiency	9	8
Total	28	27

Winner: GPT-5.4 (marginally, on speed and conciseness)

Both outputs were excellent. GPT-5.4 was faster and used fewer tokens. Opus added the rate limiting middleware unprompted — useful but not requested. For well-defined API tasks, the models are essentially interchangeable.

Task 2: Build a React Component

Prompt: "Create a React component called DataTable that accepts generic typed data, supports sortable columns, pagination (client-side), a search filter, and row selection with checkboxes. Use TypeScript generics. No UI library — just HTML/CSS with CSS modules. Include proper ARIA attributes."

GPT-5.4 Result

Delivered a well-structured generic component. TypeScript generics were used correctly for the column definition and data types. Sorting logic was clean with a custom useSortable hook extracted. Pagination used useMemo for performance. ARIA attributes were correct — role="grid", aria-sort on sortable headers, aria-selected on checkboxes.

Claude Opus 4.6 Result

Similar structure but with a few differences. Opus created a useDataTable hook that encapsulated sorting, pagination, and filtering logic — cleaner separation but more abstraction. TypeScript generics were equally correct. Missing aria-sort on the header cells. The CSS module included a responsive layout that switched to card view on mobile, which was not requested but was a thoughtful addition.

Scores

Dimension	GPT-5.4	Opus 4.6
Correctness	10	9
Code quality	9	9
Efficiency	9	8
Total	28	26

Winner: GPT-5.4

GPT-5.4's ARIA implementation was more complete, which matters for a component that will be used across an application. As noted by MindStudio's comparison, GPT-5.4 excels at boilerplate generation including React components and TypeScript interfaces.

Task 3: Write a Complex SQL Query

Prompt: "Write a PostgreSQL query that returns the top 10 customers by lifetime value (total order amount) who have placed at least 3 orders in the last 12 months, including their most recent order date, average order value, and the percentage change in their spending compared to the previous 12-month period. Use CTEs for readability."

GPT-5.4 Result

Three CTEs: one for current period aggregation, one for previous period aggregation, one for the percentage calculation. Clean, correct, well-formatted. Used COALESCE for handling customers with no previous period data. Added an index hint comment.

Claude Opus 4.6 Result

Four CTEs with a slightly different structure: separated the "last order date" calculation into its own CTE to avoid a correlated subquery. Added a NULLIF to prevent division by zero in the percentage calculation — a real edge case GPT-5.4 missed. Included a window function alternative in a comment block.

Scores

Dimension	GPT-5.4	Opus 4.6
Correctness	9	10
Code quality	8	9
Efficiency	9	8
Total	26	27

Winner: Claude Opus 4.6

The division-by-zero edge case was the differentiator. In production SQL, that kind of bug causes silent data corruption. Opus consistently surfaces edge cases that matter in real-world data pipelines.

Task 4: Debug a Race Condition

Prompt: I provided 3 files (~200 lines total) from a Node.js application with an intermittent test failure. The bug was a race condition in a caching layer where concurrent cache misses could trigger duplicate database queries and inconsistent state. "Find the bug, explain why it only manifests intermittently, and provide a fix."

GPT-5.4 Result

Identified the correct cache miss code path. Suggested adding a mutex lock using async-mutex. The fix was correct but treated the symptom rather than the root cause — it serialized all cache accesses, which would hurt performance under load.

Claude Opus 4.6 Result

Identified the same code path but also traced the state inconsistency to a second issue: the cache update was not atomic — there was a window between the read check and the write where another request could interleave. Opus suggested a "single-flight" pattern (coalescing concurrent identical requests) rather than a global mutex. The fix was more surgical and preserved concurrency for non-conflicting cache keys.

Scores

Dimension	GPT-5.4	Opus 4.6
Correctness	7	10
Code quality	7	9
Efficiency	8	8
Total	22	27

Winner: Claude Opus 4.6

A clear gap. Opus understood the concurrency model deeply enough to suggest a targeted fix. This aligns with Claude Opus 4.6's 80.8% score on SWE-bench Verified, which tests exactly this kind of real-world bug resolution.

Task 5: Code Review

Prompt: I provided a 350-line pull request adding a new payment processing module. "Review this PR for bugs, security issues, performance problems, and code quality. Prioritize findings by severity."

GPT-5.4 Result

Found 5 issues: a missing null check on the payment response, an unhandled promise rejection, a hardcoded timeout that should be configurable, a missing idempotency key, and a suggestion to extract magic numbers into constants. Organized by severity. Clear and actionable.

Claude Opus 4.6 Result

Found 8 issues: the same 5 GPT-5.4 found plus three more — a TOCTOU (time-of-check-time-of-use) vulnerability in the amount validation, a potential information leak in the error response that exposed internal stack traces, and a subtle issue where retry logic could cause double-charging if the first request succeeded but the response was lost. Each finding included the specific line number and a suggested fix.

Scores

Dimension	GPT-5.4	Opus 4.6
Correctness	8	10
Code quality	8	10
Efficiency	9	8
Total	25	28

Winner: Claude Opus 4.6

The three additional findings were all security-critical. The double-charging bug alone could cost a company significant money and reputation. Opus's 76% on MRCR v2 (multi-file reasoning) translates directly to better code review on complex modules.

Task 6: Write a Test Suite

Prompt: "Write comprehensive tests for this authentication middleware using Vitest. Cover: valid tokens, expired tokens, malformed tokens, missing authorization header, revoked tokens, rate limiting, and concurrent authentication requests." I provided the middleware source file (~120 lines).

GPT-5.4 Result

Generated 18 test cases organized in clean describe blocks. Every scenario from the prompt was covered. Added three extra edge cases: empty string token, token with wrong algorithm, and whitespace-only authorization header. Mocks were well-structured using vi.mock. Test descriptions were clear and followed the "should X when Y" pattern.

Claude Opus 4.6 Result

Generated 15 test cases. All prompted scenarios covered. The test structure used a helper factory for creating tokens with different properties — clever but added complexity. Missing the "concurrent authentication requests" test that was explicitly requested. The mocks were cleaner but the test count was lower.

Scores

Dimension	GPT-5.4	Opus 4.6
Correctness	10	8
Code quality	9	9
Efficiency	9	8
Total	28	25

Winner: GPT-5.4

GPT-5.4 followed the prompt more faithfully and added meaningful edge cases. As multiple comparisons note, GPT-5.4's test generation is among the best, writing comprehensive suites with strong edge case coverage.

Task 7: Refactor a Monolithic Module

Prompt: I provided a 500-line Python module that handled user management — registration, authentication, profile updates, password resets, and email notifications all in one file. "Refactor this into a clean module structure following SOLID principles. Maintain backward compatibility with the existing public API."

GPT-5.4 Result

Split into 5 modules: auth.py, registration.py, profile.py, password.py, notifications.py. Added an __init__.py that re-exported the original public functions for backward compatibility. Clean separation. Each module was self-contained.

However, it missed updating the circular dependency between registration.py and notifications.py — registration sends a welcome email, and the notification module needed a reference back to user data. The code would crash on import.

Claude Opus 4.6 Result

Split into 6 modules with the same breakdown plus a types.py for shared data classes. Crucially, it identified the circular dependency issue and resolved it by introducing an event-based pattern — registration emits a "user_created" event, and the notification module subscribes to it. The backward-compatible __init__.py was identical in approach.

Opus also added a brief comment at the top of each module explaining what belongs there and what does not — acting as a guide for future developers.

Scores

Dimension	GPT-5.4	Opus 4.6
Correctness	6	10
Code quality	8	10
Efficiency	8	7
Total	22	27

Winner: Claude Opus 4.6

The circular dependency bug would have caused a production failure. This is the type of multi-file reasoning where Opus excels — it understands cross-file dependencies and architectural implications before generating code.

Task 8: Write Technical Documentation

Prompt: "Write API documentation for this payment processing SDK. Include: overview, authentication, rate limits, error codes, 5 endpoint descriptions with request/response examples, a webhook section, and a migration guide from v1 to v2." I provided the SDK source code.

GPT-5.4 Result

Comprehensive documentation covering all requested sections. The endpoint descriptions were detailed with curl examples and response schemas. The error codes section was well-organized as a table. The migration guide was clear with before/after code examples. Clean markdown formatting.

Claude Opus 4.6 Result

Also comprehensive, with a slightly different structure — it led with a "Quick Start" section before the detailed docs, which is a good pattern for developer documentation. The webhook section was more detailed, including retry behavior, signature verification code, and testing guidance. The migration guide included a deprecation timeline that was not in the source code — it inferred this from versioning patterns.

Scores

Dimension	GPT-5.4	Opus 4.6
Correctness	9	9
Code quality	9	9
Efficiency	9	8
Total	27	26

Winner: Tie (GPT-5.4 by one point on efficiency)

Both produced excellent documentation. The quality difference is negligible. GPT-5.4 was slightly faster. For documentation tasks, either model works well — this aligns with developer reports that documentation quality is comparable across frontier models.

Task 9: Design a System Architecture

Prompt: "Design the architecture for a real-time collaborative document editor supporting 10,000 concurrent users. Cover: data model, conflict resolution strategy (CRDTs vs OT), WebSocket infrastructure, storage layer, presence system, and deployment topology. Provide a diagram in Mermaid syntax."

GPT-5.4 Result

Chose OT (Operational Transformation) with a central server. Reasonable architecture with Redis for presence, PostgreSQL for document storage, and a WebSocket gateway behind a load balancer. The Mermaid diagram was clean. The analysis was competent but followed a standard playbook — it did not deeply analyze the tradeoffs between CRDTs and OT for this specific scale.

Claude Opus 4.6 Result

Started by asking a clarifying question about the document model (rich text vs. plain text vs. structured data), which I answered as "rich text." Then recommended CRDTs (specifically Yjs) over OT, with a detailed explanation of why CRDTs are superior at this scale — eventual consistency without a central sequencer eliminates the single point of failure.

The architecture included a novel detail: a "document gateway" layer that handles CRDT merge operations and acts as both a WebSocket terminator and a state persistence layer. The Mermaid diagram included data flow arrows with protocol annotations. The deployment section recommended a specific partitioning strategy (shard by document ID) with reasoning about hot partitions.

Scores

Dimension	GPT-5.4	Opus 4.6
Correctness	8	10
Code quality	7	10
Efficiency	8	7
Total	23	27

Winner: Claude Opus 4.6

Architecture is where the reasoning depth gap between these models is most visible. Opus reasons more explicitly about the problem before generating output, working through edge cases and asking clarifying questions when requirements are genuinely ambiguous.

Task 10: Write a DevOps Deployment Script

Prompt: "Write a GitHub Actions workflow that: builds a Docker image, runs tests, pushes to ECR, deploys to ECS Fargate with blue-green deployment, runs a smoke test against the new deployment, and rolls back automatically if the smoke test fails. Use OIDC for AWS authentication — no hardcoded credentials."

GPT-5.4 Result

A complete workflow file with all requested steps. OIDC configuration was correct using aws-actions/configure-aws-credentials with the role ARN. Blue-green deployment used ECS service update with CODE_DEPLOY deployment controller. The smoke test was a curl-based health check. Rollback was triggered by the smoke test exit code. Well-commented, production-ready.

Claude Opus 4.6 Result

Also complete and correct. Used the same OIDC approach. The key difference was in the smoke test — Opus created a more thorough test that checked not just the health endpoint but also verified the deployment was serving the correct version by checking a /version endpoint. The rollback included a Slack notification step. However, the workflow was notably more verbose — 40% more lines for similar functionality.

Scores

Dimension	GPT-5.4	Opus 4.6
Correctness	10	10
Code quality	9	9
Efficiency	9	7
Total	28	26

Winner: GPT-5.4

For DevOps scripting, GPT-5.4's conciseness is an advantage. The workflow is easier to maintain and modify. Opus's additions (Slack notification, version verification) are nice but were not requested and added complexity. GPT-5.4 leads on Terminal-bench (75.1% vs 65.4%), and this advantage shows in terminal-oriented tasks.

The Final Scoreboard

Task	GPT-5.4	Opus 4.6	Winner
1. REST API endpoint	28	27	GPT-5.4
2. React component	28	26	GPT-5.4
3. SQL query	26	27	Opus 4.6
4. Debug race condition	22	27	Opus 4.6
5. Code review	25	28	Opus 4.6
6. Test suite	28	25	GPT-5.4
7. Refactor module	22	27	Opus 4.6
8. Documentation	27	26	Tie
9. Architecture design	23	27	Opus 4.6
10. DevOps script	28	26	GPT-5.4
Total	257	266	Opus 4.6

Final score: Claude Opus 4.6 wins 266 to 257.

But the aggregate score hides the real story.

The Pattern That Matters More Than the Score

Look at where each model wins:

GPT-5.4 wins on:

API endpoints (well-defined, scoped tasks)
React components (boilerplate with clear specs)
Test writing (comprehensive coverage from a spec)
DevOps scripts (terminal-oriented, concise output)

Claude Opus 4.6 wins on:

SQL edge cases (catching subtle data bugs)
Debugging (understanding root causes in complex systems)
Code review (finding security and correctness issues)
Refactoring (handling cross-file dependencies)
Architecture (deep reasoning about tradeoffs)

The pattern is clear: GPT-5.4 is the faster, cheaper, better model for well-defined coding tasks. Claude Opus 4.6 is the deeper, more careful model for tasks requiring reasoning across complexity.

This matches what DataCamp's analysis found: GPT-5.4 is the best all-around model while Opus 4.6 excels specifically at agentic and deep-coding tasks.

The Cost Factor

The score gap (9 points) is relatively small. The cost gap is not.

Metric	GPT-5.4	Claude Opus 4.6
Input pricing	$2.50/MTok	$15/MTok
Output pricing	$15/MTok	$75/MTok
Speed	73.4 tok/s	40.5 tok/s
Context window	1M (surcharge >272K)	1M (flat pricing)
Tool search savings	~47% token reduction	N/A

For this 10-task test, the total API cost was approximately $4.20 for GPT-5.4 and $31.50 for Opus 4.6. That is a 7.5x cost difference for a 3.5% quality gap.

For a team running hundreds of AI-assisted coding tasks per day, the math strongly favors GPT-5.4 for the majority of work, with Opus reserved for the high-stakes 10-20% where its reasoning depth makes a material difference.

The Smart Strategy: Use Both

Most working developers in 2026 are not choosing one model — they are choosing when to use each. The pattern that emerged from this test matches what we use at ZBuild:

Daily driver: GPT-5.4 (via Codex CLI or API)

Writing new endpoints, components, and scripts
Generating tests from specs
Quick debugging on isolated issues
DevOps and CI/CD automation

Heavy lifter: Claude Opus 4.6 (via Claude Code or API)

Cross-file refactoring with complex dependencies
Reviewing security-critical code
Architectural design sessions
Debugging non-obvious issues in large codebases

This two-model approach captures 95% of both models' strengths while keeping costs manageable. The Portkey guide to choosing between these models recommends the same hybrid approach.

What the Benchmarks Say (for Context)

The task-by-task results above align with the formal benchmarks:

Benchmark	GPT-5.4	Opus 4.6	What It Measures
SWE-bench Verified	~80%	80.8%	Real GitHub issue resolution
SWE-bench Pro	57.7%	~46%	Harder, stricter coding tasks
Terminal-bench 2.0	75.1%	65.4%	Terminal and system tasks
HumanEval	93.1%	90.4%	Function-level code generation
GPQA Diamond	92.0-92.8%	87.4-91.3%	Expert-level reasoning
ARC-AGI-2	73.3%	68.8-69.2%	Novel reasoning

Sources: MindStudio benchmarks, Evolink analysis, Anthropic

GPT-5.4 leads on most benchmarks. Opus 4.6 leads on SWE-bench Verified — the benchmark most closely tied to real-world bug fixing — which explains its advantage on debugging and refactoring in my tests.

The Verdict

If you can only choose one model: GPT-5.4. It handles 80% of coding tasks at equal or better quality, costs 6-7x less, and is 80% faster. The 20% of tasks where Opus is better (debugging, refactoring, architecture) can often be handled with more detailed prompting on GPT-5.4.

If you can use both: Do it. GPT-5.4 for daily coding, Opus 4.6 for complex work. This is not a compromise — it is the optimal strategy.

If cost does not matter and you want maximum quality on every task: Claude Opus 4.6. It won the overall score and its wins were on the tasks where quality matters most (bugs cost more than boilerplate).

The results were not what I expected because I assumed the more expensive model would dominate. It did not. The two models have genuinely different strengths, and the best strategy is knowing which strength you need for the task in front of you.

I Gave the Same 10 Coding Tasks to GPT-5.4 and Claude Opus 4.6 — The Results Were Not What I Expected

The Experiment

Task 1: Build a REST API Endpoint

GPT-5.4 Result

Claude Opus 4.6 Result

Scores

Task 2: Build a React Component

GPT-5.4 Result

Claude Opus 4.6 Result

Scores

Task 3: Write a Complex SQL Query

GPT-5.4 Result

Claude Opus 4.6 Result

Scores

Task 4: Debug a Race Condition

GPT-5.4 Result

Claude Opus 4.6 Result

Scores

Task 5: Code Review

GPT-5.4 Result

Claude Opus 4.6 Result

Scores

Task 6: Write a Test Suite

GPT-5.4 Result

Claude Opus 4.6 Result

Scores

Task 7: Refactor a Monolithic Module

GPT-5.4 Result

Claude Opus 4.6 Result

Scores

Task 8: Write Technical Documentation

GPT-5.4 Result

Claude Opus 4.6 Result

Scores

Task 9: Design a System Architecture

GPT-5.4 Result

Claude Opus 4.6 Result

Scores

Task 10: Write a DevOps Deployment Script

GPT-5.4 Result

Claude Opus 4.6 Result

Scores

The Final Scoreboard

The Pattern That Matters More Than the Score

The Cost Factor

The Smart Strategy: Use Both

What the Benchmarks Say (for Context)

The Verdict

Sources

Common questions

Build with ZBuild

Stop comparing — start building

Related articles

GPT-5.3 Codex vs Claude Opus 4.6: Which AI Coding Model Actually Ships Better Code in 2026?

Gemini 3.1 Pro vs Claude Opus 4.6 vs GPT-5: The Definitive AI Model Comparison for 2026

GPT-5.3 Codex vs Claude Sonnet 4.6 for Coding: Benchmarks, Speed & Real Developer Verdict (2026)

Claude Sonnet 4.6 vs Opus 4.6: The Complete Technical Comparison (2026)