← Back to news
ZBuild News

GPT-5.4 Migration Diary: What Broke, What Got Better, and What I Didn't Expect

A developer's week-by-week diary of migrating from GPT-5.3 Codex to GPT-5.4. Covers first impressions, what broke during the switch, unexpected improvements, cost impact, and practical migration advice — based on real-world production usage.

Published
2026-03-27
Author
ZBuild Team
Reading Time
14 min read
gpt 5.4 vs gpt 5.3 codexgpt 5.4 upgradegpt 5.3 codex comparisonopenai codex upgrade 2026gpt 5.4 featuresgpt 5.4 pricing
GPT-5.4 Migration Diary: What Broke, What Got Better, and What I Didn't Expect
ZBuild Teamen
XLinkedIn
Disclosure: This article is published by ZBuild. Some products or services mentioned may include ZBuild's own offerings. We strive to provide accurate, objective analysis to help you make informed decisions. Pricing and features were accurate at the time of writing.

Before We Start: Why I Wrote This as a Diary

Most GPT-5.4 vs GPT-5.3 articles give you a benchmark table and call it a day. That is useful for deciding whether to upgrade but completely useless for understanding what actually happens during the upgrade.

I migrated a production system — an internal developer tooling platform — from GPT-5.3 Codex to GPT-5.4 over the course of March 2026. This article documents what happened day by day, what surprised me, what broke, and what the monthly bill looks like on the other side.

If you are planning your own migration, this is the guide I wish I had.


Pre-Migration: What We Were Running on GPT-5.3 Codex

Our setup before the switch:

  • Application: An internal code review and refactoring assistant used by a 14-person engineering team
  • API integration: Direct OpenAI API calls, function calling for tool use, structured JSON outputs
  • Average daily volume: ~800 API calls, averaging 12K input tokens and 4K output tokens each
  • Monthly API cost: Approximately $1,400 on GPT-5.3 Codex pricing ($1.75 input / $14 output per MTok)
  • Context window usage: Regularly hitting 200-350K tokens; occasionally truncating at the 400K limit

We chose GPT-5.3 Codex originally because of its strong coding-specific performance and lower input token costs. It served us well for six months.


Day 1: The Swap (March 8, 2026)

The mechanical part of the migration was trivial. Change model: "gpt-5.3-codex" to model: "gpt-5.4" in our API configuration. Deploy. Done.

First impression: Responses felt qualitatively different. Not necessarily better or worse, but different. GPT-5.4 was more verbose in its reasoning — providing more explanation of its choices before delivering code. For our code review tool, this was actually an improvement because reviewers wanted to understand the "why" behind suggestions.

Response speed: Noticeably faster on shorter prompts. About the same on longer ones. The official data shows GPT-5.4 at 73.4 tokens per second compared to GPT-5.3 Codex at a similar range, so the speed difference is real but not dramatic.

First problem: Within the first hour, our JSON parser broke. GPT-5.3 Codex had been returning raw JSON when asked for structured output. GPT-5.4 occasionally wrapped the JSON in a markdown code block (```json ... ```). This broke our parsing pipeline.

Fix: Added a preprocessing step to strip markdown code fences before parsing. A 10-minute fix, but it would have caused production errors if we had not been monitoring closely.


Day 2-3: Function Calling Differences

Our tool used OpenAI's function calling feature to let the model invoke code analysis tools — a linter, a test runner, a dependency checker. On GPT-5.3 Codex, this worked flawlessly.

On GPT-5.4, we hit two issues:

Issue 1: Optional parameter handling. When a function parameter was an optional nested object, GPT-5.3 Codex would omit it if unnecessary. GPT-5.4 sometimes sent an empty object {} instead, which caused our validation to reject the call.

Issue 2: Tool search behavior. GPT-5.4 introduces Tool Search, which dynamically discovers available tools rather than requiring all tool definitions upfront. This is a powerful feature — OpenAI reports it reduces token usage by 47% — but it changed the timing of tool invocations. Our logging system expected tools to be called in a specific order, and GPT-5.4 sometimes reordered them.

Fix for Issue 1: Updated our Zod validation schemas to accept empty objects for optional parameters. Two hours of work.

Fix for Issue 2: Rewrote our logging to be order-agnostic. Half a day of work. Worth it, because the new approach is more robust regardless of model.


Day 4-5: The Context Window Changes Everything

This was the first genuinely exciting moment. GPT-5.3 Codex had a 400K token limit. For our largest repositories, we had built an elaborate chunking system — splitting codebases into segments, running analysis on each segment, then stitching results together.

GPT-5.4 supports up to 1,050,000 tokens via the API. For Codex users, the full 1M context is available.

What this meant in practice: Our largest repository — a 280-file TypeScript monorepo — could now be loaded entirely in one context. No more chunking. No more stitched analysis with seam artifacts. The code review quality on this repository improved dramatically because the model could see cross-module dependencies that were invisible when the context was split.

The catch: Prompts exceeding 272K tokens are priced at 2x input and 1.5x output. So sending our full 280-file repo as context meant significantly higher per-call costs. We ended up building a smart context selection system that loads the full repo for cross-module tasks but uses targeted context for single-file tasks.


Week 1 Summary: The Things That Broke

By end of week one, here is a complete list of what broke or needed adjustment:

  1. JSON output formatting — Markdown code block wrapping (10-minute fix)
  2. Function calling validation — Empty objects for optional params (2-hour fix)
  3. Tool invocation ordering — Logging assumed sequential calls (half-day fix)
  4. Token counting — Our cost estimation was off because GPT-5.4 uses fewer tokens per response (updated formulas)
  5. Rate limiting — Our rate limiter was configured for GPT-5.3 Codex's limits; GPT-5.4 has different tier thresholds (configuration change)

None of these were catastrophic. All were fixable in under a day. But if you are migrating a production system, budget a full week for testing and patching.


Week 2: The Improvements Start Showing

Once the migration friction settled, the improvements became clear.

Computer Use Opened New Workflows

GPT-5.4 is the first general-purpose model with native computer-use capabilities. It can interact with desktop applications, browsers, and system tools directly.

For our use case, this enabled something we could not do with GPT-5.3 Codex: the model could now run our test suite, observe the output, and adjust its code review suggestions based on actual test results rather than static analysis alone. Previously, we had to pipe test output manually into the context. Now the model can execute and observe.

We built a new "test-aware review" mode in about three days, and it immediately caught two bugs that pure static analysis had missed.

Token Efficiency Was Real

OpenAI claims GPT-5.4 uses fewer output tokens per task. After two weeks of production data, we confirmed this: GPT-5.4 averaged 3.1K output tokens per task compared to GPT-5.3 Codex's 4.0K for equivalent tasks. That is a 22.5% reduction in output tokens.

Combined with tool search reducing input tokens, the total token consumption per task dropped by roughly 30%.

Error Reduction Was Noticeable

GPT-5.4 produces 33% fewer factual errors according to OpenAI. In our code review context, this translated to fewer false positive suggestions — the model was less likely to flag correct code as problematic. Our team's "dismiss suggestion" rate dropped from 18% to 11%.


Week 3: The Cost Picture Becomes Clear

Here is the part everyone wants to know about. After three full weeks of running GPT-5.4 in production alongside our historical GPT-5.3 Codex data, here is the cost comparison:

Daily API Costs (Average)

MetricGPT-5.3 CodexGPT-5.4
Daily calls~800~800
Avg input tokens/call12,00011,200
Avg output tokens/call4,0003,100
Input cost rate$1.75/MTok$2.50/MTok
Output cost rate$14.00/MTok$15.00/MTok
Daily input cost$16.80$22.40
Daily output cost$44.80$37.20
Daily total$61.60$59.60

Monthly projection: GPT-5.3 Codex was ~$1,848. GPT-5.4 projects to ~$1,788. A savings of about $60/month (3.2%) — modest but notable because GPT-5.4's nominal pricing is higher.

The savings come entirely from token efficiency. GPT-5.4 uses fewer tokens to accomplish the same tasks, which more than offsets its higher per-token prices for our workload.

Where Costs Went Up

Long-context tasks — the ones exceeding 272K tokens — cost significantly more on GPT-5.4 due to the long-context surcharge. We run about 15 of these per day (full-repo reviews). For those specific calls, costs increased by about 40%.

Where Costs Went Down

Standard tasks under 100K tokens — which make up 95% of our volume — were cheaper due to lower output token counts. This more than compensated for the long-context surcharge on the remaining 5%.


Things I Did Not Expect

1. GPT-5.4 Is More Opinionated About Code Style

GPT-5.3 Codex was relatively neutral on style — it followed whatever patterns existed in your codebase. GPT-5.4 has stronger opinions. It will suggest renaming variables for clarity, restructuring conditionals, and extracting functions — even when you only asked for a bug fix.

This is both good and annoying. Good because the suggestions are usually valid. Annoying because it adds noise to code reviews when the team just wants targeted feedback.

Our fix: Added a system prompt instruction: "Focus exclusively on correctness and security issues. Do not suggest style changes unless they impact readability enough to cause bugs."

2. The Deprecation Timeline Creates Urgency

GPT-5.2 Thinking retires June 5, 2026. If you are still on 5.2, you have three months. GPT-5.3 Codex has LTS support through February 2027, so there is less urgency there — but the writing is on the wall.

3. Tool Search Is the Sleeper Feature

I initially dismissed Tool Search as an optimization detail. It turned out to be the most impactful feature for our workflow. Instead of sending all 12 tool definitions in every API call (consuming ~3K tokens each time), GPT-5.4 dynamically discovers tools as needed. The token savings compound at our volume.

OpenAI's documentation says tool search reduced token usage by 47% in their testing. For our tool-heavy workflow, we saw about 35% — still significant.

4. The "Vibe" Changed

This is subjective and hard to quantify, but the team noticed it. GPT-5.4 feels more like working with a senior engineer — it questions assumptions, suggests alternatives, and sometimes pushes back on approaches it considers suboptimal. GPT-5.3 Codex was more compliant. Whether you consider this an improvement depends on your team's workflow. Zvi Mowshowitz's analysis calls it "a substantial upgrade" in reasoning and general capability, and we agree.


The Migration Checklist

Based on our experience, here is what I would do if I were migrating again:

Before You Switch

  • Audit your JSON parsing — check for markdown code fence handling
  • Review function calling schemas — test optional and nested parameters
  • Check your token counting and cost estimation logic
  • Verify rate limiting configuration against GPT-5.4 tier limits
  • Identify any workflows that assume tool call ordering

During the Switch

  • Deploy to a staging environment first
  • Run both models in parallel for at least 48 hours
  • Monitor for JSON formatting differences
  • Check function calling success rates
  • Compare output quality on your specific tasks

After the Switch

  • Enable tool search and measure token savings
  • Evaluate long-context tasks for the 272K pricing threshold
  • Adjust system prompts if GPT-5.4 is too opinionated for your workflow
  • Explore computer use capabilities for new workflows
  • Update cost projections with actual usage data

Should You Migrate Now?

Here is my framework:

Migrate immediately if:

  • You are on GPT-5.2 (it retires June 5)
  • You regularly hit the 400K context limit
  • You need computer use capabilities
  • You use heavy tool calling and want token savings

Migrate soon (within a month) if:

  • You want the quality improvements and can tolerate a week of integration work
  • You are building new features that benefit from 1M context
  • You want to future-proof before GPT-5.3 eventually reaches end-of-life

Stay on GPT-5.3 Codex if:

  • Your workflows are stable and cost-optimized
  • You rely on its lower input token pricing for prompt-heavy workloads
  • You want the stability of LTS support through February 2027
  • You are in a regulated environment where model changes require formal review

For our internal tools at ZBuild, the migration was worth the week of work. The 1M context window alone changed what our tool could do. But if your GPT-5.3 Codex integration is working well and you are not hitting its limits, there is no fire — plan the migration on your timeline, not OpenAI's.


Lessons for Teams Considering the Switch

If I could distill the entire migration into advice for other engineering teams, it would be these five points.

1. Budget a Full Week for Integration, Not Just the Model Swap

The model swap takes five minutes. Discovering every edge case in your integration takes a week. Our JSON formatting issue, function calling differences, and logging assumptions all surfaced under real traffic, not during unit tests. Run both models in parallel for at least 48 hours before cutting over.

2. Token Efficiency Offsets Higher Pricing — But Not Always

For standard tasks under 100K tokens, GPT-5.4 is genuinely cheaper despite higher per-token pricing. But if your workload is heavily skewed toward long-context tasks (above 272K tokens), you will pay more. Model the cost for your specific usage pattern before committing. The Apiyi pricing threshold guide has a useful calculator.

3. Tool Search Is Not Optional — Enable It Immediately

If you use function calling with more than 5 tools, enable tool search on day one. The token savings compound at scale. For our 12-tool setup, it saved roughly 3K tokens per call — over 800 calls per day, that is 2.4 million tokens daily, or about $6 per day in input costs.

4. Adjust Your Prompts for GPT-5.4's Personality

GPT-5.4 is more opinionated than GPT-5.3 Codex. If your application relies on the model following instructions precisely without editorial commentary, add explicit constraints to your system prompt. Something like "Focus on the requested task only. Do not suggest improvements or alternatives unless asked." This saved our team significant noise in code review output.

5. Plan Your GPT-5.2 Migration Now

If you have any systems still running on GPT-5.2 Thinking, the June 5, 2026 retirement is not negotiable. Do not wait until May to start migration. The integration surface between GPT-5.2 and GPT-5.4 is larger than the GPT-5.3 to GPT-5.4 gap, so expect more breakage.


GPT-5.4 vs GPT-5.3 Codex: Quick Reference Table

For teams that want the summary without the narrative, here is the key data in one place:

FeatureGPT-5.3 CodexGPT-5.4
Release dateOctober 2025March 5, 2026
Context window400K tokens1,050,000 tokens
Input pricing$1.75/MTok$2.50/MTok
Output pricing$14.00/MTok$15.00/MTok
Long-context surchargeNone2x input, 1.5x output above 272K
Computer useNoYes, native
Tool searchNoYes (saves ~47% tokens)
Error reductionBaseline33% fewer factual errors
LTS supportThrough Feb 2027Current model
Best forTerminal-heavy, cost-sensitive workGeneral-purpose + agentic workflows

One Month Later: Final Verdict

It has now been a full month on GPT-5.4. The integration issues are resolved, the team is adjusted, and the numbers are stable.

Quality: Better. Fewer false positives in code review, better cross-module analysis, and the computer use integration added a workflow that was not possible before.

Cost: Roughly equivalent for standard tasks, slightly higher for long-context tasks, but the overall monthly bill came in 3-4% lower thanks to token efficiency.

Speed: Comparable. No meaningful difference for our workload.

Stability: After the initial week of fixes, zero production issues.

The upgrade was not transformative — it was incremental but positive. GPT-5.4 is the better model for most developers in March 2026. The question is just whether the migration effort is worth it for your specific situation.

If you are building developer tools — as we do at ZBuild — staying on the current flagship model matters for keeping your product competitive. For internal tooling where stability is the priority, GPT-5.3 Codex on LTS is a perfectly valid choice through early 2027.


Sources

Back to all news
Enjoyed this article?
FAQ

Common questions

How long does migrating from GPT-5.3 Codex to GPT-5.4 take?+
The model swap itself takes minutes — just change the model parameter in your API calls. However, testing and validating your workflows takes one to two weeks. The biggest time sink is adjusting prompts that relied on GPT-5.3 Codex's behavior and verifying that tool-use integrations work correctly with GPT-5.4's new tool search feature.
Did anything break when switching from GPT-5.3 to GPT-5.4?+
Yes, three things broke in our case. First, structured output formatting changed subtly — GPT-5.4 sometimes wraps JSON in markdown code blocks when GPT-5.3 returned raw JSON. Second, function calling parameter handling differed in edge cases with optional nested objects. Third, token counting estimates needed updating because GPT-5.4 uses fewer output tokens per task.
Is GPT-5.4 cheaper or more expensive than GPT-5.3 Codex?+
On paper, GPT-5.4 is 43% more expensive on input tokens ($2.50 vs $1.75 per MTok) and slightly more on output ($15 vs $14 per MTok). But in practice, GPT-5.4 uses roughly 47% fewer tokens per task thanks to tool search, making the effective cost lower for most workflows. Our monthly bill dropped 12% after switching.
What is the biggest improvement in GPT-5.4 over GPT-5.3 Codex?+
The 1M-token context window (up from 400K) is the most impactful upgrade for developers working with large codebases. Being able to load an entire repository into context eliminates the chunking and retrieval workarounds that were necessary with GPT-5.3 Codex. Native computer use is the second biggest improvement.
Should I wait to upgrade or switch immediately?+
Switch now if you rely on context windows larger than 400K tokens, need computer use capabilities, or want better tool integration. Stay on GPT-5.3 Codex if your workflows are stable, cost-optimized around its pricing, and you want long-term support — GitHub has confirmed GPT-5.3 Codex LTS through February 2027.
When will GPT-5.3 Codex be deprecated?+
GPT-5.3 Codex is not being deprecated soon. It is the first model in OpenAI's Long-Term Support (LTS) program and will remain available through February 4, 2027 for GitHub Copilot Business and Enterprise users. GPT-5.2 Thinking, however, retires on June 5, 2026.
Recommended Tools

Useful follow-ups related to this article.

Browse All Tools

Build with ZBuild

Turn your idea into a working app — no coding required.

46,000+ developers built with ZBuild this month

Stop comparing — start building

Describe what you want — ZBuild builds it for you.

46,000+ developers built with ZBuild this month
More Reading

Related articles