𤿠DEEP DIVE: Not AGI, But Useful â A Measured Look at GPT-5
The livestream was low theatre; the expectations werenât. Hereâs a practical look at GPT-5âs reliability, deference, and day-to-day performance.
TL;DR: GPT-5 arrived with high expectations, but the meaningful gains are enterprise-centric: fewer hallucinations, stronger instruction-following and more reliable tool use. Those move the needle for AI agent builders; for casual ChatGPT use, the impact is noticeable but may feel a little more muted. For those who care about reliability and consistency, however, itâs a substantial upgrade.
đŁ Hype or not
Was there hype around GPT-5? Iâd call it anticipation rather than hype. There were no billboards, pop stars, or pyrotechnics. OpenAIâs pre-launch signalling was relatively lightâa handful of Sam Altman tweets and a restrained livestream with a small panel huddled around a laptop. The spectacle was minimal, so very definitely not the Hollywood version of hype.
The intensity came from elsewhere: brand gravity, the broader AI moment, a slow-burn build-up to the announcement, and the semantics of versioning. The jump to x.0 suggests a step change, signalling audiences to expect a leap before a single demo had run. That context acted as a force multiplier: even one tweet carried outsized weight.
When reality falls short of myth, backlash follows: âmutedâ, âoverhypedâ, âdisappointsâ. Some persistent criticsâfor example, Gary Marcusâcast GPT-5 as evidence of an AI stall. The reality is messier than either end of the spectrum of debate would suggest. This post offers a hopefully measured, practitionerâs view of what changed, whatâs significant and whatâs not.
đŤ Not AGI
First things first: GPT-5 is not AGI. Itâs an evolution of the large language models weâve been usingâsmarter, more accurate, more polishedâbut not a leap to an autonomous, self-aware entity. By itself it has no goals, no persistence, and no agency; it generates text in response to inputs. In products, it can be wired to tools or workflows, but those actions are designed and constrained by humans. The risk profile comes from deployments and incentives, not from a model that âdecidesâ to act on its own.
The term AGI is so loaded that it often obscures more than it clarifies. Even industry leaders are cooling on it: Sam Altman called AGI ânot a super useful termâ in a CNBC interview, a sentiment echoed by others in the field. Dario Amodei of Anthropic likewise prefers âpowerful AIâ over AGIâhe says he dislikes the term AGI.
So rather than drawing a binary line in the sandââAGI: yes/noââitâs more honest to view progress as a gradual broadening of capabilities and reliability. Those stepwise gains will likely shape society long before any universally agreed âAGI momentâ arrives. Thatâs the lens Iâll use in this piece â GPT-5 is an important step along this path. In fact, no model release has ever represented a massive leap and gains are always incremental. To expect anything else in a single release was always unrealistic
đ Smart picker or black box? The GPT-5 router
Despite the name, GPT-5 isnât a single modelâitâs a family of models with different capabilities, reasoning capacities, and price points.
Before GPT-5, the lineup was already a tangle: GPT-4o, o4-mini, o3, and more, each with quirks and trade-offs. Add adjustable reasoning effort and you had a recipe for confusion. Unless you already knew the difference between GPT-4o and o4-miniânames that mean nothing to most peopleâpicking the right one was guesswork.
OpenAIâs answer is the GPT-5 model router: an automatic selector that decides which variant to use based on your promptâs content, complexity and explicit hints (adding âthink hard about thisâ to your prompt encourages use of a more capable GPT-5 variant). In ChatGPT, you can now just pick âGPT-5â and let the router choose the variant for you. For most people, thatâs a lot simpler and the right answer.
ChatGPT is OpenAIâs consumer product, and some simplification is reasonable and indeed requiredâitâs a balance between power and approachability.
However, for experienced users the pain point was never choosing a model and becoming reliant on a routing engine I donât always agree with is frustrating. I already know which jobs call for which engine, and typing âplease think hard about thisâ feels clunky when I could just select the model directly.
OpenAI has been continuously responding to feedback and has tweaked how the router works and the level of manual control they provide.
At the time of writing ChatGPT provides the following options:
Auto â GPT-5 with the model router deciding for you
Fast â a quicker, lighter GPT-5 (roughly GPT-4o)
Thinking â slower but more capable (roughly o4-mini/o3)
Pro (Pro subscribers only) â the most advanced GPT-5 variant, with extra reasoning effort (roughly o3-pro)
Legacy (for paying users) â access to the old 4o model.
If youâre on the ChatGPT free tier, you get Auto only. Thatâs sensible: most casual queries are simple and can be routed to the cheapest model, containing costs without forcing people to pick between obscure variants. By the way, if youâre not paying, access to anything is a luxury!
As a Plus subscriber, I manually select Thinking nearly all the time. Itâs a lot better than Fast and, unlike earlier âthinkingâ models, still acceptably quick (typically responding a few to 30 seconds). If Iâm in a hurry, Fast is dramatically quicker and responds virtually instantly.
The current rate-limit for Thinking is 3,000 messages per week, which equates to 430 a day, a big jump from the old 100/day o3 cap. For most people thatâs ample, which means you can use a better-than-o3 model for essentially everything, every day. Thatâs quite something.
Interestingly, the router sends simple editorial requestsââplease proofread this,â âmake this paragraph flow betterââto the Fast variant. Itâs competent, but Thinking is noticeably better even on these: cleaner prose and multiple options (e.g., an analytical tone alongside a punchier one).
In fact, thereâs almost no reason not to use Thinkingâwhich is perhaps another reason OpenAI favours the router. Left to my own devices, Iâm almost certainly consuming more compute for my $20/month than I would if the router decided what to choose. As ever, commercial motives and consumer experience are intertwined. Overall, I find the inevitable compromises here to be reasonable. OpenAI reacted quickly to user comments and refined the initial router experience, which I view as positive and evidence of a company thatâs listening to its users.
⨠Reliability over razzle-dazzle
The sharpest criticism of todayâs models, and the biggest brake on high-stakes use, is their tendency to produce confident falsehoods. We call these âhallucinationsâ: when the model fills gaps rather than admitting uncertainty. At the trivial end theyâre embarrassing; at the serious end they create legal exposure and erode brand trust.
Mitigations exist, but none are perfect. Thatâs why hallucination risk dominates production design, especially in regulated sectors. Adoption in finance and healthcare is real, but it comes with tighter scopes, heavier governance, and explicit human controls.
Reliability also hinges on instruction-following. Real applications use layered promptsâpolicies, schemas, examples, edge cases. The more complex the prompt, the more likely the model is to omit steps, drift from a required format, or prioritise one instruction over another.
For casual ChatGPT use, these issues are a nuisance and you can quickly rephrase your prompt and move on. For enterprise builders calling the modelâs API, a hallucinated claim or a missed instruction is a hard failure that canât be easily self-corrected. Together, hallucination and instruction-following drift remain the main brakes on production deployments: they sap trust, raise QA costs, and force teams to narrow use cases.
âGPTâ5âs responses are ~45% less likely to contain a factual error than GPTâ4o, and when thinking, GPTâ5âs responses are ~80% less likely to contain a factual error than OpenAI o3.â â OpenAI
OpenAI says GPT-5, whilst not resolving these issues outright, makes meaningful inroadsâfewer fabrications and tighter adherence to prompt instructions. 45% and 80% are significant numbers. In fact, this alone is absolutely H-U-G-E! As an enterprise AI guy, Iâd willingly take this feature alone, as it has the potential to open up use cases that were until now impractical.
đŹ Conversational style
Writing style is personal. Some people want warmth, colour, a touch of cheer and to feel flattered; others prefer crisp, straight-to-the-point prose. My comments on GPT-5âs writing style therefore reveal my personal preferences, which are not necessarily yours.
Models are often tuned to be agreeableâsometimes too much so. Thatâs where sycophancy creeps in: the model flatters or agrees even when youâre wrong, because itâs optimising for âagreeableâ.
I see this across systems. Claude Code, for instance, will happily open with âYouâre absolutely rightâŚâ even when Iâm absolutely not. In my (decidedly unscientific) side-by-sides, GPT-5 sits lower on the sycophancy scale than Claude, though it could still be more willing to challenge me when Iâm wrong. Agreeable models might feel warm and fuzzy, but theyâre less useful if the unvarnished truth is what I need. This does raise an interesting question: would vendors win the social-media popularity contest if their models were completely honest?
On tone, some say GPT-5 feels less âwarmâ than earlier GPTs. I do see a shift: many fewer flowery fillers, less chirpy praise, and no random emoji. Itâs more concise, a bit drierâand, mercifully, it no longer worships the ground I walk on. But it does still love an em-dash more than this British writer would, but thatâs a quirk I can live with.
Net: reduced sycophancy plus a toned-down style can read as colder to some, but just as easily as âa bit more normalâ to others. On identical prompts, Claude tends to be a lot wordier and less specific; GPT-5 gets to the point quickly. Too clinical? If you value efficiency, probably not.
If the default isnât your vibe, ChatGPT now offers Personalities in Settings (e.g., Default, Cynic, Robot, Listener, Nerd). Itâs labelled experimental, but it fits the reality that tastes differ: I prefer a straight-to-the-point persona; others may want a warmer, emoji-rich one. The upside is a shift toward models that can tune their behaviour to different user preferences.
These style changes do make ChatGPT-generated text a lot more difficult to detect. Anecdotally, Iâve heard that âAI detectorsâ are really struggling with GPT-5. That might be a challenge for some, but possibly a signal of a less robotic style.
Bottom line: I like GPT-5âs voice. Itâs shed much of the tell-tale âChatGPT styleâ that made earlier outputs easy to spot. But style is personalâonly you can decide if it fits how you want it to sound.
đ Model performance
Benchmarks are a good place to start when examining performanceâso long as you treat them as directional, not definitive, and validate on your own stack.
Coding. On SWE-bench Verified, GPT-5 is reported at 74.9%, versus 69.1% for o3 and 74.5% for Claude Opus 4.1âfirmly in the leading pack rather than miles ahead. Early hands-on impressions match that: GPT-5 is very good, but broadly comparable to many developersâ current favourite (Claude).
Science. In domains where models push toward PhD-level reasoning, gains look incremental but real. On GPQA Diamond, GPT-5 scores 87.3% (o3: 83.3%), with GPT-5 Pro nudging that to 89.4%. Better numbers, yesâjust not a step-change.
Agentic tool use. The standout is GPT-5âs 96.7% on a tool-calling benchmark that stresses multi-step planning, reliable tool execution, and end-to-end completion (Ď²-bench, telecom). OpenAI attributes this to improved âtool intelligenceââchaining dozens of calls, in sequence and in parallel, without losing state. If this translates to production workloads, itâs a big deal for agent developers.
Enterprise summary. Because so much enterprise AI now hinges on agentsâplanning, tool use, retries, error handlingâĎ²-bench is the more telling signal. If your workloads involve long-running, tool-heavy flows, GPT-5 may offer a meaningful bump over prior models. Still, measure it on your own tools, data, and guardrails; benchmarks donât always tell the whole story.
đ¸ Cost: itâs cheaper
GPT-5 is priced at $1.15 for input tokens and $10.00 for output, the same as Gemini 2.5 Pro. GPT-5 mini is priced at $0.25 input and $2 output, which undercuts Gemini Flash 2.5 by a small amount.
Until now, Google had carved out a notable niche as the price/performance value leader in the industry. Now itâs not only not cheaper, but is slightly more expensive than GPT-5.
GPT-5 models are good models and they donât break the bank. You can make a case for their use either technically or financially.
đ ď¸ What am I doing?
If youâre curious about my own setup: over the past six months Iâd moved almost entirely from ChatGPT to ClaudeâI found its writing style more personable and itâs good at coding. But in the week since GPT-5, Iâve shifted back. In head-to-head comparisons on the same prompts, GPT-5 wins for me every single time and itâs not even close: Claude is markedly more verbose, while GPT-5 is concise and specific. For my workflow, that âless fluff, more signalâ balance is faster and more useful.
For coding, however, Iâm full-on committed to Claude Code, which has been a revelation. Nothing else Iâve tried touches it. And, yes, that does mean I subscribe to both ChatGPT and Claudeâbut given my role in the AI world, I can justify that. Iâm not your average user. Most people will be forced to make a hard choice and Iâd find that a difficult one to make.
đ The takeaway
Hereâs my 5-point summary:
Itâs a good upgrade for ChatGPT subscribers. Plus subscribers can default to GPT-5 Thinking for virtually all work. That modelâs clearly ahead of GPT-4o and even a step up from o3. A PhD-level assistant for $20/month is a remarkable deal.
Itâs much better at enterprise things. Fewer hallucinations, better instruction-following and a leap in agentic tool-calling accuracy are the big meaningful upgradesâthey are things the average ChatGPT user will mostly not notice, but which enterprise builders do care deeply about.
Pricing is very competitive. GPT-5 now sits on a par with Googleâs Gemini tiers, erasing Googleâs former âvalueâ edge.
Less âGPT-y,â which may help. Writing style is personal, but GPT-5 is less flowery and emoji-heavy, more direct. To some thatâs more business-like; to others, less warm.
Ignore the noise. Early confusion about the model router and legacy model access dominated online chatter, but OpenAI quickly tweaked ChatGPT to address it. Move along.
One clanger: GPT-5 is great, but mostly beside the point. Model gains are almost always iterative. The real leaps come from what we build around them: products that bind models to context, tools, and guardrails. Use whatever worksâGPT-5, Claude, Geminiâthe opportunity is in solutions, not model names. Case in point: Claude Code has transformed how I code without a ânew modelâ, thanks to clever innovation around one.
To give context, cast your mind back to the GPT-4 launch: a clear step up from 3.5, but my reaction at the time was âmehâ. Its gains showed up mainly in complex, composite prompts that few people were writing back then. Two years on and those once obscure prompts are now mainstream. Incremental model capability nudges unlocked new patterns, products, and habits. Expect the same with GPT-5: steady model progress, but outsized gains in how we put it to work.
One final thought to leave you with: Will GPT-5 make a dent in the universe? We wonât know for a few years; its impact will be defined by what people build with it, not clickbait headlines or benchmark scores.
đ¨ Shameless plug alert
Barnacle Labs builds AI solutions for ambitious organisations tackling complex challenges. We're the team behind the National Cancer Institute's NanCI AI-powered appâwe help you identify the right opportunities and ship solutions that deliver results.
Reply to this email with your biggest AI challenge if you'd like to talk!