🤿 DEEP DIVE: Not AGI, But Useful — A Measured Look at GPT-5

The livestream was low theatre; the expectations weren’t. Here’s a practical look at GPT-5’s reliability, deference, and day-to-day performance.

Aug 15, 2025

TL;DR: GPT-5 arrived with high expectations, but the meaningful gains are enterprise-centric: fewer hallucinations, stronger instruction-following and more reliable tool use. Those move the needle for AI agent builders; for casual ChatGPT use, the impact is noticeable but may feel a little more muted. For those who care about reliability and consistency, however, it’s a substantial upgrade.

📣 Hype or not

Was there hype around GPT-5? I’d call it anticipation rather than hype. There were no billboards, pop stars, or pyrotechnics. OpenAI’s pre-launch signalling was relatively light—a handful of Sam Altman tweets and a restrained livestream with a small panel huddled around a laptop. The spectacle was minimal, so very definitely not the Hollywood version of hype.

The intensity came from elsewhere: brand gravity, the broader AI moment, a slow-burn build-up to the announcement, and the semantics of versioning. The jump to x.0 suggests a step change, signalling audiences to expect a leap before a single demo had run. That context acted as a force multiplier: even one tweet carried outsized weight.

When reality falls short of myth, backlash follows: ‘muted’, ‘overhyped’, ‘disappoints’. Some persistent critics—for example, Gary Marcus—cast GPT-5 as evidence of an AI stall. The reality is messier than either end of the spectrum of debate would suggest. This post offers a hopefully measured, practitioner’s view of what changed, what’s significant and what’s not.

🚫 Not AGI

First things first: GPT-5 is not AGI. It’s an evolution of the large language models we’ve been using—smarter, more accurate, more polished—but not a leap to an autonomous, self-aware entity. By itself it has no goals, no persistence, and no agency; it generates text in response to inputs. In products, it can be wired to tools or workflows, but those actions are designed and constrained by humans. The risk profile comes from deployments and incentives, not from a model that “decides” to act on its own.

The term AGI is so loaded that it often obscures more than it clarifies. Even industry leaders are cooling on it: Sam Altman called AGI “not a super useful term” in a CNBC interview, a sentiment echoed by others in the field. Dario Amodei of Anthropic likewise prefers “powerful AI” over AGI—he says he dislikes the term AGI.

So rather than drawing a binary line in the sand—“AGI: yes/no”—it’s more honest to view progress as a gradual broadening of capabilities and reliability. Those stepwise gains will likely shape society long before any universally agreed “AGI moment” arrives. That’s the lens I’ll use in this piece — GPT-5 is an important step along this path. In fact, no model release has ever represented a massive leap and gains are always incremental. To expect anything else in a single release was always unrealistic

🔀 Smart picker or black box? The GPT-5 router

Despite the name, GPT-5 isn’t a single model—it’s a family of models with different capabilities, reasoning capacities, and price points.

Before GPT-5, the lineup was already a tangle: GPT-4o, o4-mini, o3, and more, each with quirks and trade-offs. Add adjustable reasoning effort and you had a recipe for confusion. Unless you already knew the difference between GPT-4o and o4-mini—names that mean nothing to most people—picking the right one was guesswork.

OpenAI’s answer is the GPT-5 model router: an automatic selector that decides which variant to use based on your prompt’s content, complexity and explicit hints (adding “think hard about this” to your prompt encourages use of a more capable GPT-5 variant). In ChatGPT, you can now just pick “GPT-5” and let the router choose the variant for you. For most people, that’s a lot simpler and the right answer.

ChatGPT is OpenAI’s consumer product, and some simplification is reasonable and indeed required—it’s a balance between power and approachability.

However, for experienced users the pain point was never choosing a model and becoming reliant on a routing engine I don’t always agree with is frustrating. I already know which jobs call for which engine, and typing “please think hard about this” feels clunky when I could just select the model directly.

OpenAI has been continuously responding to feedback and has tweaked how the router works and the level of manual control they provide.

At the time of writing ChatGPT provides the following options:

Auto – GPT-5 with the model router deciding for you
Fast – a quicker, lighter GPT-5 (roughly GPT-4o)
Thinking – slower but more capable (roughly o4-mini/o3)
Pro (Pro subscribers only) – the most advanced GPT-5 variant, with extra reasoning effort (roughly o3-pro)
Legacy (for paying users) – access to the old 4o model.

If you’re on the ChatGPT free tier, you get Auto only. That’s sensible: most casual queries are simple and can be routed to the cheapest model, containing costs without forcing people to pick between obscure variants. By the way, if you’re not paying, access to anything is a luxury!

As a Plus subscriber, I manually select Thinking nearly all the time. It’s a lot better than Fast and, unlike earlier “thinking” models, still acceptably quick (typically responding a few to 30 seconds). If I’m in a hurry, Fast is dramatically quicker and responds virtually instantly.

The current rate-limit for Thinking is 3,000 messages per week, which equates to 430 a day, a big jump from the old 100/day o3 cap. For most people that’s ample, which means you can use a better-than-o3 model for essentially everything, every day. That’s quite something.

Interestingly, the router sends simple editorial requests—“please proofread this,” “make this paragraph flow better”—to the Fast variant. It’s competent, but Thinking is noticeably better even on these: cleaner prose and multiple options (e.g., an analytical tone alongside a punchier one).

In fact, there’s almost no reason not to use Thinking—which is perhaps another reason OpenAI favours the router. Left to my own devices, I’m almost certainly consuming more compute for my $20/month than I would if the router decided what to choose. As ever, commercial motives and consumer experience are intertwined. Overall, I find the inevitable compromises here to be reasonable. OpenAI reacted quickly to user comments and refined the initial router experience, which I view as positive and evidence of a company that’s listening to its users.

✨ Reliability over razzle-dazzle

The sharpest criticism of today’s models, and the biggest brake on high-stakes use, is their tendency to produce confident falsehoods. We call these “hallucinations”: when the model fills gaps rather than admitting uncertainty. At the trivial end they’re embarrassing; at the serious end they create legal exposure and erode brand trust.

Mitigations exist, but none are perfect. That’s why hallucination risk dominates production design, especially in regulated sectors. Adoption in finance and healthcare is real, but it comes with tighter scopes, heavier governance, and explicit human controls.

Reliability also hinges on instruction-following. Real applications use layered prompts—policies, schemas, examples, edge cases. The more complex the prompt, the more likely the model is to omit steps, drift from a required format, or prioritise one instruction over another.

For casual ChatGPT use, these issues are a nuisance and you can quickly rephrase your prompt and move on. For enterprise builders calling the model’s API, a hallucinated claim or a missed instruction is a hard failure that can’t be easily self-corrected. Together, hallucination and instruction-following drift remain the main brakes on production deployments: they sap trust, raise QA costs, and force teams to narrow use cases.

“GPT‑5’s responses are ~45% less likely to contain a factual error than GPT‑4o, and when thinking, GPT‑5’s responses are ~80% less likely to contain a factual error than OpenAI o3.” — OpenAI

OpenAI says GPT-5, whilst not resolving these issues outright, makes meaningful inroads—fewer fabrications and tighter adherence to prompt instructions. 45% and 80% are significant numbers. In fact, this alone is absolutely H-U-G-E! As an enterprise AI guy, I’d willingly take this feature alone, as it has the potential to open up use cases that were until now impractical.

💬 Conversational style

Writing style is personal. Some people want warmth, colour, a touch of cheer and to feel flattered; others prefer crisp, straight-to-the-point prose. My comments on GPT-5’s writing style therefore reveal my personal preferences, which are not necessarily yours.

Models are often tuned to be agreeable—sometimes too much so. That’s where sycophancy creeps in: the model flatters or agrees even when you’re wrong, because it’s optimising for “agreeable”.

I see this across systems. Claude Code, for instance, will happily open with “You’re absolutely right…” even when I’m absolutely not. In my (decidedly unscientific) side-by-sides, GPT-5 sits lower on the sycophancy scale than Claude, though it could still be more willing to challenge me when I’m wrong. Agreeable models might feel warm and fuzzy, but they’re less useful if the unvarnished truth is what I need. This does raise an interesting question: would vendors win the social-media popularity contest if their models were completely honest?

On tone, some say GPT-5 feels less “warm” than earlier GPTs. I do see a shift: many fewer flowery fillers, less chirpy praise, and no random emoji. It’s more concise, a bit drier—and, mercifully, it no longer worships the ground I walk on. But it does still love an em-dash more than this British writer would, but that’s a quirk I can live with.

Net: reduced sycophancy plus a toned-down style can read as colder to some, but just as easily as “a bit more normal” to others. On identical prompts, Claude tends to be a lot wordier and less specific; GPT-5 gets to the point quickly. Too clinical? If you value efficiency, probably not.

If the default isn’t your vibe, ChatGPT now offers Personalities in Settings (e.g., Default, Cynic, Robot, Listener, Nerd). It’s labelled experimental, but it fits the reality that tastes differ: I prefer a straight-to-the-point persona; others may want a warmer, emoji-rich one. The upside is a shift toward models that can tune their behaviour to different user preferences.

These style changes do make ChatGPT-generated text a lot more difficult to detect. Anecdotally, I’ve heard that “AI detectors” are really struggling with GPT-5. That might be a challenge for some, but possibly a signal of a less robotic style.

Bottom line: I like GPT-5’s voice. It’s shed much of the tell-tale “ChatGPT style” that made earlier outputs easy to spot. But style is personal—only you can decide if it fits how you want it to sound.

📊 Model performance

Benchmarks are a good place to start when examining performance—so long as you treat them as directional, not definitive, and validate on your own stack.

Coding. On SWE-bench Verified, GPT-5 is reported at 74.9%, versus 69.1% for o3 and 74.5% for Claude Opus 4.1—firmly in the leading pack rather than miles ahead. Early hands-on impressions match that: GPT-5 is very good, but broadly comparable to many developers’ current favourite (Claude).

Science. In domains where models push toward PhD-level reasoning, gains look incremental but real. On GPQA Diamond, GPT-5 scores 87.3% (o3: 83.3%), with GPT-5 Pro nudging that to 89.4%. Better numbers, yes—just not a step-change.

Agentic tool use. The standout is GPT-5’s 96.7% on a tool-calling benchmark that stresses multi-step planning, reliable tool execution, and end-to-end completion (τ²-bench, telecom). OpenAI attributes this to improved “tool intelligence”—chaining dozens of calls, in sequence and in parallel, without losing state. If this translates to production workloads, it’s a big deal for agent developers.

Enterprise summary. Because so much enterprise AI now hinges on agents—planning, tool use, retries, error handling—τ²-bench is the more telling signal. If your workloads involve long-running, tool-heavy flows, GPT-5 may offer a meaningful bump over prior models. Still, measure it on your own tools, data, and guardrails; benchmarks don’t always tell the whole story.

💸 Cost: it’s cheaper

GPT-5 is priced at $1.15 for input tokens and $10.00 for output, the same as Gemini 2.5 Pro. GPT-5 mini is priced at $0.25 input and $2 output, which undercuts Gemini Flash 2.5 by a small amount.

Until now, Google had carved out a notable niche as the price/performance value leader in the industry. Now it’s not only not cheaper, but is slightly more expensive than GPT-5.

GPT-5 models are good models and they don’t break the bank. You can make a case for their use either technically or financially.

🛠️ What am I doing?

If you’re curious about my own setup: over the past six months I’d moved almost entirely from ChatGPT to Claude—I found its writing style more personable and it’s good at coding. But in the week since GPT-5, I’ve shifted back. In head-to-head comparisons on the same prompts, GPT-5 wins for me every single time and it’s not even close: Claude is markedly more verbose, while GPT-5 is concise and specific. For my workflow, that “less fluff, more signal” balance is faster and more useful.

For coding, however, I’m full-on committed to Claude Code, which has been a revelation. Nothing else I’ve tried touches it. And, yes, that does mean I subscribe to both ChatGPT and Claude—but given my role in the AI world, I can justify that. I’m not your average user. Most people will be forced to make a hard choice and I’d find that a difficult one to make.

📌 The takeaway

Here’s my 5-point summary:

It’s a good upgrade for ChatGPT subscribers. Plus subscribers can default to GPT-5 Thinking for virtually all work. That model’s clearly ahead of GPT-4o and even a step up from o3. A PhD-level assistant for $20/month is a remarkable deal.
It’s much better at enterprise things. Fewer hallucinations, better instruction-following and a leap in agentic tool-calling accuracy are the big meaningful upgrades—they are things the average ChatGPT user will mostly not notice, but which enterprise builders do care deeply about.
Pricing is very competitive. GPT-5 now sits on a par with Google’s Gemini tiers, erasing Google’s former “value” edge.
Less “GPT-y,” which may help. Writing style is personal, but GPT-5 is less flowery and emoji-heavy, more direct. To some that’s more business-like; to others, less warm.
Ignore the noise. Early confusion about the model router and legacy model access dominated online chatter, but OpenAI quickly tweaked ChatGPT to address it. Move along.

One clanger: GPT-5 is great, but mostly beside the point. Model gains are almost always iterative. The real leaps come from what we build around them: products that bind models to context, tools, and guardrails. Use whatever works—GPT-5, Claude, Gemini—the opportunity is in solutions, not model names. Case in point: Claude Code has transformed how I code without a “new model”, thanks to clever innovation around one.

To give context, cast your mind back to the GPT-4 launch: a clear step up from 3.5, but my reaction at the time was “meh”. Its gains showed up mainly in complex, composite prompts that few people were writing back then. Two years on and those once obscure prompts are now mainstream. Incremental model capability nudges unlocked new patterns, products, and habits. Expect the same with GPT-5: steady model progress, but outsized gains in how we put it to work.

One final thought to leave you with: Will GPT-5 make a dent in the universe? We won’t know for a few years; its impact will be defined by what people build with it, not clickbait headlines or benchmark scores.

🚨 Shameless plug alert

Barnacle Labs builds AI solutions for ambitious organisations tackling complex challenges. We're the team behind the National Cancer Institute's NanCI AI-powered app—we help you identify the right opportunities and ship solutions that deliver results.

Reply to this email with your biggest AI challenge if you'd like to talk!

The Prompt Factory

Discussion about this post