🏭 Google’s Full-Stack Offensive

Field notes from the AI trenches—what actually matters this week

Dec 21, 2025

Welcome back to The Prompt Factory. This week, Google made its move - not just one product launch, but a coordinated offensive across the entire AI stack. From frontier models to edge devices to developer tools to chip wars, they’re positioning for 2026.

Meanwhile, the industry is choosing collaboration over lock-in, Anthropic is being refreshingly honest about browser automation risks, and Karpathy reminds us that agents are a decade-long project, not a quarterly deliverable.

We’re skipping next week for the holidays. We return on Sunday 4th January for a New Year Special Edition—a deep dive into the year that was 2025 and the year ahead for 2026.

Let’s dive in for the last Prompt Factory of 2025.

🚀 Google’s Gemini 3 Ecosystem Goes Live

What happened

Google released Gemini 3 Flash, a frontier-level model that delivers Pro-grade reasoning at Flash-level speed. It outperforms Gemini 2.5 Pro on most benchmarks while running 3x faster. Priced at $0.50/1M input and $3/1M output—about 60% more than 2.5 Flash ($0.30/$2.50), but a fraction of 3 Pro pricing ($2/$12) for comparable performance.

What it does

Hits 90.4% on GPQA Diamond (PhD-level reasoning questions)
Processes video, images, and audio in real-time
Handles 128K token context windows
Now the default model in the Gemini app and rolling out across Google Search

Why you should care

This is Google’s bid to make frontier intelligence a commodity. When Pro-level reasoning comes at Flash-level prices and speeds, the entire AI economics shift. Companies like JetBrains, Figma, Cursor, and Replit are already processing over 1 trillion tokens per day on the API.

The pattern

Google also released FunctionGemma (270M parameters optimized for edge function calling), T5Gemma 2 (first multimodal encoder-decoder models), and Conductor (context-driven development for their CLI). This isn’t a product launch—it’s a full-stack offensive. From edge devices to enterprise workflows, Google is positioning itself across every tier of the AI stack.

🤝 The Great AI Interoperability Push

What happened

Google announced comprehensive support for the Model Context Protocol (MCP), an open standard originally created by Anthropic for connecting AI systems to tools and data sources. Meanwhile, Anthropic transformed their skills mechanism into an open Agent Skills standard, now adopted by Cursor, Amp, Letta, Goose, and others.

Why you should care

This is the industry choosing ecosystem growth over vendor lock-in. In March 2025, OpenAI adopted MCP across ChatGPT. In December, Anthropic donated MCP to the Agentic AI Foundation under the Linux Foundation, with support from Google, Microsoft, AWS, and Bloomberg. Some estimates suggest 90% of organizations will use MCP by year’s end.

The Agent Skills standard is even more elegant—Simon Willison calls it “deliciously tiny,” a specification you can read in minutes that lets you package organizational knowledge into portable, version-controlled folders that work across multiple AI platforms.

The stakes

When major competitors agree on standards, it signals market maturity. The battle is shifting from “lock users into our platform” to “make the best tools for an open ecosystem.” That’s good for everyone building with AI.

🛠️ Simular AI: When Computer Agents Beat Humans

What happened

Simular AI, founded by former Google DeepMind researchers, achieved a 72.6% success rate on OSWorld, surpassing the ~72% human baseline. Their Agent S3 also scored 90.1% on WebVoyager and 71.6% on AndroidWorld.

What it does

Agent S3 can autonomously control browsers, computers, and smartphones. The platform offers both production-grade enterprise agents (Simular Pro) and web-based access (Simular Cloud, starting at $50/month). Over 12,000 users have joined.

Why you should care

This is what computer use agents look like when they actually work. The team open sources their tools, builds in public, and focuses on production-ready reliability for workflows with thousands to millions of steps. Featured in Wired, MIT Tech Review, and IBM Think, they’re not hyping the future—they’re shipping it now.

Why to be cautious

Competitors like “The AGI Company” claim their OSAgent achieves 76.26% on OSWorld, exceeding Simular’s performance. Benchmarks matter, but deployment reality matters more. Watch for real-world case studies, not just leaderboard positions.

🌐 Claude for Chrome: Browser Automation with Brutal Honesty

What happened

Anthropic released Claude for Chrome as a beta browser extension, enabling Claude to navigate websites, click buttons, and fill forms directly in your browser. Available to all paid subscribers.

What it does

Pulls metrics from analytics dashboards without manual exports
Organizes files in Google Drive (sort, create folders, flag duplicates)
Compares products across sites and creates comparison tables
Logs sales calls to CRM by matching calendar to Salesforce
Runs scheduled workflows (daily/weekly reports) in the background

Why to be cautious

Anthropic’s safety documentation is refreshingly blunt: “Malicious sites may hide instructions overriding user commands. Never use for financial transactions, password management, or sensitive data without supervision. Attack vectors are constantly evolving and Claude may hallucinate.”

This is prompt injection territory. Start with trusted sites only. Review sensitive actions. Pause if Claude acts unexpectedly.

Translation

Most companies would bury these warnings in legal disclaimers. Anthropic puts them front and center. That’s how you build trust when shipping powerful, risky tools. The feature is useful—just treat it like you’d treat giving someone remote access to your computer, because that’s essentially what it is.

💭 Andrej Karpathy’s Reality Check: The Decade of Agents

What happened

Andrej Karpathy, former OpenAI co-founder and Tesla AI director, delivered a sobering assessment: current AI agents have insufficient intelligence, and it will take “about a decade to work through all of those issues.” He’s not saying today’s agents aren’t useful, he’s saying there’s still work to do to catch up with the “autonomous workforce” hype.

Key quotes from his Dwarkesh podcast

“I don’t want an Agent that goes off for 20 minutes and comes back with 1,000 lines of code. I certainly don’t feel ready to supervise a team of 10 of them.”

He fears “mountains of slop accumulating across software, and an increase in vulnerabilities and security breaches” if we deploy agents before they’re ready.

Why you should care

Karpathy isn’t a skeptic—he’s a builder. His December year-in-review highlighted how the application layer around large language models matured in 2025, with tools like Cursor showing what “LLM apps” should look like: orchestrated systems for specific vertical tasks where AI “begins to feel practical, not just impressive.”

His concept of Reinforcement Learning from Verifiable Rewards (RLVR) points to where the field needs to go: proven correctness, not just plausible output.

The lesson

This is the counterweight to aggressive agent marketing from companies claiming their AI employees are production-ready. The gap between current capabilities and reliable deployment is measured in years, not quarters. Build accordingly.

🔧 Google vs. NVIDIA: The TorchTPU Gambit

What happened

Google is developing TorchTPU, a major initiative to make its TPU chips fully compatible with PyTorch—the world’s most widely used AI framework—in partnership with Meta. The goal: erode NVIDIA’s software moat.

Why you should care

Hardware alone doesn’t win. NVIDIA’s dominance comes from CUDA, the software ecosystem deeply embedded in PyTorch. Google’s TPUs require developers to switch to Jax, creating friction that limits adoption.

If TorchTPU succeeds, it dramatically reduces switching costs for companies seeking NVIDIA alternatives. Meta gains leverage in chip negotiations. Developers get more choices.

Google’s new AI infrastructure head, Amin Vahdat, reports directly to CEO Sundar Pichai—reflecting how critical this initiative is to both internal products (Gemini, AI Search) and cloud customers.

The pattern

This is the classic platform war playbook: make your hardware work with the software everyone already uses. Google started selling TPUs externally in 2022, began placing them directly in customer data centers this year, and is now tackling the last barrier—software compatibility. Watch this closely.

🎨 ChatGPT Images Gets a Major Upgrade

What happened

OpenAI released GPT Image 1.5, a new flagship image generation model that’s up to 4x faster and significantly better at precise edits. Rolling out to all ChatGPT users and available in the API.

What it does

Precise editing: Changes only what you ask for while keeping lighting, composition, and faces consistent
Better instruction following: Handles complex multi-element compositions and dense text rendering
Creative transformations: Try-ons, style transfers, and conceptual reimaginings that preserve the essence of originals
20% cheaper in the API compared to the previous model

Why you should care

This is image generation moving from “impressive demos” to “practical tool.” The focus on editing—not just generation—means you can iterate on real images: product photography variations, branded content adjustments, design mockups. Wix is already using it for concept-to-production workflows.

The new dedicated Images experience in ChatGPT’s sidebar includes preset filters and trending prompts, plus one-time likeness upload so you can reuse your appearance across creations. It’s designed to make experimentation frictionless.

The honest take

OpenAI admits “results remain imperfect” and “there is still significant room for improvement.” But that’s the right framing: useful now, getting better. Try it for practical editing tasks rather than expecting magic.

🎯 OpenAI’s FrontierScience Benchmark

OpenAI launched FrontierScience, a new benchmark for PhD-level scientific reasoning across physics, chemistry, and biology. GPT-5.2 achieved 77% on Olympiad-style questions and 25% on research tasks.

Why you should care

This is how the industry is trying to measure whether AI can contribute to real science. But the caveats matter: theoretical physicist Carlo Rovelli’s blunt assessment is that submissions to his journal have doubled, “most of it just people who think they’re doing great science by having conversations with LLMs—and it’s horrible.”

OpenAI researcher Miles Wang acknowledged the benchmark “does not measure all the important capabilities in science”—no hypothesis generation, no experimental execution, no multimodal research involving real-world lab systems.

The reality

High benchmark scores don’t equal scientific reliability. We need evaluation frameworks measuring actual utility, not just controlled performance.

🚀 Your Weekend Project

Pick one:

Test the Agent Skills standard: Visit agentskills.io and read the specification (takes minutes). If you use Claude Code or Cursor, explore creating a simple skill for a repetitive task in your workflow.
Explore Google’s MCP servers: Check out Google’s open-source MCP implementations for Workspace, Firebase, or Cloud Run. If you’re building AI integrations, evaluate whether MCP reduces your vendor lock-in risk.
Try Gemini 3 Flash: At $0.50 per million input tokens, there’s no excuse not to experiment. Compare it to your current model on a real task. Is it actually 3x faster? Does the quality hold?
Read the Claude for Chrome safety documentation: Even if you don’t use it, Anthropic’s transparent approach to prompt injection risks is a masterclass in honest product communication. Study it before deploying any AI automation.

📅 Coming Next: New Year Special Edition

We’re skipping next week for the holidays and return in the New Year with something different. We’re kicking 2026 off with a special edition looking back at 2025 and ahead to 2026, using two guides to help us synthesise the noise from the real messages of progress. Their overarching theme: “The era of AI evangelism is giving way to an era of AI evaluation.”

It’s the perfect way to start the new year: stepping back from the weekly news cycle to ask where this is all heading.

See you in 2026 for the deep dive.

🏗️ About Barnacle Labs

At Barnacle Labs we build AI systems that actually ship. From the National Cancer Institute’s NanCI app to AI systems deployed across biotech and enterprise clients, we’re the “breakthroughs, not buzzwords” team.

Got an AI challenge that’s stuck? Reply to this email—let’s talk.

The voices worth listening to in AI are the ones building, not just talking.

The Prompt Factory

Discussion about this post

Ready for more?