🏭 Claude can now edit your PowerPoints

Field notes from the AI team at Barnacle Labs—what's caught our attention this week?

Sep 14, 2025

This week's newsletter touches on why even setting temperature to zero doesn't make LLMs deterministic (spoiler: it's not just floating-point maths). Meanwhile, Yann LeCun thinks we need a completely different set of AI capabilities, so perhaps the question about reproducibility doesn’t actually matter.

Elsewhere in AI land, BlackRock is betting £500m that UK companies really do care about data residency, and someone built a terminal game where you pretend to be an AI to avoid getting voted off by actual AIs.

We're also seeing the inevitable office suite takeover continue—Claude can now edit your PowerPoints directly, which is either incredibly useful or the beginning of death by a thousand AI-generated slide decks. Meanwhile, the UAE just dropped a reasoning model that's supposedly hitting 2,000 tokens per second, trained entirely on open data, proving that the AI race isn't just a US-China duopoly anymore.

And yes, there are two new mystery models on the stealth leaderboards. The community thinks it's either Gemini 3.0 or Grok 4.0, but honestly, at this point the model release cycle is so relentless that by the time you read this, there will probably be three more.

Let's dig in…

✨ Creative Inspiration Corner

Your job in this LLM game is to destroy the bots!

Link: Github

Our Take: This one is fun: ‘Among LLMs’ turns your terminal into a chaotic chatroom playground where you’re the only human among a bunch of eccentric AI agents, dropped into a creative scenario. Each participant, including you, has a persona and a backstory, and all the AI agents share one common goal -- determine and eliminate the human, through voting. Your mission: stay hidden, manipulate conversations, and turn the bots against each other with edits, whispers, impersonations, and clever gaslighting. Outlast everyone, turn chaos to your advantage, and make it to the final two.

📜 Papers that change strategic thinking

Why do LLMs not produce the same reply every time?

Link: ThinkingMachines

Our Take: You've probably noticed that ChatGPT and other AI assistants give slightly different responses each time you ask the same question. While many experts blame this inconsistency on the way computers handle complex math calculations running in parallel, new research reveals the real story is more nuanced. Even when developers try to force these AI systems to be completely predictable by adjusting their settings, they still produce varying results—creating major challenges for scientists and businesses who need reliable, reproducible outputs. This investigation digs into the surprising technical reasons behind AI's unpredictability and explores potential solutions that could make these tools more consistent and trustworthy for critical applications.

🧑‍💻 Developer Tools

Generate and Edit Images with Google’s Nano Banana in AI SDK

Link: Vercel

Our Take: Vercel’s AI SDK is extremely elegant and easy to use. It’s a great way to bootstrap a lot of AI work and abstract some of the vendor-specific API differences and simplify model interactions. It’s nice to see the SDK expand past just text generation and incorporate images.

Ship directly to Cloud Run from Gemini CLI

Link: Google

Our Take: This is kind of neat – just type /deploy into Gemini CLI and it’ll push your code to git and publish to Cloud Run. Or, enter /security:analyze and it’ll check your code for security exposures using the new Gemini CLI Security Extension.

ChatGPT supports MCP

Link: OpenAI

Our Take: ChatGPT developer mode is a beta feature that provides full Model Context Protocol (MCP) client support for tools. At last! It’s interesting that OpenAI caveats this with “It's powerful but dangerous, and is intended for developers who understand how to safely configure and test connectors. When using developer mode, watch for prompt injections and other risks, model mistakes on write actions that could destroy data, and malicious MCPs that attempt to steal information.”

🤖 Agents

Writing effective tools for agents — with agents

Link: Anthropic

Our Take: Despite too many claiming that any piece of code that uses an LLM is an agent, real agents are the ones that are given a set of tools and allowed to dynamically plan when and how to use those tools. When the AI has autonomy and power to decide how to use a tool, the design of that tool becomes critical — it has to be obvious for the AI to understand and use. That means that just wrapping existing APIs is almost always the wrong approach, because those APIs are often too complicated for the AI to work out how to reliably use. Anthropic opens this post with the statement that ‘agents are only as effective as the tools we give them’ — something I strongly agree with. Read to understand how to write high-quality tools and evals, and how you can boost performance by using Claude to optimize its tools for itself.

Custom Tools launch for Claude Code

Link: Anthropic

Our Take: Everyone thinks of Claude Code as a coding agent, but it’s much more. There’s an SDK available for both Python and Typescript, which makes it a more general-purpose agent framework that can be used for anything. We’ve been using this in our work at Barnacle Labs — it’s currently our favourite Agent solution. This week Anthropic release ‘custom tools’ which allow you to extend Claude Code’s capabilities with your own functionality through in-process MCP servers, enabling Claude to interact with external services, APIs, or perform specialised operations. This further cements Claude Code as a very powerful and extensible general-purpose agent framework.

Replit Agent 3 — Agents for Dummies?

Link: Replit

Our Take: For those who like the idea of AI Agents, Replit’s Agent 3 might be the answer. You can generate automation workflows by just prompting. It’s getting some good press from what I see and appears very capable.

🔮 Model Watch

K2Think: The fastest reasoning model

Link: K2Think

Our Take: It’s nice to see a model that’s not from the USA or China. This one’s from the UAE. A very compact and fast reasoning model — it’s been benchmarked at 2,000 tokens/second, which is insanely fast. It was trained on 100% open datasets, no proprietary data at all. The paper has a host of interesting details about their technical innovations.

2 new stealth models!

Link: x

Our Take: Stealth models are now common — it’s how providers release their models for real-world testing to detect any issues. Platforms like OpenRouter and AI SDK that abstract over multiple AI providers make perfect sense if you want to make a model available without telling people what it is! In AI SDK we have stealth/sonoma-sky-alpha and stealth/sonoma-dusk-alpha. Either Gemini 3.0 or Grok 4.0 seems to be the community consensus.

📊 Industry Intelligence

A16Z benchmarks AI-powered office tools

Link: A16Z

Our Take: AI isn’t just a feature anymore – it’s becoming embedded into every tool we use. From drafting emails to designing slides, researching markets, or building financial models, a new layer of “agentic” tools is emerging that resemble an AI-native Office suite. Rather than ask ChatGPT a question and copy its results into Word, we can generate that text directly. But are these tools any good? This post tries to answer that question — is it worth jumping in, or should we stick with ChatGPT and copy/paste?

BlackRock to invest £500m in UK datacentres

Link: BBC

Our Take: A lot of UK enterprises need to have compute infrastructure local in the UK in order to meet their data residency commitments. I’ve worked with companies whose discriminator between model providers was “who has a UK data centre and will guarantee me that no data will be sent outside UK borders”. The truth is that not everyone can meet those requirements and those that do are often capacity constrained. So, good to see something being done about it.

Claude can now edit your documents

Link: Anthropic

Our Take: This seems inevitable, and indeed a couple of people asked for my help because they thought it was already possible (it wasn’t). Claude can now edit and generate word docs, powerpoints and excel sheets. Now you can create a presentation with a single prompt!

🧠 AGI — are we there yet?

The Shape of AI to Come — Yann LeCun at AI Action Summit 2025

Link: YouTube

Our Take: Yann LeCun is one of AI’s so-called ‘godfathers’. He’s very opinionated and critical of current LLM’s ability to reach AGI. In this talk he lays out his vision for how we’ll get to AGI and the things that need to be invented.

Amol Rajan interviews Anthropic’s Daniel Amodei

Link: BBC

Our Take: This is a nice interview. I particularly liked (and agree with) his comments on the impact of AI on low-level white-collar work: "If we look at entry-level, white-collar work. I think of people who work at law firms like first-year associates. There's a lot of document review. It's very repetitive, but every example is different. That's something that AI is quite good at. If you think of first year of entry-level work at a consulting company... If you think of entry-level work at a finance company, right? Doing routine analysis of financial documents. These are kind of the workhorses of entry-level white-collar labour. And yet they are things that AI is already pretty good at and AI is rapidly getting better at."

🎯 Week Ahead Priorities

If you want to do something, rather than just read…

Play "Among LLMs" - Install this fun terminal game where you're the only human trying to survive among AI agents who are hunting for you. It's a clever way to understand how AI thinks and communicates.
Create a presentation with one prompt - Try Claude's new document editing feature by asking it to create a PowerPoint presentation on any topic you're interested in.
Test drive a mystery model - If you’re a developer using Vercel’s AI SDK, try one of the new stealth models. See if you can guess whether it's Gemini 3.0 or Grok 4.0!
Watch Yann LeCun's AGI talk - Get a glimpse of AI's future from one of its godfathers. His skeptical take on current AI limitations is refreshingly grounded and thought-provoking.

🚨 Shameless plug alert

Barnacle Labs builds AI solutions for ambitious organisations tackling complex challenges. We're the team behind the National Cancer Institute's NanCI AI-powered app—we help you identify the right opportunities and ship solutions that deliver results.

Reply to this email with your biggest AI challenge if you'd like to talk!

The Prompt Factory

Discussion about this post