🏭 The Prompt Factory
Field notes from the AI team at Barnacle Labs—what's caught our attention this week?
This week in AI, a study revealed that experienced developers are 19% slower with AI tools, China's "autonomous" robot footballers stumbled around like toddlers on a sugar crash, and the world's richest man just released an AI that parrots his tweets back at you.
Meanwhile, your smartwatch might soon diagnose you better than your doctor, everyone wants to build an AI-powered internet browser, and someone built an AI fact-checker with "no guarantee that it's any good."
Welcome to 2025, where every AI breakthrough comes with a plot twist.
✨ Creative Inspiration Corner
China hosts first fully autonomous AI robot football match
Link: Guardian
Our Take: Cheng Hao, the founder and chief executive of Booster Robotics, the company that supplied the robot players, said sports competitions offered the ideal testing ground for humanoid robots. He said humans could play robots in the future, although judging by Saturday’s evidence the humanoids have some way to go before they can hold their own on a football pitch.
Hugging Face launches the Raspberry Pi of Robotics
Link: Hugging Face
Our Take: Reachy Mini is an expressive, open-source robot designed for human-robot interaction and AI experimentation. It’s $299 for a tethered version, or $449 for one with battery and inbuilt Raspberry Pi. (NOTE: I just ordered one whilst writing this, I couldn’t resist!)
📜 Papers That Change Strategic Thinking
Your Smartwatch Just Got a Lot Smarter
Link: arXiv
Our Take: Remember when fitness trackers could barely count your steps accurately? Well, this research has seriously impressive results that might change how we think about wearable health tech.
The team built an AI foundation model using data from people's smartwatches and fitness trackers. But here's the clever part: instead of just looking at raw sensor readings (heart rate blips, accelerometer wiggles), they focused on behavioural patterns – things like sleep cycles, activity levels, and daily rhythms. It turns out this approach works way better and when tested on 57 different health prediction tasks, it crushed most of them.
A machine learning model using clinical notes to identify physician fatigue
Link: Nature
Our Take: This is pretty cool - they trained a model on patient notes and were able to correctly identify the doctors that were tired when they wrote the notes. The signal is in the data, you just need AI to spot it.
🧑💻 Developer Tools
Is this Google’s version of LangChain?
Link: Google
Our Take: LangChain has become ubiquitous for building LLM apps, but many (including us at Barnacle Labs) find it overly complex. This new Google library looks interesting though - it seems like a more modern take on the idea of composability and workflows in LLM apps. It seems to focus especially on realtime applications - for example, those with audio/video input. Looks interesting.
🤖 Agents & Enterprise Integration
AI Fact-checker
Link: YesNo
Our Take: This truth checker validates social media posts, research papers, etc, using an AI agent to seek out the supposed truth. It’s offered with no guarantee that it’s any good. I haven’t had time to validate its approach and it’s quite possible there are issues. Nevertheless, I found it super interesting because misinformation is such a big challenge these days. Can AI be used to put a warning flag on things that might not be true? Of particular interest would be HOW this truth checker does its work - there may well be demons lurking underneath. I was watching an interview last night on TV where someone was explaining that we all have our own version of the truth… which I noisily disagreed with - there are not multiple versions of the truth, it’s just that some people are deluded!
Measuring the Impact of AI on Experienced Open-Source Developer Productivity
Link: Metr
Our Take: Surprisingly, this study finds that when developers use AI tools, they take 19% longer than without - AI makes them slower. But who were they measuring? Experienced open source developers - people who know their stuff and (because they are working on open source) obsess about code quality. I’d bet money that another study aimed at less experienced developers building closed source would show the opposite. As ever, there’s nuance behind the headlines.
🔮 Model Watch
Grok-4, the first AI that parrots the views of a single person
Link: x.ai
Our Take: Grok-4 was launched this week, is technically very impressive and gets top benchmarks. Grok 4 Heavy saturates most academic benchmarks and is the first model to score 50% on Humanity’s Last Exam, a benchmark “designed to be the final closed-ended academic benchmark of its kind”. But raw performance is only part of the picture and Grok 4 is about twice the price of OpenAI’s o3.
Grok has realtime access to x data and it’s been spotted searching for its makers political opinions when asked tricky questions. If that aligns with your world view, you’ll like it. If not, you’ll probably dislike it intensely. But the ability to answer ‘hot topic’ questions, regardless of idealogical alignment, is actually a liability for most regular business users who much prefer “that’s not something I’ve been trained to answer” responses. Nobody wants their banking bot to start an idealogical debate about global politics. Sometimes, boring it best.
T5Gemma: encoder-decoder Gemma models
Link: Google
Our Take: This one is super interesting to the techies amongst us. The original transformer architecture included both encoder and decoder components. I won’t go into the differences here because it gets super technical very quickly, but suffice to say that encoders and decoders have different benefits and the combination was initially seen as attractive. However, in the quest for simplicity and efficiency, most modern LLMs have become decoder-only - a solution that’s proven remarkably flexible. Google’s T5Gemma brings back the combination of encoder and decoder - whether this proves a useful innovation or just a distracting deviation from the mainstream is hard to tell. What it does show, however, is that model builders are actively exploring innovations in the core architecture of how models are constructed - this is just a single example from many variations.
New MedGemma variants
Link: Google
Our Take: Medically tuned LLMs are big area of research and Google is one of the more active organisations in this space. Most interesting in this announcement is MedGemma 27B Multimodal, a model tuned to be good at medical diagnosis and that’s capable of reading text and images like scans and x-rays. There’s also a stand-alone model for image analysis, MedSigLIP. Interesting stuff and all open source and so can be used in a variety of contexts.
Kimi K2 from MoonshotAI
Link: MoonshotAI
Our Take: Another Chinese AI lab you’ve never heard of! But K2 is a super interesting open source model that I think must be the first to tip over 1T parameters in size. Mainly I’m intrigued because it looks like it’s excellent at tool calling and therefore agent use cases. Much like how human capabilities expanded massively once we evolved to use tools, so the same is true with AI. So if K2 proves to be outstanding in tool calling, it might become important.
OpenAI delays its open source model
Link: x
Our Take: Sam Altman just tweeted “we planned to launch our open-weight model next week. we are delaying it; we need time to run additional safety tests and review high-risk areas. we are not yet sure how long it will take us.”
Chatter on x is that this is cover for “we can’t launch now because Kimi K2 is much better”. Pure speculation, but plausible.
FlexOlmo allows collaborative model developments without contributors giving up control of their data
Link: AllenAI
Our Take: AllenAI always seems to do something interesting and this is definitely that. With FlexOlmo, data owners can contribute to the development of a language model without giving up control of their data. There’s no need to share raw data directly, and data contributors can decide when their data is active in the model (i.e., who can make use of it and when), deactivate data at any time, and receive attributions whenever data is used for inference. Maybe this will encourage new collaborations that wouldn’t otherwise exist?
Turn images into videos with Veo 3
Link: Google
Our Take: Google added image input to Veo 3, enabling you to upload an image and turn it into a video.
Crazy price/performance Mistral coding models
Link: Mistral
Our Take: Devstral Medium has nearly Claude Sonnet level performance on coding tasks, but at just 1/8th of the cost.
📊 Industry Intelligence
The browser goes AI
Link: Perplexity
Our Take: Perplexity has launched their AI-powered Comet browser. OpenAI are heavily rumoured to be about to launch their AI browser. And The Browser Company have Dia. Does this make any sense and why the sudden rush to merge AI with browsers? There’s a lot of synergies here - using AI to reinterpret what you read (summarisation at the simple end, much more at the complex end). But also, the browser is the user interface to the world - if you own the browser, you can integrate AI to scrape, automate and control.
🧠 AGI - are we there yet?
Grok gets tuned to be more right wing, starts praising Hitler
Link: Guardian
Our Take: “We have improved Grok significantly. You should notice a difference when you ask Grok questions,” Musk posted on X last Friday.
It only took a few days for “new Grok” to be praising Hitler and making gross insults about Donald Tusk, the Polish PM. Multiple posts had to be deleted and Grok handcuffed.
Remember the Pause AI letter (a proposal vaunted by Musk)? “AI labs and independent experts should use this pause to jointly develop and implement a set of shared safety protocols for advanced AI design and development that are rigorously audited and overseen by independent outside experts.” Things seem to have changed somewhat - the only oversight of Grok’s responses seems to be one man.
🎯 Week Ahead Priorities
If you want to do something, rather than just read…
Test the AI fact-checker - Try YesNo.ai on a recent social media post or news article you're skeptical about to see how it works and whether it's actually useful. Can AI agents solve the truth problem?
Decide - are your team top experts obsessing about code quality, or slightly more average folk focused on delivering functionality? The answer might dictate how much AI coding tools will help them.
Consider ordering Reachy Mini - If you're into AI experimentation, grab one now (this newsletter author couldn't resist, so why should you?).
🚨 Shameless plug alert
Barnacle Labs builds AI solutions for ambitious organisations tackling complex challenges. We're the team behind the National Cancer Institute's NanCI AI-powered app—we help you identify the right opportunities and ship solutions that deliver results.
Reply to this email with your biggest AI challenge if you'd like to talk!