May the Cache Be With You

Prompt caching and coding harnesses

Disclaimer: This is not to bash on anyone. I don’t really care if you like one tool/model over the other. They all have their pros and cons. And if they work for you, more power to you. Even if it’s OMO.

With that out of the way, let’s be frank. No business is a charity. AI labs will every now and then say this is all for the betterment of humanity, but for as long as money exists, for as long as investors will invest, and shareholders gonna hold their shares, labs need to make money to spend money.

So where do they spend money? Well, I’m not a CFO, but I can tell you this much:

All the “normal” things that companies spend money on, like payroll, marketing, etc.
Outrageous amounts on money CAPEX investments.
All the data centres, GPUs, the cool stuff.
Running the LLMs internally. Training the models & research.
Inference costs spend to provide the models to the end users.

And out of all four, guess which are the only ones they can really own, optimise and control?

If you’re thinking they can just hire more Claws, or offload marketing to their own models, imagegen and videogen, you might not be 100% incorrect.

Get a sub or go broke.

Running the models. I could go into an entire debacle about how much which lab is giving you for your $200 sub, but I’m not going to. Go read she_llac’s post (there’s a detailed write-up as well).

Still here? Alright.

It’s a banging value for money, no matter which provider sub you use. This is undisputable, objective truth.

YES, YES I AM GETTING TO THE POINT. This is where we’re going to get nerdy, buckle up. Before I do, though I need you to take a ride with me into the world of coding harnesses and how they structure their system prompts. Shout out goes to Thariq who finally made me write this thanks to his blog post, which is kinda required reading to understand Anthropic’s point of view.

Hit the cache or hit a wall

Back? Alright then.

We’ll be looking at 3 major contenders that allowed to users to use Claude Code subscriptions. Claude Code, opencode and Pi. Wait wut, WTF is Pi? Boy oh boy, you’ll be hearing about it in the upcoming months. Anyhoo…

Before you ask, these are the three I’ve used in the past year or so.

Sorry OAI maybe one day, but not quite yet.

I’m not going to comment on any drama, I’m sure you’ll do that for me. You know already what Anthropic did. There are many rumours as to why. “They hate (insert your fave tool here)”; “they want to force everyone to use Claude Code”; “they are the Apple of AI”.

The truth is much simpler than this and less conspiratorial. They are a business. Not a charity. They are here to make money. Enter prompt caching (and my fancy Excalidraw infographic I spent way too much time on).

Coding harness prompt caching diagram

As you can see on the diagram (which I’m sure is going to result in many corrections from AISDK experts - this is merely an approximation on the harnesses’ side), there are quite a few moving parts when it comes to wrapping an LLM in a coding agent loop. The less of these moving parts, the more cost-efficient the inference will be. For both sides, the user and the provider. Prompt caching enables the provider to save on the compute cost and effectively enable these cheap subscriptions to work at all.

Fair enough.

So I just stabilise my setup around that diagram and I’m good, right?

Nope. The key point that Tariq made in his post is that Anthroic caches the stable parts globally. So every time you fire up Claude Code, you are already hitting the cache, whether it’s an old or new session.

This does not apply to opencode, Pi, OpenClaw , or any other tool that you’re using. When using these pieces of software, you need to first send your own payload, and then it gets cached. The exception would obviously be using a subscription offered by the tool developers themselves - these guys are paying for API and can optimise.

I am unsure for how long the “user cache” is valid for CC subs, but this is besides the point. OpenAI offers 24hrs cache, so if you’re hooking your software up to a Codex sub, you better keep your system prompt stable.

Yes, I am looking at you opencode users, changing custom agents in opencode mid-session, using custom-defined subagents (custom definitions do not get appended, they replace the default system prompt), going past midnight with your session (this nukes the cache, due to where the time gets injected), running 8 sessions at once with completely different system prompts, etc.

Don’t even ask how I got into investigating this.

What do we want? Open source!
When do we want it? Now!
Why? Because.

So am I an Anthropic apologist? Far from it. CC does have issues. I would like them to release a 20/120b opensource model we can use locally. I don’t use their models (anymore). I don’t like how they handle PR and dev-rel. But I also completely understand why they are trying to optimise inference costs on their side by using stable defaults they can globally cache and never recompute again.

They have less market share than OAI. They don’t have as much capacity as Google. They are not state-backed, like the Chinese labs. (Yes, of course I am oversimplifying)

If they want to grow, they need to be smart about things and optimise. It’s been said many times that Anthropic love their models and just want to make the best models ever. It’s probably in their mission statement & core values, alongside “the betterment of humanity”. It’s only natural they want as much compute capacity for training & research as possible, as opposed to spending their precious GPU cycles on serving the models to the masses.

Prompt caching is the real reason why Claude Code isn’t open source and probably will never be, whether you like it or not.

And this is the reason why the entire drama with banning 3rd party tools happened.

I only wish they communicated it better and in a more open way.

But what do I know. I’m just a guy. I don’t work at either of mentioned companies, I don’t live in the Silly Valley. I’m not even a dev and I can’t code. I’m pretty damn good at connecting the dots, though. Make your own conclusions.

Anyway, thank you for reading, I really appreciate it if you made it this far.

Go away now.

PS. Since the models are trained within their native harnessess, when will we get to the point where we can just drop the system prompt entirely? Something, something, fires up Pi to try.

PS2. No clanker has tainted this article with its presence, aside from exploring the codebases for OC & Pi.

Originally published on: https://x.com/Howaboua/status/2025222185049043189?s=20

Get a sub or go broke.

Hit the cache or hit a wall

What do we want? Open source!When do we want it? Now! Why? Because.

What do we want? Open source!
When do we want it? Now!
Why? Because.