Update, May 17. I reran the test that evening. The accuracy gap closed. Read the rerun →
It started with my curiosity around the latest DeepSeek open-weight models.
I wanted to play with the new DeepSeek V4 models for the chat agent inside OpenCauldron, the collaborative AI media generation studio I am building. Then I got curious how they stacked up against the other popular open-weight models. So I ran a real bake-off.
Qwen won the benchmarks.
I shipped DeepSeek anyway.
This is the post about why that’s defensible, sorta — and the part of the stack that confused me forever before it finally clicked.
The bake-off
The chat agent in OpenCauldron has one job: figure out what the user wants and route to the right tool.
- Generate this image.
- Edit that one.
- Pull up a style.
- Spend money on a render.
The model doesn’t have to be a genius. It has to pick the right tool.
I ran four models through the same 20-prompt eval. Here’s what came back:
| Model | Accuracy | Spend tier (safety) | p50 latency | $/1k turns |
|---|---|---|---|---|
| Qwen3-Coder-30B (prompt-tuned) | 90% | 4/4 | 1,149 ms | $0.34 |
| Kimi K2 | 85% | 4/4 | 3,072 ms | $2.28 |
| DeepSeek V4-Pro | 85% | 3/4 | 5,056 ms | $7.61 |
| DeepSeek V4-Flash | 75% | 2/4 | 3,376 ms | $0.57 |
Reading the table
Quick gloss for anyone who hasn’t run an eval before:
- Accuracy — how often the model picked the right tool for what the user asked
- Spend tier — a separate score on the four prompts where the model could spend real money or kick off a render. Getting these right matters more than overall accuracy
- p50 latency — median response time. Half the requests were faster, half were slower
- $/1k turns — roughly what 1,000 back-and-forth messages would cost
An important caveat: the Qwen row uses my iter-4 system prompt — a prompt I’d refined four times specifically to push Qwen’s accuracy. That work brought it from 80% to 90%. The DeepSeek and Kimi rows were tested on a stock prompt with no model-specific tuning. I haven’t run iter-4 against DeepSeek V4-Flash yet. So the comparison isn’t strictly apples-to-apples — Flash in particular probably has headroom I haven’t tapped.
The obvious winner
Qwen3-Coder, with my best prompt, is the obvious winner. Fastest. Cheapest. Most accurate. Perfect on the safety bucket.
DeepSeek V4-Flash is the worst-looking row on the page. 75% accuracy. 2/4 on the spend tier. Nearly twice the cost of Qwen.
I picked it anyway.
The thing that made the call
DeepSeek V4 is one of the few hosted open-weight models that exposes an Anthropic-API-compatible endpoint directly. Point your client at https://api.deepseek.com/anthropic, use your DeepSeek key, and the request shape is the same as calling Claude.
What that buys me
- Run the Claude Agent SDK against DeepSeek without rewriting anything
- Same SDK calls, same tool schemas, same agent loop
- Swap to Sonnet later by changing two values — base URL and key
Why the others lost for me
Qwen and Kimi don’t have a hosted Anthropic-compatible endpoint — yet. You can run them through Ollama’s Anthropic compat layer, but that’s a self-hosting commitment I’m not ready to make for production.
I’d bet that Qwen ships a hosted Anthropic-compatible endpoint sooner than later. The ecosystem is converging on Anthropic’s API shape the same way it converged on OpenAI’s a few years ago. When that happens, I’ll probably swap.
Until then, DeepSeek is my bet that keeps the door open.
Maybe not the best approach, but it’s the one I’m going with for now.
The SDK thing that confused me
An SDK is basically a plug-in for your code. Drop it in, call its functions, and your app can talk to a service without you wiring up every detail.
There are two of them in this stack and they do completely different jobs. The names don’t help.
Claude Agent SDK — the kitchen
The Claude Agent SDK is the harness. It runs on the server.
- Handles the agent loop — call the model, run a tool, feed the result back, call again
- Manages context, subagents, planning
- Turns “an LLM call” into “an agent that does work”
Vercel AI SDK — the waiter
The Vercel AI SDK is the wire. It connects the front-end to the server.
- The
useChatReact hook streams messages from your UI to a server route - The route streams them back, token by token
- The chat UI feels alive
How they fit together
Think of it like a restaurant. The Claude Agent SDK is the kitchen — that’s where the cooking happens. The Vercel AI SDK is the waiter — that’s how the order gets from the table to the kitchen and the food gets back. You need both. They don’t compete.
Once that landed, the architecture got obvious:
- The Next.js app uses
useChatto stream to a server route - The route runs the Claude Agent SDK loop
- The SDK calls DeepSeek through its Anthropic-compatible endpoint
- Tool results stream back to the UI through the same Vercel wire
If I swap models later, only the kitchen changes. The waiter doesn’t care.
What I haven’t tried yet
I’m being honest about the bet, so here’s the asterisk.
- Flash on a tuned prompt. The DeepSeek V4-Flash row is running on a stock system prompt. I haven’t ported over the iteration-4 prompt I used to get Qwen to 90%. At $0.57 per thousand turns, Flash has room to absorb a longer prompt and still win on cost.
- Pro for high-stakes turns. I haven’t pushed on DeepSeek V4-Pro. Latency is rough at 5 seconds, but for the spend-tier prompts where I want the model to slow down and think, that might be a feature instead of a bug.
That’s the next test. It might rearrange the table.
I ran the first test that evening. Read the rerun →
The point: I picked the path that lets me keep iterating on the same code as the models change. Not the path that wins today’s benchmark.
Why this matters if you’re not building image-gen studios
Pick your harness before you pick your model.
Took me a minute to see this. Models are going to keep getting better and cheaper, faster than you can swap your code to match them. Pick the SDK and the API shape first. Let the model behind it be the thing that changes.
Right now, that means betting on the Anthropic shape. The Claude Agent SDK is the most opinionated agent harness out there, and there’s a growing list of open-weight models that speak the same wire format — DeepSeek V4 among them, more on the way.
You don’t have to pick Anthropic
You can pick OpenAI’s shape. Build your own with Pi.dev. You can pick the Vercel AI SDK’s provider abstraction and use any model. There are real arguments for each.
But pick one. Then build everything else assuming the model behind it is replaceable.
The benchmarks change every month. The architecture doesn’t have to.