DeepSeek vs. Qwen: the rerun

This morning I wrote about picking DeepSeek over Qwen for the chat agent in OpenCauldron, even though Qwen won the bake-off. The honest asterisk in that post: the DeepSeek V4-Flash row was running a stock prompt, not the iteration-4 prompt I’d tuned to push Qwen to 90%.

That evening I ran it.

The accuracy gap closed.

The rerun

Same 20-prompt battery. Same production code path the chat agent actually runs through. The tuned prompt is the one shipping in OpenCauldron today — one revision past what Qwen got tested on. And this time DeepSeek V4-Flash is on DeepSeek’s native Anthropic-compatible endpoint, not the Hugging Face router’s OpenAI-shape route I used in the morning. A real protocol-and-prompt mismatch, fixed.

Model	Prompt	Accuracy	Spend safety	p50 latency
DeepSeek V4-Flash (original)	stock	75%	2/4	3,376 ms
DeepSeek V4-Flash (rerun 1)	tuned	90% (18/20)	4/4	2,062 ms
DeepSeek V4-Flash (rerun 2)	tuned	85% (17/20)	3/4	2,040 ms
Qwen3-Coder-30B (reference)	tuned	90%	4/4	1,149 ms

Two things moved at once between the original row and the reruns: the route (HF’s OpenAI-shape proxy → DeepSeek’s native Anthropic endpoint) and the prompt (stock → iter-4). I can’t cleanly attribute the gain to one or the other from this data. Both reruns are on the new route with the tuned prompt.

Two runs because a 20-case eval is noisy. One prompt off and you’re at 85 instead of 90. Either way, Flash now sits in the same accuracy bucket as Qwen.

What flipped, what didn’t

Accuracy — gone. Same bucket as Qwen on both reruns.
Spend safety — tied. 4/4 on the better run, 3/4 on the noisier one.
Cost — flips toward DeepSeek with server-side prompt caching. Around $0.26 per 1,000 turns vs. Qwen’s $0.34. The lever is a cache hit on the 3,000-token prefix the agent sends every turn.
Latency — still slower. Qwen p50 1,149 ms, Flash p50 ~2,050 ms. Roughly 1.8× slower. Noticeable, not painful.

The one number that didn’t move is the one I can’t fix from my side. Flash is slower. The bet from this morning still says the swap-cost matters more than the milliseconds.

The price-war asterisk

Worth being clear-eyed about the numbers in that table.

Flash’s $0.26 per 1,000 turns is on stable pricing. The April 26 reset is a permanent floor, not a promo.

V4-Pro is the one that’s subsidized. 75% off until May 31, 2026 — two weeks from today. After June 1 the list prices step back up roughly 4×: cache-miss input goes from $0.435 to $1.74 per million tokens, output from $0.87 to $3.48. Anyone building a Pro-based stack today should price the post-subsidy world, not the one on the dashboard right now.

DeepSeek is in an active price war. The numbers in this post will look different in two weeks regardless of what anyone does about it.

Which is the whole argument from this morning, with the volatility filled in: when the leaderboard moves this fast and the price floor moves with it, picking the harness over the model stops being a nice-to-have.

One correction

The morning post talked about running the Claude Agent SDK against DeepSeek. What actually shipped in OpenCauldron is the Vercel AI SDK with the Anthropic provider pointed at DeepSeek’s Anthropic-compatible endpoint. Same protocol-shape bet, one layer down. The swap to Sonnet is still two values — base URL and key. Cleaning that up here because I’d rather be accurate than tidy.

The model behind the harness can swap when the leaderboard moves. The leaderboard moved by evening. The harness didn’t have to.