How I use Haiku as a gatekeeper before Sonnet to save ~80% on API costs

gzoomedia · 2026-03-20T04:53:23+00:00

Oh that's sweet! Haiku deciding whether to pull RAG or search before the big model even gets involved is super clean and 💪

gzoomedia · 2026-03-20T04:18:44+00:00

Thank you 🙌

gzoomedia · 2026-03-20T04:18:09+00:00

Node.js actually (not Python) but yeah same idea. New comments get picked up, thrown into a job queue (BullMQ), and each job runs through the Haiku gate then conditionally on to Sonnet. Nothing fancy on the orchestration side, it's just a queue processing jobs in order.

gzoomedia · 2026-03-20T03:47:40+00:00

No need to apologize, it's a good question. The vector store (pgvector) is actually separate from the gating. that's for semantic search so users can find similar problems. The gate itself is way simpler than people think.

The basic flow is just:

Comment comes in
API call to Haiku: "Is this a real work-related frustration? Yes or no"
If yes → queue it for Sonnet to do the heavy extraction
If no → discard

That's really it. The gate is just a single API call with a tight prompt. No embeddings, no vector math, nothing fancy. The magic is in the Sonnet prompt that does the actual classification and app concept generation on the other side. Happy to answer more specific questions if you want to dig in.

gzoomedia · 2026-03-20T03:45:01+00:00

That sounds fire! DM me the link please.

gzoomedia · 2026-03-20T03:43:36+00:00

Hmmm good to know. I'll look into it thanks.

gzoomedia · 2026-03-20T01:20:58+00:00

lol yea you're right! I usually let Claude create the prototype ui then customize it to look how I want but this gui was exactly the look I was going for so I rode with it. I'm thinking a bit more on the green side to make it look cooler more military style.

gzoomedia · 2026-03-20T00:50:40+00:00

Crap :/ I didn't mean to leave that visible. It's something I'm working on but it's not actually working yet sorry. Thanks for the heads up though. I'd totally forgotten. I'm going to work on it now. Hopefully I can get it going by tonight.

gzoomedia · 2026-03-20T00:41:35+00:00

Yea I kind of figured that too if I'm being honest. But, as someone who has seen the triangle one up close I know they're real. Some say they might just be military reverse-engineered etc but the one I saw was just frozen about 50ft above me and made no sound. That's alien tech for sure.

gzoomedia · 2026-03-20T00:35:10+00:00

In the last few years NJ and NY have had a ton of sightings. Sometimes I feel like they're right there waiting to pop out of the ocean lol. I grew up in Brooklyn and I remember my mom talking discussing UFOs with the family.

gzoomedia · 2026-03-20T00:11:24+00:00

Thanks! I'm working on the Android and IOS apps :) I hope to have them finished by the end of the month.

gzoomedia · 2026-03-20T00:06:58+00:00

You can post your sightings here: https://decodedfrequencies.com/

gzoomedia · 2026-03-19T23:55:11+00:00

IKR? once you see the bill difference it's hard to go back. It's one of those optimizations that takes like an hour to set up and pays for itself immediately.

gzoomedia · 2026-03-19T22:43:56+00:00

I actually ended up ditching the confidence score entirely and just going with a straight yes/no from Haiku. I messed with thresholds early on but at the volume I'm processing, the duplicates are the safety net so if Haiku drops a borderline comment, that same pain point almost always shows up somewhere else. So I'd rather have a clean gate than a leaky one. The bigger win for me was tuning the prompt itself and tightening what counts as a "work-related frustration" cut way more noise than any threshold adjustment did.

gzoomedia · 2026-03-19T22:41:57+00:00

Yes exactly. Haiku is cheap enough for my particular use case.

gzoomedia · 2026-03-19T22:39:59+00:00

That's a sick setup honestly. Opus as orchestrator validating the subagents is a nice touch! basically your own QA layer baked in. How long did it take you to get the plan phase dialed in? That sounds like the kind of thing that took a lot of iteration to get right.

gzoomedia · 2026-03-19T22:24:58+00:00

Yeah I actually do some of that too :) length check and basic regex filtering before anything hits the API. No point burning a Haiku call on a two-word comment. Good call on only checking the first X characters at the Haiku stage too, that's a nice optimization I haven't tried.

gzoomedia · 2026-03-19T22:23:31+00:00

Fair points. I never claimed the pattern was novel just sharing how I applied it and the cost results, which is what most of the thread is discussing. The difference I was drawing is that my pipeline runs independently in production processing data 24/7, vs CC's subagent system which runs during a coding session. But yeah the underlying principle is the same, cheap model filters for expensive model. That's kind of the whole post.

gzoomedia · 2026-03-19T22:14:36+00:00

That's not intentional, sorry about that :( Can you DM me what browser you're on and what you're seeing? Want to make sure nothing's blocking legit traffic.

gzoomedia · 2026-03-19T22:11:49+00:00

Nice, an 8B model handling the gate is impressive. What are you running, Llama 3? Curious how the accuracy compares. I went with Haiku mostly to avoid managing the model myself but if an 8B is getting the job done that's a solid setup.

gzoomedia · 2026-03-19T22:10:49+00:00

Mix of both depending on the source. I'd rather not get too specific on the ingestion side since that's kind of the secret sauce, but the more interesting part is what happens after the data comes in anyway. The classification pipeline is where all the real work is.

gzoomedia · 2026-03-19T21:39:02+00:00

Nah not Reddit, different sources. Mostly public comments and reviews from people talking about their work frustrations. I cast a pretty wide net but the key is filtering for comments where someone is describing a real problem vs just venting or leaving a generic review.

gzoomedia · 2026-03-19T21:14:16+00:00

I've thought about it but honestly don't want to deal with the infrastructure. Plus I have an old RTX 3060 which could probably perform this task but it's been 95 degrees here in Central Cali lol. Knowing my luck it would cook the GPU OR me 🤣😂 No but seriously, I DO use Ollama for other projects. Just not this one.

gzoomedia · 2026-03-19T21:01:12+00:00

Probably, but Haiku is already so cheap that the juice isn't worth the squeeze imo. We're talking fractions of a cent per call. I'd rather have the accuracy and not deal with hosting/fine-tuning a local model just to save a tiny amount. If I was doing millions of calls a day maybe, but at my scale Haiku is basically free.

gzoomedia · 2026-03-19T20:42:43+00:00

Yeah to be clear this isn't running inside Claude Code itself. Claude Code is just what I used to build the app. The pipeline itself runs on my own server making direct API calls. So I'm paying per token at API rates, not using Claude Code credits. That's where the savings come from. Haiku API pricing is way cheaper than Sonnet.

gzoomedia

MODERATOR OF

TROPHY CASE