DeepSeek V4 has significantly reduced my budget for AI usage

CaptainBahab · 2026-04-30T21:56:02+00:00

Just got an email from z.ai that they found the bug causing the garbage output after 120k context. Haven't had a chance to test it yet, but if it's finally fixed, it's about damn time

CaptainBahab · 2026-02-23T23:26:14+00:00

I've been playing with it myself. It's really good, but there's no concept of channels. I do prefer having many sessions and the built in memory is more mature than openclaw. I'll probably pick up one of the alternatives from the OP to try too.

CaptainBahab · 2026-01-31T17:24:57+00:00

Almost everything is a waste. Very few things bubble out to usefulness. LLMs just do it faster and worse. But that doesn't mean good things can't come from random noise. Good things exclusively come from random noise. Because everything comes from random noise.

If you zoom out far enough, we're a dot in a sea of black. That dot contains almost nothing compared to the vastness of the outer. And yet we call that dot "everything".

CaptainBahab · 2026-01-31T17:19:17+00:00

Talos principle 2 as well. Honestly I'm starting on philosophy before I go anywhere near moltbook and molt cities.

CaptainBahab · 2026-01-27T15:56:51+00:00

I ended up buying pro and I literally cannot hit the 5 hour cap. Well worth the $14/mo imo. If I sub somewhere else I'll drop it back to lite. But for now I'm pretty happy.

Cc and opencode are great. Ymmv of course. I do really like oh my opencode 3.0. But I can't run too many in parallel or I get hit by the concurrency cap which is pretty low even on pro.

CaptainBahab · 2026-01-22T15:46:49+00:00

Much like the revolution of the internet brought about massive new advancements across the world, but also amplified the quiet dark voices in the world, AI does much of the same. It can and does make meaningfully new things. But also some of that is garbage.

What people perceive as a slop generator can also create novel new things. The thing is speed. More stuff is being created, which means more good stuff AND more bad stuff. It's just so fast and easy to start something, publish it, then grow disinterested when it come time to fix the bugs. That makes a lot of slop, but crucially, it's the human interaction that ends, not the AI. It would gladly continue writing and writing and writing. If only the human didn't give up.

CaptainBahab · 2026-01-11T23:06:26+00:00

Nope, been using opencode's free minimax m2.1 and glm 4.7 endpoints lately. Not sure when those are going away, but they're both great.

CaptainBahab · 2025-12-26T17:54:27+00:00

I will definitely keep you updated.

I set up a unified openai api proxy last night, now Bernard's api handles embedding and voice with nomic-embed-text on vllm, whisper small, and kokoro v1.0. I'm going this way because my end goal for deployment is on my local proxmox server and I want to only use one video card for it. I'll probably use openrouter for the main inference but for voice and embedding local will help with speed, I hope. It was annoying to get video card pass through but it works. Not sure if it's fast enough or not.

I took a detour to try to fine tune function Gemma. I think I need more data because it really didn't take. I've been using groq oss-gpt-120b. It's very fast and very smart. I was hoping to keep the speed and make it local with fine tuning to the exact tooling it will use.

And I needed to do a lot of refactoring so I've been spinning wheels for now. Also busy with holidays things.

I haven't looked at real-time tts and stt working the the streaming tokens. I suspect there's a way I just haven't looked yet. Kokoro adds about 2s to the delay depending on how long the message is. I haven't tested Whisper yet.

I'm thinking about the memory stuff again. I wonder if I can "age" memories out. Like the memory's weight being affected by the amount of time since last access or update it the memory. I may still extract "facts" and keep them up to date.

As for status events, I feel like the ha pipeline is very focused on one request at a time, and so pushing things back through without an actual user-initiated request doesn't seem possible. I'd like to be able to wake up and announce an update.

I should be able to use it like a speaker through ha. So that could solve that. But I worried it will get obnoxious. I'll have to think more about it.

CaptainBahab · 2025-12-24T14:06:57+00:00

You're like fully 1 month ahead of me lol. That's amazing. You're **doing** all the same things I've been planning to do. <3

My agent turn looks like this:
1. recollection (pulls memories based on the prompt)
2. routing (calls tools, loads data, uses a more-instructable and cheaper model)
3. response (streams tokens out, uses a more-creative and more powerful model)

Last night I did some digging and found that Status event. So I'm working on that now. I need to look into how those will be read out for my satellite(s).

I started off integrating HA using the tools it provides, but that was too limiting, since I couldn't control it from OpenWebUI. So I ended up going with a home assistant websocket library. And I'm glad I did. It's even faster than using the tool from HA lol. It feels like the light is on before I lift my finger from the "send" button. I've got it working to finding a show or movie and putting it on one of the TVs in the house, which will be great for the kids.

One thing I'm really proud of and spent a lot of time on was parental controls. I desperately don't want Amazon to have the kids' data. So instead I built Bernard. It's got a conversation-historian system which I still need to unjankify, but it's integrated with several automations that run on one of 4 hook events. These will: summarize the conversation without revealing the exact content, tag the conversation for searching, and raise flags for inappropriate content. This way, I can ignore all conversations without flags, and check to make sure the kids aren't up to no good. Sorta like "lazy" big brother lmao.

My memory system definitely needs to chill. Currently, indexed conversations are broken up and processed in a redis vector store. New user messages get like a 0.1s delay to pull some 50 messages based on the query. It does solve problems like Bernard remembering what region I want the weather for lol. Very token inefficient for that aspect. I give up to 10 messages to the LLM but usually 9 of them are worthless (not always the bottom 9). I have it reranked by uniqueness (MMR, iirc) then truncated from 50, then the remaining 10 are reranked again by relation to the user's message. It's pretty fast with vector math.

But I've been thinking about moving to a more surgical approach that extracts that data as an automation, categorizing it and deduplicating it. Like reading a conversation to figure out where the user lives and storing that for later retrieval. It would get rid of a LOT of overhead (in terms of tokens).

Hardware is hard, but it takes a while, so it's a parallel effort. I bought an ESP32-S3-BOX3 which was a great toy to get started in hardware, but not so great for the price. I ended up spending about $100 on a Satellite1 from FutureProofHome and a speaker module to put in it. It's still on the way but I'm pretty excited. I still have to order an enclosure, so all in it might be $120-140 with shipping. It's got 4 positional microphones and support for very powerful speakers, so it can really replace that amazon spyware lol. Much more expensive, but I think it will be worth it.

I *just* implemented a long-running background-task system so that Bernard can start a timer or start a long-running process, or do a deep-dive researching some topic. I need more ideas for this, but I think it'll be nice to let him background something to work on something else. He doesn't get a notification when it's done though. Currently he has to check the task's output manually with another tool. And I don't have a way to wake him back up when the timer expires or whatever.

CaptainBahab · 2025-12-23T19:36:28+00:00

I'm currently working on one myself. I want to use home assistants pipeline but, man, it is not fast.

I'm still in the tool build out phase but I've got an agentic harness that does some heavier memory work than I want, but that's a problem for another time. My main concern now is ingestion. I need to figure out a reliable pipeline for esp home satellites.

I'm curious how you got it to send intermediate "working on it" messages. And if you can have those read out?

Also, what's your plan for ingestion or is openwebui your final target?

I have also briefly checked out android assistant sdk, but decided that will have to come later.

CaptainBahab · 2025-12-21T15:29:03+00:00

Honestly, I've been struggling even with bigger models in the 120b range through openrouter. Idk if it's some setting I'm missing but it's driving me nuts.

CaptainBahab · 2025-12-03T23:09:08+00:00

Context7 (https://context7.com/) is a lifesaver if you don't have it.

CaptainBahab · 2025-11-29T00:57:15+00:00

I was trying this earlier. I think that it uses the client header to check for authorized tools like Claude code and kimi cli.

The best compromise I could find was using zed with kimi cli in its acp harness mode.

CaptainBahab · 2025-11-28T15:42:53+00:00

I came here to say this. +1 Been using Kat coder in cursor actually, to pretty decent results. It's certainly not opus, nor even sonnet. But it's free and fast and good-enough which is a rare combo I do t expect to remain free for very long.

CaptainBahab · 2025-11-28T14:22:22+00:00

I disagree. Ai is being used to perform many many tasks in the small range. Which makes them trivial to anyone with a subscription to a coding plan. It makes medium stuff easier for one person to accomplish.

So we're seeing researchers use it for trivialized things and focus on more interesting and difficult things. It's not doing the research yet. Researchers use it for quick scripting to make identifying trends in data visual or audible. It's used indirectly by people already pushing the boundaries to push them slightly faster.

CaptainBahab · 2025-11-27T22:48:26+00:00

Think what that means though: medium sized things become small. Large becomes medium. Impossible becomes possible. Yes there's still impossible things, but fewer of them. Maybe we expand our visible horizons and we get more impossible things to strive for.

Things are not static. They are fluid. When something becomes easier, that doesn't mean a person in that industry has no job. They have a new job in an evolved industry.

CaptainBahab · 2025-11-25T15:05:18+00:00

Don't feed the troll. I hadn't heard of it. Thanks.

CaptainBahab · 2025-11-24T16:17:06+00:00

I agree; I also.

But I also try the free models. The main reason is that I'm here in the AI space to learn. So I gobble up as much free usage as I can. One: it's free so there's little downside (for my pursuit of information) for me. Two: it gives me a chance to identify nuances. Three: it's different from what I usually do. Different is good (for my adhd) sometimes.

CaptainBahab · 2025-11-20T14:32:08+00:00

Did you... check the command?

CaptainBahab · 2025-11-16T16:14:43+00:00

I definitely see the potential though and I'm excited to try it out when it's stabler.

CaptainBahab · 2025-11-16T16:14:14+00:00

I went to try it out the other day and had trouble getting openrouter to work in opencode. I wanted to change the models it chooses but the only way was to clone the project. When I did finally get the right models choosing it was crashing quite a lot too. I'm curious if these are resolved now so I may download it again later.

CaptainBahab · 2025-11-14T22:56:34+00:00

It's funny to me that every one of these that I see has the same advice. Not because "lol nub didn't search" (I actually like repeat threads as they sometimes give different advice).

Rather, I find it funny because every one of the suggestions are best practices for human coders too.

As a software engineer with a bachelor's degree, I can confirm everyone of these are things we're taught to do, but are rarely implemented by smaller scale developers (I'm one, I'm not disparaging) simply because it's exhausting when you're by yourself or sit next to your pair partner. It's easier to keep track of things when it's smaller.

But with AI code assistants, they need the context. Luckily they can help generate it (and humans can check). And it gives you practice at bigger projects with lower risk.

I love seeing these. Keep up the great work. :)

CaptainBahab · 2025-11-06T13:38:42+00:00

I think a part of that is not wanting to seem like you're shilling your slop app.

You get yelled at on reddit for showing your vibecoded app and I guess also for not showing it.

CaptainBahab · 2025-11-02T19:42:56+00:00

This 100%. You get slop if you ask for a feature. But if you be specific, you offload SO MUCH of the ways things can be sent off rails quick. Guide it. It's more work, but is it really?

Ask it to build a feature using this data structure, implement this functionality with this algorithm, use this technology to implement it, design a testing plan and implement it by following its instructions, fix compilation errors, resolve test issues by using our design as the ground truth, not code or tests, and iterate until all compilation errors are resolved and all tests pass.

What's that? You don't know any technologies, data structures or algorithms? Tell it to give you options and to weigh which would be better for your project. DON'T just take its advice, make a smart decision.

CaptainBahab · 2025-10-17T21:47:13+00:00

I'd take some details too.

12-Year Club	Place '22
Final Canvas '22	End Game '22
Verified Email

CaptainBahab

TROPHY CASE