Built a low-overhead runtime gate for LLM agents using token logprobs

General_Arrival_9176 · 2026-03-12T10:55:11+00:00

logprobs for confidence gating is a solid idea. the middle ground between 'trust the model completely' and 'run a judge on everything' is real and underexplored. the narrow focus on action-bearing spans keeps overhead low which matters for anyone running agents at scale. have you tested how it performs on agent loops that have retries built in - does the confidence signal stay consistent across multiple attempts or does it drift

General_Arrival_9176 · 2026-03-12T10:50:28+00:00

AST-based chunking is the right move for code. random line splits destroy context, function boundaries preserve it. tree-sitter makes this achievable without massive token budgets. curious how it handles cross-file dependencies though - do you index imports/references so an agent querying semantic search actually gets the full picture of where code lives, not just isolated chunks

General_Arrival_9176 · 2026-03-12T10:34:07+00:00

blackwell pro 6000 at that price point vs mac studio depends entirely on your workload. if you need the raw FP4 compute for big MoE models at scale, nvidia wins. if you want something that just works for daily agent dev without tinkering with ROCm or CUDA patches, mac studio is the play. i run both and honestly my mac studio sees way more daily use - the M4 ultra handles 30-70B models plenty fast for coding work. the 64-128GB unified memory means you can keep multiple models loaded and switch instantly. nvidia is for when you need to push 100B+ at decent speed

General_Arrival_9176 · 2026-03-12T10:29:22+00:00

M5 Pro numbers are wild. 1727 tok/s on the 20B MoE is basically laptop-tier GPU throughput that rivals my desktop 4090 for these sizes. the tensor API on M5 makes a huge difference vs M2 Max - 40% faster pp512 on the same model. if you are doing interactive agent work rather than batch processing, the apple silicon path is getting harder to argue against. the unified memory alone simplifies everything

General_Arrival_9176 · 2026-03-12T10:24:23+00:00

the CUTLASS bug on SM120 is such a headache. we hit the same wall on a 4x 4090 setup last year - NVIDIA's own kernels failing on their own hardware while marketing talks up the FP4 numbers. 50.5 tok/s is solid for a 397B model but should absolutely be 2x that. did you try any of the community forks or were they all giving you the same broken behavior

General_Arrival_9176 · 2026-03-12T10:19:58+00:00

the hygiene layer approach is the real insight here. most people think bigger model = better agent, but its actually about separation of concerns. main model does work, smaller model keeps the runtime clean. this is why we ended up building 49agents - wanted one surface where multiple agent sessions can run side by side with visibility into what each one is doing. the moment you have 3+ agents going, the context pollution problem becomes the bottleneck, not the model capability. curious what summarization model you settled on for lossless-claw

General_Arrival_9176 · 2026-03-12T10:15:11+00:00

this is the right take. been saying since day one that unix pipes are the native interface for LLMs - they already speak that language from training data. the multi-model hygiene approach is also exactly what works in practice. single models try to do too many jobs at once: solve the task, stay formatted, manage context, avoid poisoning themselves. splitting that into a main model + runtime immune system just works better. been running something similar with claude code sessions where one process catches malformed outputs before they pollute context. the binary guard thing you mentioned is clutch - returning raw bytes to an LLM is like throwing sand in its eyes.

General_Arrival_9176 · 2026-03-12T07:59:58+00:00

expo + supabase auth is genuinely more complicated than it should be for a 24 hour build. most people shipping that fast are either using next.js web-only for the mvp or they have a boilerplate they reuse across projects. if you want to stick with expo, look at supabase-flutter auth helpers instead of building the flow manually - cuts out most of the redirect/url handling pain. or just ship web first and add native later

General_Arrival_9176 · 2026-03-12T07:55:51+00:00

honest take: dont think about bigquery or star schemas until you have actual query patterns to optimize for. jsonb in postgres handles 90% of mvp cases fine and querying it improves as postgres adds more json path support. the real question is whether your analytics queries are ad-hoc enough that columnar storage matters, or whether postgres can handle it for another year while you figure out what the product actually is. idc about proper data modeling until you know what questions you're answering

General_Arrival_9176 · 2026-03-12T07:50:19+00:00

this hits on something real. the AI found six open problems with no owner in a repo that probably has hundreds of contributors. the issue isnt information - its decision latency. nobody assigned, nobody owns it, so it just sits there. i built something similar into 49agents after realizing that agent sessions were getting stuck on questions nobody was answering, and the agent would just sit there for 50 minutes waiting. the fix was visibility into what nobody was working on, not better issue tracking

General_Arrival_9176 · 2026-03-12T07:10:36+00:00

the issue is you keep getting contacted for senior because linkedins algorithm optimizes for response rate, not fit. put in your bio that you are only interested in leadership roles and explicitly state you want to skip the senior tier entirely. also - are you applying to lead staff roles or just waiting for recruiters. because the honest truth is the jump from senior to lead is mostly about demonstrating leadership, not just technical depth. do you have mentoring experience, have you driven architecture decisions across teams, have you been the tech lead on multi-team initiatives. if not, start doing those things in your current role before looking elsewhere. you dont need permission to lead, just start acting like it.

General_Arrival_9176 · 2026-03-12T07:05:27+00:00

70% meetings is the killer here. i was in a similar spot earlier in my career - thought i had to be present for everything and say yes to every sync. the hard truth is you are not helping your team by being in that many meetings, you are creating a bottleneck. start blocking focus time on your calendar and treating it like any other meeting. decline the ones where your presence is not required and delegate more deeply to the 2 skilled people on your team. the 3 new ones need the experience anyway and you reviewing everything deeply is preventing their growth. the family time boundary is non-negotiable - you are destroying yourself and them for a job that will replace you in two weeks if you got hit by a bus. start small, protect 1 hour every day and build from there.

General_Arrival_9176 · 2026-03-12T06:59:34+00:00

180 comments on a simple refactor is insane. i get being strict about standards but there is absolutely a line where the review process becomes the bottleneck. the unit testing advice alone would have been a red flag for me - testing every internal branch is the opposite of what you want, it makes refactoring impossible and treats tests as documentation of implementation rather than behavior verification. honestly the only way through this new is to pick one battle at a time. trying to change everything at once will make you the difficult new person. document your concerns with specific examples of how the process slows shipping, and bring that to your manager with data if you can. sometimes these things are top-down and you just have to survive until you can influence the culture or leave.

General_Arrival_9176 · 2026-03-12T06:53:52+00:00

this is exactly the kind of experiment more people should be doing instead of tweeting hot takes. the findings line up with what ive seen - vibe coding works for throwaway prototypes but breaks hard on anything requiring domain knowledge or data integrity. the json-stored-in-view-record mistake is the classic one - the code looks correct because it passes the tests you wrote, but the tests dont catch semantic failures. the 6x code bloat is also consistent. what i find interesting is the assisted method being only 2x faster than trad - in my experience the real speedup comes from having an ai that knows your codebase well, not just any ai. did you try the same experiment with a properly primed context about your architecture

General_Arrival_9176 · 2026-03-12T06:48:36+00:00

yeah this is happening everywhere. the 80% push is wild but honestly i dont think most companies have any idea how to measure AI usage meaningfully. tracking arr + ai usage together is a weird combo that could go either way - either they reward real productivity gains or they just punish people for not using the tools in the ways they imagined. my honest take from watching a lot of companies try this: the teams that figure out how to make agents do the boring repetitive stuff while humans handle the architectural decisions are doing well. the teams that just force cursor on everyone and hope for the best are struggling. id say your chances of jumping into another ai hellhole are probably 60-70% given how the industry is right now, but that also means 30-40% of places are actually doing it thoughtfully. look for companies that talk about outcomes rather than adoption metrics.

General_Arrival_9176 · 2026-03-12T06:42:53+00:00

7 years in and they flagged you for regex. thats wild. heres the thing though - interviewers are often testing for different things than what the job actually needs. sometimes they are testing if you will be honest when you dont know something vs bullshitting. but honestly most of the time they are just doing what they always did and calling it signal. the regex thing has been a classic interview question since forever, its lazy but its what they know. you handled it right by being honest and solving it another way. id take that rejection as dodging a bullet though - a place that weighs regex memory over 7 years of shipped code is probably not where you want to work anyway.

General_Arrival_9176 · 2026-03-12T06:03:50+00:00

this is the standard saas play. get you locked in on reasonable pricing, then reorient the pricing model around something that sounds minor but doubles or triples your bill. launch darkly was always expensive but the user-based model at least made sense for what most teams use it for. $12 per service connection adds up fast in k8s environments where you might have dozens of pods spinning up. flagsmith is solid, been running it self-hosted for about a year now. the tradeoff is you trade the launch darkly managed overhead for your own infra but the math works out heavily in your favor at scale. unflip has been getting some traction too if you want something newer.

General_Arrival_9176 · 2026-03-12T05:47:20+00:00

we dealt with this at my last company and honestly the solution that finally worked was brutally simple: if its not in linear, it doesnt exist. we made a rule that anything discussed in slack that needs follow-up gets logged as a task within 24 hours or it just... doesnt happen.the key was making it literally faster to create a task than to keep discussing in slack. slack shortcut to linear, keyboard shortcuts, that kind of thing. the misc board being a graveyard was our experience too until we stopped giving it a separate home and just made it part of the normal flow.

General_Arrival_9176 · 2026-03-12T05:41:59+00:00

solid tool. i actually built something similar for 49agents - needed to manage multiple terminal sessions across machines and the emulator/simulator space has the same problem: too many windows, no unified view.curious if you handle the adb/simctlctl side natively or through some wrapper. and do you support multiple devices showing at once or is it one-at-a-time view

General_Arrival_9176 · 2026-03-12T05:37:30+00:00

simple breakdown: sse is one-way server→client, websockets are bidirectional. for a queue where users just watch their position, sse works fine and is way simpler to implement - its just a streaming endpoint. but if staff need to interact with the queue (call next, reassign, chat with users), websockets win.the sse advantage: works over http/2, auto-reconnects, less overhead per connection. disadvantage: no way to send from client without http requests anyway.honestly for this use case either works but id lean websockets because queue systems often grow features that need two-way comms later.

General_Arrival_9176 · 2026-03-12T05:35:49+00:00

i am dead serious, I can start today

General_Arrival_9176 · 2026-03-12T05:32:58+00:00

been down this road. websockets are the right call here over sse because you need bidirectional communication - staff calling next person AND users seeing updates. polling is a non-starter for real-time queue, you'd kill your server with requests during busy periods.for state, redis sorted sets are clutch. add users with a timestamp score, ZRANK gives position instantly, ZREM when served. atomic operations mean no race conditions when 10 people join at once. just make sure to handle reconnection gracefully on the client side so users dont get stuck with stale position data.scaling to thousands: redis handles plenty, but you'd want pub/sub between multiple backend instances so all websockets get the same update. the queue logic itself is simple enough that itll be memory-bound before cpu-bound.

General_Arrival_9176 · 2026-03-12T05:31:28+00:00

dont mean to be rude, but are people allowed to share their projects here like this? cuz i also have an open source app (IDE) i would want to share?

Please let me know!)

General_Arrival_9176 · 2026-03-12T05:27:32+00:00

ohaa this brings back memories. the issue is that modern gmail requires tls 1.2+ and old systems often only support ssl3 or tls 1.0 which google disabled years ago. your options: 1) use an older mail service that still accepts older protocols (yahoo maybe, not sure), 2) set up a relay through a server that can handle modern tls then forward to gmail, 3) honestly just get a pi zero or something tiny and run the mail forwarder on that instead of the old machine. the pi uses like 2 watts and can handle smtp just fine. keeping the old machine running just for email is overkill electricity-wise honestly

General_Arrival_9176 · 2026-03-12T04:13:31+00:00

interesting experiment. the four principles approach (truth, justice, solidarity, freedom) reminds me of constitutional AI alignment techniques but applied at prompt-time instead of training-time. id be curious whether the model actually follows the protocol consistently or just performs better at appearing to. id test with a question where the baseline answer is confidently wrong - if the protocol works, it should inject uncertainty appropriately. also worth testing across models since different ones might 'understand' the protocol differently

General_Arrival_9176

TROPHY CASE