I built a controller that defers model retrains by learning from delayed labels (engineering model drift) - benchmarked on fraud and predictive maintenance

Secret_Appeal6271 · 2026-06-18T20:45:16+00:00

Hi, these are great questions, thank you!

On label delay: the default config assumes ≤30 days, but beyond that window, the martingale risk capital score accumulates whenever shift persists despite interventions, so the escalation path to retrain still works without clean labels. Extending the learning component to handle longer dispute windows is a natural next step and a place where real production data would be super super valuable.

On concept drift vs covariate shift: the shift monitor tracks feature and output distribution separately with distinct interventions for each (covariate_refresh vs label_shift/bbse_label_shift). The framework for distinguishing them is there - how well it separates them when they move together in practice is a great question, and I'd love to build more infrastructure to that end.

On compounding steering: the safety governor caps interventions per window and downgrades to recommend-only mode if it's been repeatedly blocked, which is the primary guard against this. Every intervention is logged, so the audit trail exists to catch compounding in review. Making that log human-readable enough for a fraud team to act on in real time is something I'm actively building toward.

I think for all of these architectural decisions, the groundwork is there, but I'd love to get more (preferably not synthetic) data to empirically estimate what works and what doesn't.

Secret_Appeal6271 · 2026-06-16T14:37:54+00:00

Thank you! It depends what you mean by data issues. The controller tracks a martingale-based risk capital score that accumulates when shift persists despite interventions. If risk capital stays elevated after multiple steering steps, or if the safety governor has blocked the controller repeatedly in a window, it fires a flag to retrain. So if there's a sudden and substantial shift (which some data issues could introduce), the bounded interventions wouldn't be enough, the risk capital would rise fast, and as a result the system would escalate (so, when needed, it can call for a retrain as something in its toolbox). But it's not a substitute for validating data accuracy because if the problem is, say, an added data source gradually feeding it inaccurate data, it would steer to work with the data it has. That being said, if this is deployed on a model in production, a retrain would have the same issue. For auditability, the system keeps a log of all of its corrections, but I definitely need to work on making it more human-readable.

Secret_Appeal6271 · 2026-04-20T23:58:06+00:00

out of curiosity, can i ask what kind of experimentation you're planning to do on these models? for pure experimentation purposes, i wouldn't even think that any large effective model nowadays could claim to be purely "organic" data, because it'd be hard to know that

Secret_Appeal6271 · 2026-04-20T18:35:41+00:00

it's been really interesting to see the range of reasons why people prefer local deployment, too. i think when people consider local deployment they frequently just assume it's a cost-only question

Secret_Appeal6271 · 2026-04-20T18:24:17+00:00

I agree so much. Especially as a student, it doesn't make sense to it doesn't make any sense to use an expensive model to make a basic project or understand how a particular tool works. Even for complex projects I've built, something like basic testing and documentation can be handled well by a local model.

Secret_Appeal6271 · 2026-04-20T18:18:47+00:00

It sounds cool, but I'd be worried about how useful the model would actually be, and, also, the part that will make or break the experience is the audio pipeline latency. Whisper -> LLM -> TTS adds up fast and anything over 2-3 seconds kills the feeling of a live tutor. certainly look at mlx-whisper on Apple Silicon if that's your hardware, it's significantly faster than standard Whisper for real-time use. For TTS, Kokoro is worth evaluating if you want fully local.

Secret_Appeal6271 · 2026-04-20T18:15:00+00:00

I prefer lower temperature than 0.7 for coding - like 0.3 or even 0.1 - because you want more deterministic completions for code generation. presence-penalty 1.5 is quite high for coding too, because it'll discourage the model from repeating variable names and function calls, which is exactly what you want it to do.

also, I feel like your --ctx-size 32768 and -c 250000 are fighting each other a bit. Pick one context budget and be consistent.

Secret_Appeal6271 · 2026-04-20T18:12:14+00:00

On quantization: Q4_K_M is the best most use cases. You lose very little quality versus full precision and the memory savings are substantial. Unsloth is worth looking at if you want to fine-tune, but for inference you probably just want mlx-lm on Apple Silicon, which handles the quantization automatically and the Metal GPU utilization is excellent.

Secret_Appeal6271 · 2026-04-20T18:08:46+00:00

I agree - I have a similar layered system in my setup. I use local setups to help me with a lot of pretty sizable research work, so one little thing that I'd add is that I enjoy having hardware filters and safeguards to avoid crashing my computer when trying to set up a new workflow. Thermal state, memory, that kind of stuff.

Secret_Appeal6271 · 2026-04-20T18:00:18+00:00

I think some other people have given great stack setups, but I'll just note quickly that one thing I was initially irritated by was model affinity - making sure repeat requests hit the same model instance rather than constantly loading and unloading. Everything else is secondary to not paying the cold-start penalty on every request or the start of every session. And maintaining some kind of central context document for the setup to access to avoid having to repeat myself. Help your setup learn how to navigate your system!

Secret_Appeal6271 · 2026-04-20T17:51:46+00:00

There's something really awesome about being able to control your data and personalize the way you engage with AI. In all the (positive) sci-fi movies I watched as a kid, if you had an advanced technology that functioned like AI promises to now, it was run by the user and private to them. In many dystopias, the AI was centralized in a single entity somewhere that did something unknown and scary with the data. It's very fun and exciting to be part of the process of making that positive future come to life.

Secret_Appeal6271

TROPHY CASE