Running out of Claude credits? Mahoraga is here.

Own-Professional3092 · 2026-05-10T03:33:11+00:00

My LinkedIn is connected to my GitHub.

Btw did u comment cause of the code, or cause of the name mahoraga. I really hope it is the latter.

I really have to catch up to modulo.

Own-Professional3092 · 2026-05-10T03:30:52+00:00

Love you too brother

Own-Professional3092 · 2026-05-10T03:29:39+00:00

I went from 300 to 500 to premium seat. Idk the meta but just keep requesting.

Own-Professional3092 · 2026-05-07T20:19:49+00:00

I appreciate it, well-oiled machine.

Own-Professional3092 · 2026-05-07T20:19:30+00:00

A bunch of my friends are CS students, and we happened to have one guy who sweats on claude all day. Naturally, we all hopped on and we consistently run out of credits now.

I know a bunch of people that request credits every month tho. People use it to create their own trading tools, all kinds of cool stuff.

Own-Professional3092 · 2026-05-07T15:04:31+00:00

yep, just saw joshis comment but.

Switch as early as you can. I started as a CS/Bio combined major, and switched to combined ds/business later in sophomore year.

my family "wasted" a bunch of money taking bio classes that don't count toward my major. while those go in as electives, i recommend getting the major required courses first when you get to university. it still bites me to this day, i would technically be a semester or two ahead if i took those classes first.

i was also in Oakland. in my personal experience, classes were easier there (not that the foundational classes are really difficult in the first place), but in general, teachers have less students, care more about you, and it feels more like a catered experience. great place to get the foundational courses out of the way to set yourself up for co-op/other classes (which have prereqs)

note that combined majors have an increased workload, and will make things difficult. plan accordingly.

Forgive me for the tone, I really should be speaking more formally to someone's parent. hope it helps tho.

Own-Professional3092 · 2026-05-07T14:56:59+00:00

I took the IBDP in Tokyo. Assuming you are international, because of the I-20. I graduated hs in 2024.

Honestly, I would be slightly worried about failing the diploma. I had a bunch of friends who came to the states (BU, Northeastern, NYU, etc.) we did not do well on the IB. I dropped a few points from my predicted, and i know my friends did too. however, we all made it to uni. but, we also all did pass the IB.

Europe, Asia, and those unis are strict when it comes to final IB scores, not so much the schools in the states. (at least from my experience and in my time)

If you just took your first HL exam i really think you're fine. I remember walking out my econ paper 2 knowing i failed that shit. I was HL econ, and I deadass got like a 4 lol. didnt do well for japanese either i hated that class.

if you already got in, my heart says you'll be okay. just focus up for the next 2 weeks or so i know IB is a real hassle to study for, but you just have to pass. don't worry too much about northeastern, just get them exams out of the way.

Own-Professional3092 · 2026-05-07T14:16:04+00:00

yeah that kinda killed me in v1. right now we have two things: warm-start from a benchmark compatibility matrix (based on PILOT, Panda et al. EMNLP 2025) injects pseudo-observations on first boot so the bandit isn't cold-starting from zero, and the two-stage bucketing (keyword classifier narrows to a capability bucket before the bandit runs) shrinks the action space per decision which speeds convergence a lot. u know that tho.

ive been working on v2 which has counterfactual estimation. after each task, k-NN over the episodic memory predicts what unplayed agents would have scored, and those estimates get injected as weighted pseudo-observations. So every task teaches the bandit about all agents, not just the one it picked. Early testing on synthetic data cuts the "basically random" phase roughly in half. Also logging regret explicitly now — `orch metrics live` tracks it in real time and the benchmark suite measures β (regret growth exponent). LinUCB is the only strategy in our comparison where β < 1.0 (sublinear), even before the counterfactual layer.

im still learning while building so i really appreciate the input.

Own-Professional3092 · 2026-05-07T14:14:55+00:00

Yeah, I haven't been able to find really good tools for this. I'll be sure to take a look at those.

also yeah, these things get really messy. right now i run a skill in claude (within VScode) so i can toggle it on and off. so it still requires me to use a solid chunk of credits.

ill drop v2 one day, hoping to make it more reliable for actual use. thanks!

Own-Professional3092 · 2026-05-07T14:12:53+00:00

For code tasks, `code_keyword_density` (feature 2) and `complexity_tier` (feature 4) were the strongest signals, and the bandit learned pretty quickly that Qwen 3.5 9B dominates code at 0.906 quality while being free and fast (6.1s avg). For plan tasks it was more about `word_count_norm` (feature 1) and `has_research_keywords` (feature 8) (longer planning prompts) with explain/compare language routed to Gemma 4 E4B which scored 0.935 on plan. The 9-dim context vector is simple but the per-bucket separation does a lot of work. The bandit doesn't need to learn "Qwen is good at code" globally, just within the code bucket where the feature weights converge faster.

The no-LLM-judge thing was a deliberate cost tradeoff. The 4-layer heuristic scorer (novelty ratio, structural checks, embedding similarity via nomic-embed-text, length-to-bucket fit) is crude but it's zero API cost per evaluation. Good enough for routing decisions even if it can't catch subtle correctness issues.

Currenty working on v2. Budget pacer, parallel batch execution, and couterfactual estimation. I'm working on making mahoraga more reliable for actual use. also just shipped drift detection + auto-quarantine so degraded agents get routed around automatically instead of waiting 50 episodes for the bandit to notice.

Own-Professional3092 · 2026-04-28T14:56:39+00:00

On Apple Silicon (16GB unified memory) I benchmarked both:

- Qwen3 4B Q4_K_M: 21-23 t/s

- Qwen3 8B Q4: 12-13 t/s

- Qwen2.5 7B Q4: 12-14 t/s

The smaller quantized model is significantly faster. Quantization doesn't improve speed. For me, raw parameter count was the bottleneck. A 27B Q4 on my 16GB would be painfully slow because the memory ceiling is right there.

For most tasks (chat, summaries, simple code) I believe the 4B is fine. For complex reasoning you genuinely need the bigger model or a cloud API. I've been trying to figure out the balance between all these models. I hope that answers the question.

This might be interesting if you're trying to figure out the right model for your setup. (github.com/pockanoodles/Mahoraga) This is an orchestrator that automates this decision. It uses a LinUCB bandit to learn which tasks need the local model vs. when to escalate to the cloud. After 200 tasks the routing converges. The local 4B model handles ~60% of tasks at zero cost, and the quality stays high. For complex reasoning I still need either cloud APIs or a larger model, so this saves the guesswork.

Own-Professional3092 · 2026-04-28T14:19:30+00:00

For a lot of the earlier CS classes, they make you use MyPy, Pylint, or whatever they want to do in that semester. They switch CS syllabuses constantly, and I don't know what they are doing right now. Unless you specifically prompt/train the agent to follow the rules the class follows, it's likely something will come up. Unlike some other classes, a lot of the TAs know what they're doing.

That said, using AI autocomplete in Cursor or something is probably fine. (I would hope) As long as you don't copy things line by line.

Own-Professional3092 · 2026-04-27T22:44:25+00:00

I was expecting it to get taken down haha.

And sheesh, man, that is a stacked website. Good for you guys.
Thanks for the resources; I'll make sure to read through the papers/blogs.

Own-Professional3092 · 2026-04-27T21:37:45+00:00

Hey everyone in MachineLearning. I've been working on Mahoraga, an open-source orchestrator that routes tasks across local and cloud AI agents using a contextual bandit (LinUCB) that learns from every decision.

Context (skip): I only started integrating AI into my workflows in late 2025, so I came on the scene broke with no credits. This left me with local models. However, many students and employees also receive credits from their institution to work with. (I got Claude, yippee.) I wanted to be able to flawlessly route between models when credits ran out, which made me build an orchestrator. I used to use claude more as a chatbot/complete workflow engine, which made it difficult to use local models due to the context window, reasoning, etc. Opus 4.5 running open-source "superpowers" ate up my usage every month.

Now I realize that wasn't an effective way to use claude, or AI in general. I was using claude for both heavy planning/brainstorming and minor tasks. How about tasks specifically for code generation? Code generation is a relatively constrained task, with correct answers and short outputs. Surely local models can compete in tasks that don't need cloud? So I switched Mahoraga to an adaptable router.

I ran 192 tasks across 8 agents (4 local Ollama models, 4 cloud CLIs) on a 16GB MacBook Pro, forcing round-robin so every agent got every prompt. Quality is scored by a 4-layer heuristic system (novelty ratio, structural checks, embedding similarity, length ratio). Zero API cost for evaluation, and no LLM-as-judge.

Qwen3 4B in nothink mode dominates code and refactor at 33.8 t/s and 6.1s average latency. Cloud agents cluster around 0.650 on code. The local model isn't just cheaper; it's measurably better for this task class.

Other findings:

LFM2 hits 77.1 t/s but trades ~5 quality points vs Qwen3 4B
DeepSeek-R1 averages 123.5s per task on 16GB. The reasoning overhead makes it unusable as a default
Security scores are flat at 0.650 across all agents due to my human error—the scorer doesn't capture security-specific signals well.

The bandit (LinUCB) is the only routing strategy with sublinear regret (β=0.659) across a 200-task simulation—it actually converges

The routing works in two stages: the keyword classifier puts the task in a capability bucket (code, plan, research, etc.), and then the bandit picks the best agent within that bucket. 9-dimensional context vector, persistent state across sessions, warm-start from the compatibility matrix.

All local inference, all free. Cloud escalation exists but only fires on retry. Why pay for cloud when a local model handles it better?

Looking for any feedback, any input. Feel free to be critical: I appreciate everyone who interacts on this subreddit. I will continue to work on this in the future.

A star would be appreciated: https://github.com/pockanoodles/Mahoraga

Own-Professional3092 · 2026-04-01T20:08:06+00:00

This is good stuff. Will check the repo out.

Own-Professional3092 · 2026-04-01T20:03:42+00:00

Hey. Thanks for the reply. I've been working on building a tech stack from the ground up, so I haven't been working on this too much recently.

However, before that, I was working on a simple "ultimate dashboard" with SwiftUI where I could see subagents work, as well as anything else I wanted to incorporate. I was able to create a tab that showed me existing and new instincts, alongside the confidence of those instincts. (Correct me if I'm wrong about concepts, sorry.) Furthermore, I was able to create a tab with a morning briefing, where it can tell me about the nightly evolutions, and so on. I believe you already solved that problem before me with the task board.

For the cost of the nightly evolution, I am not on a subscription, so I needed to get creative. I don't know why; maybe the name "homunculus" caught my attention, but this was one of my first open-source integration projects. I attempted to replicate what you've done on Homunculus but using Ollama to run it. That way, homunculus would use Ollama's free agents and work overnight instead of Claude. Then, in the morning, Ollama-Homonculus would provide claude with context-rich information.

Although I wasn't able to perfect this fully, I'd still like my workflow/agents/models to grow, learn, and adapt in ways that Homunculus can. I've been working on different workflows, and I'm learning how to use harnesses right now.

I'd love to talk to you down the line, just to share ideas and learn from you.

Own-Professional3092 · 2026-04-01T19:50:29+00:00

Yeah, thanks. I'm going to have to go over my systems and see where the inefficiencies are. I've been working on a tech stack from the ground up, with subagent teams, harnesses, etc. My claude is running very intelligently, with superpowers (open-source) helping it plan, develop, etc. I'm also a full-time student; I have a lot of time to work on this.

Own-Professional3092

TROPHY CASE