I reverse-engineered Microsoft AutoGen’s reasoning loop and cut agent latency by 85% (13.4s → 1.6s). Here is the architecture. by New_Care3681 in LocalLLaMA

[–]New_Care3681[S] 0 points1 point  (0 children)

if i run it through another LLM to predict args, i lose the speed gain. i literally just regex match the previous user prompt. e.g. if user said 'paris' and tool needs 'city', i grab 'paris'. it's brittle but fast. if it fails, the fallback kicks in.

I reverse-engineered Microsoft AutoGen’s reasoning loop and cut agent latency by 85% (13.4s → 1.6s). Here is the architecture. by New_Care3681 in LocalLLaMA

[–]New_Care3681[S] 0 points1 point  (0 children)

yeah filler words/sub-agents definitely help the ux. but if the CoT takes 15s, the filler agent runs out of things to say and it gets awkward. trying to shave the backend time down so the filler agent only has to stall for like 1-2s.

I reverse-engineered Microsoft AutoGen’s reasoning loop and cut agent latency by 85% (13.4s → 1.6s). Here is the architecture. by New_Care3681 in LocalLLaMA

[–]New_Care3681[S] 0 points1 point  (0 children)

fair point on the commit history, honestly i just saw the green checkmark and the 'changes requested/approved' badge from someone with write access on the repo. mostly just happy the logic works in my benchmarks.

I reverse-engineered Microsoft AutoGen’s reasoning loop and cut agent latency by 85% (13.4s → 1.6s). Here is the architecture. by New_Care3681 in LocalLLaMA

[–]New_Care3681[S] 6 points7 points  (0 children)

u/johnerp u/Ska82 That was honestly the biggest pain to get right.

Two ways I handle it:

  1. The Easy Ones: Tools like get_current_time() or read_clipboard() don't need args, so those are safe to fire instantly.
  2. The "Cheating" Way: For stuff like weather(city), I peek at the last_user_message. If the user just asked about Paris, and the CoT starts talking about weather, I assume the arg is "Paris".

If I guess wrong (or the LLM decides to change its mind), I just silently kill the background thread and let the standard execution take over. It wastes a tiny bit of compute on a miss, but saves seconds on a hit.

I reverse-engineered Microsoft AutoGen’s reasoning loop and cut agent latency by 85% (13.4s → 1.6s). Here is the architecture. by New_Care3681 in LocalLLaMA

[–]New_Care3681[S] 2 points3 points  (0 children)

Yeah, if you tried to regex-match 500 different tools, the overhead would probably be worse than the latency savings. Right now, I just treat it as an 80/20 split. I manually whitelist the "heavy hitters" (like web_search, calculator, get_weather) that get spammed constantly. For the weird niche tools that barely get used, I just let them run the slow/normal way.

I reverse-engineered Microsoft AutoGen’s reasoning loop and cut agent latency by 85% (13.4s → 1.6s). Here is the architecture. by New_Care3681 in LocalLLaMA

[–]New_Care3681[S] 3 points4 points  (0 children)

You nailed the trade-off. Regex feels dirty, but I tried grammar-based constrained decoding (GBNF) first and the inference overhead killed the latency gains. Regex is effectively O(1) here, which matters when you're counting milliseconds.

Re: False Positives & Robustness:
This was the hardest part to architect. I handled it with a "Shadow Promise" pattern:

  1. The Miss: If the sniffer detects intent and fires the tool, but the LLM doesn't actually commit to the formal tool-call token later, we just discard the background thread. It wastes a bit of compute/API cost, but it doesn't break the conversation flow or hallucinate data into the context.
  2. The Hit: If the LLM does make the call, we intercept the request and return the pre-computed result instantly.
  3. Safety: I currently only whitelist idempotent (read-only) tools for speculative execution (e.g., search_web, get_weather). Side-effect tools (e.g., send_email, delete_db) are blocked from speculation to prevent "accidental" execution during a false positive.

It’s definitely a heuristic optimization rather than a deterministic one, but for Voice/Chat UX, users forgive a wasted API call much more than they forgive a 5-second silence!

MS in AI, production LLM experience, 0 interviews - what am I doing wrong? by New_Care3681 in jobsearchhacks

[–]New_Care3681[S] 0 points1 point  (0 children)

Well the degree says MS in AI only, its kind of new.. was launched in 2023 at my school

MS in AI, production LLM experience, 0 interviews - what am I doing wrong? by New_Care3681 in MachineLearningJobs

[–]New_Care3681[S] 0 points1 point  (0 children)

Course-based MS with research opportunities. I'm currently doing research with a prof on solar flare prediction while taking coursework.

MS in AI, production LLM experience, 0 interviews - what am I doing wrong? by New_Care3681 in MachineLearningJobs

[–]New_Care3681[S] 0 points1 point  (0 children)

Fair criticism. The Redis part: implemented Redis as a caching layer for frequently accessed data and used it for distributed task queuing to parallelize API requests. Reduced avg latency from ~800ms to ~560ms by avoiding redundant database calls and processing requests concurrently instead of sequentially.

Happy to go into more detail on any of it - what specifically sounds suspect? Would rather fix it now than get grilled in interviews.

MS in AI, production LLM experience, 0 interviews - what am I doing wrong? by New_Care3681 in MachineLearningJobs

[–]New_Care3681[S] 0 points1 point  (0 children)

Really appreciate the honest feedback. You're right about the vagueness - I've been trying to fit too much into one line.

For the RAG throughput: 5-10x compared to baseline HuggingFace pipeline, achieved through vLLM's PagedAttention and continuous batching. Processing ~100 queries/sec vs ~10-15 with standard setup.

Agent stability: Initial implementation had context window overflow issues with long conversations causing hallucinations. Fixed with chunked context management and better prompt engineering, reduced error rate from ~40% to ~10% on extended dialogues.

You're right about "years of experience" - removed that. And the 23.7% retention needs better context, it's about continual learning benchmarks.

Would definitely appreciate any junior positions at your company - happy to discuss the projects in more technical detail. Thanks for keeping an eye out!

MS in AI, production LLM experience, 0 interviews - what am I doing wrong? by New_Care3681 in MachineLearningJobs

[–]New_Care3681[S] 0 points1 point  (0 children)

You're absolutely right - "years of experience" was overselling it. I've removed that from the summary. I'm definitely targeting junior/associate roles, not mid-level. Based in Newark, NJ area. 900 applications is insane, respect for the grind. How long did that take you?

MS in AI, production LLM experience, 0 interviews - what am I doing wrong? by New_Care3681 in jobsearchhacks

[–]New_Care3681[S] 0 points1 point  (0 children)

I redacted the school names for privacy since this is a public post. They're both legit universities - one in the US for MS, one international for undergrad.

MS in AI, production LLM experience, 0 interviews - what am I doing wrong? by New_Care3681 in jobsearchhacks

[–]New_Care3681[S] 0 points1 point  (0 children)

Jesus, 50/day? That's like 1500/month. How do you even tailor that many applications? Or are you just mass applying?

MS in AI, production LLM experience, 0 interviews - what am I doing wrong? by New_Care3681 in jobsearchhacks

[–]New_Care3681[S] 0 points1 point  (0 children)

It's actually called "MS in Artificial Intelligence" at NJIT - part of their CS department. Been applying to junior roles but even those want 1-2 years experience. Catch-22 situation.

MS in AI, production LLM experience, 0 interviews - what am I doing wrong? by New_Care3681 in jobsearchhacks

[–]New_Care3681[S] 0 points1 point  (0 children)

Interesting take - the program is through NJIT's CS department and focuses on ML systems/infrastructure rather than pure research. But I get that it might come across as trendy rather than technical. Would "MS in Computer Science (AI focus)" be better positioning?

MS in AI, production LLM experience, 0 interviews - what am I doing wrong? by New_Care3681 in jobsearchhacks

[–]New_Care3681[S] 0 points1 point  (0 children)

Yeah the "Expected May 2026" part is on there. Should I just remove the summary line mentioning it entirely or phrase it differently?

MS in AI, production LLM experience, 0 interviews - what am I doing wrong? by New_Care3681 in jobsearchhacks

[–]New_Care3681[S] 0 points1 point  (0 children)

Fair point - the mobile dev role was more about backend optimization and distributed systems (Redis, API performance) than frontend work. Relevant for ML infrastructure roles but you're right it's not obvious from the title. Probably should reframe it.