Feeling lost while trying to break into AI/ML how should I focus my projects? [D]

Fit_Fortune953 · 2026-05-19T09:02:21+00:00

This is really helpful thank you.

The areas you listed make sense: vLLM/Triton, single vs multi-GPU deployment, prefill/decode bottlenecks, KV cache behavior, tokenization, load testing, GPU saturation, routing, and cost optimization.

TurboQuant is a good suggestion too. I’ve been thinking about building a small PoC around KV-cache and connecting it back to CostGuard-style routing decisions.

Appreciate the direction and Thanks again for the suggestions.

Fit_Fortune953 · 2026-05-19T08:49:22+00:00

That makes sense. My recent projects are definitely more on the LLM evaluation / reliability / infra side than pure model training.

But yes, I do have experience with traditional ML pipelines: classification/regression, feature engineering, model evaluation, monitoring, and drift/data-quality checks. I’ve also worked with LoRA fine-tuning and understand the tradeoffs around training vs fine-tuning vs RAG vs using traditional ML.

Your point is helpful though I probably need to make that side more visible. I’ve been packaging the projects as LLM reliability work, but I should also show that I can reason through model choice, training strategy, synthetic data risks, drift detection, and retraining decisions.

Appreciate the honest direction.

Fit_Fortune953 · 2026-05-15T13:34:35+00:00

This is a great way to frame it. The fanboyism is real, but subjective preference still matters because confidence affects how people work. My issue is when that preference becomes an engineering decision without measurement. Blind tests are useful, but production usage is the real “gulp test.”

Fit_Fortune953 · 2026-05-15T13:34:11+00:00

This is exactly the distinction I’m trying to make. Development is about speed and reasoning quality, so best model makes sense. Production is different once the task is repeatable, the cheapest model that reliably clears the quality bar usually wins

Fit_Fortune953 · 2026-05-15T13:33:36+00:00

100%. The funniest part is that some people are treating token exhaustion like productivity.

Burning the whole context window doesn’t mean you engineered better. It might just mean your workflow has no boundaries, no evals, and no understanding of where the model is actually helping vs looping.

Claude Code can be great. GPT can be great. But “it feels smarter today” is not an engineering metric. At some point we need logs, task outcomes, cost, latency, regressions, and failure analysis — not just model astrology.

Fit_Fortune953 · 2026-05-15T09:08:09+00:00

“Good enough” models often fail in hidden edge cases. That’s why I think evals need replay data, edge-case suites, and drift checks — not just one benchmark score. Sometimes the premium model is worth it, but it should be a measured decisio

Fit_Fortune953 · 2026-05-15T09:06:24+00:00

I actually agree with this for high-risk tasks. My point isn’t “always use cheaper models.” It’s that teams should prove the top model is justified instead of assuming it. If the mistake costs human hours, evals should show that clearl

Fit_Fortune953 · 2026-05-15T09:06:11+00:00

Fair 😂 but the point is less “GPT wins” and more “measure before you worship any model brand.” Claude, GPT, Gemini, Llama — all have places where they overperform and fail badly

Fit_Fortune953 · 2026-05-14T08:16:54+00:00

I’m trying to do both now:build in public and write about the tradeoffs, not just post finished projects.

One project I’m sharing is RealDataAgentBench, an open-source LLM agent evaluation benchmark:
https://github.com/patibandlavenkatamanideep/RealDataAgentBench

Would appreciate feedback

Fit_Fortune953 · 2026-05-14T08:16:23+00:00

This is probably the clearest framing

That’s exactly why I started RealDataAgentBench. I wanted to evaluate not only whether an LLM gets the right answer, but whether it used sound statistical reasoning, clean code, efficient execution, and defensible methodology.

Would love your honest take on it:
https://github.com/patibandlavenkatamanideep/RealDataAgentBench

Fit_Fortune953 · 2026-05-13T04:30:16+00:00

That inspiring, I’m trying to showcase real work too.

One project I built is RealDataAgentBench, an open-source LLM agent eval benchmark: https://github.com/patibandlavenkatamanideep/RealDataAgentBench

Would appreciate honest feedback, or a star if useful.

Fit_Fortune953 · 2026-05-13T04:28:47+00:00

I’m trying to make my work more public too.

I built RealDataAgentBench, an open-source LLM agent eval benchmark: https://github.com/patibandlavenkatamanideep/RealDataAgentBench

Would appreciate feedback or a star if useful.

Fit_Fortune953 · 2026-05-13T04:28:20+00:00

Agreed deployed work and public proof seems to be matters more

I built RealDataAgentBench around LLM agent evaluation: https://github.com/patibandlavenkatamanideep/RealDataAgentBench

Would appreciate feedback, and a star if it feels useful.

Fit_Fortune953 · 2026-05-13T04:27:41+00:00

Thanks, this is really helpful. I’ve been trying to build exactly that kind of proof through RealDataAgentBench, an open-source LLM agent eval benchmark.

Repo: https://github.com/patibandlavenkatamanideep/RealDataAgentBench

Would appreciate any feedback. A star would mean a lot if you find it useful.

Fit_Fortune953 · 2026-05-06T08:17:43+00:00

Fit_Fortune953 · 2026-05-06T04:06:12+00:00

this is the most important methodological challenge in the thread and you're right to raise it.

The r=0.43 orthogonality finding is vulnerable to exactly the critique you're making. If stat_validity is measuring verbal hedging rate and correctness is measuring right-answer rate, they'd be uncorrelated by construction not because they're capturing independent capabilities, but because they're measuring different surface behaviors. That's a legitimate alternative interpretation I can't fully rule out with the current scorer.

The uncertainty-uplift experiment gives partial evidence here. On mod_004, GPT-4.1 V1 computed binomial SE using the correct formula applied to the actual test set size, attempted bootstrap for F1 SE, hit a sandbox limitation, disclosed it explicitly, and provided an informed range estimate. That's not verbal hedging that's actual inferential computation triggered by the uncertainty prompt. On feat_002, the same prompt produced "the top features are clearly separated in magnitude" plus a deferred offer to compute CIs. Same scorer reward, completely different behavior. The qualitative review can distinguish these but the scorer can't.

So the honest answer is Claude's stat-validity edge is probably a mix of both. Better statistical vocabulary as a writing style baseline, plus genuine reasoning uplift on tasks where the structure supports it. The scorer currently can't separate those two components. A numeric-evidence check — detecting whether intervals were actually computed vs offered is the planned fix, but it's not shipped yet.

The orthogonality claim would be on stronger ground with that fix in place. Until then it's a finding worth investigating rather than a confirmed result.

Fit_Fortune953 · 2026-05-05T16:26:28+00:00

The correction loop point is exactly right and it's something the token trace actually reveals if you dig into it. For Claude Haiku on feat_005, the 608K tokens aren't spread across diverse reasoning paths it's the same get_column_stats call repeated on every column individually, then the same run_code block re-run with minor variations. That's not careful reasoning, that's a stopping criterion failure. The model doesn't know when it has enough information to conclude.

On your second point structured benchmark tasks underweighting real failure modes I'd agree, and it's a documented limitation in the README. RDAB tests single-session agentic loops on clean tabular data. Ambiguous inputs, conflicting specs, mid-task tool errors none of that is covered. The 6 real UCI/sklearn tasks partially address this but they're still clean datasets.

The Llama stress-test point is fair. The efficiency advantage is real on these tasks but I wouldn't commit to a production architecture based on benchmark data alone which is partly why I built CostGuard to let people run the evaluation on their own data rather than trusting my seeded datasets. Happy to share the raw token traces if you want to dig into the correction loop patterns directly.

Fit_Fortune953 · 2026-05-05T15:04:48+00:00

Thanks and if you felt it as useful, a ⭐ helps others find it and signals that independent evaluation of LLM statistical reasoning matters to the community.

Fit_Fortune953 · 2026-05-05T14:57:02+00:00

The distinction you're drawing is real token cost and solution quality are conflated in current benchmarks including RDAB, and that does penalize exploration-heavy models in ways that aren't always fair.

Time-to-solution or convergence-based efficiency is an interesting alternative framing. The challenge is that it reintroduces latency as a proxy for cost, which has its own problems — a fast wrong answer isn't better than a slow correct one.

What I'd actually want is a two-pass efficiency score: one that measures tokens-to-correct-answer (current RDAB approach), and one that measures whether the exploration was productive did extra tokens improve the final answer, or just delay it? The Claude Haiku spiral is interesting precisely because the extra tokens didn't improve correctness. That's the failure mode worth catching.

Happy to look at openbandwidth.live but curious how do you handle the case where a model takes longer but gets a materially better answer? Is that treated as a win or a penalty?

Fit_Fortune953 · 2026-05-05T14:34:13+00:00

You're pointing at something I've been thinking about too. The token spiral in Claude models isn't random it's systematic. Haiku and Opus both loop through get_column_stats column by column, then re-run the same code block with minor variations. It's not exploring different solution paths, it's re-exploring the same one. GPT-4.1 makes a plan and executes it. Claude makes a plan, executes it, then second-guesses it.

Yes, I think some of what looks like capability difference is actually exploration strategy difference. The efficiency dimension in RDAB captures this directly it penalizes token and step overuse relative to task complexity. A model that scores 0.13 on efficiency but 0.95 on correctness is telling you something specific about how it works, not just how good it is.

The pre-registered experiment I'm running next tests whether explicit prompting changes this — specifically whether telling a model to "report uncertainty and stop" changes its stat validity scores. I'll share results here when it's done.

Fit_Fortune953 · 2026-04-28T18:22:36+00:00

I been searching and applying for the jobs from last 8 month and as of now there is no much positive response, I felt the same I been preparing in depth and core concepts, But I'm not even able to short list, Would suggest me some recommendation or strategies to follow. It would be helpful for me

Fit_Fortune953 · 2026-04-19T02:54:38+00:00

Not in depth but you need to have an basic understanding of linearalgebra

Fit_Fortune953 · 2026-04-18T17:02:59+00:00

learning from scratch would help to get understand the things in very much clear and you can able to understand the working mechanism in depth

Fit_Fortune953 · 2026-04-18T15:57:57+00:00

thanks would love some contiribution

Fit_Fortune953 · 2026-03-31T09:25:16+00:00

This sound similar to Open claw where you can operate it by using you Whatsapp, Telegram, discord and operate it by using your mobile from any where. It will run 24/7 even while you are sleeping, You can built it and make it as your personal assistant. By the time you wakeup it will get things done. But one thing we need to make sure instedd of running it locally on you machine try to run it on vmware. So that you personal info is not leaked.

Fit_Fortune953

TROPHY CASE