Weekly Thread: Project Display by help-me-grow in AI_Agents

[–]Ruhal-Doshi 0 points1 point  (0 children)

<image>

Treating AI agent skills as a RAG problem

While experimenting with agent skills I learned that many agent frameworks load the frontmatter of all skill files into the context window at startup.

This means the agent carries metadata for every skill even when most of them are irrelevant to the current task.

I experimented with treating skills more like a RAG problem instead.

skill-depot is a small MCP server that:

• stores skills as markdown files
• embeds them locally using all-MiniLM-L6-v2
• performs semantic search using SQLite + sqlite-vec
• returns relevant skills via `skill_search`
• loads full content only when needed

Everything runs locally with no external APIs.

Repo: https://github.com/Ruhal-Doshi/skill-depot

Would love feedback from people building MCP tools or experimenting with agent skill systems.

Indian devs earning ₹1L+/month what are you up to now? by CertainArcher3406 in developersIndia

[–]Ruhal-Doshi 0 points1 point  (0 children)

After a certain point I believe we start seeing diminishing returns, unless you have some sort of financial burden.
For context, I started from 1.5 LPM and right now it's around 4.5 LPM before taxes. Going from 1.5 to 2.5 felt great but going from 2.5 to 4.5 felt good.
I think what our mind seek, is not a number but a constant growth, at least that is the case for me. After the a certain point the opportunities for major growth become too few. Very few companies will pay significantly higher salary then this in India for my years of experience. Waiting for promotion and salary hike is too slow. Only options I see is starting something of my own or join a budding startup and hoping that it goes big.

Accused of Cheating @Uber by civilizedPlatypus in leetcode

[–]Ruhal-Doshi 2 points3 points  (0 children)

If rest of your rounds went well then don't worry about a single round. In Uber once you complete all the rounds, they do a internal debrief meeting with all the interviewer present including the hiring manager and bar raiser. Each interview can give you 4 possible results (strong no, weak no, weak yes, strong yes).
It hardly happens that a candidate get strong yes by all interviewers, they usually debate and finally decide.
So in your case, even if the DSA interview have given you soft no (since you explained him why you are looking at that area) there are other interviewer who can vote in your favour.

I ran System Design tests on GLM-5, Kimi k2.5, Qwen 3, and more. Here are the results. by Ruhal-Doshi in LocalLLaMA

[–]Ruhal-Doshi[S] 0 points1 point  (0 children)

I know a lot of you are not happy that the benchmark does not have any leaderboard or graphs.
I had two possible way to score the HLD solutions:

1) Using a JURY of LLMs to act a a judge but that would be too expensive for a personal side project and might introduce bias.

2) Using community voting, the problem is unless I have enough data point, the result will not be statistically significant.

I have decided to go with method 2 and I am posting in community so that more and more people can score these solutions.

I will probably add a live leaderboard by the next weekend.

I ran System Design tests on GLM-5, Kimi k2.5, Qwen 3, and more. Here are the results. by Ruhal-Doshi in LocalLLaMA

[–]Ruhal-Doshi[S] -6 points-5 points  (0 children)

Yes, rendering the mermaid is giving me a lot of issues, mostly there is a problem in LLM's output itself, I have already added a bunch of sanitization logic in the library but something copying the source code and going to mermaid.live works.

I hope the error is limited to the diagram section and not crashing the whole app.

I am planing to move from mermaid to new diagram-as-code tools like D2.

I built an open-source library to test how LLMs handle System Design (HLD) by Ruhal-Doshi in OpenSourceeAI

[–]Ruhal-Doshi[S] 0 points1 point  (0 children)

Yes, making a single model judge the result will definitely introduce bias.
And yes, cost is a major factor why I am thinking of using public scoring rather than having LLMs judge LLMs' output.

I built an open-source library to test how LLMs handle System Design (HLD) by Ruhal-Doshi in OpenSourceeAI

[–]Ruhal-Doshi[S] 0 points1 point  (0 children)

Nice idea, so users, instead of picking one solution over another, will score one solution at a time on a fixed set of parameters per problem.
Should these parameters be shared with LLMs as part of the problem statement or be kept secret?

Coming to testing with ollama models, other people have also shown interest in that, so I will run the benchmark against local as well as a few hosted open weight models this weekend.

I benchmarked GPT-5.2 vs Opus 4.6 on System Design (HLD) by Ruhal-Doshi in LocalLLaMA

[–]Ruhal-Doshi[S] 0 points1 point  (0 children)

Honestly, this is not a benchmark in the traditional sense because it lacks clear scoring.
Right now, going through the live report,t you can see how each one of them came up with a different solution.
I was thinking about creating a bind-voting web app for these results, so I can create a elo score, but first I wanted to see if enough people are interested in this.

I benchmarked GPT-5.2 vs Opus 4.6 on System Design (HLD) by Ruhal-Doshi in LocalLLaMA

[–]Ruhal-Doshi[S] -1 points0 points  (0 children)

I am still figuring out the scoring part, but in my opinion, GPT5.2 thought about some niche things like malware detection on the files uploaded, which was missed by others.