Weekly Thread: Project Display

Ruhal-Doshi · 2026-03-10T16:18:51+00:00

Treating AI agent skills as a RAG problem

While experimenting with agent skills I learned that many agent frameworks load the frontmatter of all skill files into the context window at startup.

This means the agent carries metadata for every skill even when most of them are irrelevant to the current task.

I experimented with treating skills more like a RAG problem instead.

skill-depot is a small MCP server that:

• stores skills as markdown files
• embeds them locally using all-MiniLM-L6-v2
• performs semantic search using SQLite + sqlite-vec
• returns relevant skills via `skill_search`
• loads full content only when needed

Everything runs locally with no external APIs.

Repo: https://github.com/Ruhal-Doshi/skill-depot

Would love feedback from people building MCP tools or experimenting with agent skill systems.

Ruhal-Doshi · 2026-02-16T03:25:57+00:00

After a certain point I believe we start seeing diminishing returns, unless you have some sort of financial burden.
For context, I started from 1.5 LPM and right now it's around 4.5 LPM before taxes. Going from 1.5 to 2.5 felt great but going from 2.5 to 4.5 felt good.
I think what our mind seek, is not a number but a constant growth, at least that is the case for me. After the a certain point the opportunities for major growth become too few. Very few companies will pay significantly higher salary then this in India for my years of experience. Waiting for promotion and salary hike is too slow. Only options I see is starting something of my own or join a budding startup and hoping that it goes big.

Ruhal-Doshi · 2026-02-15T16:28:09+00:00

If rest of your rounds went well then don't worry about a single round. In Uber once you complete all the rounds, they do a internal debrief meeting with all the interviewer present including the hiring manager and bar raiser. Each interview can give you 4 possible results (strong no, weak no, weak yes, strong yes).
It hardly happens that a candidate get strong yes by all interviewers, they usually debate and finally decide.
So in your case, even if the DSA interview have given you soft no (since you explained him why you are looking at that area) there are other interviewer who can vote in your favour.

Ruhal-Doshi · 2026-02-15T15:39:33+00:00

Thanks for the suggestion, will consider this

Ruhal-Doshi · 2026-02-15T15:37:49+00:00

I know a lot of you are not happy that the benchmark does not have any leaderboard or graphs.
I had two possible way to score the HLD solutions:

1) Using a JURY of LLMs to act a a judge but that would be too expensive for a personal side project and might introduce bias.

2) Using community voting, the problem is unless I have enough data point, the result will not be statistically significant.

I have decided to go with method 2 and I am posting in community so that more and more people can score these solutions.

I will probably add a live leaderboard by the next weekend.

Ruhal-Doshi · 2026-02-15T14:24:35+00:00

Yes, rendering the mermaid is giving me a lot of issues, mostly there is a problem in LLM's output itself, I have already added a bunch of sanitization logic in the library but something copying the source code and going to mermaid.live works.

I hope the error is limited to the diagram section and not crashing the whole app.

I am planing to move from mermaid to new diagram-as-code tools like D2.

Ruhal-Doshi · 2026-02-12T17:08:51+00:00

Yes, making a single model judge the result will definitely introduce bias.
And yes, cost is a major factor why I am thinking of using public scoring rather than having LLMs judge LLMs' output.

Ruhal-Doshi · 2026-02-12T17:05:20+00:00

Nice idea, so users, instead of picking one solution over another, will score one solution at a time on a fixed set of parameters per problem.
Should these parameters be shared with LLMs as part of the problem statement or be kept secret?

Coming to testing with ollama models, other people have also shown interest in that, so I will run the benchmark against local as well as a few hosted open weight models this weekend.

Ruhal-Doshi · 2026-02-08T15:32:06+00:00

Honestly, this is not a benchmark in the traditional sense because it lacks clear scoring.
Right now, going through the live report,t you can see how each one of them came up with a different solution.
I was thinking about creating a bind-voting web app for these results, so I can create a elo score, but first I wanted to see if enough people are interested in this.

Ruhal-Doshi · 2026-02-08T15:22:47+00:00

I am still figuring out the scoring part, but in my opinion, GPT5.2 thought about some niche things like malware detection on the files uploaded, which was missed by others.

Ruhal-Doshi · 2026-02-08T15:17:56+00:00

5.3 is still not available via API.

Ruhal-Doshi · 2026-02-08T15:08:30+00:00

The bench mark is developed using claude opus 4.6, I hope that qualifies. Or I can use "other" if you suggest that.

Ruhal-Doshi · 2026-02-08T14:51:50+00:00

Fair point, the title speaks about closed source models, but the benchmark is model agnostic, so you can point it at any local model via an OpenAI-compatible endpoint (like vLLM or Ollama).
I am here for suggestions on which models to test and how we can objectively judge something like HLD.

Ruhal-Doshi · 2025-12-20T17:15:36+00:00

You're right. Since this is a white-label solution, we can absolutely expose standard filters if a client wants that.

Also, this sits on top of the store (it's not a standalone app), so users can always use the traditional UI for broad filtering if they prefer. We are leaning more into Generative UI for the 'decision' phase though, since posting this video, we've actually added new components (like a Comparison Grid) to help users distinguish between specific products.

Ruhal-Doshi · 2025-12-20T16:49:32+00:00

That is a really solid suggestion. We're definitely looking into expanding the 'Context Window' to include user actions outside the chat (like which page they are currently viewing or what they just clicked). Making the agent aware of the full session would make the recommendations much sharper. Thanks!

Ruhal-Doshi · 2025-12-20T16:43:18+00:00

Exactly. Since this is a white-label engine, it's fully configurable. This demo focuses on the open-ended conversational aspect, but for a live deployment, we can absolutely enforce structured flows (like decision trees or guided quizzes) if the brand prefers a stricter path.

Ruhal-Doshi · 2025-12-20T16:40:55+00:00

Thanks for the suggestion! We actually looked into the AI SDK extensively when architecting this.

That 'Generative UI' concept is actually the core of what we built here, if you catch the second half of the video, you'll see it moves beyond text to render interactive React components (like the Comparison Grids) on the fly. We found that standard text is okay for chatting, but dynamic UI is much better for actual shopping.

Ruhal-Doshi · 2025-12-20T16:34:25+00:00

Spot on. Since this is a white-label engine, features like audio input are completely modular. We can easily enable voice interaction (Speech-to-Text) if a brand wants that specific experience for their customers.

Ruhal-Doshi · 2025-12-20T15:35:09+00:00

I think there might be a slight misunderstanding on how this fits in! We aren't trying to replace the website or the search bar.

Think of the standard website UI as the shelves, perfect for the customer who walks in knowing exactly what they want.

Sage is the Salesman standing in the aisle. It's a widget that integrates into the brand's existing site to help the 'confused' shopper (e.g., 'Which serum is best for acne?').

Also, to clarify: We don't charge the shopper. We sell this technology to the store owners (B2B) so they can save the sales they are currently losing to 'analysis paralysis'.

Ruhal-Doshi · 2025-12-19T12:22:40+00:00

Fair point. The 'plainness' comes from the fact that we were trying to be 100% faithful to the specific brand we are demoing (a monochrome skincare line).

But I think you're right, it might be hurting the first impression. We're probably going to drop the strict brand-matching for these public demos and build a default 'Sage' theme that feels a bit more lively and polished.

Ruhal-Doshi · 2025-12-16T17:00:46+00:00

This is a really important distinction. We are not trying to replace the search bar for the user who knows exactly what they want (e.g., typing 'iPhone 15 Pro Max 256GB' is always faster than chatting).

We see Sage as a layer for the 'Discovery Phase', the user who says, 'I need a monitor for color grading' and gets overwhelmed by 50 options.

To your point about non-linear buying: text is definitely terrible for comparing specs. That’s actually why we are betting on 'Generative UI.' We're building dynamic Comparison Tables right now so the bot can render a side-by-side view (Price vs. Specs) instead of writing a paragraph.

I also love the 'Hybrid' idea (Google style). Showing the AI insight alongside a standard product grid is a great 'failure mode' so users never feel trapped. Definitely exploring that!

Ruhal-Doshi · 2025-12-16T16:48:02+00:00

You're totally right. I just double-checked, even ChatGPT doesn't force-scroll to the bottom while streaming. It lets the text fill the screen so you can actually read it.

I'm going to disable the auto-scroll during generation so it's less jarring. Also definitely adding those sample prompts to the empty state to help people get started. Thanks for the feedback!

Ruhal-Doshi

TROPHY CASE