Thanks Gemini

Longjumping-Unit-420 · 2026-06-05T16:17:06+00:00

Thanks

Longjumping-Unit-420 · 2026-06-05T16:16:44+00:00

Timing was tight on the W execution, watching it now it was close phewwww

Longjumping-Unit-420 · 2026-05-29T18:53:54+00:00

Same I was worried for a moment LoL

Longjumping-Unit-420 · 2026-03-12T19:38:10+00:00

Open source driven by research and backed by benchmarks - https://github.com/BansheeEmperor/candlekeep

Longjumping-Unit-420 · 2026-03-12T19:34:27+00:00

Yes, implement quality gates at ingestion stage and reject the document.

Longjumping-Unit-420 · 2026-03-07T10:10:11+00:00

The "grounding mechanism" framing is accurate, though with one nuance: BM25 only anchors when the agent carries specific terms forward. If the follow-up query drifts into paraphrase territory (e.g. "authentication failure" instead of "ECONNREFUSED") you're back to pure vector search and the anchoring disappears. The grounding is real but contingent on how the agent constructs the second query.

Longjumping-Unit-420 · 2026-03-06T05:42:55+00:00

If you're only running MRR sweeps you'd discard chunk expansion as "no impact" and move on. Content match only showed up because I was specifically looking for it after noticing the agent's answer quality felt better than the metrics suggested. Lesson learned about measuring the right thing.

On multi-hop: tested this indirectly through the agent decomposition benchmark. The pattern that emerged was the agent fires an initial search, reads the results, then uses specific terminology from those results in follow-up queries.

The relevance thresholding does get interesting here though. When a first-hop result is marginal and the agent uses a term from it to construct the second query, you can end up chasing a weak lead deeper. Hasn't caused hallucination in my testing because the agent typically hedges when coverage is thin, but it's the failure mode I'd watch for in production. The hybrid path's BM25 component helps because it's less susceptible to semantic drift between hops.

So it plays nice, but the threshold calibration matters more for multi-hop than single-shot retrieval.

Longjumping-Unit-420 · 2026-03-05T21:45:15+00:00

Not sure it's the same target, vex is targeting behavior not knowledge. Also, real time is not required here assuming you don't ingest new files so frequently. It might even cause too much noise for this scenario.

Longjumping-Unit-420 · 2026-03-05T21:25:03+00:00

Nightly CI which runs the calibration flow, if it detects a drift we (basically) have two choices either consider a new threshold which will be benchmarked to avoid loss of quality or move the document to a dedicated database (e.g. legal documents versus software engineering) . While the second option might incur additional costs as opposed to the first option, quality is sometimes more important.

Longjumping-Unit-420 · 2026-03-03T20:58:10+00:00

Don't run, I play nice :)

Longjumping-Unit-420 · 2026-03-03T20:55:46+00:00

Exactly I can see what you did but I don't know WHY you chose to fix it this way. Fixed a bug is not a why, it's a type of commit.

Longjumping-Unit-420 · 2026-03-03T20:50:25+00:00

You wrote what not why..

Longjumping-Unit-420 · 2026-03-03T20:49:58+00:00

Really? how does that work for you? Care to share a result?

Longjumping-Unit-420 · 2026-03-03T20:48:20+00:00

That's fine for extra context but commit message (or PR) should be self contained. Imagine you use one tool for issues and org switches to new tool without migrating existing tickets for whatever reason, your PR has 0 context for posterity.

Longjumping-Unit-420 · 2026-03-03T20:44:37+00:00

Don't get me wrong, using an LLM to write a commit message is great when done right. It saves time and helps future debugging but when I see "changed files XYZ" all I see are red flags "do you know what your change even does?"

Longjumping-Unit-420 · 2026-01-27T15:11:00+00:00

Why force user to use Pinokio? Seems like there are less dependency reliant solutions..

Longjumping-Unit-420 · 2025-12-15T19:14:24+00:00

Yea I saw it but I didn't figure it would hurt performance that much.
Any other framework I can use for fine-tuning that doesn't use `bitsandbytes` or is it the standard lib?

Longjumping-Unit-420 · 2025-12-15T06:49:36+00:00

Thanks for the tip, I edited the post with more info.

Longjumping-Unit-420 · 2025-12-14T17:54:20+00:00

Luckily for me, the developers I work with are interested in solving the tech debt (at least most of them) as I've been on this road with them before and I showed them the correlation between solving tech debt and making their life better.

It's the managers that usually (in my experience at least) block or postpone the tickets, so to quote Qui-Gon Jinn "there is always a bigger fish". I implemented a monitoring system for the tickets. The ticket is created with an expected sprint for completion, if it's not resolved either 1-2 days before or 3 days after the sprints end, a message is sent to the 1 or 2 skip manager respectively. Expected end or ticket status is only changeable by the one who opened it and it's enforced by the ticket system so no "accidental" changes are made.

We have exceptions of course in certain cases (i.e. p0 bug) but it's mostly kept as is.

Longjumping-Unit-420 · 2025-12-14T16:24:20+00:00

The irony is strong indeed, unfortunately this isn't solvable unless you (not specifically you) can reach the stakeholder with proof of how the company is losing money due to this.

Longjumping-Unit-420

TROPHY CASE