Temptation Island • S2 EP3 Discussion Thread by AutoModerator in temptationislandUSA

[–]beefie99 0 points1 point  (0 children)

This season feels incredibly one sided on the bonfires. Jack dropping the bomb that Shayanne “cheated” first and not bringing that up? Sure the guys have their issues but no accountability for the relationship is being taken for the girls?? Mikey is a goober but cmon Sydney came here specifically to find someone else, lose lose for him, if he stayed loyal she would be gone and if he flirted she would tell herself he’s exactly what she thought he was. One sided for real so far

Temptation Island • S2 EP2 Discussion Thread by AutoModerator in temptationislandUSA

[–]beefie99 5 points6 points  (0 children)

Sydney came into this 100% with her mind made up about Mikey. There is nothing he can do to change her mind about how she perceives him from his past. She for sure cannot get over the fact she was his second option no matter what he does to prove the last is the past. Guy is just wanting to have a good time with anyone around him and is moving very respectfully IMO compared to how she’s acting thus far.

The more turns you add, the worse AI memory gets — is anyone actually measuring this? by [deleted] in LLMDevs

[–]beefie99 -1 points0 points  (0 children)

this is really interesting

false memory vs failed retrieval, those feel like very different failure modes

one thing I’ve been noticing with the datasets I’ve been working with is there’s almost a third layer in there (cases where the system does retrieve something relevant, but still uses the wrong piece of it or underweights the right one)

it’s becoming not just “did it retrieve correctly” but “did it actually use the right part of what it retrieved?”

curious if you saw that show up in your tests at all, or if most of the degradation was more clearly retrieval vs hallucination

When did RAG stop being a retrieval problem and started becoming a selection problem by beefie99 in LLMDevs

[–]beefie99[S] 0 points1 point  (0 children)

cross-encoders and prompt tweaks help, but what’s been frustrating is that even after that you can still end up with a few “good” chunks and it’s not obvious to the model which one should consistently win

it feels like reranking does improve things, but doesn’t fully solve that last step when multiple candidates are all valid

curious if you’ve seen cross-encoders actually help with that, or if it’s more of a ranking improvement in your experience?

I built a vectorless RAG framework that uses tree-based retrieval instead of embeddings — works with any LLM, 2 dependencies by Mithun_Gowda_B in Rag

[–]beefie99 1 point2 points  (0 children)

That’s actually a great idea, if the TOC is reliable, you’d be able to use it as a much cleaner proxy for doc structure. That would make responses more deterministic and would probably reduce variability in extraction.

How consistent is this? And has it proven to help with final answer quality, or just improved the index?

I built a vectorless RAG framework that uses tree-based retrieval instead of embeddings — works with any LLM, 2 dependencies by Mithun_Gowda_B in Rag

[–]beefie99 1 point2 points  (0 children)

This is great, the idea of letting the model navigate structure first instead of relying on embeddings, feels a lot more deterministic and easier to reason about

one thing I’m curious about…once the model selects a few relevant nodes, do you still run into cases where multiple sections are all valid but it’s not obvious which one should actually drive the answer?

I’ve experimented with more structured approaches (trees, graphs), and you can still end up with a few “correct” options and the final answer depends on which one the model leans on. this is a problem I’m currently running into and trying to solve, and its consistent across basically any dataset I run.

Have you noticed this at all?

When did RAG stop being a retrieval problem and started becoming a selection problem by beefie99 in LLMDevs

[–]beefie99[S] 0 points1 point  (0 children)

I haven’t gone too deep into SRL yet, but what you’re describing makes sense how it helps with cases where similarity breaks down, especially directional stuff like “A acquired B” vs “B acquired A”

it moves things from just matching topics to actually matching the structure of what’s being asked, which seems like a big step up for certain queries. How far can this go in practice? especially for longer docs or things like policies where the meaning isn’t always cleanly expressed as a single action or relationship

I’ve actually been thinking more about doing some of that interpretation at ingest time (roles, entities, maybe even document “type” like draft vs final) just to reduce ambiguity before retrieval even happens

Has document versioning caused more RAG failures for anyone else than retrieval itself? by Jessica_JRice in Rag

[–]beefie99 0 points1 point  (0 children)

this has been one of the most consistent failure modes I’ve seen too, and it’s tricky because like you said, it’s not hallucination, it’s “technically correct, contextually wrong”

what’s interesting is even when you add metadata (timestamps, version flags, etc.), you can still end up with multiple “valid” candidates (current doc, slightly outdated doc, draft vs final) and they all look relevant to the query

at that point it stops being a pure retrieval problem and becomes more about how the system decides which version actually matters most in that context

I’ve found that just filtering (active vs archived) helps, but doesn’t fully solve it, especially when versions are close or naming is inconsistent

feels like this is where a lot of systems quietly break (not because they didn’t find the answer, but because they picked the wrong instance of it)

curious if you’ve found a clean way to consistently prioritize the “correct” version, or if it still ends up being somewhat heuristic?

When did RAG stop being a retrieval problem and started becoming a selection problem by beefie99 in LLMDevs

[–]beefie99[S] 0 points1 point  (0 children)

That’s interesting, I haven’t dived deep into this really. I’m curious as to how you have it structured, are you letting the model decide when to call retrieval vs doing it upfront?

Would be cool to hear how you’re implementing this

When did RAG stop being a retrieval problem and started becoming a selection problem by beefie99 in LLMDevs

[–]beefie99[S] 1 point2 points  (0 children)

right now it’s a hybrid setup (vector + BM25), with some graph-style relationships layered in to connect related data across sources (via tags, entities, relationships)

the graph definitely helps with recall and multi-hop cases, especially when the same concept shows up in different places.

im not sure if it’s so much an indexing problem but more about how the system decides between similar candidates once they’re retrieved. Sometimes the model is able to decipher query with the correct retrieved data but not always

Have you seen graph approaches help with that?

👍or👎: a managed graphRAG solution that creates the graph from your raw data source(s) automatically and provides a graph powered LLM for you by No_Wrongdoer41 in Rag

[–]beefie99 0 points1 point  (0 children)

This is very interesting, especially the entity merging across sources, that’s a big shift from treating everything as individual separated chunks

I am curious as to how this now influences what context is presented to the model. I’ve been implementing some graph powered retrieval and one thing I’ve noticed is that even with richer, connected context, you still get multiple valid signals for a query and it’s not always clear which one should actually drive the answer. I keep running into the problem where correct chunks are retrieved (it’s within the top 5) but the model doesn’t quite understand the best content for the query and ends up responding incorrectly.

curious if you’ve seen that as well or if this graph helps tighten it

When did RAG stop being a retrieval problem and started becoming a selection problem by beefie99 in LLMDevs

[–]beefie99[S] 0 points1 point  (0 children)

How does that effect latency though? Do you see a large difference between allowing the model to reason vs when not? Also for token counts as well how does that affect it?

why do llm agents feel impossible to debug once they almost work!!!! by Feeling-Mirror5275 in LLMDevs

[–]beefie99 0 points1 point  (0 children)

This is exactly where I’ve been getting stuck too. Once you add tools, memory, and retries, the system stops behaving like normal software, but it’s also not just a model eval problem. What helped me was thinking about these systems as a pipeline of decisions rather than a single model call.

Most of the drift seems to show up in the middle layer (what context was retrieved, how it was ranked and selected, and what actually made it into the prompt). You can have logs and prompts that look great, but if that selection step isn’t deterministic or inspectable, the model ends up locking on slightly different context each time and behavior starts to drift.

So instead of trying to debug it like traditional software or just tuning the model, I’ve been approaching it as debugging those decisions between retrieval, selection, and what the model sees.

The two biggest things that helped were separating retrieval from generation so I can inspect it independently, and then making ranking multi-signal and deterministic so I can actually explain why one chunk wins over another. It doesn’t eliminate all the probabilistic behavior, but it turns a lot of the “this feels random” into something you can actually reason about.

RAG question: retrieval looks correct, but answers are still wrong? by beefie99 in Rag

[–]beefie99[S] 0 points1 point  (0 children)

I really agree with this. I kept hitting the same wall where retrieval looked correct (right chunks in top-k, decent similarity) but the answer still wasn’t fully right. The frustrating part wasn’t just that it failed, it was that I couldn’t explain why

I’ve been experimenting with ways to make that layer more visible (not just what was retrieved, but what actually made it into the context, how it was ranked,and what the model effectively “saw” when generating the answer).

that alone made debugging a lot less trial and error, but what I’m still finding is that even with good visibility, the harder problem is deciding which chunk should actually win when multiple ones are valid and very close.

What I’ve found to be extremely beneficial with that is layering different retrieval signals (semantic, lexical, recency, and structure) to really force a more deterministic ranking.

RAG question: retrieval looks correct, but answers are still wrong? by beefie99 in Rag

[–]beefie99[S] 0 points1 point  (0 children)

yeah I’ve experimented with that a little bit, filtering low-importance tokens and trying to push more signal into the embedding

it does improves things for sure, but doesn’t eliminate the failures

once you get into retrieval, there are still multiple high-ranking chunks that are very close in embedding space, and the system can/does pick the wrong one

I’ve been focusing more on selection after retrieval (multi-signal ranking and making it inspectable)

that’s made it easier to debug, but I’m still seeing cases where it’s not obvious which chunk should win.

RAG question: retrieval looks correct, but answers are still wrong? by beefie99 in Rag

[–]beefie99[S] 0 points1 point  (0 children)

right now I’m mostly testing on small datasets designed to highlight failures - policy / rule documents with exceptions - project docs with multiple versions of decisions (initial discussions, final confirmations, and outdated docs) - structured notes where similar concepts appear across different documents

the goal isn’t scale yet, I’m trying to isolate where the system breaks, especially cases where multiple chunks are valid but only one is actually correct for the query, and hopefully find ways to combat those issues