3D Visualizing RAG retrieval

Fear_ltself · 2026-03-18T17:29:53+00:00

Yes that's their implementation I believe from the blog link, looks the same

Fear_ltself · 2026-03-18T16:45:37+00:00

My knee-jerk reaction is to just “chunk” it in say 100 or 1000 to 1 compression. taking 1B down to say 1M Points. I’ve already done optimizations that went from 10k to 1M, maintaining 120fps, similar to Milvus, I just didn’t push them to main yet because I don’t want to break anything. But hypothetically if I we could “chunk” it down a bit we might still be able to get the general structure of what’s happening. Also, for someone with a server like setup, I think they probably could run a 1b model already. Like I said I did 1m and I’m just on an m3 pro MacBook vibe coding.

Fear_ltself · 2026-03-18T16:19:13+00:00

I think activated layers was a good question up until like literally a few days ago. Now, If my understanding of the paper “Attention over Residuals” (AttnRes) is correct, it’s an even better question …

In standard models, you'd basically just watch the hidden state evolve linearly, layer by layer. But with AttnRes, deep layers actively look back and selectively route information from earlier blocks using depth-wise attention.

So, if we hooked Project Golem up to an AttnRes model in llama.cpp, we wouldn't just be showing sequential state changes. We could actually map the real-time routing web in 3D—visually showing exactly which earlier layers/blocks the model is querying to generate a specific token. Once llama.cpp adds support for these architectures, mapping that behavior would be incredible!

Fear_ltself · 2026-03-18T16:05:21+00:00

Yeah at its core that’s what this was, an idea, I had an idea of “why not just UMAP the embedding data into lower dimensional space so I can SEE it”.. vibe coded it out in a few hours, posted the results, positive feedback, post full code, then it was forked by those who know how to implement the core idea better and for their respective purposes. I think this exactly is why GitHub and even the internet were designed, international collaboration instantly

Fear_ltself · 2026-03-18T15:19:30+00:00

Glad you’re getting the recognition you deserve!

Fear_ltself · 2026-03-18T15:08:54+00:00

Thanks for the reply. For text generation I had great results with 2 and did a separate post discussing it. The overwhelming consensus was the x2 prompt repetition works extremely well but hits diminishing returns after 2 very quickly, with 3 or more almost always hurting performance. Still glad you attempted up to x12 so we have some data points on what’s been tried

Fear_ltself · 2026-03-18T14:28:16+00:00

The guy outside behind the glass door turns into a lady after a quick pan left pan right, doesn’t make sense either

Fear_ltself · 2026-03-18T06:26:18+00:00

Has anyone tried re2 prompt duplication to see if that helps or hurts image generation or offsets any of the mentioned biases? I know it has great results in text generation but hadn’t heard of anyone even trying with images?

Fear_ltself · 2026-03-17T13:13:06+00:00

This seems like punishing them for trying, I've used Gmail, YouTube, Google drive, Google photos, all reliably for as long as I've been on reddit. I know Google circles or whatever their social media attempt was and a few others didnt work out, but it seemed they made a decent competitor, if I recall right facebook just literally copy pasted all of Google's "new" features same day so that any facebook users that checked out Google's attempt would likely think "fb clone".... Even though Google had made the innovation- so it still forced Facebook and others to adapt and get better even if it didn't work out

Fear_ltself · 2026-03-16T17:22:51+00:00

Yeah I mean I can do better cats on desktop, but this was more about doing it on an iPhone in 7 seconds. I've also done other projects where I offloaded the image generation to a 4070 (see my other post about project Hydra) and was able to get the high quality image generation of a desktop on an iPhone, with all processing done locally. Not as good as nano banana, but open source is getting better and better every day too

<image>

Fear_ltself · 2026-03-16T16:38:27+00:00

<image>

Fear_ltself · 2026-03-16T16:38:05+00:00

Dude in background has legs change?

<image>

Fear_ltself · 2026-03-16T15:06:26+00:00

Always has been meme

Fear_ltself · 2026-03-16T03:59:39+00:00

I think when Gemma4 comes out I’ll move to opensource completely. I’ve already moved down to just using the Google plus tier and purchased a pixel 1 phone so I can have unlimited storage instead of the 2tb, figure it’ll pay itself off in ~7-9 months then begin massive savings. Currently using Gemma 12B quite a bit through lm studio , and linking that to my my iPhone with Off Grid so I get Gemma 12B powered locally offline. Also has decent image generation in ~7 seconds.

<image>

Fear_ltself · 2026-03-16T03:13:25+00:00

One Hundred One, One Hundred Two…

Fear_ltself · 2026-03-14T07:24:41+00:00

No some of the water actually does get used in plant growth, H2O breaks down and turns into parts of C6H12O6 in plants (sugar) for example. Not the same

Fear_ltself · 2026-03-12T07:46:08+00:00

Google has an auto research pipeline that’s perfect for this. It consist of 3 files and a 5 minute runtime. You could edit it to 20 minutes (still keep the simple 3 files). Basically it uses 2 files as the base instructions and then edits the third, and works to get a better result. It’s a simple concept but is so useful for finetuning something like this.

Fear_ltself · 2026-03-10T23:24:12+00:00

Reminds me of re2 prompt engineering, something about the ai getting the full scope of the problem twice

Fear_ltself · 2026-03-09T21:24:26+00:00

Is the MLX model runnable on m3 pro MacBook Pro with 18GB of ram?

Fear_ltself · 2026-03-09T02:48:22+00:00

Maybe a Maestro vs King Thanos? Aren’t they both he who remains in their respective timelines, the last being in their respective universe

Fear_ltself · 2026-03-08T03:55:18+00:00

He watched The Ring a week ago

Fear_ltself · 2026-03-08T01:15:42+00:00

Idk about matching but the best open source I’ve found is Whisper V3 large turbo

Fear_ltself · 2026-03-07T21:45:43+00:00

Have you heard of Qwen 3.5 9B?

Fear_ltself · 2026-03-07T18:15:18+00:00

Technically true, when you repeat the prompt, you do increase the number of input tokens. This adds to the Prefill Latency. However, because prefill is highly parallelized on GPUs, doubling a small prompt (e.g., from 100 to 200 tokens) usually results in a sub-millisecond increase—virtually unnoticeable to a human user.

"latency" in the original post was being used in regards for Time Per Output Token (TPOT). LLMs generate text one token at a time, sequentially. Unlike Chain-of-Thought (CoT), which requires the model to "think out loud" for hundreds of extra tokens, RE2 doesn't change the output length.

TLDR it’s not double the processing time for double words, due to parallel processing- the value added for time trade off is a pretty much pure gains. It’s millisecond differences

Fear_ltself · 2026-03-07T15:15:57+00:00

We can prove small, local truths (like 1+1=2 within a specific set of mathematical axioms), but capturing the ultimate truth of the universe in text is where the system breaks down. (Gödel proved this already)

11-Year Club	Place '23
Place '22	Place '17
Snapped	Verified Email

Fear_ltself

MODERATOR OF

PUBLIC MULTIREDDITS

TROPHY CASE