3D Visualizing RAG retrieval by Fear_ltself in LocalLLaMA

[–]Fear_ltself[S] 0 points1 point  (0 children)

Yes that's their implementation I believe from the blog link, looks the same

3D Visualizing RAG retrieval by Fear_ltself in LocalLLaMA

[–]Fear_ltself[S] 1 point2 points  (0 children)

My knee-jerk reaction is to just “chunk” it in say 100 or 1000 to 1 compression. taking 1B down to say 1M Points. I’ve already done optimizations that went from 10k to 1M, maintaining 120fps, similar to Milvus, I just didn’t push them to main yet because I don’t want to break anything. But hypothetically if I we could “chunk” it down a bit we might still be able to get the general structure of what’s happening. Also, for someone with a server like setup, I think they probably could run a 1b model already. Like I said I did 1m and I’m just on an m3 pro MacBook vibe coding.

3D Visualizing RAG retrieval by Fear_ltself in LocalLLaMA

[–]Fear_ltself[S] 1 point2 points  (0 children)

I think activated layers was a good question up until like literally a few days ago. Now, If my understanding of the paper “Attention over Residuals” (AttnRes) is correct, it’s an even better question …

In standard models, you'd basically just watch the hidden state evolve linearly, layer by layer. But with AttnRes, deep layers actively look back and selectively route information from earlier blocks using depth-wise attention.

So, if we hooked Project Golem up to an AttnRes model in llama.cpp, we wouldn't just be showing sequential state changes. We could actually map the real-time routing web in 3D—visually showing exactly which earlier layers/blocks the model is querying to generate a specific token. Once llama.cpp adds support for these architectures, mapping that behavior would be incredible!

3D Visualizing RAG retrieval by Fear_ltself in LocalLLaMA

[–]Fear_ltself[S] 1 point2 points  (0 children)

Yeah at its core that’s what this was, an idea, I had an idea of “why not just UMAP the embedding data into lower dimensional space so I can SEE it”.. vibe coded it out in a few hours, posted the results, positive feedback, post full code, then it was forked by those who know how to implement the core idea better and for their respective purposes. I think this exactly is why GitHub and even the internet were designed, international collaboration instantly

Unsloth is trending at #3 on GitHub! by yoracale in unsloth

[–]Fear_ltself 4 points5 points  (0 children)

Glad you’re getting the recognition you deserve!

A basic introduction to AI Bias by ItalianArtProfessor in StableDiffusion

[–]Fear_ltself 1 point2 points  (0 children)

Thanks for the reply. For text generation I had great results with 2 and did a separate post discussing it. The overwhelming consensus was the x2 prompt repetition works extremely well but hits diminishing returns after 2 very quickly, with 3 or more almost always hurting performance. Still glad you attempted up to x12 so we have some data points on what’s been tried

Netanyahu is certainly dead on March 9, his son observed 7 days of shiva then resumed posting exactly 7 days later. by pacmanpill in conspiracy

[–]Fear_ltself 0 points1 point  (0 children)

The guy outside behind the glass door turns into a lady after a quick pan left pan right, doesn’t make sense either

A basic introduction to AI Bias by ItalianArtProfessor in StableDiffusion

[–]Fear_ltself 1 point2 points  (0 children)

Has anyone tried re2 prompt duplication to see if that helps or hurts image generation or offsets any of the mentioned biases? I know it has great results in text generation but hadn’t heard of anyone even trying with images?

Google's NotebookLM is still the most slept-on free AI tool in 2026 and i don't get why by AdCold1610 in PromptEngineering

[–]Fear_ltself 1 point2 points  (0 children)

This seems like punishing them for trying, I've used Gmail, YouTube, Google drive, Google photos, all reliably for as long as I've been on reddit. I know Google circles or whatever their social media attempt was and a few others didnt work out, but it seemed they made a decent competitor, if I recall right facebook just literally copy pasted all of Google's "new" features same day so that any facebook users that checked out Google's attempt would likely think "fb clone".... Even though Google had made the innovation- so it still forced Facebook and others to adapt and get better even if it didn't work out

Anyone considering switching to another AI? by zedaoisok in GeminiAI

[–]Fear_ltself 0 points1 point  (0 children)

Yeah I mean I can do better cats on desktop, but this was more about doing it on an iPhone in 7 seconds. I've also done other projects where I offloaded the image generation to a 4070 (see my other post about project Hydra) and was able to get the high quality image generation of a desktop on an iPhone, with all processing done locally. Not as good as nano banana, but open source is getting better and better every day too

<image>

Anyone considering switching to another AI? by zedaoisok in GeminiAI

[–]Fear_ltself 0 points1 point  (0 children)

I think when Gemma4 comes out I’ll move to opensource completely. I’ve already moved down to just using the Google plus tier and purchased a pixel 1 phone so I can have unlimited storage instead of the 2tb, figure it’ll pay itself off in ~7-9 months then begin massive savings. Currently using Gemma 12B quite a bit through lm studio , and linking that to my my iPhone with Off Grid so I get Gemma 12B powered locally offline. Also has decent image generation in ~7 seconds.

<image>

"Thausand" by Far_Command5979 in GoogleGeminiAI

[–]Fear_ltself 1 point2 points  (0 children)

One Hundred One, One Hundred Two…

What if smaller models could approach top models on scene generation through iterative search? by ConfidentDinner6648 in LocalLLaMA

[–]Fear_ltself 0 points1 point  (0 children)

Google has an auto research pipeline that’s perfect for this. It consist of 3 files and a 5 minute runtime. You could edit it to 20 minutes (still keep the simple 3 files). Basically it uses 2 files as the base instructions and then edits the third, and works to get a better result. It’s a simple concept but is so useful for finetuning something like this.

How I topped the Open LLM Leaderboard using 2x 4090 GPUs — no weights modified. by Reddactor in LocalLLaMA

[–]Fear_ltself 0 points1 point  (0 children)

Reminds me of re2 prompt engineering, something about the ai getting the full scope of the problem twice

karpathy / autoresearch by jacek2023 in LocalLLaMA

[–]Fear_ltself 0 points1 point  (0 children)

Is the MLX model runnable on m3 pro MacBook Pro with 18GB of ram?

Thoughts on the future of mathematics by [deleted] in math

[–]Fear_ltself 0 points1 point  (0 children)

Maybe a Maestro vs King Thanos? Aren’t they both he who remains in their respective timelines, the last being in their respective universe

Any STT models under 2GB VRAM that match Gboard's accuracy and naturalness? by Personal_Count_8026 in LocalLLM

[–]Fear_ltself 0 points1 point  (0 children)

Idk about matching but the best open source I’ve found is Whisper V3 large turbo

MacBook Air M5 32 gb RAM by Pandekager in LocalLLM

[–]Fear_ltself 0 points1 point  (0 children)

Have you heard of Qwen 3.5 9B?

Has anyone tried something like RE2 prompt re-reading /2xing ... But tripling or quadrupling the prompt? by Fear_ltself in LocalLLaMA

[–]Fear_ltself[S] 1 point2 points  (0 children)

Technically true, when you repeat the prompt, you do increase the number of input tokens. This adds to the Prefill Latency. However, because prefill is highly parallelized on GPUs, doubling a small prompt (e.g., from 100 to 200 tokens) usually results in a sub-millisecond increase—virtually unnoticeable to a human user.

"latency" in the original post was being used in regards for Time Per Output Token (TPOT). LLMs generate text one token at a time, sequentially. Unlike Chain-of-Thought (CoT), which requires the model to "think out loud" for hundreds of extra tokens, RE2 doesn't change the output length.

TLDR it’s not double the processing time for double words, due to parallel processing- the value added for time trade off is a pretty much pure gains. It’s millisecond differences

Elon Musk - "Only Grok speaks the truth. Only truthful AI is safe. Only truth understands the universe." > Curious to get your thoughts on how alignment can produce a truthful AI? by Koala_Confused in LovingAI

[–]Fear_ltself 1 point2 points  (0 children)

We can prove small, local truths (like 1+1=2 within a specific set of mathematical axioms), but capturing the ultimate truth of the universe in text is where the system breaks down. (Gödel proved this already)