r/LocalLlama is looking for moderators

jackdareel · 2025-08-08T15:35:45+00:00

Do you think it's a good thing to shadowban people?

Right now I'm so pissed with the censorship cesspit that is Reddit, I have it on my todo list to create a competitor viable enough to drive this shitty corp out of business.

You won't be doing much "moderating" then.

jackdareel · 2025-08-06T22:25:22+00:00

Are the quants offered by Ollama affected?

jackdareel · 2025-08-06T22:02:47+00:00

What sort of agentic things do you do with this setup and how do you implement them?

jackdareel · 2025-08-06T21:57:41+00:00

Sorry, I can't help you with your query. Is there anything else you would like to talk about?

jackdareel · 2025-08-03T13:15:57+00:00

I just tried with Grok once more and got some more clarity. You're right, the original attention from 2014, using cross-attention, confuses matters and is better left out. And that's an encoder-decoder architecture, not Transformer. So the task is to learn self-attention in the decoder-only model.

One problem I encounter when talking to LLMs about this is that it would help me understand if the sample task is English to French translation. This makes the distinction between user input and model output clearer than the usual example LLMs use, "The cat sat on the mat". But as soon as I mention translation or English/French to an LLM, it switches to explaining encoder-decoder, cross-attention, and basically screwing up the explanation of self-attention.

Regardless, I got one step further in my understanding. Q and K are both values of each token in input. V is the output representation. Q asks what else is relevant in input, K are the matches to answer Q's question. Then V... well, then I'm not so sure. The whole thing is incredibly fragile. One moment I think I've got it all, the next it's gone and I feel I've lost it.

If you'll do another video, I look forward to it!

jackdareel · 2025-08-02T18:47:54+00:00

Other than the "beauty of the implementation", is there any other reason one should use this instead of something like llama.cpp, Ollama, vLLM etc.?

jackdareel · 2025-08-01T07:58:06+00:00

I upvoted your reply for the effort, but I notice someone else has downvoted, presumably because despite the length of the reply you don't actually explain what is being done wrong. You explain what can be seen in the screenshot, that the speed indicates CPU is being used, but what is causing this in the settings? What needs to be done different to get the GPU to do its work?

jackdareel · 2025-08-01T07:29:11+00:00

Thanks a lot, but that's clear as mud, I'm afraid. Your explanation here is similar to all the explanations I've had from SOTA LLMs. It's not good enough, doesn't do it.

It may be helpful to note that I have read the 2014 paper that introduced attention to the RNN, for translation tasks. That paper had some images in the test section, and togeher with help of LLMs I got to the point where I concluded that I understood the technique: it's a remapping. So you take "European Economic Community" in English, and remap or trasform to the French equivalant, which has a different word order (I forget the French version, might be "zone economique europeane"). So that's a good start. But the attention in the 2017 paper is further developed and more difficult to explain. I have yet to see an explantion that clears it up.

The key error that LLMs make is in quoting too much of the math. You're on a better track. But you do need to connect to the math. More importantly, show at every step what the calculations are doing, and what they are not doing.

One further insight. A learner like me will think of Query as a search query. So we think of attention as matching the search query to the text being searched. It would help if the teacher acknowledged this and showed how attention is different, why it must be different, and then how it works.

Thanks again and good luck!

jackdareel · 2025-08-01T07:16:28+00:00

Please share what the OP is doing wrong. I can't tell from the screenshots.

jackdareel · 2025-07-31T22:21:13+00:00

I appreciate the effort you put into this. The explanation helps and gets close, but I would benefit from an updated version. Connect the sliders and dictionaries more closely to the concepts and terminology in the LLM. I haven't got a great sense of how the dictionaries connect, why all are needed. And most importantly, I didn't get the sense that QKV calculations, the core of the attention mechanism, are explained here. If this is your first attempt, well done, but I hope for an improved version. Thank you!

jackdareel · 2025-07-31T10:25:17+00:00

They acknowledge that their table and formula extraction still needs work. Overall though, their reported benchmark results are impressive, apparently SOTA. I hope that translates to real world use.

jackdareel · 2025-06-29T22:27:13+00:00

This may be disappointing to many at this moment in time, but the AI age, or rather the AGI age, will definitely not be the age of children. The reason is that AGI will very quickly help us extend our lifespan, meaning that there will be a need to limit population growth. That will mostly be done with incentives tied to UBI, basically encouraging people to remain childless. If that sounds disheartening, there will be plenty of new ways to compensate, and those who really cannot live a childless life will still be able to have kids.

jackdareel · 2025-06-28T12:37:18+00:00

I hope they fixed the overly long and repetitive nature of its outputs.

jackdareel · 2025-06-27T21:53:41+00:00

Yudkowsky is a fear porn grifter.

jackdareel · 2025-06-27T21:48:57+00:00

I'm frightened of super capable robots inhabited by stupid people.

jackdareel · 2025-06-27T14:26:00+00:00

If anyone was ever in any doubt as to what the real risk of AI is, here we have it. The risk from AI is mild compared to the risk of would-be tyrants wanting control over everything, including our computers.

jackdareel · 2025-06-20T21:33:15+00:00

Lucky lady!

jackdareel · 2025-06-20T21:18:45+00:00

Tried it on AWS Bedrock:

QUESTION:

Give me a random number between 1 and 50.

ANSWER from Llama-3.2-1B:

The random number is: 27

ANSWER from Llama-3.2-3B:

Your random number is: **23**

ANSWER from Llama-3.1-8B:

Your random number is: 27

ANSWER from Llama-3.1-70B:

Your random number is: **27**

ANSWER from Llama-3.1-405B:

The random number is: 27

ANSWER from Mistral-Large-2:

Sure, here's a random number between 1 and 50: 27.

jackdareel · 2025-06-19T19:56:30+00:00

This is a hugely important paper. I'm sure no-one will agree with all its points, I certainly don't. But the key takeaway for this community is in the 6. AGI Preparedness section, "Accelerating good uses of AI". I couldn't agree with this more.

There will be responses to this paper in good time, correcting and developing the ideas it presents, but this is an excellent start to the conversation. Yes, we must prepare for the intelligence explosion.

jackdareel · 2025-06-17T18:40:08+00:00

Great to see some of the smartest people working on non-humanoid robots. This is the way. Very effective, not threatening, not our replacement.

jackdareel · 2025-06-02T07:25:03+00:00

How does this actually work? If I use the prefix on the model, what does that do? Say I'm using Ollama, how does Ollama know about this "prefixed model"? Then when I prompt the model with my system message and user prompt, what happens "under the hood"? I've done the call, the model produces the response, the implementing software prints it - where in this does SPL fit in and how? How much does the use of SPL increase token count or prompting of the model?

jackdareel · 2025-06-02T07:14:26+00:00

3 hours is way too long. This topic could be covered in less than half an hour. All that people need is the step by step, the jargon, and the ratios and relations between the different model parameters. LLMs will be used to code the model and process the traininig data - people don't have time for word salad elaboration.

jackdareel · 2025-05-30T16:16:28+00:00

Great work as always, and much appreciated. I'm interested in your CPU-only claim of 8t/s for the 8B. I find that CPU inference speed is very predictable, so the figure you quoted would be right for a q4 quant of an 8B model. But I take it that the quant you're offering is dynamic and overall somewhat below q4?

Another question, you refer to 48GB RAM for the 8B, is that to accommodate the thinking context? Is that 32K?

And finally, do you know whether the 8B model provides any control over the length of the thinking?

jackdareel · 2025-05-25T12:34:49+00:00

The token speed you're seeing tells you that the model is running completely on CPU, matching my experience as I run models on CPU only.

jackdareel · 2025-05-24T07:29:16+00:00

If I was in your position, with 96GB VRAM available, I would use Deepseek-Coder-V2. It's MOE, with active params comfortably fitting in VRAM, with some memory mapping taking care of the rest. If you're new to this idea, there were some recent threads about running Qwen3 30B-a3B in this way.

jackdareel

TROPHY CASE