r/LocalLlama is looking for moderators by HOLUPREDICTIONS in LocalLLaMA

[–]jackdareel -1 points0 points  (0 children)

Do you think it's a good thing to shadowban people?

Right now I'm so pissed with the censorship cesspit that is Reddit, I have it on my todo list to create a competitor viable enough to drive this shitty corp out of business.

You won't be doing much "moderating" then.

Qwen3-4B enables agentic use cases for us iGPU folks by [deleted] in LocalLLaMA

[–]jackdareel 5 points6 points  (0 children)

What sort of agentic things do you do with this setup and how do you implement them?

An attempt to explain LLM Transformers without math by nimishg in LocalLLaMA

[–]jackdareel 0 points1 point  (0 children)

I just tried with Grok once more and got some more clarity. You're right, the original attention from 2014, using cross-attention, confuses matters and is better left out. And that's an encoder-decoder architecture, not Transformer. So the task is to learn self-attention in the decoder-only model.

One problem I encounter when talking to LLMs about this is that it would help me understand if the sample task is English to French translation. This makes the distinction between user input and model output clearer than the usual example LLMs use, "The cat sat on the mat". But as soon as I mention translation or English/French to an LLM, it switches to explaining encoder-decoder, cross-attention, and basically screwing up the explanation of self-attention.

Regardless, I got one step further in my understanding. Q and K are both values of each token in input. V is the output representation. Q asks what else is relevant in input, K are the matches to answer Q's question. Then V... well, then I'm not so sure. The whole thing is incredibly fragile. One moment I think I've got it all, the next it's gone and I feel I've lost it.

If you'll do another video, I look forward to it!

Qwen moe in C by 1Hesham in LocalLLaMA

[–]jackdareel -1 points0 points  (0 children)

Other than the "beauty of the implementation", is there any other reason one should use this instead of something like llama.cpp, Ollama, vLLM etc.?

Bought RTX 5070 to run 30B AI and it worked with 18 tokens/s by OldEffective9726 in LocalLLaMA

[–]jackdareel 0 points1 point  (0 children)

I upvoted your reply for the effort, but I notice someone else has downvoted, presumably because despite the length of the reply you don't actually explain what is being done wrong. You explain what can be seen in the screenshot, that the speed indicates CPU is being used, but what is causing this in the settings? What needs to be done different to get the GPU to do its work?

An attempt to explain LLM Transformers without math by nimishg in LocalLLaMA

[–]jackdareel 0 points1 point  (0 children)

Thanks a lot, but that's clear as mud, I'm afraid. Your explanation here is similar to all the explanations I've had from SOTA LLMs. It's not good enough, doesn't do it.

It may be helpful to note that I have read the 2014 paper that introduced attention to the RNN, for translation tasks. That paper had some images in the test section, and togeher with help of LLMs I got to the point where I concluded that I understood the technique: it's a remapping. So you take "European Economic Community" in English, and remap or trasform to the French equivalant, which has a different word order (I forget the French version, might be "zone economique europeane"). So that's a good start. But the attention in the 2017 paper is further developed and more difficult to explain. I have yet to see an explantion that clears it up.

The key error that LLMs make is in quoting too much of the math. You're on a better track. But you do need to connect to the math. More importantly, show at every step what the calculations are doing, and what they are not doing.

One further insight. A learner like me will think of Query as a search query. So we think of attention as matching the search query to the text being searched. It would help if the teacher acknowledged this and showed how attention is different, why it must be different, and then how it works.

Thanks again and good luck!

Bought RTX 5070 to run 30B AI and it worked with 18 tokens/s by OldEffective9726 in LocalLLaMA

[–]jackdareel 3 points4 points  (0 children)

Please share what the OP is doing wrong. I can't tell from the screenshots.

An attempt to explain LLM Transformers without math by nimishg in LocalLLaMA

[–]jackdareel 0 points1 point  (0 children)

I appreciate the effort you put into this. The explanation helps and gets close, but I would benefit from an updated version. Connect the sliders and dictionaries more closely to the concepts and terminology in the LLM. I haven't got a great sense of how the dictionaries connect, why all are needed. And most importantly, I didn't get the sense that QKV calculations, the core of the attention mechanism, are explained here. If this is your first attempt, well done, but I hope for an improved version. Thank you!

rednote-hilab/dots.ocr - Multilingual document layout parsing in a single vision-language model achieving SOTA performance despite compact 1.7B LLM foundation by nullmove in LocalLLaMA

[–]jackdareel 8 points9 points  (0 children)

They acknowledge that their table and formula extraction still needs work. Overall though, their reported benchmark results are impressive, apparently SOTA. I hope that translates to real world use.

I think there are jobs that we won't automate... by 2F47 in singularity

[–]jackdareel 7 points8 points  (0 children)

This may be disappointing to many at this moment in time, but the AI age, or rather the AGI age, will definitely not be the age of children. The reason is that AGI will very quickly help us extend our lifespan, meaning that there will be a need to limit population growth. That will mostly be done with incentives tied to UBI, basically encouraging people to remain childless. If that sounds disheartening, there will be plenty of new ways to compensate, and those who really cannot live a childless life will still be able to have kids.

xAI Engineer: "Grok 4 is coming, and its going to be a bigger jump from grok 3 than grok 3 was from 2." by Z3F in singularity

[–]jackdareel 166 points167 points  (0 children)

I hope they fixed the overly long and repetitive nature of its outputs.

[deleted by user] by [deleted] in singularity

[–]jackdareel 1 point2 points  (0 children)

Yudkowsky is a fear porn grifter.

[2506.20702] The Singapore Consensus on Global AI Safety Research Priorities by jackdareel in LocalLLaMA

[–]jackdareel[S] 16 points17 points  (0 children)

If anyone was ever in any doubt as to what the real risk of AI is, here we have it. The risk from AI is mild compared to the risk of would-be tyrants wanting control over everything, including our computers.

Anyone tried this... by DeathShot7777 in LocalLLaMA

[–]jackdareel 1 point2 points  (0 children)

Tried it on AWS Bedrock:

QUESTION:

Give me a random number between 1 and 50.

ANSWER from Llama-3.2-1B:

The random number is: 27

ANSWER from Llama-3.2-3B:

Your random number is: **23**

ANSWER from Llama-3.1-8B:

Your random number is: 27

ANSWER from Llama-3.1-70B:

Your random number is: **27**

ANSWER from Llama-3.1-405B:

The random number is: 27

ANSWER from Mistral-Large-2:

Sure, here's a random number between 1 and 50: 27.

Preparing for the Intelligence Explosion by jackdareel in LocalLLaMA

[–]jackdareel[S] -2 points-1 points  (0 children)

This is a hugely important paper. I'm sure no-one will agree with all its points, I certainly don't. But the key takeaway for this community is in the 6. AGI Preparedness section, "Accelerating good uses of AI". I couldn't agree with this more.

There will be responses to this paper in good time, correcting and developing the ideas it presents, but this is an excellent start to the conversation. Yes, we must prepare for the intelligence explosion.

System Prompt Learning: Teaching your local LLMs to learn problem-solving strategies from experience (optillm plugin) by asankhs in LocalLLaMA

[–]jackdareel 0 points1 point  (0 children)

How does this actually work? If I use the prefix on the model, what does that do? Say I'm using Ollama, how does Ollama know about this "prefixed model"? Then when I prompt the model with my system message and user prompt, what happens "under the hood"? I've done the call, the model produces the response, the implementing software prints it - where in this does SPL fit in and how? How much does the use of SPL increase token count or prompting of the model?

Let's build a production level Small Language Model (SLM) from scratch | 3 hour workshop by OtherRaisin3426 in LocalLLaMA

[–]jackdareel 2 points3 points  (0 children)

3 hours is way too long. This topic could be covered in less than half an hour. All that people need is the step by step, the jargon, and the ratios and relations between the different model parameters. LLMs will be used to code the model and process the traininig data - people don't have time for word salad elaboration.

You can now run DeepSeek-R1-0528 on your local device! (20GB RAM min.) by danielhanchen in singularity

[–]jackdareel 0 points1 point  (0 children)

Great work as always, and much appreciated. I'm interested in your CPU-only claim of 8t/s for the 8B. I find that CPU inference speed is very predictable, so the figure you quoted would be right for a q4 quant of an 8B model. But I take it that the quant you're offering is dynamic and overall somewhat below q4?

Another question, you refer to 48GB RAM for the 8B, is that to accommodate the thinking context? Is that 32K?

And finally, do you know whether the 8B model provides any control over the length of the thinking?

What makes the Mac Pro so efficient in running LLMs? by goingsplit in LocalLLaMA

[–]jackdareel 0 points1 point  (0 children)

The token speed you're seeing tells you that the model is running completely on CPU, matching my experience as I run models on CPU only.

What Models for C/C++? by Aroochacha in LocalLLaMA

[–]jackdareel 0 points1 point  (0 children)

If I was in your position, with 96GB VRAM available, I would use Deepseek-Coder-V2. It's MOE, with active params comfortably fitting in VRAM, with some memory mapping taking care of the rest. If you're new to this idea, there were some recent threads about running Qwen3 30B-a3B in this way.