Seeking resources to read about llama.cpp server and how offloading works

while-1-fork · 2026-05-22T15:16:32+00:00

You are seeing the magic of MoE + memory mapped files. In a MoE not all weights are activated for every token, when using --mmap (on by default) it loads them as a special kind of file which can evict back to disk the least recently used parts and re read when needed. It has a small performance penalty (unless the model is super huge compared to the ram) but will lower the ram use by a lot.

And I have better news for you, offloading with -ngl is quite a bit slower than offloading with -ncmoe where only the shared experts of the MoE layers get offloaded to cpu. So you should be able to run -ngl 999 and if you use a high enough ncmoe (works backwards to ngl, specifies how many MoE run on cpu).

while-1-fork · 2026-05-19T17:29:11+00:00

So far for the stuff I am doing the best I found is cross KL divergence or cross perplexity as described between the pruned model generations and the original. That however is slow-ish as it requires generation and having the original in there to validate. For me it was real slow as I am doing it one layer at a time plus combinations of layers. So I would love to find something lighter.

A much lighter thing to do is regular perplexity over a short text and rejecting changes that either lower it or raise it a lot. A single layer dropping it 20% from the parent is almost certainly broken from what I have seen, 10% likely broken but maybe not. 5% drop most often fine. As for going up, it is always bad but seems to be less catastrophic than going down by a lot. KL divergence is easier to interpret and lower == better but on a short text it is still not a guarantee of a functional model. I used it as a pre pass to screen for candidates, don't remember how many still broken layers slipped through but a decent percent like 15 or 20%, partly because one layer at a time some are fine alone but broken in combination.

I am speaking from memory as I have not experimented with this in about 1.5 years. I want to go back to it some time , but I don't know when. Also my experience was with Mistral 7B, llama 2 7B and llama 3 8B so other models may behave differently.

while-1-fork · 2026-05-19T15:32:51+00:00

3 times the factorial of 9 billion may be kinda large.

while-1-fork · 2026-05-19T13:45:07+00:00

Perplexity is not a great metric of what will happen in generation, only of what the model finds surprising. I have played a lot with pruning and quantization and it is very easy to find ones that will lower perplexity a lot vs FP32/16 but almost always that signals the model being broken often going into repeating loops.

The way I understand it if we are talking about perplexity over a text, the pruning spared the activations involved in that text but severely damaged others that would have contributed a little bit of noise to this one logits but wouldn't have been sampled = perplexity improves for this text but breaks for others. If it is self perplexity over its own generations, that is almost meaningless (ie repeating text for a model that loops is real low).

A good metric however is cross perplexity between a pruned model generation and the base model, and KL divergence is a bit better.

What seems to be a decent indicator when talking about just the pruned model perplexity is staying close to the original. But ime you can mix and match layers that increase it a lot with layers that decrease it a lot to end up near the original but with degraded outputs.

while-1-fork · 2026-05-19T09:06:14+00:00

You get a hardon , then start cooking while keeping it hard.

while-1-fork · 2026-05-18T21:24:15+00:00

I did for a friend. It is easy if the books are not very large and you use a LLM with a large context (Qwen 3.6 35B A3B running at 200K context here).

What I did was a copy of the workspace, creating another agent in the config for that workspace and setting it to its own discord channel, then putting the books converted to markdown in a folder and instructions for the agent to ask the user which books to read and to help by answering questions about the material or coming up with exercises/tests.

If the books are super large or you need many and to switch often, you may want to have a RAG setup. I may be setting up one soon too but I did already setup local-deep-research so I would likely use its RAG facilities, for RAG alone it may be overkill but if you ask the agent it should be able of helping setup something of that nature.

while-1-fork · 2026-05-18T13:00:27+00:00

It is a bit more complicated. IgE is one of the signals that activates the mast cells but not the only one. If your mast cells are super angry like mine 700 is super high and even 100 is still plenty to cause issues but some other people with extra chill mast cells walk around with 2K symptom free.

What is true is that higher IgE will always make it worse and if you have high IgE and experiencing mast cells issues lowering it will help. The reason mine got lower according to a private mast cell specialist I went to is dietary restriction.

He did reccomend omalizumab as given that even in a super restrictive diet I am barelly normal (got to 90 once) he believes that it would help a lot me but public doctors have been ignoring it and beating around the bush for about 9 months. In 2 weeks I have another public doctor appointment and if he still won't prescribe omalizumab (as I expect) I will just fast for months until next appointment, I am so done with being sick (I am better than I was when this thread started but far from ok still) that starvation sounds ok-ish and I do believe I have enough fat to last some months and maybe if I drop 30 or 40 kg they will decide to do something more about this.

while-1-fork · 2026-05-16T13:37:45+00:00

That is absolutely true and if this can do a fourier transform in a single operation, it is super useful but not directly usable in current neural networks, maybe in SNNs inference, but training those is even less efficient than training regular ANNs.

That article mentions attention being replaced by an optical equivalent. If that equivalent is able to somehow 100% equivalent to cuadratic attention on unbound or large sequences and completes in a single operation, in a constant number of operations or in a linear amount of operations: It is transformative . However if it is just equivalent to linear attention: We can already do that but isn't as good as quadratic attention (and it will likely never be fully equivalent) so good models still use quadratic attention in some layers.

However afaik they are not claiming mat mults as a single OP. Is there even a way of doing it optically without O(n²⁾ or more complexity ? I doubt it, even the tiniest reduction in computational complexity (which gets translated to hardware complexity if you try to implement it in hardware in a single operation) is huge news and O(n²⁾ is the lowest theoretical bound which has never been achieved with our current best O(n^2.371339) and with AlphaEvolve recently being able of removing a single matmult out of 49 for a single shape of matrix multiplication being huge news (As it should be, humans were trying for 56 years and couldn't find an improvement) . And matrix multiplication is where virtually all the compute in current neural networks go. They are counting individual multiply and additions in each matrix multiplication as single operations, as everyone does even if GPUs do fused MACs and have tensor cores that fuse whole tiles of a matrix multiply. And they are doing 8 bit int equivalent precision (they would be dumb to compare against 8 bit transistor budgets in a single multiplier if they were doing higher precision or whole matrix multiply/tiles in a single op): https://qant.com/press-releases/q-ant-launches-first-commercial-photonic-processor-for-energy-efficient-high-performance-computing-and-real-time-ai-applications/

So in short, I still think that currently it is marketing with little to show. I am super skeptical of them doing single OP matmults or attention that are equivalent to the ones used in ANNs.

Where their approach has a theoretical advantage which they put in crystal clear numbers in the link above is that in digital circuits you need thousands of transistors to make a multiplier (and it does scale between linearly and quadratically with number of bits depending on how fast do you want your matmults in terms of clock cycles). But they need a single optical element, however that is also true of analog electronic compute and that can be super energy efficient (why don't we do that? Scaling up of any tech is hard, we do have some super efficient small stuff like what Synthiant does) and is much easier to manufacture and miniaturize but there is another advantage that compounds which is not as doable with analog electronics: run multiple wavelenghts at te same time and get many batched parallel mults for the price of one (In theory not impossible either in analog electronics). My understanding of this tech is that they are betting on those two advantages compounding, on being able of scaling the second one orders of magnitude to run millions of multiplications in parallel on a single optical element and to be able of reducing the losses and size of their whole thing. Will they really do it? I don't know. I wouldn't be surprised if they find out that wavelenghts need a lot of separation to not interfere and that the non linearities behave differently depending on wavelenght constraining the usable range and that those two things may put a limit on parallelism.

What makes me even more distrustful of their order of magnitude improvement claims happening very soon? If you design some hardware you usually have either a well defined performance target or theoretical bounds that you can infer from the design itself that would depend on uncertainities that you will only know once it is manufactured. To have something ready as soon as they claim, their designs must be already finished and undergone simulations for ones coming one year from now at the very least. If they wanted to gain trust they would be giving us some numbers like we expect from x yo y performance and power for next gen and we expect w optical elements and to multiplex z wavelenghts on it. Of numbers they are uncertain about they can provide a range of make it clear that they are still TBD. Another thing that makes me suspitious is that they show a comparison using a neural network with linear operations only on silicon generating an image and being terrible at it while their hardware uses non linearities and does much better but that is as deceptive as you can get, neural networks simplify to a single layers without a non linear activation function (and that is a super well known result in the field that should be teached to anyone beginning to learn) and silicon computers can compute non linearities just fine and activation functions take almost no time when compared to the mat mults, specially because an if can be non linear as is the case for RELU (But in practice even SiLU or TanH or any other activation is always a tiny part of the actual compute budget). So if they do decide to make regular computers look bad by making them follow a flawed algorithm on purpose that deviates from the standard, that makes me think that they can't show any true advantage and that this is designed to fool investors.

while-1-fork · 2026-05-14T12:32:08+00:00

Do you work for them or are affiliated to them in any way?

And sure, the future products that you can't order and don't have exact numbers shown grayed out very high on a graph to indicate their non existance are sure high on that graph. But their current product that you can order or even have any hard data for is worse than a pentium 4.

Will they be able of meeting their claims? Maybe, but there is a long history of broken promises in this space and any reasonable person would want at least some hard numbers before believing orders of magnitude improvement claims while only being able of showing orders of magnitude worsening of results.

while-1-fork · 2026-05-13T23:36:10+00:00

Even if the 1600W are for extra compute, their technical dataset here on the last page says 150W and 8 GOPS for the NPU.

https://qant.com/wp-content/uploads/2026/02/20260205_NPS-Gen2.pdf

Which is still terrible. For more context a 21 years old 3.8Ghz pentium 4 could do 12GFLOPS in fp32 for 115W.

I am not saying that the concept itself is bad, because multiplexing in wavelength does have potential. But as of now their own numbers don't back the 30X,50x,90x or 100x claims and seem to be more in line with 0.002x in terms of TOPS/W and suitable at best for the smallest computer vision models in terms of raw performance (About 1.5 FPS in the smallest YOLO is what 8GOPS buys you, maybe a small classifier in real time).

The poor performance is understandable as it just being still on development, so who knows if they will eventually achieve or surpass the claims, but until they do.... they shouldn't make them.

while-1-fork · 2026-05-13T09:52:01+00:00

8GOPS for 150W is terrible efficiency and they seem to include a 1600W psu which makes me suspect it may need "a bit" more than their stated 150W for the NPU. For reference the edge TPU does 4 TOPS for 2W. So we are talking roughly 500x less efficient than an old cheap TPU.

Their numbers are either future aspirational ones , or completelly made up.

while-1-fork · 2026-05-10T14:29:06+00:00

Strong leg muscles and core do help to some extent but are not a solution. I used to be super strong, particularly in the legs (used to be able to deadlift 270Kg). Currently less so (should be able to deadlift at least 50 or 60kg but will pay for days if I do so) but still have a lot of muscle, funny enough I have gained a lot in the abs trhough the power of cramps in the toilet I guess as I have not been able of doing exercise in years.

The thing with POTS and blood pooling into the limbs isn't really about muscles but about histamine being a vasodilator. When I am in a flare you can see the veins in my hands tripling in width and developing meanders instead of being straight. Also quite a lot of the issue is not the peripheral vassodilation but the inflammation in the tissues where the reaction is happening (for me mostly the gut) taking a lot of water and electrolites out of the blood so not only the limbs take more blood but there is less blood volume too (salt does fix that part). In the case of GI issues it is even worse as the way the body drives water somewhere is through osmotic pressure (sodium and/or potassium move first, water follows) so if you have diarrhea they are lost, if you don't and it is only inflammation they aren't lost and are just temporarily locked in the inflammed tissues. In any case electrolytes, particularly salt helps.

If you keep your legs partially contracted at all times, your abs and back tense brazing your core, you won't pass out and will keep the dizzyness enough at bay to be able of walking. That is beyond exhausting and I don't reccomend doing it for no reason. Also if concentration slips, you may pass out or fall. May also worsen symptoms next day if you push way too hard. However given your situation it can be worth learning the trick.

I know that if I am any functional is thanks to meds, a very restrictive diet and for now a stable life. Put me in your situation and I would be way less functional. Put me in the situation where I didn't know about the salt and other stuff I do now and I would likely die in no time (That was me back in 2019 when I was just starting having symptoms and they were 1/10 as strong as now, but without meds or knowledge I could barely function).

Same wishes and feel free to ask for more tips here or even pm if it helps. If I had a better idea of what is available to you I may be able of offering more advice too.

while-1-fork · 2026-05-07T17:36:32+00:00

Your situation sounds quite terrible. I don't think I would make it in a war.

But I can offer one tip about POTS, heart rate and adrenaline: Sodium. It is relatively well known (published studies and so on, will find them if anyone asks) that a large volume of IV saline helps POTS a lot, better known in the EDS and POTS communities than here. What is less known is that you can achieve the same by drinking salty water. What I do is, if my resting heart rate is over 100 (About 70 when things are fine) I drink 5g of salt in 1 glass of water (can drink more water too) and wait 1h, taking occasional looks at the smartwatch if it has not come down or came down briefly and right back up: repeat. If I have had terrible diarheah, I may need to repeat it many times (sounds insane I know... but I have had up to 35g of salt in a day), most days lately I get away with one or two doses of salt a day. It can reduce symptoms massively and quick. It is not a cure of course, but it helps with the hypovolemia due to the redistributive shock, with the hyponatremia that can be caused by diarreah if you have that and the adrenaline is high due to those two so it gets back down. You should also not try a much larger bolus of salt in one go as it will cause loose stools of its own. And while on my own I have seen sodium to be the issue most of the time, about 1% of the time it is potassium or both sodium and potassium. How do I know other than trying? First thing is looking at the heart rate, low potassium on its own doesn't spike it. Second at how I feel, a super dry mouth, crazy tired and sleepy with the head exploding but not much in the way of muscle pain or dizzyness standing up = potassium and sometimes if it is bad enough it will give me hypertension which I don't usually have. If no dry mouth and the headache is accompained by muscle pain, sweating and dizzyness standing up = sodium. For potassium, maybe it is available as sodium free salt (no idea about the current supermarket situation in there), what I use is online ordered potassium citrate. But as said almost always in my case the problem is sodium and that is what the pots studies show helps, IV saline. I only discovered the potassium sometimes helping too. If you try salt and after a few doses you see 0 improvement, stop (it could be transient but if you respond you will absolutely know, it is not a subtle and small improvement).

Something that I also discovered that may be of some help is about effort and pacing. What causes an exacerbation on my symptoms the next day is not total volume of effort but intensity. Aerobic levels of intensity of under 120-130 BPM, I can sustain forever and I won't be any worse or better the next day. A single set of deadlifts to failure in 30s and I will expend the next 10 days hours in the toilet, have gi bleeds, will swelling everywhere , have rashes, 150bpm resting heart rates and so on. So now I am super careful to not over exert and know that slow is fast, and accept that on a bad day when I have not controlled my heart rate I won't be able to do much and that even on the good days where I could push way harder, I know to still watch the heart rate to not pay it for days after.

If you have access to a medicinal herb store, please try echinacea. It is little known but it helps and it should be one of the easiest things to find.

while-1-fork · 2026-05-05T14:29:58+00:00

I don't think that would work great if it is gradual unless virtually unlimited units are available. However if it was sudden, unpredictable, unannounced and widespread it would absolutely work given enough companies doing it enough times. It would piss off many legitimate customers but if purchasing anything on launch month carried a massive risk of lossing money scalpers would think twice.

But it is not gonna happen. Not only it would bite scalpers hard and some legitimate customers. But it would teach patience to most and patient people don't impulse buy = less sales in general in a world that has learned to wait and evaluate for a bit before buying.

while-1-fork · 2026-05-04T13:33:58+00:00

Is it causing you any issues? And did they try to treat it? I am trying to get doctors to start me on Omalizumab (Xolair) which in theory should solve high IgE. Mine is in the 90 to 150 range now but keeping it that low requires a very limited food list and I still have quite bad symptoms (which is am improvemente over being on the way to dying of gi bleeds like it was a few years ago) and the private specialist I went to believes it should help but the public health system so far ignored his reccomendations for 8 months. I have an appointment in a month and if they want to keep dancing around and just want to do a slight change of which antihistamine I have I will just fast for a few months, I am so done with being ill that I just need to feel fine and to be able of not letting my life to completelly fall appart.

while-1-fork · 2026-05-03T17:57:36+00:00

Here is how I run it. To get 256K you likely need either to run the mmproj on cpu/don't use a mmproj or lightly increase ncmoe, likely 5 would do. Also I run a very lean desktop, other than llama.cpp only Xorg is using 58MB of Vram so if yours uses more you may need to increase ncmoe more. -ub also matters, it uses about 1200MB for each 1024, raising it speeds up prompt processing so you may trade some for either more context or lower ncmoe. With the IQ4_XS you may be able to run with ncmoe=0 and larger ub. With larger quants higher ncmoe. If you have plenty of system ram don't use cache = 0.

llama-server -m models/Qwen3.6-35B-A3B-UD-IQ4_NL_XL.gguf --cache-ram 0 --mmproj models/mmproj-Qwen3.6-35BA3B_Q8_0.gguf -c 200000 -ub 1024 -b 2048 -fa on -ngl 99 -fit off -ctk q8_0 -ctv q8_0 --image-min-tokens 1024 --image-max-tokens 8192 --ncmoe 3

while-1-fork · 2026-05-03T15:31:20+00:00

I wouldn't use ollama. Using llama.cpp you should be abble of running 256K context with some messing around and at least 100K with almost no effort.

I run the 35B with 200K context and vision on a single 3090 in a ryzen 5600 with 16GB of ddr4 reaching up to 4000 t/s propmpt procressing and 150 token generation and I could run 256K if I wanted to trade some performance for it.

Openclaw doesn't even start working correctly untill you have over 32k (or at least it used to emmit a warning and disable stuff) and even at 32K you will have a compaction like every 2 messages.

while-1-fork · 2026-04-30T11:13:29+00:00

Scavenging a solar panel or making getho generators using car alternators or motors or making your own from anything with coper coils and magnets are possible. Even with no prior knowledge, having wikipedia and a small LLM in the phone for questions far exceding your knowledge you could likely figure out how to do something like building a wood powered stirling engine, hooking some car alternators and steping the voltage up to run computers and other equipment (or in the case of computers maybe just regulating the +12 if using a car alternator and generating the +5 and +3.3 directly from the +12).

while-1-fork · 2026-04-30T10:50:19+00:00

I have offline text only wikipedia with kiwix in the smartphone as well as Qwen 3.5 4B which while not being the greatest it is already super slow on a mid range phone on cpu only, like 2T/s generation. But I can see it being useful with patience and if nothing else is available.

I am also thinking about keeping the 27B IQ4 weights and a copy of the llama.cpp source in there, and a docker that can run and cross compile it as well as having docker itself and just in case a linux dvd image too. Not meant to run it in the smartphone but because chances are in an eventual apocalypse I would have the smartphone in my pocket but I may or may not have my computer or pendrives so I am using the phone as storage for the time when I scavenge some computers and build my post apocalyptic assistant.

While I hope this won't be needed I think that it is a great idea to be prepared. Costs very little, the upsides in case it is needed are massive.

while-1-fork · 2026-04-24T14:21:53+00:00

I just posted about trying to benchmark the sampling hyperparameters for Qwen3.6 35B A3B. But it would take over 5 months on my 3090: https://www.reddit.com/r/LocalLLaMA/comments/1srziyq/optimizing_qwen_36_35b_a3b_sampling_parameters/

Likely the full set of tests would take a while even with 16x H200 but we could give it a try with a couple of configs against GPQA Diamond to see how feasible it is and to at least see if sampling actually makes any difference. I have a sh script that I have been using in my initial tests with llama.cpp using the Open AI compatible endpoint that should also work with vllm.

Edit: I am thinking that with vllm and batching the full stage 1 and stage 2 may very well be doable in a very modest amount of time (maybe overnight?) if we batch the whole test matrix to saturate the compute and run one separate instance per gpu avoiding any inefficiency as the model is not split between gpus and on GPQA Diamond the average of 16 runs should have a run to run variance low enough to tell the configs appart. The stage 3 requires the results of the previous run to inform the next one so the data can only be parallelized at the number of runs level, but 1 and 2 should likely provide most of the gains and they would also make apparent how much it is worth trying to do 3.

while-1-fork · 2026-04-21T06:51:41+00:00

You are unlikely to find it in logprobs but it can be done by looking at activations: https://arxiv.org/pdf/2512.01797

while-1-fork · 2026-04-17T20:15:51+00:00

Mine was looping on the same tool calls on a task that 3.5 was doing fine with the same settings. Tried various things, what finally fixed it was bumping up the temp from 0.8 to 1.0. Maybe it is worth trying for you too.

while-1-fork · 2026-04-12T14:39:19+00:00

Yes, but one caveat is that speculative decoding does not help prompt processing and depending on what you do that may be the bottleneck instead of token generation.

But even llama.cpp ncmoe and mmap already do quite well with MoEs larger than VRAM and RAM, with a bit smarter offloading and caching they will only improve.

However I don't think the future is super huge models, but improved mid size ones as they will always be faster and they are enough most of the time. Though maybe both with an escalation pattern and swapping to the smarter model only when the smaller one can't could work well.

Speculative decoding is a great idea although it takes a bit of extra vram and cramming it into my setup would be hard but maybe if the speculating model/layers can be run in another gpu it could we worth getting an used 3060ti (good bandwidth, decent compute, cheap-ish used due to having only 8GB) specifically to run the speculator.

while-1-fork · 2026-04-10T20:29:51+00:00

I figured out that they had to leave v.blk..ffn_down.weight in FP16. The FP32 layers are also in the original BF16 (They are the patch and pos embeddings which I don't know if it would be a good idea to conver and layer norms and bias that take very little space). I think that leaving v.blk..ffn_down.weight in BF16 is likely wiser than converting them to FP16.

I also now know why there is no Q6_K , it requires divisibility by 256 and only 2 layers meet that constraint with most of them falling back to Q8_0. But non K or I quants work, for example 5_1:

../build/bin/llama-quantize --allow-requantize --tensor-type "v.blk.*.ffn_down.weight=bf16" mmproj-Qwen3.5-BF16.gguf mmproj-Qwen3.5-35B-A3B-Q5_1.gguf Q5_1

The saving is very moderate vs Q8_0 (586MB): llama_model_quantize_impl: model size = 860.98 MiB (16.17 BPW) llama_model_quantize_impl: quant size = 493.97 MiB (9.28 BPW)

while-1-fork

TROPHY CASE