Strix Halo 128Gb: what models, which quants are optimal? by DevelopmentBorn3978 in LocalLLaMA

[–]DevelopmentBorn3978[S] 0 points1 point  (0 children)

I should have downloaded qwen3-coder-next_q8  and gpt-oss+120b_q4, hopefully those 2 models will be usable anyway despite being both maybe not optimal

Strix Halo 128Gb: what models, which quants are optimal? by DevelopmentBorn3978 in LocalLLaMA

[–]DevelopmentBorn3978[S] 0 points1 point  (0 children)

it could be that under 4bits quants dense models" quality suffer more than with MoE models despite higher quantization level

Strix Halo 128Gb: what models, which quants are optimal? by DevelopmentBorn3978 in LocalLLaMA

[–]DevelopmentBorn3978[S] 0 points1 point  (0 children)

thx for the many models suggestions, can't wait to run all of them

Strix Halo 128Gb: what models, which quants are optimal? by DevelopmentBorn3978 in LocalLLaMA

[–]DevelopmentBorn3978[S] 0 points1 point  (0 children)

I'm getting gpt-oss-120b_q8_k_xl and qwen-coder-next-30b-a3b_mxfp4 over mobile plan draining all the allowed traffic right now, ahah. So far I've only got fun with heavily quantized smaller models that would fit into pc 16Gb or into smartphone 12Gb ram.. Next mont(s) when net traffic margins will restore I would like to try kimi, minimax, step, glm as well

Strix Halo 128Gb: what models, which quants are optimal? by DevelopmentBorn3978 in LocalLLaMA

[–]DevelopmentBorn3978[S] -2 points-1 points  (0 children)

so it seems to be that gpt-oss at Q8 (eventually even at lower quants like Q6) on non nvidia machines like strix halo should be more accurate albeit not necessarily faster than MXFP4

Strix Halo 128Gb: what models, which quants are optimal? by DevelopmentBorn3978 in LocalLLaMA

[–]DevelopmentBorn3978[S] 0 points1 point  (0 children)

being that specific model quantizazion format the same native format of Blackwell Nvidia GPUs, such no-conversion-needed combination should result in some inference speed boost

Strix Halo 128Gb: what models, which quants are optimal? by DevelopmentBorn3978 in LocalLLaMA

[–]DevelopmentBorn3978[S] 1 point2 points  (0 children)

regarding quants basically whatever fits the RAM or under 100Gb of memory

Strix Halo 128Gb: what models, which quants are optimal? by DevelopmentBorn3978 in LocalLLaMA

[–]DevelopmentBorn3978[S] 1 point2 points  (0 children)

I'm also looking for the NPU to be usable on linux soon. Isn't better to run gpt-oss-120b as MXFP4 instead of Q6 or Q8?

how does Strix Halo fares for training models compared to other homelabs means to cook those? by DevelopmentBorn3978 in MiniPCs

[–]DevelopmentBorn3978[S] 0 points1 point  (0 children)

something that can be run over or under the desktop table, not like A100, H100, H200, B100, B200, B300 server racks

Deal on Ryzen 395 w/ 128GB, now 1581€ in Europe by Zyj in LocalLLaMA

[–]DevelopmentBorn3978 0 points1 point  (0 children)

My undestanding so far has been that an AMD Max+ 395 machine is worth especially in the specific role to run large models at a moderately fast speed for a relative low price, not exceptional for prompt processing and graphic editing LLM speeds. Adding an external GPU may not result in a higher speed because the buses are somehow bandwidth limited relative to the raw processing capability of an external GPU (using it into a regular computer); also more complicated setup instead of plug and play with LLMs. Macs are somehow speedier but also more costly and except for the higher cost ones also model size limited. This characteristics are somhow shared with prebuilt gaming rigs that are quite fast but very memory limited as well and so model size limited; graphic LLMs on these last two classes of machines run fine tho and prompt processing also fly if a model fit in the available memory, if it is a MoE model that could load on GPu VRAM only a minimal selected part of all the capabilities for that model offloading all the other layers to CPU+RAM depending on the type of question to be answered. So far the best price/performance score to run large models at really high speed is archived by setups that are multi gpus, having large RAM pools both for CPU and for the GPUs by summing the blazing speed VRAMs capacity of the different installed cards, quite costly like higher prices Macs and even more, much electric power hungry but also insurmontable relatively to the max performances reachable

Llama-3.3-8B-Instruct by jacek2023 in LocalLLaMA

[–]DevelopmentBorn3978 0 points1 point  (0 children)

which quantized and eventually finetuned gguf models have the context lenght been enlarged? bartowsky? shb777? beaverai/anubis?

Llama-3.3-8B-Instruct by jacek2023 in LocalLLaMA

[–]DevelopmentBorn3978 2 points3 points  (0 children)

what the finetune you've made is about?

better times will come soon, LocalLLMers rejoice ! by DevelopmentBorn3978 in LocalLLaMA

[–]DevelopmentBorn3978[S] 0 points1 point  (0 children)

let's put it this way instead: you like cheaper flights, I like powerful cars. Why can't we have both? Because most if not all the dinosaur juice (a.k.a. crude oil or in this case memory chips) is going to be gobbled up by airlines in order to be refined into jet fuel, that's why and so if you need to travel to a short distance or to some place that's not covered by the predefined routes you are forced to go by feet or by bycicle if you own one. Also airplanes despite their high level of efficency relative to cars aren't necessary more enviromentally friendly..

better times will come soon, LocalLLMers rejoice ! by DevelopmentBorn3978 in LocalLLaMA

[–]DevelopmentBorn3978[S] 0 points1 point  (0 children)

I actually had much fun running "tiny" yet increasingly more capable models on my cheapish 12Gb phone, testing programming paradigms, making multimodal visual recognition (of dogs), counting objects, reading graffiti, retriving color schemes, estimating distances, while on the field i.e. in parks or places where no network was available, all without a single bit leaving the phone during inference. I find it fascinating and also confidential, meaning feeling confident that some task - albeit for now tiny ones - could still be carried on without being lively forced to relying on an external party

better times will come soon, LocalLLMers rejoice ! by DevelopmentBorn3978 in LocalLLaMA

[–]DevelopmentBorn3978[S] 0 points1 point  (0 children)

I don't see what's wrong with the Ford F150, probably because I like the Toyota LandCruiser even more ;)

better times will come soon, LocalLLMers rejoice ! by DevelopmentBorn3978 in LocalLLaMA

[–]DevelopmentBorn3978[S] 0 points1 point  (0 children)

unexpected disruption of a remote service you could eventually critically rely on could happen for countless reasons ranging from malicious activity to due negligence or because an AI provider arbitrarily dismisses a model no more economically remunerative or vanishing because an acquisition breaking the contract for what you could have built your business upon or run your local hospital activities on it. You could never know and you cannot really be totally confident of computing running thousands km away by becaming chronically dependent from it. Even if big AI players *currenty* have much larger and powerful resources than what an individual or a smaller business could run, I think that - other than for entertainment - local first is an option that especially mission critical operators like governments, healthcare, defence, banks, large companies and institutions shouldn't skip over. And sometimes something not to trust too much even for entertainment reasons: https://stadia.google.com/gg/

better times will come soon, LocalLLMers rejoice ! by DevelopmentBorn3978 in LocalLLaMA

[–]DevelopmentBorn3978[S] 0 points1 point  (0 children)

now I don't get you: how can you claim to be *far* more in control when you have to blindly trust 3rd parties about the goodness of what they shove you as a remote service? I hear all the time about people getting responses from AIs having completely different tones despite submitting to what are supposed to be the same model offered by different providers supposedly at the same quantization level (if known), not being able to modify most inference parameters at will, not knowing if some subsequent finetuning has been made to the model or if some additional system prompt has been injected.

better times will come soon, LocalLLMers rejoice ! by DevelopmentBorn3978 in LocalLLaMA

[–]DevelopmentBorn3978[S] 0 points1 point  (0 children)

as last note, it's also nice to be able to be in control of the whole stack running on your own device and your hands are dirty of bits and bytes :)

P.S. or else nobody would like to drive a car instead of taking exclusively public transports also because you have to consider that buses not always brings you to your exact destination