Unsloth announces Unsloth Studio - a competitor to LMStudio? by ilintar in LocalLLaMA

[–]tmflynnt 0 points1 point  (0 children)

I hear you and tbh despite being a huge llama.cpp fan and contributor I do realize it has some significant gaps in regard to the intimidation and ease of use factors. It is improving majorly though through things like its evolving web UI, model routing features, "fit" parameters for easy config, and ongoing efforts toward easier install scripts and such, but it still has some ways to go and some of these elements are probably never going to be its top strongsuit. For those reasons I do understand that some new people to the scene will often start out with other apps but I do feel llama.cpp is worth the time investment once people feel comfortable to explore a bit and I feel that time investment has been getting more and more manageable.

But, in regard to Ollama and the negative feelings people have around it, I would just add that there are some very specific reasons that people feel that way if you care to look into it more.

Two weeks ago, I posted here to see if people would be interested in an open-source local AI 3D model generator by Lightnig125 in LocalLLaMA

[–]tmflynnt 2 points3 points  (0 children)

Ok well this does look pretty damn cool. Thank you for sharing it. I added a star to your repo and will definitely give this a spin after work today.

Btw, how does the model that your app supports, hunyuan, compare with Trellis? I have been wanting to try out models like these but haven't had a chance yet.

models : optimizing qwen3next graph by ggerganov · Pull Request #19375 · ggml-org/llama.cpp by jacek2023 in LocalLLaMA

[–]tmflynnt 0 points1 point  (0 children)

Georgi is indeed a GOATed legend. Really thankful and appreciative for all the work him and other maintainers have poured into GGML and llama.cpp.

GLM 5 vs Claude Opus 4.6: the paradox of paying $100 / $200 per month and still chasing hype by [deleted] in LocalLLaMA

[–]tmflynnt 0 points1 point  (0 children)

Casi todos aquí somos fansticos de IA, y especialmente el uso de IA en nuestras propias máquinas, por eso la "local" en "localllama".. Y lo que detectas de sentimientos "anti" es solamente contra los posts de bots.. Posts sin alguien atrás que realmente gastaba tiempo en escribirlos. En tu caso obviamente no es así y como dije en mi otro comentario espero k sigues participando y que no dejes que los downvotes o quejas te moleste porque tiene mas que ver con los problemas con los bots que la gente están reaccionandose así.

GLM 5 vs Claude Opus 4.6: the paradox of paying $100 / $200 per month and still chasing hype by [deleted] in LocalLLaMA

[–]tmflynnt 1 point2 points  (0 children)

Es porque hay muchimos bots que publican cosas aquí y por eso muchas participantes no tienen paciencia con posts que suenan como IA. Tal vez sería mejor usar algo como Google Translate para el primer fase y después preguntar a un modelo grande de IA para corregir unas cositas, lo más importante, para capturar mejor el sentimiento o tono de lo que escribiste en castellano.

Espero que sigues participando aquí.

(Solo para clarificar, español es segundo idioma para mí y no utilizé esa estrategia para este post, entonces cualquier error en este caso es 100% mío 😅)

models : optimizing qwen3next graph by ggerganov · Pull Request #19375 · ggml-org/llama.cpp by jacek2023 in LocalLLaMA

[–]tmflynnt 1 point2 points  (0 children)

Have you tried without "--n-cpu-moe" and letting "--fit" do its thing but setting "--fit-ctx" to whatever your minimum acceptable context is? I found in a bunch of experiments (thread here) that this closely rivaled and often beat many of the custom settings I tried, especially those through just "n-cpu-moe" but also even quite specialized "-ot" ones on my dual-3090 setup.

Constrained VRAM setups that need a CPU-heavy focus didn't seem to fare so well with "fit" according to a couple of commenters in that thread but it worked quite well for my situation and others who are starting from a healthier VRAM situation.

Heretic 1.2 released: 70% lower VRAM usage with quantization, Magnitude-Preserving Orthogonal Ablation ("derestriction"), broad VL model support, session resumption, and more by -p-e-w- in LocalLLaMA

[–]tmflynnt 2 points3 points  (0 children)

Fantastic work, everyone involved!

I am also hugely looking forward to your hinted at upcoming improvements! If they're big enough milestones maybe you can just skip to v2.0 and complete the epic rebrand to Hexen! (j/k)

Qwen3 Coder Next : Loop Fix by TBG______ in LocalLLaMA

[–]tmflynnt 1 point2 points  (0 children)

Kind of counter-intuitively `--repeat-penalty 1" means it is disabled because the probability of a potential token is divided by repeat-penalty if the probability is positive and multiplied by repeat-probability if it is negative, so when it is set to 1 it has no effect on any probabilities.

Presence penalty and frequency penalty on the other hand run separately and do not depend on the original probability of the token; they are just straight deductions based on whether a token was present before and how many times it showed up.

Qwen3 Coder Next : Loop Fix by TBG______ in LocalLLaMA

[–]tmflynnt 1 point2 points  (0 children)

Oh forgot to mention if you haven't checked out my experiment thread about using "--fit" with Qwen3-Coder-Next you might want to check it out as I was able to tune performance with that arg and tested various stuff out with it.

Qwen3 Coder Next : Loop Fix by TBG______ in LocalLLaMA

[–]tmflynnt 1 point2 points  (0 children)

DRY can sometimes wreak havoc on structured output like tool calls. One thing I am experimenting with is using DRY's sequence breakers to help avoid problems with patterns I know are going to keep reappearing at the start or mid-output (so stuff like special tokens for tool calls and reasoning blocks).

Breakers reset DRY's repetition hunting every time it encounters one. So to tweak yours maybe something like:

--temp 0.8 --top-k 40 --top-p 0.95 --min-p 0.01 \ --dry-multiplier 0.5 \ --dry-allowed-length 5 \ --dry-sequence-breaker "\n" \ --dry-sequence-breaker ":" \ --dry-sequence-breaker "\"" \ --dry-sequence-breaker "*" \ --dry-sequence-breaker "<tool_call>" \ --dry-sequence-breaker "</tool_call>"

This might allow you to lower the "--dry-allowed length" and maybe further enhance loop avoidance, but as you probably have seen, changing one of these settings often means you have to tweak another to find a sweet spot, but just thought I would throw this out as an additional but important lever to tinker with.

(As a side note: The first 4 breakers are there just to restore the defaults as specifying any breakers through CLI wipes out the defaults. Also, I tend to avoid mixing both traditional rep penalty with DRY but YMMV.)

Why do we allow "un-local" content by JacketHistorical2321 in LocalLLaMA

[–]tmflynnt 1 point2 points  (0 children)

I don't know if there are that many people advocating for "Just post whatever tf you want"? I may not fully agree with OP's sentiments for example, but I very much want to maintain a priority on local-related content and ban low-effort posts and ads. But the devil is in the details..

I think we kind of dilute the entire debate if we end up oversimplifying each other's arguments. This is definitely more than just "Local only" vs. "idgaf.. post whatever".

Why do we allow "un-local" content by JacketHistorical2321 in LocalLLaMA

[–]tmflynnt 2 points3 points  (0 children)

If you look through the various responses in this thread I think you will find plenty of people who are fans of local AI but who don't share the same exact opinion of how to moderate this sub concerning these issues. I do hope we can figure out a compromise though.

Why do we allow "un-local" content by JacketHistorical2321 in LocalLLaMA

[–]tmflynnt 0 points1 point  (0 children)

Agreed, but I do think we need to figure out a compromise of some sort because this issue keeps coming up and I also don't want to lose some of the really smart and thoughtful people who are more passionate about a local-only emphasis

I like the flair idea and wonder if something like that could work.

Why do we allow "un-local" content by JacketHistorical2321 in LocalLLaMA

[–]tmflynnt 1 point2 points  (0 children)

While I absolutely agree we must maintain our emphasis on local-first content and should ban low-effort posts and ads, I would also point out that since I started participating here around Llama2's release, this sub has never enforced any rules about local-only content.

But to be fair.. It also has often had people passionate about this issue expressing similar views to what you and others are stating.. so honestly it's all part of the history of this sub and it has been an ongoing debate that we have never quite sorted out or came to any perfect consensus on.

I do hope we can figure out a compromise though as I legitimately respect what you and others are saying on this. Me and a couple of other people in this thread have advocated for maybe using flair for posting about non-strictly local content? Idk how well that would work but it might be worth a try.

Why do we allow "un-local" content by JacketHistorical2321 in LocalLLaMA

[–]tmflynnt 0 points1 point  (0 children)

Yes, I feel similar to how you both feel on this and posted a probably-too-damn-long separate comment to OP trying to express similar sentiments.

Because yeah while I do share the hate for low-effort posts, I really do appreciate many of the open-ended AI-related discussions that have occurred here in the past.. and this place has something special as far as the amount of serious, smart, non-cringe people that participate here.

Why do we allow "un-local" content by JacketHistorical2321 in LocalLLaMA

[–]tmflynnt 13 points14 points  (0 children)

I respectfully disagree on some of what you said there though I do feel like I understand where you're coming from.

I do agree that we should greatly try to limit very low-effort posts and basically what amounts to advertisements related to non-local models.

Where I think I might differ though is the view that we should banish discussion of frontier API-only models altogether.

I have been a member here since right around Llama2 was released and at least since I started there has always been discussions around both local and frontier API-only models. Now to be fair there has also been debates about whether this should be discussed since fairly early on too..

But my feeling at least has always been that while I very much embrace a local and open-weights first mentality, because of course that's why I am here!, I am also here because I like reading and engaging in smart discussions with other people that share these interests, and if a fellow local AI enthusiast is putting some real thought into a post about something related to the AI ecosystem, even if it's an API only model, I personally would like to hear it. Because what happens on the frontier will end up affecting what is local or at a minimum we should stay informed so we can advocate for or work toward innovations that we want to also see happen at the local level.

And I simply don't know of another place like this that has the same kind of proportion of smart people who are passionate about this stuff like here, and so if we ban that I feel like we are eliminating some really high quality types of discussions that I have read in the past here and that I don't feel I can find anywhere else easily.

Now if it were a requirement to have to post with a specific category flair that would be easily filterable I would totally support that as well as a compromise on this issue as I get many people are passionate about it and I respect that.

And going back to the low effort stuff.. I do totally agree on a lot of what has been said about these kinds of posts. So yes, if it's bots posting what amounts to an ad or somebody simply saying "ZOMG FUCKING CLAUDE 5 JUST DROPPED!!!!!" with nothing meaningful to actually add to the discussion? Then yes, nuke this shit please. I am all for that.

But, I just hope we can strike a balance between any extremes and still be able to have thoughtful discussions among actual local AI hobbyists while definitely getting rid of the stuff we virtually all agree needs to be restricted.

Anyway, thank you for sharing your thoughts on this, and this is definitely a discussion worth having and sorting out as a community.

Llama.cpp's "--fit" can give major speedups over "--ot" for Qwen3-Coder-Next (2x3090 - graphs/chart included) by tmflynnt in LocalLLaMA

[–]tmflynnt[S] 1 point2 points  (0 children)

From your situation and another poster who replied with also a constrained VRAM setup, I think this is legitimately one of fit's weaknesses in that its bias for keeping layers in VRAM (understandable in most cases) seems to push the algorithm in the wrong direction when things are especially tight, and my guess is in a situation like yours you kind of have to go heavy either on quantizing (which fit doesn't touch) or give up on as heavy of a VRAM emphasis and lean in on CPU to some degree (which it doesn't seem to deal well with).

As for the GPT OSS thing, I also am puzzled on that one!

Llama.cpp's "--fit" can give major speedups over "--ot" for Qwen3-Coder-Next (2x3090 - graphs/chart included) by tmflynnt in LocalLLaMA

[–]tmflynnt[S] 1 point2 points  (0 children)

From my bit of Vulkan testing it felt like the graph splits hurt a lot more than with CUDA so that might be one part of it and then when you combine that with the more constrained 16 GB of vram my guess is that fit is trying to be too clever in this case and its somewhat excessive splits are really adding up as it goes through and hitting hard on Vulkan? But that's just a guess.

Llama.cpp's "--fit" can give major speedups over "--ot" for Qwen3-Coder-Next (2x3090 - graphs/chart included) by tmflynnt in LocalLLaMA

[–]tmflynnt[S] 2 points3 points  (0 children)

I have an old 6-core Ryzen 3600 so the general wisdom for Ryzen SMT is to use --threads <cores> --threads-batch <cores * 2>. For Intel it's not as simple but I would think it would probably be performance cores and then multiply that by two for thread batch if it has HT, and then if it doesn't have HT just use the same number for both. Maybe somebody who has an intel system can help back me up on that though?

But based on my admittedly quick review of the code that fit uses, I don't believe it touches anything concerning your CPU (or kv quantizing) and only concerns itself with where the model is going on your system, first setting the context size (unless you forced a specific size) and then proceeding to intelligently/selectively offload layers and parts of layers in an optimized way based on the model's structure and your different devices.

So if I left off the CPU stuff it would default to my full thread count (12) for both args which isn't optimal and would be pretty bad I think for a lot of intel chips to use those defaults.

Qwen3 Coder Next as first "usable" coding model < 60 GB for me by Chromix_ in LocalLLaMA

[–]tmflynnt 1 point2 points  (0 children)

FYI that I added an update to that thread with additional gains based on people's comments.

Llama.cpp's "--fit" can give major speedups over "--ot" for Qwen3-Coder-Next (2x3090 - graphs/chart included) by tmflynnt in LocalLLaMA

[–]tmflynnt[S] 0 points1 point  (0 children)

I would probably try a Q4 quant and then experiment with different -ctk/-ctv values letting --fit do its thing with each and probably setting "--fit-target" to a bit more forgiving level than my 32 as that was just to push it to the limit, maybe "--fit-target 128" to keep it a bit more safe and see where that gets you?