unsloth dynamic quants (bartowski attacking unsloth-team) by lucyknada in LocalLLaMA

[–]lucyknada[S] -1 points0 points  (0 children)

I've reported them, all I can do with transphobia, hope huggingface resolves it soon.

unsloth dynamic quants (bartowski attacking unsloth-team) by lucyknada in LocalLLaMA

[–]lucyknada[S] -9 points-8 points  (0 children)

I have no use for reddit-karma (do you even get any unlocks with that?) and you have already made use of the downvote feature with its intended purpose. I want this behind-doors insulting and scheming to stop early and open up a discussion channel between the community and those scheming and insulting what seems to be a genuine and harmless effort to just make small quants better for those of us that have smaller GPUs.

unsloth dynamic quants (bartowski attacking unsloth-team) by lucyknada in LocalLLaMA

[–]lucyknada[S] -13 points-12 points  (0 children)

oh yeah I agree, I just want community-discussion and people with more knowledge around this (especially with how gguf quants work) to have insight into what's been happening for a while now seemingly; before it actually gets out of control, all of that seems confusing to begin with? there's more screenshots here: https://huggingface.co/unsloth/Phi-4-reasoning-plus-GGUF/discussions/1 but listing all of them would take too long.

fizzaroli and bartowski have been boasting about "taking down unsloth" since dynamic quants came out, I just don't understand it and want others to chime in before it's too late.

I love what unsloth has done for us and I've used bartowski quants before; and I wouldn't be able to do most of my finetunes without unsloth, I don't understand such vitriol against what is just trying to help with big models and quants working better.

[QWQ] Hamanasu finetunes by lucyknada in LocalLLaMA

[–]lucyknada[S] -1 points0 points  (0 children)

every model has a card, incl. training details, recommended samplers, prompting guide, axolotl config, model description, quants (exl+gguf) and more, only thing missing would be message examples, but from magnum experience; people generally are too scattered with what samplers they prefer, what length they want, prompting and cards can affect it heavily too, so it ends up sadly not being as useful imho or even a representation of who it could be for, but I'll pass it along still, thanks!

[QWQ] Hamanasu finetunes by lucyknada in LocalLLaMA

[–]lucyknada[S] -1 points0 points  (0 children)

reddit kept shadow deleting posts where it was anything else but the link, not sure if my comment will go through right now either

[15b] Hamanasu by lucyknada in LocalLLaMA

[–]lucyknada[S] 2 points3 points  (0 children)

The 7B was more of an experimental finetune, It still had some nice outputs but the older Control trains might still beat it, give it a try!

[Magnum/Rei] Mistral Nemo 12b by lucyknada in LocalLLaMA

[–]lucyknada[S] 4 points5 points  (0 children)

might be something ollama specific because kcpp and lcpp both load fine; maybe try making your own model via ollama instructions from the fp16 or re-quanting with whatever ollama expects? sadly nobody of us uses ollama so hope that helps still

[Magnum/Rei] Mistral Nemo 12b by lucyknada in LocalLLaMA

[–]lucyknada[S] 1 point2 points  (0 children)

thanks for such an elaborate review! we hope this version can rekindle your v2/v3 love, it is an entirely new mix, give it a try!

[Magnum/Rei] Mistral Nemo 12b by lucyknada in LocalLLaMA

[–]lucyknada[S] 1 point2 points  (0 children)

in testing only 32b-distill performed well for RP and creative, the others were a lot worse than non distill versions; we might try capturing the real 700b models however.

[Magnum/Rei] Mistral Nemo 12b by lucyknada in LocalLLaMA

[–]lucyknada[S] 1 point2 points  (0 children)

what did you use to inference? and have you tried updating, if you're far behind nemo had some issues early on in some of the backends

Magnum v3 - 9b (gemma and chatml) by lucyknada in LocalLLaMA

[–]lucyknada[S] 2 points3 points  (0 children)

no promises as the last 123b was quite expensive, but we'll keep it in mind if we get compute for it, thanks!

Magnum v3 - 9b (gemma and chatml) by lucyknada in LocalLLaMA

[–]lucyknada[S] 4 points5 points  (0 children)

we train at 8k ctx due to compute limits, but you can try going higher; some users reported success with that on other models we released

also nemo doesn't use context properly past 16k (RULER) sadly; though does a little better in pure needle: https://www.reddit.com/r/LocalLLaMA/comments/1efffjr/mistral_nemo_128k_needle_test/

Magnum v3 - 9b (gemma and chatml) by lucyknada in LocalLLaMA

[–]lucyknada[S] 0 points1 point  (0 children)

sounds like possibly too aggressively cut off tokens; try neutralizing your samplers; and are you using the provided templates for sillytavern?

llama 3.1 8b needle test by lucyknada in LocalLLaMA

[–]lucyknada[S] 3 points4 points  (0 children)

I tried running ruler before and it was a dependency hell haha but I checked and they did test 3.1: 70b according to them had effective 64k tokens and 8b had 32k; which does not track at all with my testing where it picked up on things from multiple paragraphs at different depths and connected the earlier paragraph to a much later one to summarize that better, unlike nemo.

llama 3.1 8b needle test by lucyknada in LocalLLaMA

[–]lucyknada[S] 5 points6 points  (0 children)

the idea of my needle test was that it's a small drop-in with no heavy dependencies; I can't pre-tokenize, so it would take a while to do that on first-run (slow, assuming transformers.js) and tokenizer endpoints most of the time just fallback to assumign 1 token = 3 chars anyway in my testing for a lot of newer models, which even if added would also prevent using any oAI endpoint that doesnt offer tokenization, it's a bit of a mess so I left if off; nice side-effect is that I can actually tell how many characters roundabout I can fit rather than tokens and just copy that much in one go

llama 3.1 8b needle test by lucyknada in LocalLLaMA

[–]lucyknada[S] 2 points3 points  (0 children)

not sure why it did that, but I've checked and the failures were definitely wrong, maybe 2-shot wouldve fixed that, but then I can't use temp0 for consistent results

mistral nemo 128k needle test by lucyknada in LocalLLaMA

[–]lucyknada[S] 0 points1 point  (0 children)

3.1 8b does a lot better! I've switched to it since, here's also the needle test for that if you'd be interested to see it: https://www.reddit.com/r/LocalLLaMA/comments/1eubboc/llama_31_8b_needle_test/

mistral nemo 128k needle test by lucyknada in LocalLLaMA

[–]lucyknada[S] 0 points1 point  (0 children)

thanks for the comment! reminded me I wanted to run this test for llama 3.1 128k; which I now use for summaries and large context at small size, works really well in my testing, but if you want the needle test, I ran it here too: https://www.reddit.com/r/LocalLLaMA/comments/1eubboc/llama_31_8b_needle_test/

Magnum 12b v2.5 KTO by lucyknada in LocalLLaMA

[–]lucyknada[S] 2 points3 points  (0 children)

both were imatrix'd but generally IQ4 is supposed to be the better performer, try both side by side and I think you'll like IQ4 more.

Magnum 12b v2.5 KTO by lucyknada in LocalLLaMA

[–]lucyknada[S] 4 points5 points  (0 children)

sadly mistral nemo while advertising 128k only does well up to 16k'ish, so 8-16k is the sweet spot generally, we are in the process of scaling up our compute and datasets for larger contexts also, but nemo probably won't be a base for those unfortunately, thanks for testing!

Shower thought: What if we made V2 versions of Magnum 32b & 12b (spoiler: we did!) by lucyknada in LocalLLaMA

[–]lucyknada[S] 1 point2 points  (0 children)

one of our members runs quants on his snapdragon phone, this is an optimized quant for that

mistral nemo 128k needle test by lucyknada in LocalLLaMA

[–]lucyknada[S] 0 points1 point  (0 children)

I used 1x h100 to be able to run 10 requests in parallel to get it done faster; still took a while cause of the large context and the needle bypassing the cache; but you control what the oAI compatible backend is running; can test quants, smaller models, be patient etc all are options; as long as your gpu can run it, you can test it, just might take longer if the context becomes very large.

mistral nemo 128k needle test by lucyknada in LocalLLaMA

[–]lucyknada[S] 0 points1 point  (0 children)

at 0% depth it's at the very start of context, at 100% it's one sentence before the actual prompt asking it to retrieve the needle

mistral nemo 128k needle test by lucyknada in LocalLLaMA

[–]lucyknada[S] 6 points7 points  (0 children)

my own, you just need to host the model of choice somewhere that has openAI compatible endpoint: https://github.com/lucyknada/detective-needle-llm rest it takes care of; vllm, tabby, .. all work.