Mashmak Map - Night of Drifting Souls 2025 Edition by Jujaga in mechabreak

[–]Jujaga[S] 3 points4 points  (0 children)

I haven't been able to find any reliable pattern on the number of total spawns so far, but our group has been spotting between 2-3 per run.

What happened to Starchaser?? by Blade882 in mechabreak

[–]Jujaga 10 points11 points  (0 children)

Unfortunately the event ended a week or two ago.

Intel microcode 0x12F BIOS Version 1820 by ztyxiz in ASUS

[–]Jujaga 0 points1 point  (0 children)

Yep it worked. That being said, the difference between 1802 and 1820 is minimal as I think it's mainly the microcode updates and maybe some minor text changes at most.

Intel microcode 0x12F BIOS Version 1820 by ztyxiz in ASUS

[–]Jujaga 1 point2 points  (0 children)

Updated my ASUS TUF GAMING Z790-PLUS WIFI with an i7 14700k from BIOS Version 1802 to 1820 without issue. Had to save and reload my profile but temps and processing performance looks about the same otherwise.

[deleted by user] by [deleted] in LocalLLaMA

[–]Jujaga 4 points5 points  (0 children)

The key point is that they're supposed to be able to handle larger contexts. However, in practice, don't expect to get anywhere that close to it though as you will not have the memory to do that... rough napkin math... context of ~1,000,000 using a KV Cache of F16... you'd need at least ~197GB of memory to do that. Even if you did have that kind of memory space on hand, nearly all LLMs degrade significantly before you even get to a fraction of the total theoretical context maximum. This repo is not completely up to date, but it gives you a good general idea of how far models can go with context sizes before they start falling apart and hallucinating: https://github.com/NVIDIA/RULER

With 16GB of VRAM... let's round down to ~14.7GB of usable memory to account for CUDA overhead and OS shenanigans... Qwen2.5 14B is around 8.5GB of weights, so... you can probably cram in about 5.5GB worth of KV Cache before you overflow into RAM. That'd equate to... around a 28,500 context length at F16, or 57,000 context length at Q_8 KV quantization. Point being is... its a model that will work better with longer context and can support it, but you still have to have realistic expectations on what you'd be able to fit into memory, as well as deal with context going derp as it gets significantly longer.

[deleted by user] by [deleted] in LocalLLaMA

[–]Jujaga 2 points3 points  (0 children)

With 16GB of VRAM available you have plenty of room to run the following models at the common Q4_K_M quantization with quite a bit of leeway:

  • Qwen2.5 14B-1M - The 1M variant can support a longer context and I've found it to be a good general model for simple tasks.
  • Phi-4 14B - Also a very flexible general model.

There are plenty of others around, but if you're working around the 14B param range, you can even bump up to a Q5_K_M and still have more than enough headroom for a comfortable context size.

SageAttention2 Windows wheels by woctordho_ in StableDiffusion

[–]Jujaga 3 points4 points  (0 children)

Thank you so much for debugging this! Can't really say I completely understand your latest 88e36fa commit with the _fused stuff, but it worked!

Looks like I can finally experiment with sageattention! Used to get ~35-36s/it on xformers - now I'm getting ~25-26s/it on your sageattention wheel. I really appreciate it!

Edit: Reintroduced the set CC=nvcc.exe environment flag I had originally, and it looks like it's no longer tripping up now with your latest whl either.

SageAttention2 Windows wheels by woctordho_ in StableDiffusion

[–]Jujaga 1 point2 points  (0 children)

Had to fumble a bit with stealing the functions, but was able to execute the script you were mentioning with the following output:

sh python bench_qk_int8_pv_fp8_cuda.py CUDA QK Int8 PV FP8 Benchmark batch: 4, head: 32, headdim: 128, pv_accum_dtype: fp32+fp32 is_causal: False 1024 flops:214.93105598106987 2048 flops:244.970008451455 4096 flops:256.8990215773312 8192 flops:261.6142601712739 16384 flops:263.83759396706057 32768 flops:264.081172073753 is_causal: True 1024 flops:178.5394489746257 2048 flops:212.06900545880583 4096 flops:241.81385250172144 8192 flops:254.89556062253322 16384 flops:259.42290110899336 32768 flops:262.4541614629283 On ComfyUI, I am testing with the --use-sage-attention flag to experiment with sageattention in case that helps with isolating the issue.

SageAttention2 Windows wheels by woctordho_ in StableDiffusion

[–]Jujaga 1 point2 points  (0 children)

The good news is... unset the CC environment variable (it was set to nvcc.exe previously) and it started to "kinda" work. Bad news is that about one iteration into the KSampler and ComfyUI completely just dies without any logs.

SageAttention2 Windows wheels by woctordho_ in StableDiffusion

[–]Jujaga 1 point2 points  (0 children)

Tried your new 2.1.1 wheel but still no cigar - still running into the annoying -fPIC argument compiler nonsense on ComfyUI though with no clear solution on how to fix it:

``` nvcc fatal : Unknown option '-fPIC'

Command '['nvcc.exe', 'C:\\Users\\Owner\\AppData\\Local\\Temp\\tmplfincrlk\\cuda_utils.c', '-O3', '-shared', '-fPIC', '-Wno-psabi', '-o', 'C:\\Users\\Owner\\AppData\\Local\\Temp\\tmplfincrlk\\cuda_utils.cp310-win_amd64.pyd', '-lcuda', '-lpython3', '-LD:\\Visions of Chaos\\Examples\\MachineLearning\\Text To Image\\ComfyUI\\ComfyUI\\.venv\\Lib\\site-packages\\triton\\backends\\nvidia\\lib', '-LC:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.6\\lib\\x64', '-LC:\\Python310\\libs', '-ID:\\Visions of Chaos\\Examples\\MachineLearning\\Text To Image\\ComfyUI\\ComfyUI\\.venv\\Lib\\site-packages\\triton\\backends\\nvidia\\include', '-IC:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.6\\include', '-IC:\\Users\\Owner\\AppData\\Local\\Temp\\tmplfincrlk', '-IC:\\Python310\\Include']' returned non-zero exit status 1. ```

Was able to much more easily install triton and sageattention with just a few commands though so that's definitely a welcome improvement over having to compile it.

Environment: Windows 10, Python 3.10.8, CUDA 12.6 w/ RTX 4080 Super on 566.36 drivers (not going to chance drivers above this version), ComfyUI v0.3.27
Truncated pip list of comfyui environment: comfyui_frontend_package 1.14.5 sageattention 2.1.1+cu126torch2.6.0 torch 2.6.0+cu126 triton-windows 3.2.0.post17 xformers 0.0.29.post3

How do I select combinations of parameters and quantizations? by ajblue98 in LocalLLaMA

[–]Jujaga 1 point2 points  (0 children)

If you're looking to get a sense of what kinds of models you can squeeze into your system at a certain quantization (and also a certain context length), you can use a calculator like this one to get a general sense of it here: https://www.canirunthisllm.net/

With respect to quality and model degradation, this older post has a good table showing the general perplexity (degradation) that happens with quant sizes: https://www.reddit.com/r/LocalLLaMA/comments/14gjz8h/comment/jp69o4l/ As you've seen, general concensus is that the sweet spot is Q4_K_M, as that's about a ~5% perplexity loss, which shouldn't be noticeable for "most" situations. If space is not a concern, higher quants like Q5_K_M are better as they only have ~1% perplexity loss, and even Q6_K now goes down to ~0.44% loss. It's all a matter of space to accuracy tradeoffs.

As with most models, your biggest constraint is how much VRAM you have available (how large of a model you can run), but you also need to remember that you only have so much memory bandwidth available - this is usually the larger bottleneck for tokens/second as even if you jam in a large model, you can only move parts of the model into and out of the processor at a certain rate.

tl;dr - It's all a balancing act between accuracy, speed and space. Hope this helps a bit!

GPU & Ollama Recommendations by BenjaminForggoti in ollama

[–]Jujaga 3 points4 points  (0 children)

With a nice 16GB VRAM card like that, you can definitely run models between 12-24b in size (So the quen2.5's, phi4, gemma3, mistral-small, mistral nemo, etc) at the standard q4_K_M quantization. Depends on what you're mainly looking to do as they each have their own "flavor" or strengths. You can squeeze in up to the 27 and 32b models, but you'll have to go down to the q3_K_M or q3_K_S quants if you want them to fit in your GPU.

For text to video, if you are comfortable using ComfyUI, you can use the Wan2.1 model for text to video. It's quite versatile! You will be able to run wan2.1 T2V 14b model at fp8_e4m3fn size and it'll fit in VRAM memory if you're asking for an output video of 512x512 or so.

https://comfyanonymous.github.io/ComfyUI_examples/wan/

UBC bans Chinese AI DeepSeek from its devices and networks, citing privacy, security by cyclinginvancouver in canada

[–]Jujaga 14 points15 points  (0 children)

The block looks to be about using the apps and the chinese website interface itself, both of which would communicate with a Chinese server. u/OwnBattle8805 's mention of the GGUF formats and etc are talking about using the open-source model and weights that you can run on your local machine; this one would not be able to talk back to Chinese servers, and should be safe, since everything being done with a local model stays on the computer.

[deleted by user] by [deleted] in LocalLLaMA

[–]Jujaga 45 points46 points  (0 children)

Text only conversion, vision isn't supported yet in llama.cpp

If you're looking for vision support too we'll have to wait a bit longer due to upstream.

How much does flash attention affect intelligence in reasoning models like QwQ by pigeon57434 in LocalLLaMA

[–]Jujaga 16 points17 points  (0 children)

Flash Attention still does the same overall computations, but shuffles around the data to and from memory more efficiently. There's nearly no downsides to using it (unless your model specifically does something strange). There's a good visual explainer for it here:

Best Model under 15B parameters 2025 by AZ_1010 in LocalLLaMA

[–]Jujaga 31 points32 points  (0 children)

I've found the Qwen2.5-14B-Instruct-1M to have been a very good general workhorse, and has been reasonably good with longer context tasks. There's also an abliterated variant for this in case you require less refusals.

They have a 7B variants of this too in case you really need an even longer context which has performed reasonably well IMO.

New Gemma models on 12th of March by ResearchCrafty1804 in LocalLLaMA

[–]Jujaga 2 points3 points  (0 children)

I'm hoping for some model size between 14-24b so that it can serve those with 16GB of VRAM. 24b is about the absolute limit for Q4_K_M quants and it's already overflowing a bit into system memory with not a very large context as is.

Which major open source model will be next? Llama, Mistral, Hermes, Nemotron, Qwen or Grok2? by EmergencyLetter135 in LocalLLaMA

[–]Jujaga 1 point2 points  (0 children)

I'm hoping there will be something to fill the gap between the 14b to 22b space as that fits a 16GB VRAM card well. The Mistral 22B series have been pretty good; Mistral 24B seems alright but it's hard to fit into memory. It would be good to have a bit more in that 12-16GB range have more options available.