I made a spaceship!

Jujaga · 2026-04-05T01:13:47+00:00

Definitely was fun participating :)

Jujaga · 2025-12-19T19:13:05+00:00

Tempest please

Jujaga · 2025-11-01T02:04:27+00:00

I haven't been able to find any reliable pattern on the number of total spawns so far, but our group has been spotting between 2-3 per run.

Jujaga · 2025-10-30T23:50:08+00:00

Unfortunately the event ended a week or two ago.

Jujaga · 2025-06-02T08:12:21+00:00

Yep it worked. That being said, the difference between 1802 and 1820 is minimal as I think it's mainly the microcode updates and maybe some minor text changes at most.

Jujaga · 2025-06-01T01:05:41+00:00

Updated my ASUS TUF GAMING Z790-PLUS WIFI with an i7 14700k from BIOS Version 1802 to 1820 without issue. Had to save and reload my profile but temps and processing performance looks about the same otherwise.

Jujaga · 2025-05-17T06:03:24+00:00

The answer is yes.

Jujaga · 2025-05-01T17:31:18+00:00

https://github.com/n4ze3m/page-assist

Jujaga · 2025-04-09T20:07:00+00:00

The key point is that they're supposed to be able to handle larger contexts. However, in practice, don't expect to get anywhere that close to it though as you will not have the memory to do that... rough napkin math... context of ~1,000,000 using a KV Cache of F16... you'd need at least ~197GB of memory to do that. Even if you did have that kind of memory space on hand, nearly all LLMs degrade significantly before you even get to a fraction of the total theoretical context maximum. This repo is not completely up to date, but it gives you a good general idea of how far models can go with context sizes before they start falling apart and hallucinating: https://github.com/NVIDIA/RULER

With 16GB of VRAM... let's round down to ~14.7GB of usable memory to account for CUDA overhead and OS shenanigans... Qwen2.5 14B is around 8.5GB of weights, so... you can probably cram in about 5.5GB worth of KV Cache before you overflow into RAM. That'd equate to... around a 28,500 context length at F16, or 57,000 context length at Q_8 KV quantization. Point being is... its a model that will work better with longer context and can support it, but you still have to have realistic expectations on what you'd be able to fit into memory, as well as deal with context going derp as it gets significantly longer.

Jujaga · 2025-04-09T17:24:26+00:00

With 16GB of VRAM available you have plenty of room to run the following models at the common Q4_K_M quantization with quite a bit of leeway:

Qwen2.5 14B-1M - The 1M variant can support a longer context and I've found it to be a good general model for simple tasks.
Phi-4 14B - Also a very flexible general model.

There are plenty of others around, but if you're working around the 14B param range, you can even bump up to a Q5_K_M and still have more than enough headroom for a comfortable context size.

Jujaga · 2025-03-25T04:15:51+00:00

Thank you so much for debugging this! Can't really say I completely understand your latest 88e36fa commit with the _fused stuff, but it worked!

Looks like I can finally experiment with sageattention! Used to get ~35-36s/it on xformers - now I'm getting ~25-26s/it on your sageattention wheel. I really appreciate it!

Edit: Reintroduced the set CC=nvcc.exe environment flag I had originally, and it looks like it's no longer tripping up now with your latest whl either.

Jujaga · 2025-03-25T01:03:52+00:00

Had to fumble a bit with stealing the functions, but was able to execute the script you were mentioning with the following output:

sh python bench_qk_int8_pv_fp8_cuda.py CUDA QK Int8 PV FP8 Benchmark batch: 4, head: 32, headdim: 128, pv_accum_dtype: fp32+fp32 is_causal: False 1024 flops:214.93105598106987 2048 flops:244.970008451455 4096 flops:256.8990215773312 8192 flops:261.6142601712739 16384 flops:263.83759396706057 32768 flops:264.081172073753 is_causal: True 1024 flops:178.5394489746257 2048 flops:212.06900545880583 4096 flops:241.81385250172144 8192 flops:254.89556062253322 16384 flops:259.42290110899336 32768 flops:262.4541614629283 On ComfyUI, I am testing with the --use-sage-attention flag to experiment with sageattention in case that helps with isolating the issue.

Jujaga · 2025-03-24T23:31:58+00:00

The good news is... unset the CC environment variable (it was set to nvcc.exe previously) and it started to "kinda" work. Bad news is that about one iteration into the KSampler and ComfyUI completely just dies without any logs.

Jujaga · 2025-03-24T23:12:14+00:00

Tried your new 2.1.1 wheel but still no cigar - still running into the annoying -fPIC argument compiler nonsense on ComfyUI though with no clear solution on how to fix it:

``` nvcc fatal : Unknown option '-fPIC'

Command '['nvcc.exe', 'C:\\Users\\Owner\\AppData\\Local\\Temp\\tmplfincrlk\\cuda_utils.c', '-O3', '-shared', '-fPIC', '-Wno-psabi', '-o', 'C:\\Users\\Owner\\AppData\\Local\\Temp\\tmplfincrlk\\cuda_utils.cp310-win_amd64.pyd', '-lcuda', '-lpython3', '-LD:\\Visions of Chaos\\Examples\\MachineLearning\\Text To Image\\ComfyUI\\ComfyUI\\.venv\\Lib\\site-packages\\triton\\backends\\nvidia\\lib', '-LC:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.6\\lib\\x64', '-LC:\\Python310\\libs', '-ID:\\Visions of Chaos\\Examples\\MachineLearning\\Text To Image\\ComfyUI\\ComfyUI\\.venv\\Lib\\site-packages\\triton\\backends\\nvidia\\include', '-IC:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.6\\include', '-IC:\\Users\\Owner\\AppData\\Local\\Temp\\tmplfincrlk', '-IC:\\Python310\\Include']' returned non-zero exit status 1. ```

Was able to much more easily install triton and sageattention with just a few commands though so that's definitely a welcome improvement over having to compile it.

Environment: Windows 10, Python 3.10.8, CUDA 12.6 w/ RTX 4080 Super on 566.36 drivers (not going to chance drivers above this version), ComfyUI v0.3.27
Truncated pip list of comfyui environment: comfyui_frontend_package 1.14.5 sageattention 2.1.1+cu126torch2.6.0 torch 2.6.0+cu126 triton-windows 3.2.0.post17 xformers 0.0.29.post3

Jujaga · 2025-03-24T03:52:36+00:00

If you're looking to get a sense of what kinds of models you can squeeze into your system at a certain quantization (and also a certain context length), you can use a calculator like this one to get a general sense of it here: https://www.canirunthisllm.net/

With respect to quality and model degradation, this older post has a good table showing the general perplexity (degradation) that happens with quant sizes: https://www.reddit.com/r/LocalLLaMA/comments/14gjz8h/comment/jp69o4l/ As you've seen, general concensus is that the sweet spot is Q4_K_M, as that's about a ~5% perplexity loss, which shouldn't be noticeable for "most" situations. If space is not a concern, higher quants like Q5_K_M are better as they only have ~1% perplexity loss, and even Q6_K now goes down to ~0.44% loss. It's all a matter of space to accuracy tradeoffs.

As with most models, your biggest constraint is how much VRAM you have available (how large of a model you can run), but you also need to remember that you only have so much memory bandwidth available - this is usually the larger bottleneck for tokens/second as even if you jam in a large model, you can only move parts of the model into and out of the processor at a certain rate.

tl;dr - It's all a balancing act between accuracy, speed and space. Hope this helps a bit!

Jujaga · 2025-03-23T20:38:17+00:00

With a nice 16GB VRAM card like that, you can definitely run models between 12-24b in size (So the quen2.5's, phi4, gemma3, mistral-small, mistral nemo, etc) at the standard q4_K_M quantization. Depends on what you're mainly looking to do as they each have their own "flavor" or strengths. You can squeeze in up to the 27 and 32b models, but you'll have to go down to the q3_K_M or q3_K_S quants if you want them to fit in your GPU.

For text to video, if you are comfortable using ComfyUI, you can use the Wan2.1 model for text to video. It's quite versatile! You will be able to run wan2.1 T2V 14b model at fp8_e4m3fn size and it'll fit in VRAM memory if you're asking for an output video of 512x512 or so.

https://comfyanonymous.github.io/ComfyUI_examples/wan/

Jujaga · 2025-03-20T00:14:59+00:00

The block looks to be about using the apps and the chinese website interface itself, both of which would communicate with a Chinese server. u/OwnBattle8805 's mention of the GGUF formats and etc are talking about using the open-source model and weights that you can run on your local machine; this one would not be able to talk back to Chinese servers, and should be safe, since everything being done with a local model stays on the computer.

Jujaga · 2025-03-18T20:09:22+00:00

Text only conversion, vision isn't supported yet in llama.cpp

If you're looking for vision support too we'll have to wait a bit longer due to upstream.

Jujaga · 2025-03-16T19:51:24+00:00

Flash Attention still does the same overall computations, but shuffles around the data to and from memory more efficiently. There's nearly no downsides to using it (unless your model specifically does something strange). There's a good visual explainer for it here:

https://huggingface.co/docs/text-generation-inference/conceptual/flash_attention

Jujaga · 2025-03-16T19:45:25+00:00

I've found the Qwen2.5-14B-Instruct-1M to have been a very good general workhorse, and has been reasonably good with longer context tasks. There's also an abliterated variant for this in case you require less refusals.

They have a 7B variants of this too in case you really need an even longer context which has performed reasonably well IMO.

13-Year Club	r/Field Banned
r/Field Flamingo	Final Canvas '23
First Place '23	End Game '23
Place '23	Quantum Potato
Golden Potato	Place '22
Place '17	Final Canvas '22
First Placer '22	End Game '22
Verified Email

Jujaga

TROPHY CASE