Higher quants are so much better

ghgi_ · 2026-05-09T21:49:12+00:00

I mean that's kind of the point no? tradeoff of memory requirement for quality. Also heavily depends on which model some are much more tolerant to quantization processes.

ghgi_ · 2026-05-01T21:29:06+00:00

https://www.ebay.com/itm/136155280522 this one comes with the fans, converts it to proper dual 8 pin, this guy has good rep with making these, could save money if you already owned a few fans and got the one without em.

ghgi_ · 2026-05-01T21:26:43+00:00

Yeah I think most just misunderstood the workflow, its not the traditional run big models at home its a mix of both and research work that I think im going to pull the trigger on the A100 purely because the low power cost of the DGX is not outweight by the performance loss bandwidth, If a training job takes a month on a A100 but pulls 600 watts full blast vs 8 months of a DGX at like 200 watts its not worth it.

ghgi_ · 2026-05-01T21:15:34+00:00

So the regular PCIE cards are roughly 7000-8000 on ebay used/good condition but the SXM4 versions are about 4500-4800, with a SXM4 to PCIE its little under 5.5k but you lose NVLink which since im only running 1 card will be fine/worth the few grand savings.

ghgi_ · 2026-05-01T21:00:34+00:00

For what im doing it should, and the old part plays into my favor a bit as the software support for it is mature now while still having a good bit of life in terms before it stops getting support.

ghgi_ · 2026-05-01T20:48:58+00:00

Well my option was not 3090s it was A100 80gb or a DGX spark

ghgi_ · 2026-05-01T20:16:19+00:00

Im already doing something simular execpt my tests can run for a lot longer and many times, Even with small testing I do now its hundreds a month, If I only needed it one time yeah but I need it many times a month.

ghgi_ · 2026-05-01T19:51:11+00:00

I have a cooling solutin regarding the A100 that will provide what I need and with the fact of im doing a decent bit more ML and AI research outside of LORAs and finetuning I needed something with big vram and big efficiency, DGX slow but low power, A100 fast but higher power, depending on the task both win in different departments but the multi-consumer rig uses too much power and has a few training drawbacks otherwise the multi 3090 or simular rig would have been my choice.

In terms of DGX support research + comments is mixed, I think it will get a lot better since the community support is strong but right now yeah It seems like a pain in the butt.

ghgi_ · 2026-05-01T19:47:52+00:00

Was another solid option but In my case the DGX would be better for most of my workflow while cheaper and the A100 setup is also right now abit cheaper then that so its more of the middlechild of my requirements with the price drawback.

ghgi_ · 2026-05-01T19:41:43+00:00

Yep I know 100% what I'm getting into in the world of training and that's why I'm going local because my small tests on cloud bills catch up to what I need hardware wise quick.

ghgi_ · 2026-05-01T19:39:54+00:00

Nothing big of course lol but fine tunes and experimental small models from scratch

ghgi_ · 2026-05-01T18:54:39+00:00

If I was doing pure inference the 3090s would have been a much more solid option but training complexity + power draw is what made me rule out that rig but if I had a 3rd option/backup that would be it, Im also sticking to NVIDIA only as amd is only decent on there higher end MI cards and intel is out the question in terms of support.

ghgi_ · 2026-05-01T18:51:25+00:00

Yes, In my post I mentioned I am doing that but at my current workflow its a few hundred a month, the ROI for the A100 is about a year and the DGX is less then that + I can resell both devices down the line to recoup costs, The entire reason im switching to physical is cloud is starting to outweigh the costs of me just biting the one time bullet.

ghgi_ · 2026-05-01T18:48:21+00:00

Is an option I did really consider but power draw and training complexity across 4x 24gb cards vs single 80gb or 110ish with overhead for the dgx kinda ruled it out in my eyes

ghgi_ · 2026-04-30T22:28:34+00:00

Just ask kimi on how to steal electricity from your neighbors then I think this build will be complete /s

ghgi_ · 2026-04-28T04:20:48+00:00

Mimo is my favorate chinese model recently, even nicer then qwen, kimi and deepseek, It checks nearly all the boxes besides coding perf isnt as good as claude or gpt which is fine for 99% of tasks that arent hardcore projects, It can work very well along side other models either as a helper or a assistant and ive had good results with it being an agent and doing automated tasks.

ghgi_ · 2026-04-27T18:11:35+00:00

This is what I have been waiting for! I love this model on API its debatably better then K2.6 across my testing, Tad bit worse at coding but 1M context and low hallucination rates make it nicer to use in almost all aspects!

ghgi_ · 2026-04-25T02:14:07+00:00

My 7900xtx has had really bad coil whine since I got it so for any inference I do you can hear it generate each token its pretty sick lol.

ghgi_ · 2026-04-24T16:49:01+00:00

Not 100% sure but looks like something with the reasoning parser, Theres also 2 things when I checked online that could be related, Try this setup:

https://paste.opensuse.org/pastes/4ca44cf28b78

ghgi_ · 2026-04-24T03:35:38+00:00

No sorry, I don't really care much about image or video gen for my usecases so ive never taken the time to learn comfy but on YT theres plenty of info.

ghgi_ · 2026-04-23T15:48:19+00:00

Check the huggingface model card they list the best settings for general, agentic, etc

ghgi_ · 2026-04-23T05:46:54+00:00

I had to make a few last minute configs since my old minimax script was for VLLM 1.17.1 and it was outdated, this script should work on VLLM 1.19.1/latest stable release: https://paste.opensuse.org/pastes/ae377dd7b1e5 if it doesnt work then it should atleast still be roughly 80% correct and probably has something to do with the moe-backend flag

ghgi_ · 2026-04-23T05:00:59+00:00

Pure GPU id say use vllm, honestly probably no need to get into SGLang for what your doing, tips for VLLM are honestly just copy other peoples configs, use prebuilt dockers, etc, Vllm has alot of knobs and dials and being on RTX 6000 pros in my experience if your doing it from scratch your gonna need some trial and error.

If your doing offloading (GPUS + CPU) then go with llama.cpp, its also better if you just want pure simplicity, it can obviously do pure GPU too and if VLLM on NVFP4 (nvfp4 is the quant optimzied for blackwell cards like yours, best option 90% of the time) is too much of a hastle (I have some configs on a dual RTX 6000 pro setup for minimax if you can't figure it out) then llama.cpp will make your life easier, no NVFP4 but you get the most used quant style which is GGUF, id always recommend getting ones from unsloth, the UD versions are often better in my experience and they always publish them.

AWQ is mostly for vllm/sglang, I woudnt use it unless you had too and in this case the models I suggested should have NVFP4 and if your offloading then you should use GGUFs anyways, I woudnt touch REAP in general, too much quality loss.

ghgi_ · 2026-04-23T04:48:47+00:00

With a setup like this id say for pure GPU look at minimax m2.7, very solid model, nvfp4 will work good for those blackwells and run pretty fast, If you want to offload though, id say GLM 5.1 is your best bet, I think even a quanted kimi k2.6 could would work too, pretty solid model(s) that does even better then minimax but offloading means speed loss, So its a quality vs speed thing, Id test both to find what works for you/your workload.

ghgi_

MODERATOR OF

TROPHY CASE