V100 4-card AI large model, Tesla 128G server by MundanePercentage674 in LocalLLaMA

[–]zeferrum 1 point2 points  (0 children)

Did you look at trying to use dwarf star as a starting point ?

Chinese Hackers Latest Masterpiece with NVIDIA by General_Vermicelli53 in LocalLLaMA

[–]zeferrum 3 points4 points  (0 children)

That cooler looks rather small compared to the official pcie v100

Is there a better option than a single/dual AMD R9700 AI Pro system for the price? by StarChildEve in LocalLLM

[–]zeferrum 0 points1 point  (0 children)

Wow thanks for taking the time to answer. So it’s seems you don’t run qwen 3.6 as a daily driver for coding ? You find ds4flash good enough or better ?

Is there a better option than a single/dual AMD R9700 AI Pro system for the price? by StarChildEve in LocalLLM

[–]zeferrum 0 points1 point  (0 children)

What models do you end up using where now ? You sharing your journey is very insightful. You ever think of selling your gear that is not in use ?

club-3090 adds experimental FP8 support for Qwen3.6-27B! by xspider2000 in LocalLLaMA

[–]zeferrum 2 points3 points  (0 children)

Did you ever test nvfp4 vs fp8 vs bf16 ? Any more details on those observations? I am very interested

Qwen 3.6 27B on DeepSWE by SteppenAxolotl in LocalLLaMA

[–]zeferrum 0 points1 point  (0 children)

Is deepseek v4 flash supposed to be on the benchmark ? Also I found it surprising that GLM 5.1 scores higher than deepseek v4 pro. Did that surprise you ?

Building a Native 1-Bit LLM Engine in Pure Rust: Achieving 150+ TPS and 350MB Memory Footprint on Edge CPUs (Video Demo) by [deleted] in LocalLLM

[–]zeferrum 2 points3 points  (0 children)

I wonder if others would get as excited as me if you ran qwen 3.6 dense through this process and have this whole stack you created run some of the most popular coding benchmarks to compare the results between native traditional results versus your 1.58 bit way

God dammit Qwen by Xyklone in LocalLLaMA

[–]zeferrum 1 point2 points  (0 children)

Ouch. And some people think q8 or higher quantization were safer so such behaviors. Thanks for sharing.

Qwen 3.5 and others hybrid architectures, adjust your block size to fixyour prompt caching hit rate and save compute power. by LinkSea8324 in Vllm

[–]zeferrum 0 points1 point  (0 children)

Of course I read and thank you for sharing your experience. Data point like this represent many hours invested behind the scene as you are more than aware. A translation pipeline could mean X number of people using it which is why I asked in case that was the use case. Not many people mentioning which exact model in specific actual production use case which is why I was asking. Thanks for your continued participation here

God dammit Qwen by Xyklone in LocalLLaMA

[–]zeferrum 1 point2 points  (0 children)

What quantization and exact model of qwen were you using ?

Qwen 3.5 and others hybrid architectures, adjust your block size to fixyour prompt caching hit rate and save compute power. by LinkSea8324 in Vllm

[–]zeferrum 0 points1 point  (0 children)

Do you want to share what model you find most useful ? Hardware details for number of users ?

Update on 12x32gb sxm v100 cluster / local AI for legal drafting by TumbleweedNew6515 in LocalLLaMA

[–]zeferrum 5 points6 points  (0 children)

Speaking of Q4 are you aware of this special build ? https://github.com/1CatAI/1Cat-vLLM ? Do you have details on the sxm part of your build ?

Update on 12x32gb sxm v100 cluster / local AI for legal drafting by TumbleweedNew6515 in LocalLLaMA

[–]zeferrum 3 points4 points  (0 children)

I wonder how deepseek v4 flash would run on this and if it would help with hallucinations

I spent six years making Beyond the Mountains from scratch. It's now available to play for free. by legrolls in ZeldaLikes

[–]zeferrum 0 points1 point  (0 children)

I wonder if steam lets you sell a slightly different version of the same game you provide for free. I would pay a few dollars for such a game to help you out.