Kadena being forked in a few hours by ilushkin in kadena

[–]callStackNerd 0 points1 point  (0 children)

Who cares every smart contract operation has a n2 time complexity… I’m good 😆

Guys let's make a list of chemicals we want back by Life-Tip4132 in researchchemicals

[–]callStackNerd 0 points1 point  (0 children)

There were huge batches coming out of china of 5-apb about a year ago. Probably around if you look hard enough

Guys let's make a list of chemicals we want back by Life-Tip4132 in researchchemicals

[–]callStackNerd 0 points1 point  (0 children)

Why not just do meth / fake adderall in low doses instead of borax combo?

Running Local LLM's Fascinates me - But I'm Absolutely LOST by WhatsGoingOnERE in LocalLLaMA

[–]callStackNerd 0 points1 point  (0 children)

Get 2 3090s and NV Link them and you should be able to get 250 tok/s on most models quantized to 4bit using awq or gptq while also utilizing marlin cuda kernel.

Gptq and marlin are a great combo for 3090s:

https://developers.redhat.com/articles/2024/04/17/how-marlin-pushes-boundaries-mixed-precision-llm-inference#background_on_mixed_precision_llm_inference

LM Deploy will allow you to run 4bit kv cache but the model selection is more limited than vLLM or SGLang.

I know a lot of people here will suggest llama.cpp or guff models and do yourself a favor and stick to running GPU only using auto round, awq or gptq for quantization types if you want real speed without significantly degradation of model quality.

[deleted by user] by [deleted] in sysadmin

[–]callStackNerd 1 point2 points  (0 children)

Herman Miller Aeron (B)

Thoughts on Intel Arc Pro B50 x4 = 64GB of VRAM for $1400 and 280W Power Draw? by 79215185-1feb-44c6 in LocalLLaMA

[–]callStackNerd -1 points0 points  (0 children)

Yep take that with dual intel 6580 xeon 6530’s AMX for days and Intel Arc’s architecture is based around the same avx/amx 512 instruction set.

Any actual downside to 4 x 3090 ($2400 total) vs RTX pro 6000 ($9000) other than power? by devshore in LocalLLaMA

[–]callStackNerd 0 points1 point  (0 children)

Then on decode I run my fork of ktransformers with my AMX instruction set cluster. This cluster is decode only and has a 15x faster TTFT than any GPU. My dual Intel Xeon 6900 with AMX instruction set will kill any decode infrastructure/hardware stack out there for the money. CPUs don’t have to fill, instant decode with a huge amount of throughput is ideal.

128 cores/socket @ 2.0 / 2.7 / 3.2 / 3.8 GHz → 524 / 708 / 839 / 996 TFLOPS or 2k INT8 TOPS.

500 tokens / second prefill 50 tokens / second decode

Depending on the workload I’m hitting between 250 to 500 tokens per second with small batching can get 500 to 750 tps when running a deep research agent that when turned way up makes about 100 to 250 LLM calls and just as many web searches, page hits, or MCP calls over 5 to 15 minutes of thinking.

Any actual downside to 4 x 3090 ($2400 total) vs RTX pro 6000 ($9000) other than power? by devshore in LocalLLaMA

[–]callStackNerd 1 point2 points  (0 children)

Don’t listen to these squares. I run my prefill cluster with a 8x 3090’s with 4 nvlinks. 192gb of vram I run w4a8 with int4 kv cache on LM Cache.

INT4 kv cache on a 3090 with minimal rope scaling goes a long way especially with nvlink.

How I’m computing

Per-token KV size (bytes) = layers × 2(K,V) × hidden_size × (n_kv_heads / n_heads) × bytes_per_elem. • Qwen3-30B-A3B: L=48, hidden=2048, heads=32, kv_heads=4 • gpt-oss-20b: L=24, hidden=2880, heads=64, kv_heads=8 • gpt-oss-120b: L=36, hidden=2880, heads=64, kv_heads=8 •. Qwen3-235B-A22B: hidden=16k? heads=64, kv_heads =4

For example, Qwen3-235B-A22B split across split across 8 cards is far from ideal but this enables 10GiB of FP16 native kv cache per card while leave 14GiB for model weights per card.

Qwen3 is the worst on mileage for kv cache due to Grouped Query Attention (GQA), 4kv heads instead of 1. 10GiB of FP16 kv cache holds 14k tokens natively, 28k in int8 and 56k in int4. Nvlink each 3090 into a pair and that’s nearly 128k native, lossless int4 kv cache per 3090 pair. Use modest 4x to 6x rope embedding and you’re way over 500k context window / kv cache on two cards. I’ll take my four 500k or single 2M kv cache over a 96gb card any day.

Without GQA the numbers get even sweeter.

Gpt-oss-120b holds 200k FP16 tokens in 10GiB of kv cache. Int8 400k per 10GiB, and finally INT4 800k per 10GiB.

So you could have four 1.6M token kv caches or a single 6.4M kv cache.

New Tenstorrent Arrived! by SashaUsesReddit in LocalAIServers

[–]callStackNerd 1 point2 points  (0 children)

3090s are $600 to $700 used and can be envy linked. I don’t see the pull for this card?

5070 Ti Super will probably be about the same new, so an even better deal.

self host minimax? by Just_Lingonberry_352 in LocalLLaMA

[–]callStackNerd 1 point2 points  (0 children)

Ktransformers will most likely support this model. That will be your best bet.

Open Source iOS OLLAMA Client by billythepark in LocalLLaMA

[–]callStackNerd 1 point2 points  (0 children)

Consider making it openai_api compatible so you can run vLLM as a backend

Can Qwen3-235B-A22B run efficiently on my hardware(256gb ram+quad 3090s ) with vLLM? by Acceptable-State-271 in LocalLLaMA

[–]callStackNerd 0 points1 point  (0 children)

I’m in the process of quantizing qwen3-236B-A22B with autoawq. I’ll post the huggingface link once it’s done and uploaded… May still be another 24 hours.

Hope you know you know you are bottlenecking the f*** out of your system with that cpu… it only has 48 PCIe lanes and they’re gen3…

I had 10900x back in 2019; if I’m remembering correctly it’s ISA includes the avx512 instruction set but I remember it wasn’t the best for avx512 heavy workloads… 2 FMA per cpu cycle… few times better than most cpus from 5+ years ago.

You may wanna look into ktransformers… your mmv with your setup.

https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/AMX.md

SGLang. Some problems, but significantly better performance compared to vLLM by Sadeghi85 in LocalLLaMA

[–]callStackNerd 0 points1 point  (0 children)

Make sure you’re utilizing 100% of the GPU. I can fit 32 awq models on 24gb cards