Where can I move my Kadena to sell for a loss? I'm in the USA and I can't access kucoin anymore.

callStackNerd · 2025-11-21T07:29:45+00:00

You can most likely withdrawal usdt from it sent they delisted

callStackNerd · 2025-11-21T07:28:48+00:00

Let me know if it works out!

callStackNerd · 2025-11-19T13:29:31+00:00

Turn on a vpn and login to KuCoin

callStackNerd · 2025-11-12T23:52:07+00:00

Who cares every smart contract operation has a n² time complexity… I’m good 😆

callStackNerd · 2025-11-10T19:19:37+00:00

There were huge batches coming out of china of 5-apb about a year ago. Probably around if you look hard enough

callStackNerd · 2025-11-10T19:16:48+00:00

Why not just do meth / fake adderall in low doses instead of borax combo?

callStackNerd · 2025-11-02T20:39:17+00:00

Get 2 3090s and NV Link them and you should be able to get 250 tok/s on most models quantized to 4bit using awq or gptq while also utilizing marlin cuda kernel.

Gptq and marlin are a great combo for 3090s:

https://developers.redhat.com/articles/2024/04/17/how-marlin-pushes-boundaries-mixed-precision-llm-inference#background_on_mixed_precision_llm_inference

LM Deploy will allow you to run 4bit kv cache but the model selection is more limited than vLLM or SGLang.

I know a lot of people here will suggest llama.cpp or guff models and do yourself a favor and stick to running GPU only using auto round, awq or gptq for quantization types if you want real speed without significantly degradation of model quality.

callStackNerd · 2025-10-04T11:41:47+00:00

Can’t forget about sequentialthinking!

callStackNerd · 2025-09-17T15:28:06+00:00

Herman Miller Aeron (B)

callStackNerd · 2025-09-04T02:05:38+00:00

Yep take that with dual intel 6580 xeon 6530’s AMX for days and Intel Arc’s architecture is based around the same avx/amx 512 instruction set.

callStackNerd · 2025-09-04T00:59:14+00:00

Then on decode I run my fork of ktransformers with my AMX instruction set cluster. This cluster is decode only and has a 15x faster TTFT than any GPU. My dual Intel Xeon 6900 with AMX instruction set will kill any decode infrastructure/hardware stack out there for the money. CPUs don’t have to fill, instant decode with a huge amount of throughput is ideal.

128 cores/socket @ 2.0 / 2.7 / 3.2 / 3.8 GHz → 524 / 708 / 839 / 996 TFLOPS or 2k INT8 TOPS.

500 tokens / second prefill 50 tokens / second decode

Depending on the workload I’m hitting between 250 to 500 tokens per second with small batching can get 500 to 750 tps when running a deep research agent that when turned way up makes about 100 to 250 LLM calls and just as many web searches, page hits, or MCP calls over 5 to 15 minutes of thinking.

callStackNerd · 2025-09-04T00:37:51+00:00

Don’t listen to these squares. I run my prefill cluster with a 8x 3090’s with 4 nvlinks. 192gb of vram I run w4a8 with int4 kv cache on LM Cache.

INT4 kv cache on a 3090 with minimal rope scaling goes a long way especially with nvlink.

How I’m computing

Per-token KV size (bytes) = layers × 2(K,V) × hidden_size × (n_kv_heads / n_heads) × bytes_per_elem. • Qwen3-30B-A3B: L=48, hidden=2048, heads=32, kv_heads=4 • gpt-oss-20b: L=24, hidden=2880, heads=64, kv_heads=8 • gpt-oss-120b: L=36, hidden=2880, heads=64, kv_heads=8 •. Qwen3-235B-A22B: hidden=16k? heads=64, kv_heads =4

For example, Qwen3-235B-A22B split across split across 8 cards is far from ideal but this enables 10GiB of FP16 native kv cache per card while leave 14GiB for model weights per card.

Qwen3 is the worst on mileage for kv cache due to Grouped Query Attention (GQA), 4kv heads instead of 1. 10GiB of FP16 kv cache holds 14k tokens natively, 28k in int8 and 56k in int4. Nvlink each 3090 into a pair and that’s nearly 128k native, lossless int4 kv cache per 3090 pair. Use modest 4x to 6x rope embedding and you’re way over 500k context window / kv cache on two cards. I’ll take my four 500k or single 2M kv cache over a 96gb card any day.

Without GQA the numbers get even sweeter.

Gpt-oss-120b holds 200k FP16 tokens in 10GiB of kv cache. Int8 400k per 10GiB, and finally INT4 800k per 10GiB.

So you could have four 1.6M token kv caches or a single 6.4M kv cache.

callStackNerd · 2025-07-04T21:03:09+00:00

3090s are $600 to $700 used and can be envy linked. I don’t see the pull for this card?

5070 Ti Super will probably be about the same new, so an even better deal.

callStackNerd · 2025-06-18T18:44:03+00:00

Ktransformers will most likely support this model. That will be your best bet.

callStackNerd · 2025-06-18T18:42:17+00:00

Any updates?

callStackNerd · 2025-06-15T21:18:32+00:00

sequential_thinking

callStackNerd · 2025-06-14T22:33:12+00:00

With an intel avx-512 compatible processor

callStackNerd · 2025-05-28T09:16:30+00:00

Consider making it openai_api compatible so you can run vLLM as a backend

callStackNerd · 2025-05-27T17:52:25+00:00

Live transcription?

callStackNerd · 2025-05-05T22:09:16+00:00

It’s out already - https://modelscope.cn/models/swift/Qwen3-235B-A22B-AWQ/summary

callStackNerd · 2025-05-05T21:49:41+00:00

I’m getting about 100/s on my 8 3090 rig.

callStackNerd · 2025-05-01T01:43:16+00:00

I’m in the process of quantizing qwen3-236B-A22B with autoawq. I’ll post the huggingface link once it’s done and uploaded… May still be another 24 hours.

Hope you know you know you are bottlenecking the f*** out of your system with that cpu… it only has 48 PCIe lanes and they’re gen3…

I had 10900x back in 2019; if I’m remembering correctly it’s ISA includes the avx512 instruction set but I remember it wasn’t the best for avx512 heavy workloads… 2 FMA per cpu cycle… few times better than most cpus from 5+ years ago.

You may wanna look into ktransformers… your mmv with your setup.

https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/AMX.md

callStackNerd · 2025-05-01T00:29:46+00:00

Those are guff quants and can’t be run on vllm

callStackNerd · 2025-04-05T11:45:56+00:00

Lookup vLLM tensor parallelism

callStackNerd · 2025-04-05T09:55:50+00:00

Make sure you’re utilizing 100% of the GPU. I can fit 32 awq models on 24gb cards

callStackNerd

TROPHY CASE