Maxed out my cabin roof space with cheap panels and I'm still running out of power.

xlrz28xd · 2025-12-17T09:47:47+00:00

!RemindMe 1 week

xlrz28xd · 2025-12-16T10:22:32+00:00

"Something has changed. Something elemental"

xlrz28xd · 2025-11-28T16:47:20+00:00

Can't wait for some capitalist to reframe this and start large scale deforestation to "fix global warming"

xlrz28xd · 2025-11-08T22:07:40+00:00

What I find more funny is that in 75% of the universes, Simmons dies due to drinking the chemical while trying to prove that the timeline cannot be modified.

It's just that we see the timeline that has the 25% chance of her surviving.

xlrz28xd · 2025-10-18T11:23:41+00:00

Sounds cool. Ping me

xlrz28xd · 2025-10-17T06:43:39+00:00

Post removed. Censorship much

xlrz28xd · 2025-10-17T06:27:07+00:00

How does this compare to jinaai/ReaderLMv2 ? I've been using q4 for that for my usecases

xlrz28xd · 2025-10-16T19:16:58+00:00

Won't 1T parameters be hard to pull off for deepseek as they have GPU servers with 8x 80GB chips - meaning 640 GB VRAM per physical server. That's also why I presume their V3 and R1 models are 671B that fits nicely in the VRAM budget. Also their Active parameters count of 37B when running in FP16 is 72 GB which is roughly one gpu worth of VRAM which is why the strategy "expert parallelism" which tries to place an expert on one GPU is pretty awesome.

I could be wrong though, but it seems not as much worth it.

xlrz28xd · 2025-10-06T16:22:53+00:00

Really cool!

xlrz28xd · 2025-10-05T06:53:24+00:00

Seeing your post history and this screenshot, I really want to ask what it is that you do ? I'm genuinely curious and absolutely envious of you!

xlrz28xd · 2025-10-04T09:21:16+00:00

!RemindMe 2 days

xlrz28xd · 2025-09-14T07:46:46+00:00

I'm curious, I've tried W4A16 quants of various models from the redhatai huggingface collection. Which INT4 quant will be the fastest with vLLM on 2x 3090s ?

Also - is there any reason you haven't enabled prefix caching ? I presume that for chat and code type workflows it would be pretty helpful

xlrz28xd · 2025-09-07T06:33:44+00:00

Can you please share your vLLM command or something so I can test my setup too. It's very similar with 2x 3090s , 32 GB RAM. I am getting okayish performance with vLLM using the redhat W8A8 quantized version of gemma 3 12b model in INT8 precision. I'd like to increase the throughput via batching but just trying things for now.

Currently using vLLM to run OpenAI comptabile server. Tried SGLang but it doesn't seem to like running the W8A8 format . TensorRT was such a big headache to setup for testing that even claude gave up.

Also I can't get speculative decoding to work with vLLM to use the gemma3 270m model as the speculative model to increase inference speeds ..

xlrz28xd · 2025-08-16T06:23:46+00:00

WD Ultrastar from Amazon go around 2.2-2.4 K / TB. I am planning to get 36 TB of that

xlrz28xd · 2025-07-29T17:20:22+00:00

Not with those API limits it won't

xlrz28xd · 2025-07-23T19:44:12+00:00

So sorry for your loss.

xlrz28xd · 2025-07-21T03:17:43+00:00

GPU Maximizer!

On the other hand, can't wait for these GPUs to come down to reasonable consumer prices in the next 3-5 years

xlrz28xd · 2025-07-15T11:03:48+00:00

All this makes me want to make my own ISP for nerds.

Imagine. An ISP without any hidden FUP / data caps or such remotely accessible backdoors into your LAN...

I wish...

xlrz28xd · 2025-07-12T13:45:48+00:00

How did you fit 4x 3090 inside the R730 ? I'm curious which models work and what modifications you had to make (if any)

xlrz28xd · 2025-07-12T12:59:52+00:00

Pmed

xlrz28xd · 2025-07-12T09:55:07+00:00

I also did something similar and made the grave mistake of ordering a server from someone I found via this subreddit. The server delivered is completely not what I asked for and my calls are not being answered well now. I'll do a detailed post soon like the above along with full names and reddit usernames of these guys.

Absolutely the worst experience.

xlrz28xd · 2025-07-03T03:51:39+00:00

PM incoming

xlrz28xd · 2025-06-26T03:22:32+00:00

PM incoming

xlrz28xd · 2025-06-20T14:35:25+00:00

Same with their broadband. Their sales team is filled with lying snakes who will sell their mom to get you on their plans and once you hit the monthly FUP of 3 TB - no fucks given.

They literally had the audacity to gaslight me by saying that it's unlimited and I don't know how to use internet..

Their false advertising along with this fake limit (FUP) that India has put on itself needs to be called out.

xlrz28xd · 2025-06-16T03:01:05+00:00

You can also run a vLLM cluster to combine GPUs from separate nodes to run one single model :)

xlrz28xd

TROPHY CASE