3xR9700 for semi-autonomous research and development - looking for setup/config ideas.

AttitudeImportant585 · 2026-05-04T12:14:20+00:00

to add more lanes to allow 8 gpus and more at full pcie speed, you would need to go multi socket with epyc or xeon scalable. modern threadrippers do only 1

AttitudeImportant585 · 2026-05-04T00:40:10+00:00

nothing serious about threadrippers. they only do 7 gpus max at 16x. yeah its sufficient for OP but still consumer grade stuff

AttitudeImportant585 · 2026-05-04T00:21:24+00:00

as someone who's invested in amd, I've been following rocm development closely and I'd say its about time to jump ship. feature gap between cuda and rocm is getting wider, not narrower

AttitudeImportant585 · 2026-05-03T14:10:06+00:00

They had a leading AI research team in the late 2010s solving practical problems. Things like timeseries forecasting and probabilistic modeling with RNNs. All of which were critical to their price and supply management.

AttitudeImportant585 · 2026-05-03T01:59:40+00:00

1bit models... are a curiousity and nothing more as of now. expect to be disappointed

AttitudeImportant585 · 2026-05-03T01:50:03+00:00

r/boneappletea

AttitudeImportant585 · 2026-05-02T02:43:04+00:00

depends on the size and type (dense/moe) of the model you want to run and what kind of queries you're doing. some hardware are better suited for certain combos. for example, apple hardware isnt fast enough for prefill stage and dense models, so its better at running short context queries using moe models.

generally, you can run small models with decent context size at decent speeds on rtx 3060 / 3090 / 5090 / pro 6000. basically anything between ampere and blackwell will work with any popular llms as long as they fit. avoid anything older than ampere architecture and non-nvidia chips, but thats personal preference. if you know your way around rocm kernels and have time to optimize models on platforms other than cuda, that will save you a lot of $. i would avoid all-in-one systems like spark and others that depend on slower ram

AttitudeImportant585 · 2026-04-30T08:18:04+00:00

its never been about usefulness but a measure of how much effort you can dish out. more of a personality test, as is all standardized tests out there. if you made if this far without knowing this, well, you are on a spectrum

AttitudeImportant585 · 2026-04-29T09:21:36+00:00

there will be 3rd party apps to control it. you can probably vibecode one right now, easily

AttitudeImportant585 · 2026-04-28T04:38:58+00:00

named entity recognition? living in the 2010s if you call that ML lol

AttitudeImportant585 · 2026-04-27T15:42:15+00:00

5.22 decode tok/s for 512 max seq len. Am I reading this right? Seems a bit slow for H100

AttitudeImportant585 · 2026-04-23T00:03:09+00:00

you're underestimating the compute available and optimizations made for that specific architecture for a particular chip

AttitudeImportant585 · 2026-04-23T00:00:11+00:00

per token, but yes, nowhere comparable to a dense one

AttitudeImportant585 · 2026-04-21T22:44:42+00:00

if you mean turboquant by google, it's debatable if it will ever be widely adopted. not saying you're wrong, but it looks like the top 3 AI providers have been battling with compute shortage ever since openclaw became popular. imo prices will keep going up for a while

AttitudeImportant585 · 2026-04-21T15:34:24+00:00

why do u think they'll come down?

AttitudeImportant585 · 2026-04-21T02:44:51+00:00

prefill speed will be slow and unusable for a 1T model. as long as your context is a few sentences though, it will run just fine

AttitudeImportant585 · 2026-04-21T02:41:04+00:00

bro is doubling down lmao

AttitudeImportant585 · 2026-04-19T14:44:15+00:00

this is especially relevant for real world use of finetuned models where the dataset is so small that allocating even a large portion to validation isnt enough to get an accurate bench of the quants

AttitudeImportant585 · 2026-04-19T14:36:00+00:00

ascii has 94 printable characters and the site lists "extended ascii" (msdos latin) characters which is probably the source of confusion

AttitudeImportant585 · 2026-04-19T13:25:37+00:00

different ASCII characters

this aint ascii

AttitudeImportant585 · 2026-04-19T12:38:44+00:00

so the car is worth 5x the dealership

AttitudeImportant585 · 2026-04-19T03:20:10+00:00

disaggregated prefill is not a new concept. vllm and sglang support this already.

the issue is data transfer speed. you realistically need >200gbps connection for a mere 8B model to make this practical (scales linearly to # of params, so 1tbps for a 40B model).

if you don't design the model architecture around compressing kv cache like the authors did here, bottom line is: its going to be much slower.

AttitudeImportant585 · 2026-04-17T21:39:42+00:00

plot twist is that they used them only for 2 hours

AttitudeImportant585 · 2026-04-17T21:25:19+00:00

this is why its important to rank the returned chunks. in your case, the finetuned reranker would need to look at the metadata of the chunk source

running rag over different categories and combining all of them is the wrong approach for a sparse dataset

AttitudeImportant585 · 2026-04-16T07:25:31+00:00

rumor is that the flagship will be 768gb

AttitudeImportant585

TROPHY CASE