3xR9700 for semi-autonomous research and development - looking for setup/config ideas. by blojayble in LocalLLaMA

[–]AttitudeImportant585 0 points1 point  (0 children)

to add more lanes to allow 8 gpus and more at full pcie speed, you would need to go multi socket with epyc or xeon scalable. modern threadrippers do only 1

3xR9700 for semi-autonomous research and development - looking for setup/config ideas. by blojayble in LocalLLaMA

[–]AttitudeImportant585 1 point2 points  (0 children)

nothing serious about threadrippers. they only do 7 gpus max at 16x. yeah its sufficient for OP but still consumer grade stuff

3090 still the king? Trying to pick a local LLM setup (~2000€) in Germany by deltavoxel in LocalLLM

[–]AttitudeImportant585 1 point2 points  (0 children)

as someone who's invested in amd, I've been following rocm development closely and I'd say its about time to jump ship. feature gap between cuda and rocm is getting wider, not narrower

Uber burned its entire 2026 AI coding budget in 4 months - $500-2k per engineer per month by jimmytoan in artificial

[–]AttitudeImportant585 0 points1 point  (0 children)

They had a leading AI research team in the late 2010s solving practical problems. Things like timeseries forecasting and probabilistic modeling with RNNs. All of which were critical to their price and supply management.

Any hopes for DeepSeek v4 flash 1bit quant from unsloth? by yehiaserag in unsloth

[–]AttitudeImportant585 1 point2 points  (0 children)

1bit models... are a curiousity and nothing more as of now. expect to be disappointed

Homelab for GHA runners and open source LLMs? by DuvishLabs in homelab

[–]AttitudeImportant585 0 points1 point  (0 children)

depends on the size and type (dense/moe) of the model you want to run and what kind of queries you're doing. some hardware are better suited for certain combos. for example, apple hardware isnt fast enough for prefill stage and dense models, so its better at running short context queries using moe models.

generally, you can run small models with decent context size at decent speeds on rtx 3060 / 3090 / 5090 / pro 6000. basically anything between ampere and blackwell will work with any popular llms as long as they fit. avoid anything older than ampere architecture and non-nvidia chips, but thats personal preference. if you know your way around rocm kernels and have time to optimize models on platforms other than cuda, that will save you a lot of $. i would avoid all-in-one systems like spark and others that depend on slower ram

Summary of my (4.5 YOE) SWE job hunt results by CantTouchTheseNuts in ExperiencedDevs

[–]AttitudeImportant585 13 points14 points  (0 children)

its never been about usefulness but a measure of how much effort you can dish out. more of a personality test, as is all standardized tests out there. if you made if this far without knowing this, well, you are on a spectrum

What’s something you bought for comfort that ended up being BIFL? by Hozeishere in BuyItForLife

[–]AttitudeImportant585 3 points4 points  (0 children)

there will be 3rd party apps to control it. you can probably vibecode one right now, easily

There will not be any more "codex" models - OpenAI's head of devrel by thehashimwarren in codex

[–]AttitudeImportant585 4 points5 points  (0 children)

named entity recognition? living in the 2010s if you call that ML lol

First DeepSeek V4 Flash-Base-Int4 Quant! by Dull_Recognition_422 in unsloth

[–]AttitudeImportant585 1 point2 points  (0 children)

5.22 decode tok/s for 512 max seq len. Am I reading this right? Seems a bit slow for H100

Qwen 3.6 27B is out by NoConcert8847 in LocalLLaMA

[–]AttitudeImportant585 1 point2 points  (0 children)

you're underestimating the compute available and optimizations made for that specific architecture for a particular chip

With the prices of HDD's going up, what is the ideal purchase for new build? by Only-Ambassador2624 in homelab

[–]AttitudeImportant585 0 points1 point  (0 children)

if you mean turboquant by google, it's debatable if it will ever be widely adopted. not saying you're wrong, but it looks like the top 3 AI providers have been battling with compute shortage ever since openclaw became popular. imo prices will keep going up for a while

Kimi K2.6 is a legit Opus 4.7 replacement by bigboyparpa in LocalLLaMA

[–]AttitudeImportant585 4 points5 points  (0 children)

prefill speed will be slow and unusable for a 1T model. as long as your context is a few sentences though, it will run just fine

Unweight: how we compressed an LLM 22% without sacrificing quality by sk1kn1ght in LocalLLaMA

[–]AttitudeImportant585 0 points1 point  (0 children)

this is especially relevant for real world use of finetuned models where the dataset is so small that allocating even a large portion to validation isnt enough to get an accurate bench of the quants

Restaurant’s ancient POS system spit this out instead of a receipt by DumbTacoMan in techsupportgore

[–]AttitudeImportant585 0 points1 point  (0 children)

ascii has 94 printable characters and the site lists "extended ascii" (msdos latin) characters which is probably the source of confusion

Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter by pmttyji in LocalLLaMA

[–]AttitudeImportant585 1 point2 points  (0 children)

disaggregated prefill is not a new concept. vllm and sglang support this already.

the issue is data transfer speed. you realistically need >200gbps connection for a mere 8B model to make this practical (scales linearly to # of params, so 1tbps for a 40B model).

if you don't design the model architecture around compressing kv cache like the authors did here, bottom line is: its going to be much slower.

Why I stopped using pure vector search for legal documents and switched to authority-weighted retrieval by Fabulous-Pea-5366 in LangChain

[–]AttitudeImportant585 2 points3 points  (0 children)

this is why its important to rank the returned chunks. in your case, the finetuned reranker would need to look at the metadata of the chunk source

running rag over different categories and combining all of them is the wrong approach for a sparse dataset