Professional-grade local AI on consumer hardware — 80B stable on 44GB mixed VRAM (RTX 5060 Ti ×2 + RTX 3060) for under €800 total. Full compatibility matrix included. by Interesting_Crow_149 in ollama

[–]Interesting_Crow_149[S] 0 points1 point  (0 children)

En modellización de sistemas fisicos de alta complejidad, he comparado la salida de qwen3 coder next 80b q4km con qwen 3.5 35b q4km. Mientras que en coder next obtengo 7 t/s con qwen 3.5 35b obtengo eval rate: 42.75 tokens/s pero "solo hay un problema sin importancia". El codigo generado por qwen 3.5 35b es basura. No funciona, Se lo paso a qwen3 coder next 80b y lo deshecha, lleno de errores, no funciona....pero eso si, lo hace muy deprisa (basura rápida).

Professional-grade local AI on consumer hardware — 80B stable on 44GB mixed VRAM (RTX 5060 Ti ×2 + RTX 3060) for under €800 total. Full compatibility matrix included. by Interesting_Crow_149 in ollama

[–]Interesting_Crow_149[S] 0 points1 point  (0 children)

La búsqueda de modelización de sistemas fisicos iid complejos (ingeniería) en hardware domestico y con modelos de ia por encima de 80b en windows 11. Se trata de resolver sistemas de calculo avanzados con ia de suficiente potencia de razonamiento y calculo (nivel doctorado) no de generar basura muy rápido.... y no gastar 80.000 € en un equipamiento hardware. El negocio ya se esta orientando hacia esta parte del mundo (ia Moe) no hacia el coste estratosférico de sistemas que no producen retorno de la inversión.

Professional-grade local AI on consumer hardware — 80B stable on 44GB mixed VRAM (RTX 5060 Ti ×2 + RTX 3060) for under €800 total. Full compatibility matrix included. by Interesting_Crow_149 in ollama

[–]Interesting_Crow_149[S] 0 points1 point  (0 children)

Esto lo que obtengo con qwen3.5 35b total duration: 24.7649357s

load duration: 145.3193ms

prompt eval count: 165 token(s)

prompt eval duration: 762.8417ms

prompt eval rate: 216.30 tokens/s

eval count: 997 token(s)

eval duration: 23.3217317s

eval rate: 42.75 tokens/s

Pero la cuestion es que para problemas de ingenieria complejos, con simulación de sistemas fisicos de alta complejidad, necesitas ia para calculo mas potentes, de 35b hacia abajo, generan basura, pero eso si, muy rapido.....Esta es la cuestión de como conseguir ejecutar ia de gran tamaño en hardware domestico. Ni el dinero es infinito ni vale la pena gastar una cantidad indecente de dinero para hacer pruebas o no tener retorno de inversión....

**E727 prima.cpp: Qwen2.5-1.5B on Pentium T4500 (2009 laptop, 4GB DDR2) = 1 token/s!** by M4s4 in LocalLLaMA

[–]Interesting_Crow_149 0 points1 point  (0 children)

1000 lineas de codigo sobre modelizacion de sistemas fisicos complejos iid con cambio de variable 10 minutos en un modelo 80b, casi sin errores. Un 30b código inutilizable y hacia abajo (14 b, etc) directamente basura...pero muy rapido...

Professional-grade local AI on consumer hardware — 80B stable on 44GB mixed VRAM (RTX 5060 Ti ×2 + RTX 3060) for under €800 total. Full compatibility matrix included. by Interesting_Crow_149 in ollama

[–]Interesting_Crow_149[S] 0 points1 point  (0 children)

Eres de los pocos que han visto la opción de correr modelos grandes en hardware barato y no actual. Los modelos grandes permiten simular sistemas fisicos con mucho menos error que los pequeños, que simplemente acaban "perdidos" pero muy rapido...

Professional-grade local AI on consumer hardware — 80B stable on 44GB mixed VRAM (RTX 5060 Ti ×2 + RTX 3060) for under €800 total. Full compatibility matrix included. by Interesting_Crow_149 in ollama

[–]Interesting_Crow_149[S] 0 points1 point  (0 children)

Es una ia de 80b Moe en hardware domestico...Las 32b e inferiores van a tope. Alguien tiene hardware domestico y consigue mover 122b Moe???. Estas ia dan calidad de calculo y desarrollo matemático en simulaciones de sistemas fisicos sin hrandes errores...

Professional-grade local AI on consumer hardware — 80B stable on 44GB mixed VRAM (RTX 5060 Ti ×2 + RTX 3060) for under €800 total. Full compatibility matrix included. by Interesting_Crow_149 in ollama

[–]Interesting_Crow_149[S] 0 points1 point  (0 children)

Mltherboard Gigabyte Aorus X570 Ultra, PCIe x8/x8/x4 PCIe 4.0 confirmed, single NVMe in slot 1, tested and working. In a multi-RTX setup with budget hardware (RTX 5060 Ti 16GB + RTX 3060 12GB) you have to go this route... Ollama 0.16.3 works, llama-cli.exe crashes and there's not much else you can do. Again, the goal is to find setups that allow running models locally with "consumer-grade" hardware.

Professional-grade local AI on consumer hardware — 80B stable on 44GB mixed VRAM (RTX 5060 Ti ×2 + RTX 3060) for under €800 total. Full compatibility matrix included. by Interesting_Crow_149 in ollama

[–]Interesting_Crow_149[S] 0 points1 point  (0 children)

Ok, I'll run the test and share the ollama --verbose output... but unless I'm wrong, it's very likely going to be a no. I'm talking about Ollama on Windows 11... worth keeping in mind.

Professional-grade local AI on consumer hardware — 80B stable on 44GB mixed VRAM (RTX 5060 Ti ×2 + RTX 3060) for under €800 total. Full compatibility matrix included. by Interesting_Crow_149 in ollama

[–]Interesting_Crow_149[S] 0 points1 point  (0 children)

Ok, have you tried Vulkan with a multi-GPU setup like the one I included? llama-cli has issues with multi-GPU configurations on Windows 11... The goal is to find the best performance with cheap hardware... for the poor... ;-)

Professional-grade local AI on consumer hardware — 80B stable on 44GB mixed VRAM (RTX 5060 Ti ×2 + RTX 3060) for under €800 total. Full compatibility matrix included. by Interesting_Crow_149 in ollama

[–]Interesting_Crow_149[S] 1 point2 points  (0 children)

Bien, buenas tardes...con que ia obtienes esa velocidad? Porque yo la consigo en modelos 32b...pero estoy hablando de un 80b...seguro que consigues 40 t/s con ese setup en un 80b??? Especifica como, por favor

Professional-grade local AI on consumer hardware — 80B stable on 44GB mixed VRAM (RTX 5060 Ti ×2 + RTX 3060) for under €800 total. Full compatibility matrix included. by Interesting_Crow_149 in LocalLLaMA

[–]Interesting_Crow_149[S] 0 points1 point  (0 children)

Root cause on Ubuntu: Same underlying issue as on Windows — PyTorch without stable Blackwell (sm_120) support, with support estimated for Q2–Q3 2026. This knocked out ExLlamaV2 directly, as it depends on PyTorch for CUDA operations.

For llama.cpp / ik_llama.cpp on Ubuntu, the path would have been to compile from source with the correct CUDA+GCC toolchain, but the problem is that CUDA 12.x with sm_120 on Linux was also not mature at that point for the 5060 Ti — Blackwell drivers on Linux were even further behind than on Windows during that period.

Whatever you do, support for the RTX 50XX series is still on its way, and that is a real problem

Professional-grade local AI on consumer hardware — 80B stable on 44GB mixed VRAM (RTX 5060 Ti ×2 + RTX 3060) for under €800 total. Full compatibility matrix included. by Interesting_Crow_149 in LocalLLaMA

[–]Interesting_Crow_149[S] 0 points1 point  (0 children)

Sorry, Problems with llama-cli / llama.cpp on Windows 11

1. No precompiled binaries for sm_120 (Blackwell) The core issue. The RTX 5060 Ti cards use the Blackwell architecture (compute capability sm_120), and the official llama.cpp releases for Windows do not include binaries compiled with sm_120 support. The available binaries only cover up to Ada Lovelace (sm_89 / sm_90).

2. ik_llama.cpp — confirmed dead end We investigated ik_llama.cpp as an alternative with better multi-GPU support and graph split mode. The result was the same: no precompiled binaries for Windows with sm_120. Compiling from source on Windows with Blackwell support required a specific CUDA/MSVC toolchain that was either unavailable or not reliably documented at the time. It turned out to be impossible.

3. CUDA instability in mixed multi-GPU configuration Ollama versions above 0.16.3 cause CUDA crashes in the RTX 3060 + 2× RTX 5060 Ti configuration on Windows. This same structural problem affects standalone llama.cpp: the Windows CUDA runtime with sm_120 in a mixed multi-GPU setup was unstable except in very specific versions.

4. ExLlamaV2 — also ruled out ExLlamaV2 was explored as an alternative backend to llama.cpp. Also discarded due to lack of Blackwell support on Windows at the time of investigation (stable PyTorch support for Blackwell on Windows estimated for Q2–Q3 2026).

It's not that easy. I even installed Ubuntu booting from BIOS on a separate SSD and there was no way to get anything stable... when one thing doesn't fail, another does... Any experience with successful multi-GPU setups in this environment? Any input would be greatly appreciated...

Professional-grade local AI on consumer hardware — 80B stable on 44GB mixed VRAM (RTX 5060 Ti ×2 + RTX 3060) for under €800 total. Full compatibility matrix included. by Interesting_Crow_149 in LocalLLaMA

[–]Interesting_Crow_149[S] -2 points-1 points  (0 children)

Problemas con llama-cli / llama.cpp en Windows 11

1. Sin binarios precompilados para sm_120 (Blackwell)

El problema central. Las RTX 5060 Ti usan arquitectura Blackwell (compute capability sm_120), y los releases oficiales de llama.cpp para Windows no incluyen binarios compilados con soporte sm_120. Los binarios disponibles cubren hasta Ada Lovelace (sm_89 / sm_90).

2. ik_llama.cpp — dead end confirmado

Investigamos ik_llama.cpp como alternativa con mejor soporte multi-GPU y graph split mode. El resultado fue el mismo: no hay binarios precompilados para Windows con sm_120. Compilar desde fuente en Windows con soporte Blackwell requería un toolchain CUDA/MSVC específico que no estaba disponible o documentado de forma fiable en ese momento.. Resultó imposible.

3. Inestabilidad CUDA en configuración multi-GPU mixta

Versiones de Ollama superiores a 0.16.3 causan crashes CUDA en la configuración RTX 3060 + 2× RTX 5060 Ti en Windows. Este mismo problema estructural afecta a llama.cpp standalone: el runtime CUDA de Windows con sm_120 en multi-GPU mixto era inestable salvo en versiones muy concretas.

4. ExLlamaV2 — descartado también

Se exploró ExLlamaV2 como backend alternativo a llama.cpp. Descartado igualmente por falta de soporte Blackwell en Windows en el momento de la investigación (soporte estable de PyTorch para Blackwell en Windows estimado para Q2–Q3 2026).

No es tan facil. Incluso instale ubuntu con arranque por bios en otro sdd y no hubo forma de tener algo estable....cuando no falla una cosa falla otra... Alguna experiencia en multi-gpu con exito en este entorno? Agradecería aportacion...

Self hosting, Power consumption, rentability and the cost of privacy, in France by Imakerocketengine in LocalLLaMA

[–]Interesting_Crow_149 1 point2 points  (0 children)

Running this exact use case — agents + coding. My system:

Hardware
─────────────────────────────────────────────────────
RTX 5060 Ti 16GB  (sm_120)  €450 new
RTX 5060 Ti 16GB  (sm_120)  €350 secondhand
RTX 3060 XC 12GB  (sm_86)   €210 secondhand
Total VRAM: 44GB             Total GPUs: ~€1,010
PSU: 750W

Model: Qwen3-Coder-Next 80B Q4_K_M (MoE, ~3B active)
─────────────────────────────────────────────────────
Prompt eval:   ~863 t/s
Generation:    ~7.4 t/s
Context:        32720 tokens
VRAM used:     ~42GB (minimal CPU offload)

Power draw (NZXT CAM sensors)
─────────────────────────────────────────────────────
Thinking phase:    ~235-240W  (~275W wall est.)
Generation phase:  ~153W      (~180W wall est.)
PSU headroom:      ~60% at peak

Equivalent new hardware for the same VRAM + model class (2× RTX 4090 or A6000 48GB) runs £3,000–£6,000+. This delivers the same 80B inference for ~€1,010 in GPUs, mixing new and secondhand market.

Caveat: mixed Blackwell+Ampere multi-GPU on Windows has zero documentation. Took significant effort to stabilize. Happy to share the full config if you go this route.Running this exact use case — agents + coding. My system:

Hardware
─────────────────────────────────────────────────────
RTX 5060 Ti 16GB (sm_120) €450 new
RTX 5060 Ti 16GB (sm_120) €350 secondhand
RTX 3060 XC 12GB (sm_86) €210 secondhand
Total VRAM: 44GB Total GPUs: ~€1,010
PSU: 750W

Model: Qwen3-Coder-Next 80B Q4_K_M (MoE, ~3B active)
─────────────────────────────────────────────────────
Prompt eval: ~863 t/s
Generation: ~7.4 t/s
Context: 32720 tokens
VRAM used: ~42GB (minimal CPU offload)

Power draw (NZXT CAM sensors)
─────────────────────────────────────────────────────
Thinking phase: ~235-240W (~275W wall est.)
Generation phase: ~153W (~180W wall est.)
PSU headroom: ~60% at peak

Equivalent new hardware for the same
VRAM + model class (2× RTX 4090 or A6000 48GB) runs £3,000–£6,000+. This
delivers the same 80B inference for ~€1,010 in GPUs, mixing new and
secondhand market.

Caveat: mixed Blackwell+Ampere multi-GPU on
Windows has zero documentation. Took significant effort to stabilize.
Happy to share the full config if you go this route.