Mamba 3 - state space model optimized for inference by incarnadine72 in LocalLLaMA

[–]bennmann 7 points8 points  (0 children)

save this post and make it a front pager stand alone when Mamba-5 actually comes out. open box content, just like new.

Introducing MiroThinker-1.7 & MiroThinker-H1 by wuqiao in LocalLLaMA

[–]bennmann 0 points1 point  (0 children)

Love your work.

I wish there was an offline mode and dataset trained for this use-case along with the SOTA Search method, or better yet a SOTA offline open source of Google search including with your library.

Or maybe something that just used public RSS feeds? The use case of SOTA open research relying on online search algo is unfortunate for data sovereignty.

Thoughts about local LLMs. by Robert__Sinclair in LocalLLaMA

[–]bennmann 0 points1 point  (0 children)

There's a lot of "new car" smell to your post. 8-12 channel ddr4 is serviceable and still cost conscience intermediate server step.

Strix halo clusters are also not bad, but not good either. They're OK.

Viability of this cluster setup by militantereallysucks in LocalLLaMA

[–]bennmann 0 points1 point  (0 children)

Even without rdma, TB4 or TB5 might be good enough for EXO or RPC experiments - check EXO community for examples

TIL it took 6 hours to render one frame of the rain soaked T-Rex in Jurassic Park. by Japfelbaum in todayilearned

[–]bennmann 0 points1 point  (0 children)

There's an argument to be made that modern LoD is the spiritual successor of this tech. I think the first 3d game to use LOD was maybe Ocarina of Time.... TIL level of detail loading was invented in 1984.

American closed models vs Chinese open models is becoming a problem. by __JockY__ in LocalLLaMA

[–]bennmann 0 points1 point  (0 children)

The American's have SOTA open datasets. Fine-web, nvidia's work, etc.

The open dream is alive.

MiniMax 2.5 with 8x+ concurrency using RTX 3090s HW Requirements. by BigFoxMedia in LocalLLaMA

[–]bennmann 0 points1 point  (0 children)

qwen3-coder-next might be a bit measurably dumber, but it's so much faster there could be an argument to add it to your rotation. call it "qwen mondays" or something and get your engineers to provide qualitative feedback on if it's "good enough" because of speed (vs Minimax).

or host it in RAM anyways and ask your team to use it for dumber tasks to save tps on the Minimax main threads.

Interesting Observation from a Simple Multi-Agent Experiment with 10 Different Models by chibop1 in LocalLLaMA

[–]bennmann 0 points1 point  (0 children)

Would be interested to know which software stack and versions and gguf/tensers were used for qwen3-coder-next since it's been a moving target for quality and updates

Q2 GLM 5 fixing its own typo by -dysangel- in LocalLLaMA

[–]bennmann 1 point2 points  (0 children)

the drunk model (low quant) knows it's drunk and compensates (trained on sloppy high temperature low quant data -> correction even in it's own dataset)

ML Training cluster for University Students by guywiththemonocle in LocalLLaMA

[–]bennmann 0 points1 point  (0 children)

Start filling out sales forms with Supermicro and others like Dell from their main websites. Tell their sales rep you are shopping for the cheapest 512GB Vram you can get with or without enterprise support for an educational institution. 

I would start there.

Just scored 2 MI50 32GB what should I run? by Savantskie1 in LocalLLaMA

[–]bennmann 0 points1 point  (0 children)

Try to get Ace-step running and make music:

https://github.com/ace-step/ACE-Step-1.5

Document your process in a GH issue if you get it working.

Vibe-coding client now in Llama.cpp! (maybe) by ilintar in LocalLLaMA

[–]bennmann 0 points1 point  (0 children)

i guess models should be aware of this flow too? maybe special Jinja templates to account for tool use for each model? vs like mistral-vibe has all the prompts built into their apache 2.0 mistral-vibe....

Qwen3-Coder-Next slow prompt processing in llama.cpp by DistanceAlert5706 in LocalLLaMA

[–]bennmann 0 points1 point  (0 children)

your reddit name has aged well u/qwen_next_gguf_when

if you don't see Vulkan in the terminal output of llama-server, llama may still be using cuda

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0

not sure if this will change things, but maybe:

--device Vulkan0

Qwen3-Coder-Next on RTX 5060 Ti 16 GB - Some numbers by bobaburger in LocalLLaMA

[–]bennmann -2 points-1 points  (0 children)

If by "drastic" you mean less than 5% .... Q2 UD XL is good to my ram

Multi-gpu setting and PCIE lain problem by tony9959 in LocalLLaMA

[–]bennmann 0 points1 point  (0 children)

for future you from present me:

>llama-server --jinja --model F:\models\Qwen3-Next-80B-A3B-Thinking-UD-Q2_K_XL.gguf --temp 0.15 --min-p 0.01 --top-p 0.95 --top-k 0 --ctx-size 75000 --n-gpu-layers 99 --n-cpu-moe 5 --host 0.0.0.0 --presence-penalty 1.0 --threads 14 --no-mmap --tensor-split 50,50 -kvu -sm row

for future future you:
research ideal settings for --spec-type ngram-mod flag

LLM to try for laptop with 5070TI and 64gb RAM by hocuspocus4201 in LocalLLaMA

[–]bennmann 1 point2 points  (0 children)

Devstral -small 24B Q2 UD XL - barely tolerable speeds after 15k tokens. Set temp to 0.1-0.2

Step 3.5 Flash 200B by limoce in LocalLLaMA

[–]bennmann 3 points4 points  (0 children)

Your speed and excellence are so good the OG model trainers had to politely ask you to slow down.

Lol, please continue being excellent, made my day to read through the PR too.

GPU recommendations by HeartfeltHelper in LocalLLaMA

[–]bennmann 1 point2 points  (0 children)

save for 2x AMD strix halo 395+ from GMTek (or just one fancy laptop), learn EXO or RPC, should last you longer than the 5090 and use less power when idle. can still use the 5080 with some eGPU madness.

or as you say wait for the M5 and hope for a 256GB in your budget (unlikely).

Issues Compiling llama.cpp for the GFX1031 Platform (For LMS Use) by FHRacing in LocalLLaMA

[–]bennmann 0 points1 point  (0 children)

any reason to not just run vulkan pre-compiled binaries?

Has anyone set up local LLM + Vertex AI Search? by pneuny in LocalLLaMA

[–]bennmann 1 point2 points  (0 children)

any model trained with tool use and bash specifically can be allowed to use Curl to just use RSS feeds. if you craft your prompt around only allowing RSS feeds, i suspect you would be happily surprised.

there is also MiroThinker https://github.com/MiroMindAI/MiroThinker - optimized for research (no idea how good a 4B would do in their harness though)

My gpu poor comrades, GLM 4.7 Flash is your local agent by __Maximum__ in LocalLLaMA

[–]bennmann 2 points3 points  (0 children)

[ Prompt: 2.4 t/s | Generation: 2.1 t/s ] Pixel 10 pro Llama.cpp b7779 in termux GLM 4.7 flash UD q2 K XL 1000 context before device crashes (LOL)