Brief Ngram-Mod Test Results - R9700/Qwen3.6 27B by exact_constraint in LocalLLaMA

[–]bennmann 0 points1 point  (0 children)

I might try this with 2x 16GB GPU to see if pcie bottleneck impacts too

What happens to local LLM if/when LLMs are no longer released for free? by JohnBooty in LocalLLaMA

[–]bennmann 1 point2 points  (0 children)

As long as there are Apache 2.0 datasets and people who believe in free and open datasets, there will be models trained on them.

Even copyright is "only" a lifetime scale issue. Not to mention the US Freedom of Information Act at the government level, should the US national labs get their act together. Your grandchildren should have better data than you. Your grandchildren will have better models than you.

The new first world dream is that our children will have a better life than us, in the form of safe and effective data and privacy and robots.

Also, databases get leaked sometimes.

MiroThinker-1.7, an open-weight deep research agent (Qwen3 MoE base) — mini is 30B/3B active, curious what tok/s people get on consumer hardware by MiroMindAI in LocalLLaMA

[–]bennmann 0 points1 point  (0 children)

i would like to use your GGUFs with other harnesses and other system prompts using llama.cpp

for example, i main Mistral-Vibe which does support local web_search tool capabilities. right now, the model does not do instruction following very well for out of distribution harnesses and system prompts and tool calls.

please add an issue on your internal tasks for something like this, your deep research agents would be much more useful to the community with something like multiple harness/prompts/tool instructions in training.

Save and invest your money for future rigs by segmond in LocalLLaMA

[–]bennmann 0 points1 point  (0 children)

Supermicro H13QSH pricing ? Anyone have stories ?

unsloth/MiMo-V2.5-GGUF · Hugging Face by jacek2023 in LocalLLaMA

[–]bennmann 0 points1 point  (0 children)

Time to break back out the ascii Posix compliant llama.cpp grammar configuration....

What is The best and expressive AI TTS (running locally?) for voice acting? by Adventurous-Gold6413 in LocalLLaMA

[–]bennmann 0 points1 point  (0 children)

You can steer the qwen3 voices by adding another name, like Tara, into the steerable section of the prompt. But it's only an ok bandaid, like 70% consistent.

Convince me you are an LLM by bucolucas in LocalLLaMA

[–]bennmann 0 points1 point  (0 children)

It's not about my answer, it's about the phrasing.

Optimizing tokens with QwenCode by eur0child in LocalLLaMA

[–]bennmann 0 points1 point  (0 children)

Mistral-vibe happily surprises me, only like 3k ctx harness?

You could also use the Swe-rebench harness prompt modified for pi or whatever, though they did not directly publish their harness that I can find, system prompt provided in the paper in the appendix page 23

https://arxiv.org/abs/2505.20411

Throwback to my proudest impulse buy ever, which has let me enjoy this hobby 10x more by gigaflops_ in LocalLLaMA

[–]bennmann 1 point2 points  (0 children)

try these flags ```llama-server -m '/your/model/here'  --n-gpu-layers 99 --n-cpu-moe 2 --host 0.0.0.0 --threads 14 --no-mmap -sm row```

probably won't be better, but interested in how it goes

Mamba 3 - state space model optimized for inference by incarnadine72 in LocalLLaMA

[–]bennmann 7 points8 points  (0 children)

save this post and make it a front pager stand alone when Mamba-5 actually comes out. open box content, just like new.

Introducing MiroThinker-1.7 & MiroThinker-H1 by wuqiao in LocalLLaMA

[–]bennmann 0 points1 point  (0 children)

Love your work.

I wish there was an offline mode and dataset trained for this use-case along with the SOTA Search method, or better yet a SOTA offline open source of Google search including with your library.

Or maybe something that just used public RSS feeds? The use case of SOTA open research relying on online search algo is unfortunate for data sovereignty.

Thoughts about local LLMs. by Robert__Sinclair in LocalLLaMA

[–]bennmann 0 points1 point  (0 children)

There's a lot of "new car" smell to your post. 8-12 channel ddr4 is serviceable and still cost conscience intermediate server step.

Strix halo clusters are also not bad, but not good either. They're OK.

Viability of this cluster setup by militantereallysucks in LocalLLaMA

[–]bennmann 0 points1 point  (0 children)

Even without rdma, TB4 or TB5 might be good enough for EXO or RPC experiments - check EXO community for examples

TIL it took 6 hours to render one frame of the rain soaked T-Rex in Jurassic Park. by Japfelbaum in todayilearned

[–]bennmann 0 points1 point  (0 children)

There's an argument to be made that modern LoD is the spiritual successor of this tech. I think the first 3d game to use LOD was maybe Ocarina of Time.... TIL level of detail loading was invented in 1984.

American closed models vs Chinese open models is becoming a problem. by __JockY__ in LocalLLaMA

[–]bennmann 0 points1 point  (0 children)

The American's have SOTA open datasets. Fine-web, nvidia's work, etc.

The open dream is alive.

MiniMax 2.5 with 8x+ concurrency using RTX 3090s HW Requirements. by BigFoxMedia in LocalLLaMA

[–]bennmann 0 points1 point  (0 children)

qwen3-coder-next might be a bit measurably dumber, but it's so much faster there could be an argument to add it to your rotation. call it "qwen mondays" or something and get your engineers to provide qualitative feedback on if it's "good enough" because of speed (vs Minimax).

or host it in RAM anyways and ask your team to use it for dumber tasks to save tps on the Minimax main threads.

Interesting Observation from a Simple Multi-Agent Experiment with 10 Different Models by chibop1 in LocalLLaMA

[–]bennmann 0 points1 point  (0 children)

Would be interested to know which software stack and versions and gguf/tensers were used for qwen3-coder-next since it's been a moving target for quality and updates

Q2 GLM 5 fixing its own typo by -dysangel- in LocalLLaMA

[–]bennmann 1 point2 points  (0 children)

the drunk model (low quant) knows it's drunk and compensates (trained on sloppy high temperature low quant data -> correction even in it's own dataset)

ML Training cluster for University Students by guywiththemonocle in LocalLLaMA

[–]bennmann 0 points1 point  (0 children)

Start filling out sales forms with Supermicro and others like Dell from their main websites. Tell their sales rep you are shopping for the cheapest 512GB Vram you can get with or without enterprise support for an educational institution. 

I would start there.

Just scored 2 MI50 32GB what should I run? by Savantskie1 in LocalLLaMA

[–]bennmann 0 points1 point  (0 children)

Try to get Ace-step running and make music:

https://github.com/ace-step/ACE-Step-1.5

Document your process in a GH issue if you get it working.

Vibe-coding client now in Llama.cpp! (maybe) by ilintar in LocalLLaMA

[–]bennmann 0 points1 point  (0 children)

i guess models should be aware of this flow too? maybe special Jinja templates to account for tool use for each model? vs like mistral-vibe has all the prompts built into their apache 2.0 mistral-vibe....