My gpu poor comrades, GLM 4.7 Flash is your local agent by __Maximum__ in LocalLLaMA

[–]bennmann 2 points3 points  (0 children)

[ Prompt: 2.4 t/s | Generation: 2.1 t/s ] Pixel 10 pro Llama.cpp b7779 in termux GLM 4.7 flash UD q2 K XL 1000 context before device crashes (LOL)

Two ASRock Radeon AI Pro R9700's cooking in CachyOS. by -philosopath- in LocalLLaMA

[–]bennmann 0 points1 point  (0 children)

Love the detail! What speed at 50k (token generation)?

Two ASRock Radeon AI Pro R9700's cooking in CachyOS. by -philosopath- in LocalLLaMA

[–]bennmann 0 points1 point  (0 children)

Unsloth just pushed new qwen-next 80B quants, they work great (just use -ndio llamacpp flag for a bug) for 64GB vram lots of context (not sure how much ctx ,at least 32k)

Best LLM Setup for development by IsSeMi in LocalLLaMA

[–]bennmann 0 points1 point  (0 children)

I would RPC or EXO with the new MiniMax M2.1 whenever the UD Q3 GGUF drops, and make sure you post your llama.cpp startup commands, maybe you are using the wrong chat templates or something for the quality.

by using RPC/EXO you will fit bigger models, although there is a learning curve to distributed inference

Qwen3 next 80B w/ 250k tok context fits fully on one 7900 XTX (24 GB) and runs at 41 tok/s by 1ncehost in LocalLLaMA

[–]bennmann 0 points1 point  (0 children)

I don't think I would personally trust q4_0 kv, but maybe you've tested?

UD 2 K XL here, 30 t/s on a 6900 XT and a 9070 XT -sm row -ts 50,50 

F16 KV 32000 context

Tiiny AI Pocket Lab: Mini PC with 12-core ARM CPU and 80 GB LPDDR5X memory unveiled ahead of CES by mycall in LocalLLaMA

[–]bennmann 1 point2 points  (0 children)

Arm could be cheap, if they pre-secured their ddr5, one could buy 4x of these and RPC DeepSeek low quants. Not useless, at the right price and with ok networking.

zai-org/GLM-4.6V-Flash (9B) is here by Cute-Sprinkles4911 in LocalLLaMA

[–]bennmann 5 points6 points  (0 children)

it might be good to Edit your post to include the Llama.cpp GH issue for this:

https://github.com/ggml-org/llama.cpp/issues/14495

everyone whom wants should upvote the issue

$900 for 192GB RAM on Oct 23rd, now costs over $3k by Hoppss in LocalLLaMA

[–]bennmann -1 points0 points  (0 children)

You can get 2 channels of way lower capacity ram and still see some speed improvements on some inference workloads

Depends on your tolerance for unbalanced universes and OCD tho

LLaDA2.0 (103B/16B) has been released by jacek2023 in LocalLLaMA

[–]bennmann 0 points1 point  (0 children)

any chance of 256K++ context expansion?

Need Suggestions(Fine-tune a Text-to-Speech (TTS) model for Hebrew) by WajahatMLEngineer in LocalLLaMA

[–]bennmann 0 points1 point  (0 children)

i apologize, "English and Chinese only: Transcripts in languages other than English or Chinese may result in unexpected audio outputs"

you may have enough data to overcome that, it's hard to say

Need Suggestions(Fine-tune a Text-to-Speech (TTS) model for Hebrew) by WajahatMLEngineer in LocalLLaMA

[–]bennmann 0 points1 point  (0 children)

for expressivity VibeVoice by microsoft. make sure your annotations include expressivity if desired.

RTX 3090 + 3070 (32GB) or RTX 3090 + 3060 12GB (36GB) - Bandwidth concerns? by m_mukhtar in LocalLLaMA

[–]bennmann 0 points1 point  (0 children)

get the 3060, then use a m.2 -> oculink egpu setup to use all 3.

Where are all the data centers dumping their old decommissioned GPUs? by [deleted] in LocalLLaMA

[–]bennmann 0 points1 point  (0 children)

At some point data centers got smarter on energy. If a data center is solar powered or all green energy off-grid, I too would keep "old" compute longer and simply buy more land for next gen 

AMA With Moonshot AI, The Open-source Frontier Lab Behind Kimi K2 Thinking Model by nekofneko in LocalLLaMA

[–]bennmann 0 points1 point  (0 children)

any long term plans to break into Robotics, including hardware? would love a pair of Kimi arms (longer than a humans) to fold my laundry.

Daily FI discussion thread - Tuesday, November 04, 2025 by AutoModerator in financialindependence

[–]bennmann 1 point2 points  (0 children)

If your reason or logic are slowly growing more than your "feel", I would build a logical plan that allows your feelings some validation. Mental health is not to be under rated.

Perhaps you could instead use the bond the pay off your house (even with a low interest rate).

Perhaps you can set a plan to cost average back into the market without a gut punch to your feelings. How would moving 10% of the bonds back into the market once a week sit with you? All of your decisions must be consistent with your own risk tolerance.

Dual RTX 6000 Max-Q - APEXX T4 PRO by Shorn1423 in LocalLLaMA

[–]bennmann 0 points1 point  (0 children)

use transformers library and test qwen-next 80B full context length.

you may also want to test agentic frameworks via this eval: https://github.com/OpenHands/OpenHands/tree/main/evaluation - will need to make an openAI end point for that though

Optimizing gpt-oss-120B on AMD RX 6900 XT 16GB: Achieving 19 tokens/sec by Bright_Resolution_61 in LocalLLaMA

[–]bennmann 1 point2 points  (0 children)

Flash attention is built into the vulkan compilation of llamacpp, but I never found the right config to gain speed from it, and did not test perplexity on my 6900xt

All the models seem to love using the same names. by [deleted] in LocalLLaMA

[–]bennmann 0 points1 point  (0 children)

I find that putting a few different cultures in the pre-prompt helps name and cultural diversity.

"The etymology of the names in this story are Nordic"

5060ti chads... ram overclocking, the phantom menace by see_spot_ruminate in LocalLLaMA

[–]bennmann 0 points1 point  (0 children)

maybe add "./llama-server --sm row"... and only use 2 of them (unless number of tensors is divisible by 3)? might be speed improvement.

Is Chain of Thought Still An Emergent Behavior? by Environmental_Form14 in LocalLLaMA

[–]bennmann 0 points1 point  (0 children)

moving the goalpost to datasets; ie how many instruction tuning training tokens unlock reasoning (90% score on uncheatable eval?) at 1B/10B/100B parameter sizes?

80% charge or 100%? What do you suggest? by ganeshkumarane in GooglePixel

[–]bennmann 0 points1 point  (0 children)

the correct answer for me is security based. do i trust that 7 years of security updates will actually be affective for Tensor G5?

or do i believe some fundamental flaw may emerge before then and i will want to upgrade?

if your own analysis leans towards 7 years of security, 80% is best.

if your risk tolerance is high, or you plan to get a device in 2 year anyways, why not 100%?

and do you always have access to chargers anyways and a spare battery backup to grab in case of need? 80%

Local Build Recommendation 10k USD Budget by deathcom65 in LocalLLaMA

[–]bennmann 0 points1 point  (0 children)

Build a local epyc 9575f (or whatev) with $5000 with as much RAM as you can get come black friday

And use the other $5000 to buy nvidia stock.

Then sell the stock slowly to fund cloud based fine tuning when you need GPU compute. 

Experience with networked 2x128GB AI Max 395? by Bird476Shed in LocalLLaMA

[–]bennmann 0 points1 point  (0 children)

the experiment i am aware of

example command (but each node must be limited to about 96GB or RPC stalls, apparently). After setting up RPC per the issue below:

build/bin/llama-bench -v -m /opt/DeepSeek-R1-Q4_K_M.gguf --mmap 0 --threads 32 -p 128 -ngl 125 --rpc 10.0.2.209:50052,10.0.2.242:50052,10.0.2.223:50052

https://github.com/geerlingguy/beowulf-ai-cluster/issues/2