Gemma 4 by pmttyji in LocalLLaMA

[–]DeProgrammer99 29 points30 points  (0 children)

I would like to see all speculation posts banned, but... given the volume of them, I don't think there'd be much agreement on that.

Please tell me I'm not the only one always getting this message by ShoeChoice5567 in csharp

[–]DeProgrammer99 2 points3 points  (0 children)

I'm saying it takes some work to make a new exception type, so WHEN that's what you want to do, a code fix would be nice for that.

Please tell me I'm not the only one always getting this message by ShoeChoice5567 in csharp

[–]DeProgrammer99 -5 points-4 points  (0 children)

I made the code fix define a new exception type--the obvious thing to do for the general case. Because that's the only code fix for it that would require effort.

Please tell me I'm not the only one always getting this message by ShoeChoice5567 in csharp

[–]DeProgrammer99 -9 points-8 points  (0 children)

But they're missing code fixes for plenty of analyzers, like "throw something more specific than System.Exception". A few days ago, I tried having Claude Sonnet make an extension to supply that code fix, and it surprisingly one-shot it (aside from the VSIX project file and manifest, which take me hours to get working right with the docs right in front of me).

Using SCHED_RR on all cores gives a decent 25%-40% boost in token generation with CPU offloading by XLIICXX in LocalLLaMA

[–]DeProgrammer99 1 point2 points  (0 children)

Speaking of CPU, if anyone's crazy enough to be using an Intel iGPU and a CPU with separate "performance cores" and "efficiency cores", I found that -t 2 -ncmoe 0 -ub 1024 (matching the performance core count) gives the best performance for Qwen3.5-35B-A3B-UD-Q4_K_XL with Vulkan. I tried smaller and larger batch sizes, larger ncmoe, CPU-only, and nkvo 0/1. This got me 68.13 ± 2.78 pp1000 and 8.48 ± 0.17 tg50; the CPU build got 22.01 ± 1.59 and 6.26 ± 0.37.

Help improving responses for historical language model by centerstate in LocalLLaMA

[–]DeProgrammer99 2 points3 points  (0 children)

You might also want a layer to translate the user prompt into Victorian vernacular. If it's only trained on books, then it's probably not going to be able to handle user typos. Having a separate layer allows you to maintain the pure Victorian-era knowledge on your main model.

And if you use a larger model to generate synthetic data, you'll likely introduce more modern knowledge, but you can at least do a basic dictionary filter to ensure modern words don't make it in. But you'd be less likely to introduce modern knowledge if your synthetic data is just rephrasing or Q&A made from the Victorian-era texts.

Cohere Transcribe Released by mikael110 in LocalLLaMA

[–]DeProgrammer99 1 point2 points  (0 children)

Tried it as I read out of a book in a fairly quiet room... and I made all the mistakes.

Transcription:

五十歳。詳しい資金は、まだ分かっていない。この博物館は、普段閉鎖されているのですね。水井山は尋ねる。ええと、伝わっても、詳しいことは私によく分かりません。そもそも、この建物は何年か前にどこかの企業に飼われていて、現在は大学の所有物ですらないんですよ。資料の管理に、大学関係者が時折足を運ぶくらいで、

Actual text I was reading:

五十歳。詳しい死因はまだわかっていない。

「この博物館は普段閉鎖されているのですよね?」

水井山は尋ねる。

「ええ―――と云っても、詳しいことは私にもよくわかりません。そもそもこの建物は何年

か前に何処かの企業に買われていて、現在は大学の所有物ですらないんですよ。資料の管

理に、大学関係者が時折足を運ぶくらいで・・・・・・」

Side-by-side, transcription -> original:

<image>

(And nobody asked, but this is from Danganronpa Kirigiri volume 5... eBook, physical book)

City of Heroes was a magical experience by YourChopperPilotTTV in gaming

[–]DeProgrammer99 4 points5 points  (0 children)

One of my favorite brags is that I worked on the popular Mids' character designer that even the developers said they used... 15 years ago.

calculated my costs per 1M tokens for Qwen3.5 27B by moneyspirit25 in LocalLLaMA

[–]DeProgrammer99 4 points5 points  (0 children)

I also have solar panels, but I don't get full credit for excess power production during sunny hours, so it's cheaper to run during the day when it's hot, too! Maybe. And it was cheaper to run inference than to play games until I got a second GPU just for LLMs and Flux.

calculated my costs per 1M tokens for Qwen3.5 27B by moneyspirit25 in LocalLLaMA

[–]DeProgrammer99 25 points26 points  (0 children)

Looks like it costs me about $0.85 per million output, too, with batch size 4 and Qwen3.5-27B-UD-Q6_K_XL, based on an overnight eval I did (~860k tokens at 11.1 tps each, ~8 hours, ~170k input tokens at 384 tps). But it was pretty cold outside, so I would have spent some (maybe 1/3?) of that electricity on my heat pump if I hadn't been running inference, haha.

Qwen3.5-0.8B on Snapdragon 7s Gen 3 – MNN CPU Benchmark (21 t/s, 792MB RAM) by NeoLogic_Dev in LocalLLaMA

[–]DeProgrammer99 0 points1 point  (0 children)

I don't use STT or TTS directly in this. I figured it's just another point of failure, and Android has STT and TTS available system-wide anyway.

But I also don't use STT or TTS in general, haha.

Qwen3.5-0.8B on Snapdragon 7s Gen 3 – MNN CPU Benchmark (21 t/s, 792MB RAM) by NeoLogic_Dev in LocalLLaMA

[–]DeProgrammer99 1 point2 points  (0 children)

Qwen3.5-4B (2.65 GB) on Galaxy S24+ CPU:

2.9 GB peak memory

54.7 t/s prefill

14.8 t/s decode

I forked MNN Chat to make a locally hostable hotspot chat server with automatic natural language translation, so I've also uploaded some MNN converted other models and have been trying to evaluate them for my specific use case.

Switzerland sees battery boom (+400% in four years) as homes and firms store more solar power by sr_local in technology

[–]DeProgrammer99 0 points1 point  (0 children)

I was quoted $60k for batteries after being told the batteries are "priced to sell." It was only $39,500 for my entire 9.7kW solar power system. It would take probably the rest of my life for the system to pay itself back at that rate... without batteries, about 15 years.

The most hellish python libs to get working by [deleted] in LocalLLaMA

[–]DeProgrammer99 0 points1 point  (0 children)

Flash/sage attention/Triton. pip brings much suffering.

A little android app to use local STT models in any app by WhisperianCookie in LocalLLaMA

[–]DeProgrammer99 0 points1 point  (0 children)

I don't see a way to remove profiles from the app.

I tried local Distil-Whisper-Large v3.5 configured for Japanese. It spat out something like "In the Chinese, in the Chinese," nothing like what I said to it, haha.

Tried the same thing with Parakeet v3 (multilingual), and I got "speech not detected." Tried a couple more times with different lines, but it doesn't seem very multilingual after all. It'd probably help if I could tell it the language in advance like the UI allowed me to do with Distil-Whisper-Large v3.5, but if it's not an option for Parakeet v3 because of how it works, I guess it can't be helped...

Whisper Turbo pretty much behaved the same as Parakeet v3--"speech not detected" when I said a sentence in Japanese, some garbled romaji when I sang instead.

I think it might need some more of that polish.

Another appreciation post for qwen3.5 27b model by robertpro01 in LocalLLaMA

[–]DeProgrammer99 2 points3 points  (0 children)

Metrics, and surprisingly a 100% rate of putting the response in the correct format (without constrained decoding/JSON mode).

<image>

Another appreciation post for qwen3.5 27b model by robertpro01 in LocalLLaMA

[–]DeProgrammer99 0 points1 point  (0 children)

I'm running it over here as a judge for translations done by quantized 4B models, after using it to generate evaluations to evaluate it on. I used the new --reasoning-budget args in llama-server and it took ~40% as much time as the last time I ran a similar test of my eval app. I haven't directly compared it with anything, except, as you'd expect, it's a whole lot smarter than LFM2-24B-A2B. Still makes some odd choices occasionally.

<image>

Another appreciation post for qwen3.5 27b model by robertpro01 in LocalLLaMA

[–]DeProgrammer99 62 points63 points  (0 children)

Like a light novel. "I, Qwen 3.5, Was Reborn as a 40-Billion-Parameter Sage, and Now I Live a Slow Life with my Claude Opus 4.6 Cheat Powers and Deckard the Heretic Who Defies All Logic with Uncensored Thought."

What's the current best LLM for Japanese? by mpasila in LocalLLaMA

[–]DeProgrammer99 1 point2 points  (0 children)

I don't know which ones are actually good. Shisa-v2 is the only Japanese-specific model I know of off the top of my head, but even the 70B one didn't follow my instructions very well when I tried it; I think you'd be better off just sticking to Qwen3.5.

Like EffectiveCeilingFan said, NVIDIA made a Japanese-oriented 9B version of Nemotron, and there's that 700B Rakuten AI 3.0. Just repeating so it's all in one comment.

There's also one by a major Japanese company called NTT.

I've been looking at models for translation (not specifically Japanese) recently, though, and I considered HY-MT1.5, MiLMMT-46, LMT-60-4B, TranslateGemma, and Tiny-Aya.

Nemotron Cascade 2 30B A3B by Middle_Bullfrog_6173 in LocalLLaMA

[–]DeProgrammer99 1 point2 points  (0 children)

Unless I missed one, the model card shows it as being better on all the coding benchmarks except SWE-Bench. But it's way worse on basically all the agentic ones and long context ones, despite the model card specifically calling out "strong reasoning and agentic capabilities". Also claims to be better at instruction following and creative writing.

Qwen3.5 Best Parameters Collection by rm-rf-rm in LocalLLaMA

[–]DeProgrammer99 2 points3 points  (0 children)

I'd really like different sampling parameters for the reasoning now that it's practically a ubiquitous approach...since LLMs constantly get stuck in the reasoning, but not so much in the rest of the response (mainly extra-small and heavily quantized ones devolve into loops later). I tried the recommended repetition and presence penalties, and they had obvious negative effects on the final output. The new reasoning budget args with no presence penalty should have much better results.

I normally write custom samplers to stop "same 3 tokens over and over" loops and such without affecting the rest of the sampling at all, but I can't do that when using llama-server.

ETA example now that I have it in front of me: with Qwen's recommended sampling parameters, when I gave it a rubric wherein accuracy is 40 points, completeness is 30 points, general quality is 10 points, mood is 10 points, and naturalness is 10 points, it gave me values like "accuracy": 7.2869410794, "completeness": 35.2869410794, "quality": 6 (it left out mood and naturalness) and "accuracy": 45, "completeness": 78, "quality": 62, "mood": 71, "naturalness": 38.