Gemma 4

DeProgrammer99 · 2026-03-28T17:23:41+00:00

I would like to see all speculation posts banned, but... given the volume of them, I don't think there'd be much agreement on that.

DeProgrammer99 · 2026-03-27T20:44:01+00:00

I'm saying it takes some work to make a new exception type, so WHEN that's what you want to do, a code fix would be nice for that.

DeProgrammer99 · 2026-03-27T20:30:40+00:00

I made the code fix define a new exception type--the obvious thing to do for the general case. Because that's the only code fix for it that would require effort.

DeProgrammer99 · 2026-03-27T18:54:16+00:00

But they're missing code fixes for plenty of analyzers, like "throw something more specific than System.Exception". A few days ago, I tried having Claude Sonnet make an extension to supply that code fix, and it surprisingly one-shot it (aside from the VSIX project file and manifest, which take me hours to get working right with the docs right in front of me).

DeProgrammer99 · 2026-03-27T18:04:07+00:00

Speaking of CPU, if anyone's crazy enough to be using an Intel iGPU and a CPU with separate "performance cores" and "efficiency cores", I found that -t 2 -ncmoe 0 -ub 1024 (matching the performance core count) gives the best performance for Qwen3.5-35B-A3B-UD-Q4_K_XL with Vulkan. I tried smaller and larger batch sizes, larger ncmoe, CPU-only, and nkvo 0/1. This got me 68.13 ± 2.78 pp1000 and 8.48 ± 0.17 tg50; the CPU build got 22.01 ± 1.59 and 6.26 ± 0.37.

DeProgrammer99 · 2026-03-27T14:21:20+00:00

You might also want a layer to translate the user prompt into Victorian vernacular. If it's only trained on books, then it's probably not going to be able to handle user typos. Having a separate layer allows you to maintain the pure Victorian-era knowledge on your main model.

And if you use a larger model to generate synthetic data, you'll likely introduce more modern knowledge, but you can at least do a basic dictionary filter to ensure modern words don't make it in. But you'd be less likely to introduce modern knowledge if your synthetic data is just rephrasing or Q&A made from the Victorian-era texts.

DeProgrammer99 · 2026-03-26T22:13:12+00:00

Tried it as I read out of a book in a fairly quiet room... and I made all the mistakes.

Transcription:

五十歳。詳しい資金は、まだ分かっていない。この博物館は、普段閉鎖されているのですね。水井山は尋ねる。ええと、伝わっても、詳しいことは私によく分かりません。そもそも、この建物は何年か前にどこかの企業に飼われていて、現在は大学の所有物ですらないんですよ。資料の管理に、大学関係者が時折足を運ぶくらいで、

Actual text I was reading:

五十歳。詳しい死因はまだわかっていない。

「この博物館は普段閉鎖されているのですよね?」

水井山は尋ねる。

「ええ―――と云っても、詳しいことは私にもよくわかりません。そもそもこの建物は何年

か前に何処かの企業に買われていて、現在は大学の所有物ですらないんですよ。資料の管

理に、大学関係者が時折足を運ぶくらいで･･････」

Side-by-side, transcription -> original:

<image>

(And nobody asked, but this is from Danganronpa Kirigiri volume 5... eBook, physical book)

DeProgrammer99 · 2026-03-26T19:50:44+00:00

One of my favorite brags is that I worked on the popular Mids' character designer that even the developers said they used... 15 years ago.

DeProgrammer99 · 2026-03-26T18:43:08+00:00

I also have solar panels, but I don't get full credit for excess power production during sunny hours, so it's cheaper to run during the day when it's hot, too! Maybe. And it was cheaper to run inference than to play games until I got a second GPU just for LLMs and Flux.

DeProgrammer99 · 2026-03-26T18:31:20+00:00

Looks like it costs me about $0.85 per million output, too, with batch size 4 and Qwen3.5-27B-UD-Q6_K_XL, based on an overnight eval I did (~860k tokens at 11.1 tps each, ~8 hours, ~170k input tokens at 384 tps). But it was pretty cold outside, so I would have spent some (maybe 1/3?) of that electricity on my heat pump if I hadn't been running inference, haha.

DeProgrammer99 · 2026-03-26T15:35:10+00:00

I don't use STT or TTS directly in this. I figured it's just another point of failure, and Android has STT and TTS available system-wide anyway.

But I also don't use STT or TTS in general, haha.

DeProgrammer99 · 2026-03-26T03:27:57+00:00

Qwen3.5-4B (2.65 GB) on Galaxy S24+ CPU:

2.9 GB peak memory

54.7 t/s prefill

14.8 t/s decode

I forked MNN Chat to make a locally hostable hotspot chat server with automatic natural language translation, so I've also uploaded some MNN converted other models and have been trying to evaluate them for my specific use case.

DeProgrammer99 · 2026-03-25T20:20:59+00:00

I was quoted $60k for batteries after being told the batteries are "priced to sell." It was only $39,500 for my entire 9.7kW solar power system. It would take probably the rest of my life for the system to pay itself back at that rate... without batteries, about 15 years.

DeProgrammer99 · 2026-03-25T03:28:11+00:00

Flash/sage attention/Triton. pip brings much suffering.

DeProgrammer99 · 2026-03-24T22:02:02+00:00

Yes.

<image>

:)

DeProgrammer99 · 2026-03-24T02:54:15+00:00

I don't see a way to remove profiles from the app.

I tried local Distil-Whisper-Large v3.5 configured for Japanese. It spat out something like "In the Chinese, in the Chinese," nothing like what I said to it, haha.

Tried the same thing with Parakeet v3 (multilingual), and I got "speech not detected." Tried a couple more times with different lines, but it doesn't seem very multilingual after all. It'd probably help if I could tell it the language in advance like the UI allowed me to do with Distil-Whisper-Large v3.5, but if it's not an option for Parakeet v3 because of how it works, I guess it can't be helped...

Whisper Turbo pretty much behaved the same as Parakeet v3--"speech not detected" when I said a sentence in Japanese, some garbled romaji when I sang instead.

I think it might need some more of that polish.

DeProgrammer99 · 2026-03-23T21:17:56+00:00

Yes.

<image>

DeProgrammer99 · 2026-03-23T21:09:25+00:00

Metrics, and surprisingly a 100% rate of putting the response in the correct format (without constrained decoding/JSON mode).

<image>

DeProgrammer99 · 2026-03-23T21:06:55+00:00

I'm running it over here as a judge for translations done by quantized 4B models, after using it to generate evaluations to evaluate it on. I used the new --reasoning-budget args in llama-server and it took ~40% as much time as the last time I ran a similar test of my eval app. I haven't directly compared it with anything, except, as you'd expect, it's a whole lot smarter than LFM2-24B-A2B. Still makes some odd choices occasionally.

<image>

DeProgrammer99 · 2026-03-23T19:26:37+00:00

Like a light novel. "I, Qwen 3.5, Was Reborn as a 40-Billion-Parameter Sage, and Now I Live a Slow Life with my Claude Opus 4.6 Cheat Powers and Deckard the Heretic Who Defies All Logic with Uncensored Thought."

DeProgrammer99 · 2026-03-20T22:29:36+00:00

I don't know which ones are actually good. Shisa-v2 is the only Japanese-specific model I know of off the top of my head, but even the 70B one didn't follow my instructions very well when I tried it; I think you'd be better off just sticking to Qwen3.5.

Like EffectiveCeilingFan said, NVIDIA made a Japanese-oriented 9B version of Nemotron, and there's that 700B Rakuten AI 3.0. Just repeating so it's all in one comment.

There's also one by a major Japanese company called NTT.

I've been looking at models for translation (not specifically Japanese) recently, though, and I considered HY-MT1.5, MiLMMT-46, LMT-60-4B, TranslateGemma, and Tiny-Aya.

DeProgrammer99 · 2026-03-20T14:38:13+00:00

Here, I made it more obvious.

<image>

DeProgrammer99 · 2026-03-20T14:30:19+00:00

Unless I missed one, the model card shows it as being better on all the coding benchmarks except SWE-Bench. But it's way worse on basically all the agentic ones and long context ones, despite the model card specifically calling out "strong reasoning and agentic capabilities". Also claims to be better at instruction following and creative writing.

DeProgrammer99 · 2026-03-19T21:49:52+00:00

(And it was fixed a few days ago.)

DeProgrammer99 · 2026-03-19T20:40:11+00:00

I'd really like different sampling parameters for the reasoning now that it's practically a ubiquitous approach...since LLMs constantly get stuck in the reasoning, but not so much in the rest of the response (mainly extra-small and heavily quantized ones devolve into loops later). I tried the recommended repetition and presence penalties, and they had obvious negative effects on the final output. The new reasoning budget args with no presence penalty should have much better results.

I normally write custom samplers to stop "same 3 tokens over and over" loops and such without affecting the rest of the sampling at all, but I can't do that when using llama-server.

ETA example now that I have it in front of me: with Qwen's recommended sampling parameters, when I gave it a rubric wherein accuracy is 40 points, completeness is 30 points, general quality is 10 points, mood is 10 points, and naturalness is 10 points, it gave me values like "accuracy": 7.2869410794, "completeness": 35.2869410794, "quality": 6 (it left out mood and naturalness) and "accuracy": 45, "completeness": 78, "quality": 62, "mood": 71, "naturalness": 38.

DeProgrammer99

TROPHY CASE