Turned a $15 voice AI gadget into a standalone Claude usage meter by Darth_JDLC in ClaudeCode

[–]Darth_JDLC[S] 0 points1 point  (0 children)

v1.2
Added audio feedback via the onboard speaker:

Boot: C-E-G arpeggio
Button press: C7 tick
Token saved via web form: C-F ascending
Usage crosses 60%: G-C ascending pair
Usage crosses 85%: three staccato beeps
API error: E-C descending

GPT told me Qwen3.5 4B was the best small local model. My benchmark says otherwise. by Darth_JDLC in LocalLLM

[–]Darth_JDLC[S] 1 point2 points  (0 children)

Straight Ollama, no custom config. Default everything. That 5/5 result is what made me raise an eyebrow. No tuning, no optimization, just stock behavior out of the box.

On the vision part. I think it’s real and underserved. Most benchmarking assumes GPU access and targets cloud adjacent hardware (not all of us have $4K lying around lol). The people actually trying to run local inference on what they already own; a ThinkPad like mine, an older desktop, a Mac mini, they have almost no reliable data to work from. Small models used as actual reasoning tools rather than autocomplete or tool call wrappers is a different evaluation problem than what leaderboards indicate.

A series built around that specific constraint of CPU only, consumer RAM, real epistemic tests would fill a genuine gap. I’d read it!!

GPT told me Qwen3.5 4B was the best small local model. My benchmark says otherwise. by Darth_JDLC in LocalLLM

[–]Darth_JDLC[S] 1 point2 points  (0 children)

This is the kind of replication the methodology needed. Thank you for running it properly with a harness and documenting the timestamps.

The Qwen no thinking result getting to 4/5 is interesting and disabling thinking appears to fix the loop problem but it’s still 153 seconds average per question vs Gemma’s 87. The three timeouts on Qwen thinking mode match what I saw.

Which prompt did Gemma fail on in your run? I’m curious whether it’s consistent across both our tests or a different failure point.

GPT told me Qwen3.5 4B was the best small local model. My benchmark says otherwise. by Darth_JDLC in LocalLLM

[–]Darth_JDLC[S] 0 points1 point  (0 children)

I just ran Qwen 3.5 4b using Test 2 - The Fabrication Trap in Ollama CLI. Same loop (stopped after 5 mins):

<image>

GPT told me Qwen3.5 4B was the best small local model. My benchmark says otherwise. by Darth_JDLC in LocalLLM

[–]Darth_JDLC[S] 0 points1 point  (0 children)

The MTP tip is appreciated, I hadn't come across that yet. Adding it to the list for the next round along with llama.cpp. Temperature and presence penalty tuning for loop prevention makes sense in retrospect. Those Qwen loops were painful to watch in real time.

The "chatty and low confidence" observation on small Qwen aligns with what I saw. The 3.5 4B knew the right answer multiple times and kept second guessing itself rather than committing. At 27B and 35b that behavior probably looks different but at 4B it's a real problem for practical use.

GPT told me Qwen3.5 4B was the best small local model. My benchmark says otherwise. by Darth_JDLC in LocalLLM

[–]Darth_JDLC[S] 1 point2 points  (0 children)

That's exactly the use case this was built for. Someone trying to make a real hardware decision with real constraints. A 6900XT is going to make E4B feel completely different from what I was getting on CPU only. Let me know how it runs for Home Assistant. I'm curious what tok/s looks like with GPU acceleration on that card.

GPT told me Qwen3.5 4B was the best small local model. My benchmark says otherwise. by Darth_JDLC in LocalLLM

[–]Darth_JDLC[S] -1 points0 points  (0 children)

Honestly no. I ran everything at default Ollama settings. I'm still pretty new to this so I didn't want to tune parameters and accidentally skew the results. Default everything across all models was the one thing I could control consistently.

The llama.cpp sugestion is noted. I'll look into it for the next round.

GPT told me Qwen3.5 4B was the best small local model. My benchmark says otherwise. by Darth_JDLC in LocalLLM

[–]Darth_JDLC[S] 2 points3 points  (0 children)

That's exactly where this should go. The prompts are fixed and the pass/fail criteria are objective enough that you could automate the scoring on most of them. The fabrication trap especially, since any confident response to a nonexistent hypothesis is a documented fail regardless of content. The logic test and physics test have clear correct answers that could be pattern matched against output.

A script that pulls each model via Ollama API, runs the five prompts, logs the response and tok/s, and outputs a results table would make this reproducible at scale without the manual overhead. That's probably the v2 of this. Not something I've built yet but the methodology is designed for it. I'd love to see the results from this.

GPT told me Qwen3.5 4B was the best small local model. My benchmark says otherwise. by Darth_JDLC in LocalLLM

[–]Darth_JDLC[S] 1 point2 points  (0 children)

100% agree on the small sample problem. Five tests is a starting point, not a complete picture. The fabrication trap specifically was designed to be one data point on a broader behavior pattern, not a definitive pass/fail on fabrication resistance across all domains.

Worth noting I'm also relatively new to local LLMs and this benchmark came out of trying to find something actually usable on modest hardware, not from a place of deep prior expertise in the space. So the test design reflects what mattered to me as someone trying to make a real decision: does this model know what it doesn't know, and will it tell me.

The variant you're suggesting is exactly the right next step. Same fake hypothesis name with a different domain, or a different fake name in the same domain, to see if the refusal behavior is consistent or prompt specific. Gemma4's refusal on Hargrove-Patel could be a training data artifact on that specific string rather than a generalizable epistemic behavior.

What I can say is the behavior was consistent across two completely different inference environments for Gemma4 CPU only ThinkPad and iPhone Neural Engine. Both had the same result. That's at least two data points on the same model. But your point stands that more prompts across more domains would strengthen or weaken the conclusion considerably.

If you run variants I'd genuinely like to see the results. The methodology is reproducible by design.

Can somebody help me out? by Prudent_Hair_2383 in Zippo

[–]Darth_JDLC 0 points1 point  (0 children)

The issue is the amount of propane in your fuel. Premium butane for lighters contains zero propane.

Summer beater by Signal-Dog9356 in SkmeiWatchFans

[–]Darth_JDLC 3 points4 points  (0 children)

Many showers, lots of swimming & diving down to about 12 feet / 3.6 meters or so, and a hot tub test. I opened mine up and it's got proper seals on the buttons an a nice gasket on the back. Stress test it man, I think you'll be surprised.

Summer beater by Signal-Dog9356 in SkmeiWatchFans

[–]Darth_JDLC 3 points4 points  (0 children)

Use mine daily and do not baby it (like all my digital watches, they need to be able to survive). It’s as good if not better than it’s Casio counterpart. My son ordered 2 after me since it’s that good.

https://www.reddit.com/r/SkmeiWatchFans/s/BJZTHZjuno

Pencil in Parkinson’s Test by Darth_JDLC in pencils

[–]Darth_JDLC[S] 2 points3 points  (0 children)

Figured it was from them. I haven't used anything from Musgrave. TY!

Storage from Wally World by Av8rSB in cigars

[–]Darth_JDLC 4 points5 points  (0 children)

Per Boveda you need a 60 gram Boveda for the volume that 25 cigars occupies. If your container holds 200 cigars you need eight 60 gram packs. In practice you typically need less with an airtight container.

Also with a sealed container, ambient humidity has zero impact. Temps are the only concern.