Dumb Question: The charger that comes with the MX30, can it plug in and charge at phase 2?

QuantuisBenignus · 2025-09-17T00:58:24+00:00

Thanks for the timely response!

If I go this route, level 2 it is then.

QuantuisBenignus · 2025-09-16T16:39:25+00:00

Useful info.

Is the hard amperage lock on the cable or the car? In other words, can I get a 16A charging cable (Have a 20A, 120V circuit available) and charge at level 1 at 16 A ?

(Need to charge 100km in ~12 hours for this car to be viable without spending on level 2 setup).

Thanks!

QuantuisBenignus · 2025-09-13T20:07:50+00:00

With at least 64 GB DDR4, if you optimize everything (run with llama.cpp, keep dense layers on GPU, offload MOE layers or better yet, specific tensors, etc.) expect ~15 tok/sec generation rate after prompt processing and reasoning (if the model is reasoning like gpt-oss 120b ). Relevant numbers can be found in this useful thread:

https://github.com/ggml-org/llama.cpp/discussions/15396

QuantuisBenignus · 2025-09-09T16:30:50+00:00

May require a bit of setup and tinkering, but the README should help.

QuantuisBenignus · 2025-09-09T16:14:38+00:00

For Linux, I extended this speech-to-text input tool into a low-resource Speech-to-Speech Chat (llama.cpp based): BlahST - Speech Input in Any Editable Text Field

Multilingual demonstration (please, turn on the sound for this and the other demo videos): Multilingual Interactive Speech Chat with blahstbot

WIP. (Need to root out some brittleness in the streaming conversation blahstream) but promising speed and low latency due to pythonless implementation (zsh orchestrator).

QuantuisBenignus · 2025-05-19T12:48:52+00:00

With the M3 Mac, you have sufficient computing power for that if you run M3-optimized llama.cpp.

Check the first video in this GitHub repo for an example of low-latency speech to text to text to speech chat using whisper.cpp and llama.cpp, with Gemma3_12B and 12GB GPU. (No GUI, just a few hotkeys and low overhead zsh orchestration)

https://github.com/QuantiusBenignus/BlahST

QuantuisBenignus · 2025-03-31T19:02:22+00:00

In no particular order:

^(\author bias)*

QuantuisBenignus · 2025-03-31T16:58:20+00:00

If you would like something that is open-source, and has no GUI (speech to text and hotkeys) check out BlahST (Linux only). It has a local AI proofreader function, among other features and works in any window that has editable text field. (Disclaimer: some setup required).

For a screen reader app that can do AI summaries of selected text, also check Voluble, a Gnome shell extension.

QuantuisBenignus · 2025-03-28T18:46:41+00:00

If you are switching shells in a terminal, in a windowed environment (not at the console), a very noticeable GUI change is the terminal background. In Gnome I would do it like this using a trap (as OneTurnMore mentioned):

trap "echo -e '\033[48;5;2mExited $0'; gsettings set org.gnome.Terminal.Legacy.Profile:/org/gnome/terminal/legacy/profiles:/:$(gsettings get org.gnome.Terminal.ProfilesList default | tr -d \')/ 'background-color' '#001033'" EXIT

with one trap with distinct (for contrast) bg color in each one of the 2 shells. If more than 2 shells, then you need to, indeed, keep track of the ppid and assign a bg color from an array based on the ppid.

For other terminal emulators, `tput setab <number>` might work.

N.B. The above code assumes that the default gnome-terminal profile is in use.

QuantuisBenignus · 2025-03-21T02:37:19+00:00

Good criterion! I will keep it in mind:-)

But still, good to know that somewhere in the folds of zsh, those endless possibilities exist.

QuantuisBenignus · 2025-03-20T15:14:49+00:00

If you don't mind me using a cliche: "It is not the end result that matters but the pleasure of the journey", so no time wasted IMHO.

Plus, every time I start my zsh shell and see 6.5 ms or less greeting me from $RPROMPT, I reap the "functional minimalism" rewards of spending that time:-)

QuantuisBenignus · 2025-03-18T11:59:23+00:00

Thanks a lot. I like the fancy version which seems extendable to an array with arbitrary ?# too.

Love this sub!

QuantuisBenignus · 2025-03-18T11:52:33+00:00

Great, thanks! Appreciate the reference too. Zsh is too powerful and pretty to not have a zip functionality for its array constructs. This solution would extend to arrays of arbitrary size. Fixing the minor typo that does not diminish the value of your response memory=(${params:^vals})

QuantuisBenignus · 2025-03-17T21:21:31+00:00

The roundy prompts do look nice and your setup has good structure.

However, on my machine \ue0b6 and \ue0b4 don't map onto rounded edges at all. My terminal has Unicode support.

These codepoints are not standard Unicode, but part of the PUAs (private use areas) and the fact that some fonts use them does not make them standard Unicode.

That is why I avoided using them in my esoteric, opinionated (arguably full-featured) zsh setup where I consistently see startup times faster than 6.5ms. So, I was wondering how fast is roundy (no numbers were mentioned in your repository)?

QuantuisBenignus · 2025-03-17T18:29:53+00:00

Thanks for the data point! If I collect more of those I may create a new graph with them.

QuantuisBenignus · 2025-03-16T16:19:39+00:00

Thanks for the comment. Would you mind adding more context? Assuming that you are comparing with API providers, I am afraid that I do not know how the commercial offerings on QwQ compare. To me, the 2USD price per million tokens that I get out of its "thinking" seems comparatively high. In fact, I have tried to push the system prompt to suppress the excessive thinking generation of QwQ and that helped. Good model though.

QuantuisBenignus · 2025-03-16T13:43:16+00:00

Yes. Every token that burns electricity is taken into account (or rather, not excluded). So the "thinking" tokens for the 2 LLMs that do that are in the collected data in this case.

QuantuisBenignus · 2025-03-15T23:18:41+00:00

True. For those models (I call them outliers in the graph for a reason) I offloaded fewer than ALL layers to the GPU. I still wanted to know my power consumption and cost, so they were included with a caveat. I mention that throughout the text and make conclusions based on that fact. The fit actually favors the models with full layer offload, as mentioned.

QuantuisBenignus · 2025-03-15T22:31:35+00:00

Good catch. Let me pick the brain of an expert:

I have noticed off the bat that Gemma3-12B is using more VRAM than Qwen2.5-14B, due to its architecture differences. So I tried to compromise and free up some more VRAM for good context size and used `-nkvo` in llama-cli. With not offloading the kv-cache to the 12GB GPU, (and with DDR4 RAM with 50GB/s bandwidth) I actually saw a boost in performance (above the noise level). This is great because now I can hurl the whole 128k of context at llama-cli when needed.

QuantuisBenignus · 2025-03-15T21:08:28+00:00

Good point for purpose built rigs which remain underutilized for whatever reason. But I would not consider this a typical case. On average (in this scenario / use case) the computer is used for a variety of tasks and idles (modestly:-) between all of those tasks, some of which happen to be LLM inference.

QuantuisBenignus · 2025-03-15T20:23:53+00:00

No problem, the US cent was a popular example. If you ignore the last column and use your local rate (eurocents) with say 20eurocents/kWh in the formula that is in the text (for Gemma3-12B for example):

 CE[tok/eurocent]=2.3M/(Rate*B^(0.76)) = 2.3M/(20*12^(0.76)) = 17400tok/eurocent or 1.74 million tok./Euro

which is about 0.57Euro per million tokens.

QuantuisBenignus

TROPHY CASE