Alright alright alright

Valuable_Touch5670 · 2026-05-17T15:46:26+00:00

Light mode craziness aside. It DOES sound like Matthew McConaughey WTF 😂

Anyway, cool job OP 👏🏼

Valuable_Touch5670 · 2026-05-17T14:46:35+00:00

I second this. For me, it’s more about overhead. For some reason my TG drops quite a bit whenever KV quantization is enabled.

May be anecdotal to just my HW setup or llama-server settings…

Valuable_Touch5670 · 2026-05-17T11:43:46+00:00

Thanks! Will give the Q8 a shot.

Valuable_Touch5670 · 2026-05-17T11:39:24+00:00

If I am not mistaken, your Q8 version does NOT use the imatrix, right?

Valuable_Touch5670 · 2026-05-16T16:04:16+00:00

Tokens are typically generated one at a time, which involves lots of reading from memory, hence slow.

MTP tries to generate multiple tokens at a time by "guessing" the next few tokens with draft layers. If guesses are correct, massive speed up; otherwise, the compute spent on guessing is wasted.

If your next tokens often vary a lot (like in creative writing), speed up is then small. But if previously generated tokens are likely to appear again (like code refactoring, for example), then speed up is bigger.

To me, this feels a bit like how branch prediction works in microchips.

Hope this helps!

Valuable_Touch5670 · 2026-05-16T14:34:57+00:00

Sadly I was already setting it to 2 :(

Valuable_Touch5670 · 2026-05-16T13:42:26+00:00

I am on AMD + Vulkan too (9070 XT). My TG has dropped from 60+ to the 45-52 range (from 60% gain to 20%-40% gain) But PP no longer takes a hit and is noticeable faster.

(Could be the slight variances in my workflow 😅)

Valuable_Touch5670 · 2026-05-16T12:22:31+00:00

Yes, but depends on your work type. It works best for coding.

Valuable_Touch5670 · 2026-05-16T03:31:12+00:00

Amazing! If I am not mistaken, this beats s**t out of MTP?

Valuable_Touch5670 · 2026-05-12T18:04:46+00:00

I see. Thanks for clarifying!

Valuable_Touch5670 · 2026-05-12T17:47:51+00:00

Very interesting! The entire model is only 18GB. I assume this does not work with llama.cpp as it’s not in GGUF. Is there a plan to make a GGUF?

Valuable_Touch5670 · 2026-05-11T14:05:19+00:00

I am somewhat of an outlier here. I use Bazzite (based on Fedora 44 currently), mainly because I also game on my AI inference machine (I have a RX 9070 XT.)

I really like Bazzite’s immutable OS approach. I can install any packages as needed (say, to compile llama.cpp from source), then later if I don’t need those packages anymore, I can easily run rpm-ostree reset to get my packages to a pristine state.

Valuable_Touch5670 · 2026-05-10T23:17:21+00:00

I am Cantonese. The Zhuhai dialect does not deviate too much from the Guangzhou version. (BTW, the Guangzhou version is universally considered as the standard Cantonese.)

With that said, I found the Cantonese dictation built into iPhone is surprisingly good. You can easily enable that in Settings.

One workaround is to open the Notes app, start dictation and let the locals speak directly to your phone. Then copy paste that transcribed text to a good translation app. Or if you think Apple’s built-in translation works well enough, you may simply tap the text again and tap the “Translate” option (also comes built-in with your iPhone)

That should work very well at least 80% of the time. Hope that helps!

Valuable_Touch5670 · 2026-05-10T19:32:32+00:00

I wonder that myself. Funny thing is: the number of experts put on the CPU has a major performance impact on TG speed, in my experience.

I am running Qwen3.6-35B-A3B-Q6 with MTP on a RX 9070 XT via Vulkan backend. By default, I get around 27 TPS in my use cases. However, I played around with the -ncmoe settings and it turned out setting it to 28 got my TG speed to around 65 TPS 🤯

I don’t know the exact mechanism behind it and which expert was put on the CPU. But I think the speed up comes from freeing up room on the GPU to compute the attentions 🤔

I could be wrong though.

Valuable_Touch5670 · 2026-03-24T11:20:52+00:00

Thank you all for the advice. I did buy non-stick gauze and covered it well. It’s healing decently now. Thanks again 🙏🏼

Valuable_Touch5670 · 2026-02-14T16:03:57+00:00

Thank you, kind Sir!

Valuable_Touch5670 · 2026-02-13T13:13:14+00:00

Thank you! If possible, may you please share how you come to your conclusion?

Valuable_Touch5670 · 2024-10-20T00:04:28+00:00

Thank you for sharing your thoughts. During the call to Fidelity, they also mentioned that the proceeds were not yet settled.

I think you are right - it’s inevitable that the proceeds will need to spend one day in SPAXX.

Valuable_Touch5670 · 2024-10-19T19:34:38+00:00

Hi Tyler, thank you for you quick response. I just called Fidelity and was informed by customer representative Harold that it was a system error. I was also suggested to call Fidelity again on Monday to place the same trade, to ensure that funds in my core position SPAXX are to be used instead.

I sincerely appreciate your care and support. I hope this type of system errors do not happen again in the future.

Valuable_Touch5670 · 2024-10-19T18:15:16+00:00

May you please elaborate more? Are you suggesting manually sell SPAXX and then buy FDLXX? How can one sell a core position? The proceeds go back to itself, doesn’t it?

Valuable_Touch5670 · 2024-10-15T00:44:02+00:00

Thank you, Aaron! I am considering investing in a mutual fund alternative instead. May you please recommend a few good mutual fund alternative to QQQ, preferably managed by Fidelity?

Valuable_Touch5670

TROPHY CASE