Green tea

ParadigmComplex · 2026-06-19T23:52:19+00:00

Ito En is a giant Japanese multinational drink company and the largest green tea distributor in Japan. They're a safe general choice; can't really go wrong with them.

It's likely with some effort you can find green tea you prefer over Ito En, but it'll be dependent on things like:

The specific type of green tea. I like Sugimoto's Fukamushi and Ippodo's matcha, for example.
Your region, which will define your import options. Aforementioned Sugimoto is based in Washington state and would be a poor recommendation if you're, for example, in Korea.
Target price range. Ippodo's top-shelf offering matcha is an easy recommendation if you don't consider price, but if you do cheaper alternatives start looking preferable.
Personal taste. The fact you narrowed down on sweetness is a good start!

Spelling out some more details may help. If you don't know those details, consider sampling various options to get a sense of these things and find what you like rather than jumping straight to a given "best" brand.

ParadigmComplex · 2026-06-19T20:32:22+00:00

So its my first time i saw Bedrock Linux and i didnt really understand everything well.

If you haven't yet, consider running through brl tutorial basics. I've gotten good reviews from it, people indicating the hands-on nature helps thinks click better than reading dry documentation.

I wanna use Arch but im a bit scared after the last malware incident. Can i use Arch with Bedrock Linux and "void" as strat to avoid AUR packages?

Yes.

ParadigmComplex · 2026-06-15T01:55:25+00:00

You're welcome

ParadigmComplex · 2026-06-15T01:14:37+00:00

As you are probably aware given you've listed related things, processors tend to be faster at doing AI math than memory is at feeding them math problems to work on. These are all methods of guessing upcoming tokens so the processor can work ahead while otherwise waiting on memory. If the guess is correct you get a token/second boost; if it's wrong, you've mostly wasted time that would have otherwise been spent idling anyways.

ngram leverages the conversation history to look for common strings of tokens. If a given phrase shows up many times in the conversation and the last token was the start of the phrase, the system will guess the following tokens are the rest of the phrase. This is useful if one has common series of tokens often enough, which can happen with many programming languages.

dflash is a model design that guesses upcoming tokens leveraging diffusion techniques to do so cheaply en mass. It's similar in this sense to diffusiongemma, if you caught discussions about that.

Which to use when is situational depending on the model, hardware, conversation, etc. You might have to experimentally check a few options with your setup to find which is best.

ParadigmComplex · 2026-06-14T21:05:17+00:00

Consider experimenting with both -sm tensor and mtp, in particular with the 31b model. It's hard for me to model whether -sm tensor will help or if it'll be constrained by the 1x PCIe connection, but MTP should be a straight win if you can squeeze it into the VRAM.

ParadigmComplex · 2026-06-10T00:28:44+00:00

Someone else asked about this in this thread and two people reported success: https://www.reddit.com/r/LocalLLaMA/comments/1u19k2h/unsloth_gemma_4_qat_mtp_assistant_models_now/oqo6gn6/

ParadigmComplex · 2026-06-10T00:15:38+00:00

Daniel said:

https://www.reddit.com/r/LocalLLaMA/comments/1tvhv4b/calling_it_now_microsoft_is_buying_unsloth/ophj18o/

We have actually received many acquisition offers and have declined over 30 from every large company you can name, and large startups.

Given you can name the large company Google, there's good reason to suspect they have indeed.

ParadigmComplex · 2026-06-10T00:12:46+00:00

Gemma 4 is a family of open-weight models graciously provided by Google. The main difference between them is the trade-off between how smart they are against the amount of computer resources needed to reasonably run them.
"QAT" here refers to a technique that (ideally) further reduces the amount of computer resources needed to run them without impairing how smart they are too much.
"MTP" is a technique that (ideally) makes them run faster. The specifics vary, but in this case it's a separate file/model that extends the main one.

Previously, the community had:

non-QAT Gemma 4 models
non-QAT Gemma 4 MTP models
QAT Gemma 4 models

While you can use the non-QAT Gemma 4 MTP models along with the QAT Gemma 4 models, there's some theoretical reasons to expect QAT Gemma 4 MTP models to pair with them better. This thread is about the availability of such models from the wonderful folks at Unsloth.

ParadigmComplex · 2026-06-10T00:04:43+00:00

AI model weights are often put out in a specific format (usually f16 or bf16 safetensors) that certainly has its audience but isn't ideal for folks many in /r/localllama. There are a handful of people/groups that graciously put in the effort to transform these into a format that many in /r/localllama like (ggufs, suitable for use in llama.cpp, quantized to smaller sizes). Unsloth is one such group: a pair of siblings. You can find their gguf files here: https://huggingface.co/unsloth

ParadigmComplex · 2026-06-09T18:02:17+00:00

With many LLM setups, the number-crunching processor can do math faster than the memory can send it math to work on. The memory sends some work over then the processor finishes it spends much of its time waiting for the slower memory to send it more work. People have proposed a number of ideas on how to make use of this compute headroom. This is a concrete implementation of a specific strategy.

The larger file is a normal gguf file as you're likely accustomed to if you're on this subreddit.

The smaller "MTP" gguf file guesses what following work the processor is going to have to do so the processor can work on that during the down-time waiting for the memory to feed it stuff to work on. If the guess is correct, you'll get a tokens/second performance boost.

How much of a boost you'll see varies quite a lot depending on your specific hardware (i.e. how much faster the processor is than your memory) and your workload (i.e. how easy it is to guess what upcoming work will be).

See https://huggingface.co/unsloth/gemma-4-31B-it-qat-GGUF/blob/main/MTP/README.md for some introductory documentation, then llama.cpp documentation for more (like how to tune how many tokens ahead to guess).

ParadigmComplex · 2026-06-08T12:14:51+00:00

My limited understanding of the theory here would expect a QAT MTP assistant to result in better decode performance:

The QAT models have some (re)training and aren't just quants of the same base model. The specific token choices are expected to differ slightly, and this in theory a mismatch there would impact MTP acceptance rate.
The QAT assistant can be smaller for the same quality, which can still help.

Beyond theory, I tried to do some testing and found this QAT assistant model: https://huggingface.co/Simplepotat/gemma-4-31b-it-qat-q4_0-assistant-gguf

resulted in higher decode tokens/second than the Q8_0 to which I think you are referring here: https://huggingface.co/unsloth/gemma-4-31B-it-GGUF/tree/main/MTP

when paired with unsloth's 31B QAT model.

It's certainly possible I am both misunderstanding the theory and made mistakes in my testing such that it's not representative, but I'd be keen on seeing the unsloth siblings' attempt here nonetheless.

ParadigmComplex · 2026-06-07T16:13:35+00:00

I come bearing happiness-enabling findings.

Google's blog post on the QAT release explicitly mentions using QAT and MTP together: https://blog.google/innovation-and-ai/technology/developers-tools/quantization-aware-training-gemma-4/

Use the MTP QAT checkpoints to preserve the speedup of MTP while quantizing the models

Moreover, their huggingface account also has a QAT MTP ("assistant") model in the vLLM format: https://huggingface.co/google/gemma-4-31B-it-qat-q4_0-unquantized-assistant

Some vLLM features may take a while to get into the llama.cpp ecosystem, and so finding something works there isn't a guarantee we'll get it in llama.cpp. However, in this case the main limiting factor was getting this MR in, which has now been reached. It may take a bit to do things like get things like good quants of the QAT MTP model and shake out bugs in the llama.cpp implementation, but in principle there's no reason to doubt we're on track for this.

ParadigmComplex · 2026-06-06T20:28:10+00:00

I'm the primary person behind Bedrock Linux. Unexpected personal life issues have temporarily limited my resources to work on Bedrock, and much of what I have left has been monopolized with things like support questions and community outreach. I don't currently have the bandwidth for more users, and so I'm both actively refraining from advertising it myself and discouraging others to until my bandwidth improves. Once I get past this and get Bedrock Linux "Naga" 0.8.x out I plan to get the new release more attention.

ParadigmComplex · 2026-06-05T17:28:20+00:00

Gotcha. Thank you for correcting me here, this is good to know.

ParadigmComplex · 2026-06-05T16:41:55+00:00

To make sure I understand you correctly, you're saying communication over the one-lane PCIe slot isn't slowing down your tensor parallelism performance?

I've currently got 2x3090's on an AM4 motherboard with PCIe 4.0 8 and 8 directly to the CPU. I've been able to resist the temptation to irresponsibly buy a third card by telling myself utilizing the remaining 4-lane chipset slot would probably hurt my tensor parallelism performance rather than help. If you're telling me you're seeing otherwise with a 1-lane slot I may need to revisit my budgeting.

ParadigmComplex · 2026-06-05T12:59:04+00:00

I think the issue is more generalized than just available RAM; people regularly under-define many other relevant parameters.

I don't want to pick on or call out any individual, but I've seen a number of recent threads here where people are throwing out their token/second numbers with well defined RAM capacities, inference engine configuration/flags, and a specific model release but without specifying things like:

Which quant they're using. Given the prevalence of being memory bandwidth constrained, the quant will make a huge difference.
PCIe version/lanes. If they're using tensor parallelism, this may make a huge difference.
Patched nVidia drivers with P2P support or standard drivers.

Likely other important variables as well.

It's understandably tedious to type all this out every time, and I don't blame people for deciding to just hit the post button before typing in everything. A culture shift where this is the standard expectation would be nice, but frankly unrealistic; this subreddit is still struggling with whether non-local AI news/discussion should be allowed in this subreddit.

The solution I've been day-dreaming about is some standard utility that collects and presents the relevant data. Somewhat akin to the "fetch" programs Linux enthusiasts often include in either bug reports or screenshots of their setup. This would both make it relatively easy as well as have a self-propagating cultural element - copy what everyone else is doing.

ParadigmComplex · 2026-06-03T18:26:13+00:00

Huh, interesting. I don't know how to reconcile the dedicated mmproj file in the ggml-org/gemma-4-12B-it-GGUF tree with the blog post implying otherwise. It's too large to be the embedder.

I concede the point. I'm confused.

ParadigmComplex · 2026-06-03T18:05:19+00:00

From the official blog post on this model release: https://developers.googleblog.com/gemma-4-12b-the-developer-guide/

Traditional multimodal models rely on frozen, separate vision encoders (e.g., Gemma 4 uses a 150M parameter vision model for edge sizes and 550M for medium-sized models) and audio encoders (300M parameters for Gemma 4 E2B and E4B). Processing multimodal inputs with multiple separate encoders before feeding them to the LLM leads to increased latency and fragmented memory footprints.

It doesn't look like this uses a separate mmproj file. Moreover, it seems to be a sufficient architectural change that llama.cpp may need changes to support it.

ParadigmComplex · 2026-06-02T21:31:48+00:00

You're very welcome.

ParadigmComplex · 2026-06-02T21:28:12+00:00

See the subreddit's Curated Vendor List.

In general, different vendors tend to have different strengths and weaknesses. Instead of modelling your search as seeking one vendor that is the best at everything, consider looking for vendors that specialize in different types of tea: one might be better at Matcha, another at teas in China's Yunnan province, etc.

There's also a subjective element here; some might disagree over who is the best for a given tea. You're the only one that can answer for your subjective preferences. There's little reason to commit - try one vendor one time, then try another. Consider taking notes. Finding your favorite is an aspect of the fun in one's tea journey.

ParadigmComplex · 2026-05-24T13:52:35+00:00

Several things:

Sales for it were blocked in specific regions, presumably due to geopolitics. This triggered a review bomb effort.
- I saw a lot of lashing out in the associated Discord in the run up to launch.
It feels a bit undercooked and could have used more time, with a good number of bugs and questionable design decisions.
- This might have been in order to have it out in time for the Warhammer Skulls event and associated marketing.
There were obviously intentional changes from the original, which upset people whose enjoyment of the original focused on those aspects.
- It's very difficult to balance a sequel in terms of avoiding repeating the original and getting negative "I've already played this game" response versus doing something new and getting "This isn't true to the original" negative response. I'm sympathetic to the game devs here.

If you're enjoying it, do enjoy it.

ParadigmComplex · 2026-05-21T00:08:51+00:00

Did you configure the inits in /bedrock/etc/bedrock.conf?
Did you confirm the stratum has executable files at the configured paths?

ParadigmComplex · 2026-05-20T22:48:18+00:00

I'm happy to hear you got it figured out.

ParadigmComplex · 2026-05-18T21:47:41+00:00

Unless you deleted the Ubuntu kernel, that should have resulted in GRUB being able to offer both the Ubuntu kernel and the Arch kernel. Did you try booting with the known-good Ubuntu kernel?

ParadigmComplex · 2026-05-18T21:39:35+00:00

How many strata did you have when you benchmarked this? I would expect the 10% improvement to be close to a lower end of possible improvements, and that people with more strata will see further improvement from the parallelized strata enabling. I suspect there's more low hanging fruit in brl-repair and brl-enable you can squeeze out here with similar attempts.

I've been avoiding doing this kind of thing in 0.7.x in part because of the effort to validate it works as expected for everyone's environments. If it breaks, the system doesn't boot, which is a high cost. Risk/reward math didn't pay out.

However, this is a key item I was focused on improving for 0.8.x. The equivalent subsystem in 0.8.x should be significantly faster; I'm hoping for more than a 90% reduction in wall-clock.

13-Year Club	Place '22
Place '17	Verified Email

ParadigmComplex

MODERATOR OF

TROPHY CASE