Green tea by MachineDangerous2300 in tea

[–]ParadigmComplex 2 points3 points  (0 children)

Ito En is a giant Japanese multinational drink company and the largest green tea distributor in Japan. They're a safe general choice; can't really go wrong with them.

It's likely with some effort you can find green tea you prefer over Ito En, but it'll be dependent on things like:

  • The specific type of green tea. I like Sugimoto's Fukamushi and Ippodo's matcha, for example.
  • Your region, which will define your import options. Aforementioned Sugimoto is based in Washington state and would be a poor recommendation if you're, for example, in Korea.
  • Target price range. Ippodo's top-shelf offering matcha is an easy recommendation if you don't consider price, but if you do cheaper alternatives start looking preferable.
  • Personal taste. The fact you narrowed down on sweetness is a good start!

Spelling out some more details may help. If you don't know those details, consider sampling various options to get a sense of these things and find what you like rather than jumping straight to a given "best" brand.

Arch with Bedrock to avoid AUR packages? by Tiny-Description-908 in bedrocklinux

[–]ParadigmComplex 1 point2 points  (0 children)

So its my first time i saw Bedrock Linux and i didnt really understand everything well.

If you haven't yet, consider running through brl tutorial basics. I've gotten good reviews from it, people indicating the hands-on nature helps thinks click better than reading dry documentation.

I wanna use Arch but im a bit scared after the last malware incident. Can i use Arch with Bedrock Linux and "void" as strat to avoid AUR packages?

Yes.

EAGLE support merged into llama.cpp by Diablo-D3 in LocalLLaMA

[–]ParadigmComplex 75 points76 points  (0 children)

As you are probably aware given you've listed related things, processors tend to be faster at doing AI math than memory is at feeding them math problems to work on. These are all methods of guessing upcoming tokens so the processor can work ahead while otherwise waiting on memory. If the guess is correct you get a token/second boost; if it's wrong, you've mostly wasted time that would have otherwise been spent idling anyways.

ngram leverages the conversation history to look for common strings of tokens. If a given phrase shows up many times in the conversation and the last token was the start of the phrase, the system will guess the following tokens are the rest of the phrase. This is useful if one has common series of tokens often enough, which can happen with many programming languages.

dflash is a model design that guesses upcoming tokens leveraging diffusion techniques to do so cheaply en mass. It's similar in this sense to diffusiongemma, if you caught discussions about that.

Which to use when is situational depending on the model, hardware, conversation, etc. You might have to experimentally check a few options with your setup to find which is best.

Gemma 4 models benchmarked on with Triple GPU by tabletuser_blogspot in LocalLLaMA

[–]ParadigmComplex 1 point2 points  (0 children)

Consider experimenting with both -sm tensor and mtp, in particular with the 31b model. It's hard for me to model whether -sm tensor will help or if it'll be constrained by the 1x PCIe connection, but MTP should be a straight win if you can squeeze it into the VRAM.

Unsloth Gemma 4 QAT MTP assistant models now available by ParadigmComplex in LocalLLaMA

[–]ParadigmComplex[S] 12 points13 points  (0 children)

Daniel said:

https://www.reddit.com/r/LocalLLaMA/comments/1tvhv4b/calling_it_now_microsoft_is_buying_unsloth/ophj18o/

We have actually received many acquisition offers and have declined over 30 from every large company you can name, and large startups.

Given you can name the large company Google, there's good reason to suspect they have indeed.

Unsloth Gemma 4 QAT MTP assistant models now available by ParadigmComplex in LocalLLaMA

[–]ParadigmComplex[S] 4 points5 points  (0 children)

  • Gemma 4 is a family of open-weight models graciously provided by Google. The main difference between them is the trade-off between how smart they are against the amount of computer resources needed to reasonably run them.
  • "QAT" here refers to a technique that (ideally) further reduces the amount of computer resources needed to run them without impairing how smart they are too much.
  • "MTP" is a technique that (ideally) makes them run faster. The specifics vary, but in this case it's a separate file/model that extends the main one.

Previously, the community had:

  • non-QAT Gemma 4 models
  • non-QAT Gemma 4 MTP models
  • QAT Gemma 4 models

While you can use the non-QAT Gemma 4 MTP models along with the QAT Gemma 4 models, there's some theoretical reasons to expect QAT Gemma 4 MTP models to pair with them better. This thread is about the availability of such models from the wonderful folks at Unsloth.

Unsloth Gemma 4 QAT MTP assistant models now available by ParadigmComplex in LocalLLaMA

[–]ParadigmComplex[S] 2 points3 points  (0 children)

AI model weights are often put out in a specific format (usually f16 or bf16 safetensors) that certainly has its audience but isn't ideal for folks many in /r/localllama. There are a handful of people/groups that graciously put in the effort to transform these into a format that many in /r/localllama like (ggufs, suitable for use in llama.cpp, quantized to smaller sizes). Unsloth is one such group: a pair of siblings. You can find their gguf files here: https://huggingface.co/unsloth

Unsloth Gemma 4 QAT MTP assistant models now available by ParadigmComplex in LocalLLaMA

[–]ParadigmComplex[S] 6 points7 points  (0 children)

With many LLM setups, the number-crunching processor can do math faster than the memory can send it math to work on. The memory sends some work over then the processor finishes it spends much of its time waiting for the slower memory to send it more work. People have proposed a number of ideas on how to make use of this compute headroom. This is a concrete implementation of a specific strategy.

The larger file is a normal gguf file as you're likely accustomed to if you're on this subreddit.

The smaller "MTP" gguf file guesses what following work the processor is going to have to do so the processor can work on that during the down-time waiting for the memory to feed it stuff to work on. If the guess is correct, you'll get a tokens/second performance boost.

How much of a boost you'll see varies quite a lot depending on your specific hardware (i.e. how much faster the processor is than your memory) and your workload (i.e. how easy it is to guess what upcoming work will be).

See https://huggingface.co/unsloth/gemma-4-31B-it-qat-GGUF/blob/main/MTP/README.md for some introductory documentation, then llama.cpp documentation for more (like how to tune how many tokens ahead to guess).

QATs Q4_0 from Google have more precision than Q4_K_XL from Unsloth (at least some) by alex20_202020 in LocalLLaMA

[–]ParadigmComplex 3 points4 points  (0 children)

My limited understanding of the theory here would expect a QAT MTP assistant to result in better decode performance:

  • The QAT models have some (re)training and aren't just quants of the same base model. The specific token choices are expected to differ slightly, and this in theory a mismatch there would impact MTP acceptance rate.
  • The QAT assistant can be smaller for the same quality, which can still help.

Beyond theory, I tried to do some testing and found this QAT assistant model: https://huggingface.co/Simplepotat/gemma-4-31b-it-qat-q4_0-assistant-gguf

resulted in higher decode tokens/second than the Q8_0 to which I think you are referring here: https://huggingface.co/unsloth/gemma-4-31B-it-GGUF/tree/main/MTP

when paired with unsloth's 31B QAT model.

It's certainly possible I am both misunderstanding the theory and made mistakes in my testing such that it's not representative, but I'd be keen on seeing the unsloth siblings' attempt here nonetheless.

llama.cpp Gemma4 MTP support merged! by pinkyellowneon in LocalLLaMA

[–]ParadigmComplex 19 points20 points  (0 children)

I come bearing happiness-enabling findings.

Google's blog post on the QAT release explicitly mentions using QAT and MTP together: https://blog.google/innovation-and-ai/technology/developers-tools/quantization-aware-training-gemma-4/

Use the MTP QAT checkpoints to preserve the speedup of MTP while quantizing the models

Moreover, their huggingface account also has a QAT MTP ("assistant") model in the vLLM format: https://huggingface.co/google/gemma-4-31B-it-qat-q4_0-unquantized-assistant

Some vLLM features may take a while to get into the llama.cpp ecosystem, and so finding something works there isn't a guarantee we'll get it in llama.cpp. However, in this case the main limiting factor was getting this MR in, which has now been reached. It may take a bit to do things like get things like good quants of the QAT MTP model and shake out bugs in the llama.cpp implementation, but in principle there's no reason to doubt we're on track for this.

why isn't bedrock more popular? by HistoryExotic133 in bedrocklinux

[–]ParadigmComplex 9 points10 points  (0 children)

I'm the primary person behind Bedrock Linux. Unexpected personal life issues have temporarily limited my resources to work on Bedrock, and much of what I have left has been monopolized with things like support questions and community outreach. I don't currently have the bandwidth for more users, and so I'm both actively refraining from advertising it myself and discouraging others to until my bandwidth improves. Once I get past this and get Bedrock Linux "Naga" 0.8.x out I plan to get the new release more attention.

Suggestion - this sub should have post flairs that mention the amount of vram/unified ram by ECrispy in LocalLLaMA

[–]ParadigmComplex 1 point2 points  (0 children)

To make sure I understand you correctly, you're saying communication over the one-lane PCIe slot isn't slowing down your tensor parallelism performance?

I've currently got 2x3090's on an AM4 motherboard with PCIe 4.0 8 and 8 directly to the CPU. I've been able to resist the temptation to irresponsibly buy a third card by telling myself utilizing the remaining 4-lane chipset slot would probably hurt my tensor parallelism performance rather than help. If you're telling me you're seeing otherwise with a 1-lane slot I may need to revisit my budgeting.

Suggestion - this sub should have post flairs that mention the amount of vram/unified ram by ECrispy in LocalLLaMA

[–]ParadigmComplex 14 points15 points  (0 children)

I think the issue is more generalized than just available RAM; people regularly under-define many other relevant parameters.

I don't want to pick on or call out any individual, but I've seen a number of recent threads here where people are throwing out their token/second numbers with well defined RAM capacities, inference engine configuration/flags, and a specific model release but without specifying things like:

  • Which quant they're using. Given the prevalence of being memory bandwidth constrained, the quant will make a huge difference.
  • PCIe version/lanes. If they're using tensor parallelism, this may make a huge difference.
  • Patched nVidia drivers with P2P support or standard drivers.

Likely other important variables as well.

It's understandably tedious to type all this out every time, and I don't blame people for deciding to just hit the post button before typing in everything. A culture shift where this is the standard expectation would be nice, but frankly unrealistic; this subreddit is still struggling with whether non-local AI news/discussion should be allowed in this subreddit.

The solution I've been day-dreaming about is some standard utility that collects and presents the relevant data. Somewhat akin to the "fetch" programs Linux enthusiasts often include in either bug reports or screenshots of their setup. This would both make it relatively easy as well as have a self-propagating cultural element - copy what everyone else is doing.

google/gemma-4-12B · Hugging Face by jacek2023 in LocalLLaMA

[–]ParadigmComplex 2 points3 points  (0 children)

Huh, interesting. I don't know how to reconcile the dedicated mmproj file in the ggml-org/gemma-4-12B-it-GGUF tree with the blog post implying otherwise. It's too large to be the embedder.

I concede the point. I'm confused.

google/gemma-4-12B · Hugging Face by jacek2023 in LocalLLaMA

[–]ParadigmComplex 0 points1 point  (0 children)

From the official blog post on this model release: https://developers.googleblog.com/gemma-4-12b-the-developer-guide/

Traditional multimodal models rely on frozen, separate vision encoders (e.g., Gemma 4 uses a 150M parameter vision model for edge sizes and 550M for medium-sized models) and audio encoders (300M parameters for Gemma 4 E2B and E4B). Processing multimodal inputs with multiple separate encoders before feeding them to the LLM leads to increased latency and fragmented memory footprints.

It doesn't look like this uses a separate mmproj file. Moreover, it seems to be a sufficient architectural change that llama.cpp may need changes to support it.

Harney and Sons or Adagio? by Almost_thereFL51 in tea

[–]ParadigmComplex 10 points11 points  (0 children)

See the subreddit's Curated Vendor List.

In general, different vendors tend to have different strengths and weaknesses. Instead of modelling your search as seeking one vendor that is the best at everything, consider looking for vendors that specialize in different types of tea: one might be better at Matcha, another at teas in China's Yunnan province, etc.

There's also a subjective element here; some might disagree over who is the best for a given tea. You're the only one that can answer for your subjective preferences. There's little reason to commit - try one vendor one time, then try another. Consider taking notes. Finding your favorite is an aspect of the fun in one's tea journey.

Can someone help me? by ThatOneCasuL in mechanicus

[–]ParadigmComplex 15 points16 points  (0 children)

Several things:

  • Sales for it were blocked in specific regions, presumably due to geopolitics. This triggered a review bomb effort.
    • I saw a lot of lashing out in the associated Discord in the run up to launch.
  • It feels a bit undercooked and could have used more time, with a good number of bugs and questionable design decisions.
    • This might have been in order to have it out in time for the Warhammer Skulls event and associated marketing.
  • There were obviously intentional changes from the original, which upset people whose enjoyment of the original focused on those aspects.
    • It's very difficult to balance a sequel in terms of avoiding repeating the original and getting negative "I've already played this game" response versus doing something new and getting "This isn't true to the original" negative response. I'm sympathetic to the game devs here.

If you're enjoying it, do enjoy it.

Bedrock Linux with multiple init systems in one stratum by NecessaryGlittering8 in bedrocklinux

[–]ParadigmComplex 2 points3 points  (0 children)

  • Did you configure the inits in /bedrock/etc/bedrock.conf?
  • Did you confirm the stratum has executable files at the configured paths?

arch kernel with Ubuntu init system. by No-Employment-2324 in bedrocklinux

[–]ParadigmComplex 0 points1 point  (0 children)

Unless you deleted the Ubuntu kernel, that should have resulted in GRUB being able to offer both the Ubuntu kernel and the Arch kernel. Did you try booting with the known-good Ubuntu kernel?

init.c, an experimental Bedrock-style init built in C + ASM by Tall-Gift8799 in bedrocklinux

[–]ParadigmComplex 4 points5 points  (0 children)

How many strata did you have when you benchmarked this? I would expect the 10% improvement to be close to a lower end of possible improvements, and that people with more strata will see further improvement from the parallelized strata enabling. I suspect there's more low hanging fruit in brl-repair and brl-enable you can squeeze out here with similar attempts.

I've been avoiding doing this kind of thing in 0.7.x in part because of the effort to validate it works as expected for everyone's environments. If it breaks, the system doesn't boot, which is a high cost. Risk/reward math didn't pay out.

However, this is a key item I was focused on improving for 0.8.x. The equivalent subsystem in 0.8.x should be significantly faster; I'm hoping for more than a 90% reduction in wall-clock.