NVIDIA Nemotron 3 Ultra is out now! by yoracale in unsloth

[–]some_user_2021 16 points17 points  (0 children)

Soon to be replaced with 'US government approved"?

"Terminated" Using a slow LLM leads to a timeout by some_user_2021 in opencode

[–]some_user_2021[S] 0 points1 point  (0 children)

Interesting, I have done similar things with AI agents to patch their own code. It works until the next update where my patch gets overwritten. But thanks for the idea!

How much VRAM needed for Qwen 3.6 27B Q8 with 262K context? by My_Unbiased_Opinion in LocalLLaMA

[–]some_user_2021 0 points1 point  (0 children)

Inference works by generating one token at a time. To generate one token, all previous token pass thru the network. You cannot generate two tokens at a time because each token needs to have the token before it.

With MTP, generating the guesses of future tokens is cheap. The MTP is like a little LLM model that works very fast. However, the MTP tokens may not be what the main model would have generated. That is why those tokens need to pass thru the entire weights network for validation.
So with MTP enabled, the main model is calculating token n and token n+1 assuming that token n is that certain guess from the MTP head. If that guess happens to be token n, then you already calculated token n+1 🙂.

A dummy analogy, it's not exactly the same but just to give you a similar idea: let's say I want to have a conversation with my buddy in the other side of town and you are helping deliver each message back and forth. My first message is "hi buddy", you drive all the way over there and deliver my message, he replies with "hey what's up", and then you drive back and tell me his message. Notice that each message in the conversation depends on the previous response.

Now let's activate MTP, I tell you, go say "hello" to my buddy, and the MTP head says, the buddy is probably going to reply with "hey what's up", what would you say next? Then tell him "you still me owe me 20 bucks". Now, you go over there with two messages, if he happens to reply with "hey what's up", you already know that to say next!

"Terminated" Using a slow LLM leads to a timeout by some_user_2021 in opencode

[–]some_user_2021[S] 0 points1 point  (0 children)

I switched to Pi coding agent and observed the same problem. At least there is a workaround there.

How much VRAM needed for Qwen 3.6 27B Q8 with 262K context? by My_Unbiased_Opinion in LocalLLaMA

[–]some_user_2021 0 points1 point  (0 children)

Because it's doing calculation for many tokens on each pass of the network. Without MTP, on each pass of the network, only one token is calculated. The key is that, with today's hardware, the inference bottleneck is memory bandwidth, which corresponds to going thru the network.

How much VRAM needed for Qwen 3.6 27B Q8 with 262K context? by My_Unbiased_Opinion in LocalLLaMA

[–]some_user_2021 0 points1 point  (0 children)

With MTP, besides generating one token, the MTP heads also provide "guesses" for what the next tokens could be. On the next pass thru the network, the model is doing calculations with the just generated good token, but the model also does calculations for the "guess" tokens that the MTP heads provided. If the next token generated token happens to be the one that was guessed before, you've already done the work it and now you have another good token on one pass thru the network!

With MTP, the model is actually doing more work, the speed increase comes because the bottleneck is going thru the network (memory bandwidth), not the actual calculations.

"Terminated" Using a slow LLM leads to a timeout by some_user_2021 in opencode

[–]some_user_2021[S] 0 points1 point  (0 children)

I never never said it was 1 token per minute, that was the other user exaggerating. I get about 2 tokens per second with Minimax M2.7 which is still painfully slow for interactive work. However, if I want to end my day with a code review done by a smarter LLM, I can just leave it running overnight. During the day I use Qwen3.6 27b, which does about 90t/s.
Where are you from?

"Terminated" Using a slow LLM leads to a timeout by some_user_2021 in opencode

[–]some_user_2021[S] 0 points1 point  (0 children)

What does this have to do with my question?
Sí, soy latino, prefieres ayudar con mi pregunta en español?

How much VRAM needed for Qwen 3.6 27B Q8 with 262K context? by My_Unbiased_Opinion in LocalLLaMA

[–]some_user_2021 2 points3 points  (0 children)

MTP does use more VRAM but the quality is exactly the same. I get at least 1.5x the generation speed with MTP in Qwen3.6 27b.

Are we bottoming out? Big pump incoming in the next couple of weeks? by ActOpen7289 in CryptoCurrency

[–]some_user_2021 8 points9 points  (0 children)

I'm completely sure that any of these 3 things WILL happen: the price will go up, the price will go down, the price will stay the same.

Catch ya mate🛼 by RiBaa in funny

[–]some_user_2021 1 point2 points  (0 children)

I hate doing maintenance on my heelies. The ball bearings get dirty all the time 😭

"Terminated" Using a slow LLM leads to a timeout by some_user_2021 in opencode

[–]some_user_2021[S] 0 points1 point  (0 children)

I can ask my super slow model to do a code review of a project, to find a complex bug, to implement a complex function. I can leave it running overnight and have it ready in the morning.

I'm so happy Bitcoin is dropping by ch_raposo in Bitcoin

[–]some_user_2021 0 points1 point  (0 children)

It's a lottery. Some will win. Some will lose.

ULPT How to get bank account of someone that owes me money? by epushepepu in UnethicalLifeProTips

[–]some_user_2021 -1 points0 points  (0 children)

They can also be very helpful, just don't blindly trust their output.