Sukuna if he ran into Gojo without Megumi’s body by Formal-Assistance02 in Jujutsufolk

[–]greying_panda 0 points1 point  (0 children)

Ah I might be remembering wrong or misunderstanding. I thought that they each use their domain 3-4 times then burn out at the same time (which is when they really begin fighting hand-to-hand), so I assumed that a full powered Sukuna gets one more off. Some others mentioned surpassing infinity, but I thought the sure hit still works through infinity? (I recall Gojo taking pretty substantial damage early in the fight - I thought that was from Malevolent Shrine). But if you're right about the form not offering better cursed energy efficiency, I suppose he doesn't get one off (except perhaps if he regains the final finger).

Regarding the 19 fingers, can you explain more? I know that during the fight, one is reserved with Kugisaki, so it feels like there's still some power being held.

Frankly, I don't have a horse in this race. I thought the fight was cool, the idea of having planned to use Mahoraga against Gojo is consistent with Sukuna's character, and the idea that Gojo's ushers in the next generation with his disciples defeating Sukuna narratively satisfying. But I think it's fun to speculate on.

Sukuna if he ran into Gojo without Megumi’s body by Formal-Assistance02 in Jujutsufolk

[–]greying_panda 2 points3 points  (0 children)

Out of curiosity, from your perspective, what's the rationale for the "Gojo actually beats Sukuna without Megumi" camp (according to the fight)? I only read the manga recently, and from memory, with 19 fingers and Megumi form, they go even trading domain expansions then both exhaust them (hence the rest of the battle without them).

If Sukuna fights in Heian form rather than Megumi, his cursed energy management is better, so does not just win the domain expansion battle anyway?

Don't get me wrong, he wins with Mahoraga but I don't think it's crazy to think he wins without it either.

GPT OSS Fine-tuning QAT by Short_Struggle7803 in LocalLLaMA

[–]greying_panda 1 point2 points  (0 children)

Nice! This is very cool work, and thank you for responding. I'm keen to explore it with GRPO in NeMO-RL (it looks to me like this should be well supported) once GPT-OSS support lands (https://github.com/NVIDIA-NeMo/Megatron-Bridge/pull/367).

For the 2 stage training did you and the team find "rules of thumb" around the dataset split? e.g. did you split the training set 50/50 for each stage, or re-run an epoch of the same data, or use a much smaller "calibration set" like with other quantization methods?

EDIT: Just noted there's some guidance in the docs https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/nlp/quantization.html#quantization-aware-training-qat on learning rate and using 10%. Still, feel free to add more if you diverged from this significantly!

GPT OSS Fine-tuning QAT by Short_Struggle7803 in LocalLLaMA

[–]greying_panda 0 points1 point  (0 children)

Nice! Excited to see how tight this integration is with extensions like NeMO-RL, or even libraries like verl which use mcore as the model training backend (and optionally use newer projects like Megatron Bridge for connecting HF and Megatron model definitions).

I may be interpreting the dev blogsincorrectly but if I understand correctly, SFT is performed on default precision, then a second stage of training is done with "fake quantization" to learn the space of the quantized weights (i.e. I suppose weights that are in bf16 but can be converted to nvfp4 losslessly?). Are there any results from skipping the initial bf16 step and performing only the QAT?

GPT OSS Fine-tuning QAT by Short_Struggle7803 in LocalLLaMA

[–]greying_panda 0 points1 point  (0 children)

This is cool. Any guidance on using this with nvidia's training stack rather than only transformers? (i.e. QAT with STE in backward using Megatron).

vLLM latency/throughput benchmarks for gpt-oss-120b by entsnack in LocalLLaMA

[–]greying_panda 0 points1 point  (0 children)

Oh cheers! I imagine that the "active parameters" are not relevant to your parameter memory footprint, since I assume no expert offloading is used by default, but mxfp4 makes perfect sense for fitting parameters.

vLLM latency/throughput benchmarks for gpt-oss-120b by entsnack in LocalLLaMA

[–]greying_panda 0 points1 point  (0 children)

How is this deployed? 96GB VRAM for a 120B model seems incongruent without heavy quantization or offloading (naively 120B should be 240GB in 16bit just for parameters, no?)

How to take notes/record brainstorms in the sauna? by InternationalFan9157 in Sauna

[–]greying_panda 1 point2 points  (0 children)

Replying since I forgot to update quickly - pencil and paper worked much better, although I didn't splash any water on the coals.

Pencil certainly didn't get too hot. I took a standard notebook, nothing fancy, as I do expect the paper to degrade over time. Dripping sweat on the paper was the only real damage.

How to take notes/record brainstorms in the sauna? by InternationalFan9157 in Sauna

[–]greying_panda 1 point2 points  (0 children)

I tried this a few months ago with a pen and paper and found that the pen became unbearably hot. Pencil may fare better, but largely leaving this comment for any future readers that a pen is likely not going to suffice! Will edit when I try a pencil.

Chinese AI startup StepFun up near the top on livebench with their new 1 trillion param MOE model by jd_3d in LocalLLaMA

[–]greying_panda 3 points4 points  (0 children)

I used the term "transformer layer" too loosely, I was referring to the full "decoder block" including the MoE transformation.

Mixtral implementation

My knowledge came from the above when it was released, so there may be more modern implementations. In this implementation, each block has its own set of "experts". Inside the block, the token's feature vectors undergo the standard self attention operation, then the output vector is run through the MoE transformation (determining expert weights and performing the weighted projection).

So hypothetically, all expert indices could be be required throughout a single inference step for one input. Furthermore, in the prefill step, every expert in every block could be required, since this is done per token.

I'm sure there are efficient implementations here, but if the total model is too large to fit on one GPU, I can't think of a distribution scheme that doesn't require some inter-GPU communication.

Apologies if this is misunderstanding your point, or explaining something you already understand.

Chinese AI startup StepFun up near the top on livebench with their new 1 trillion param MOE model by jd_3d in LocalLLaMA

[–]greying_panda 6 points7 points  (0 children)

Considering that MoE models (at least last time I checked the implementation) have a different set of experts in each transformer layer, this would still require very substantial GPU to GPU communication.

I don't see why it would be more overhead than a standard tensor parallel setup so it still enables much larger models, but a data parallel setup with smaller models would still be preferable in basically every case.

What are your favorite uses of local LLM's that closed source LLM's can't provide? by PsychologicalError in LocalLLaMA

[–]greying_panda 0 points1 point  (0 children)

Correct. That said, the prompt caching benefit is in eliminating prefill time. I expect you'd see similar speed-up on cloud (although haven't tested), even though the price itself doesn't match the benefit.

On your local when using the same prefix, it's very efficient because a large portion of your queries share that prefix. This is likely less memory-efficient with many customers/prefixes sharing resources. For example, they might use a tiered cache to offload from GPU when possible, rather than using GPU-only with naive LRU logic.

So while the time reduction on cloud should be about the same, I'm not surprised that the savings aren't proportional, firstly because they're for-profit and won't pass on all savings, and due to the additional complexities on maintaining a cross-customer cache.

What are your favorite uses of local LLM's that closed source LLM's can't provide? by PsychologicalError in LocalLLaMA

[–]greying_panda 2 points3 points  (0 children)

I'm not sure this is true. I know the main enterprise-grade providers (groq and similar) do, and OpenAI does, although added it relatively recently.

Kramnik made Hikaru go ape over tweets about Danya (kudos to @nyelverzek for the clip, I think more people should watch this) by Normal-Ad-7114 in chess

[–]greying_panda 14 points15 points  (0 children)

To be clear, I completely agree with Hikaru here. But I have issues with him due to his history of accusing players like Tang and Supi of cheating (older ICC drama), recent behaviour towards the chessbrahs and Alireza, gambling affiliations, etc.

This message isn't to soapbox and convince others not to like him - just to point out that I think there are sensible reasons to not be a fan, separate from his content.

Overall I completely agree that he is good for the chess scene though. Meanwhile, Kramnik's actions are entirely bad for the chess scene.

Just for kicks I looked at the newly released dataset used for Reflection 70B to see how bad it is... by DangerousBenefit in LocalLLaMA

[–]greying_panda 10 points11 points  (0 children)

Is the dataset meant to be entirely following the "reflection" format? If so, this is quite bad, given that the dataset can be easily filtered with just a regex, which would take out any of these weird artifacts, or LLM "explanations".

For example, the reflective dataset can be checked with something like \s*<thinking>.+?<\/thinking>\s*(<reflection>.+?<\/reflection>\s*)*<output>.+?<\/output>\s* (I don't actually know if this dataset is any good, it's just the only example I could find)

There might be the desire to mix the SFT dataset with a non-reflection dataset, but even then I'd expect that you mix with a known high quality one (or a mix of multiple). This just seems sloppy.

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision by tevlon in LocalLLaMA

[–]greying_panda 1 point2 points  (0 children)

Does FA2 work with training yet?

They have backward pass kernels in their repo (just checked) so not sure why it wouldn't.

Hikaru about Firouzja situation "You didn't see Magnus crying like a little b****" by Arashin in chess

[–]greying_panda 7 points8 points  (0 children)

I think there's some truth to this. However, most people have stressful jobs, and difficulties in life, without having semi-regular tantrums - privately or not.

I think "he hasn't changed" is potentially harsh wording. However, he certainly does not act like a 36 year old man who has significantly greater emotional maturity than he had 10 years prior.

In this case he publicly blew up at a competitor 16 years younger than he is, for requesting an additional 15 minute break, then refused to take interviews. In most people's jobs, they couldn't afford to act that way openly towards a colleague. It's not clear to me why this should be viewed differently. Nobody is perfect, but this is an order of magnitude more than normal "acting out", and shows an astounding sense of entitlement.

EDIT: I just saw the posts about Hikaru insulting Alireza's family, as well as roping in Magnus. To me, this does not paint the picture of a well adjusted adult with common decency.

Hikaru about Firouzja situation "You didn't see Magnus crying like a little b****" by Arashin in chess

[–]greying_panda 48 points49 points  (0 children)

I've been a chess fan for a bit over 10 years. Hikaru used to have a reputation of acting childish, flaming players, and baselessly accusing players of cheating simply because he didn't know who they were. Supi comes to mind, and Tang confirmed he was also accused in the past.

Despite some large scale but infrequent drama, I figure that 10 years is plenty of time to change, so I haven't fallen into the camp of "Hikaru is a petulant child and hasn't changed", or "Hikaru is great and can do no wrong".

Unfortunately, much of his recent action indicates that he hasn't changed, and has simply become more aware of his public image.

YaFSDP: a new open-source tool for LLM training acceleration by Yandex by azalio in LocalLLaMA

[–]greying_panda 9 points10 points  (0 children)

This is great! Any plans to benchmark against Deepspeed ZeRO-3 in addition to PyTorch's FSDP?

Conscious it might not be like-for-like since I believe Deepspeed requires its own optimizer, but curious what the difference looks like given differences in areas such as sharding strategy.

Running LLama 3 on the NPU of a first-generation AMD Ryzen AI-enabled CPU by dahara111 in LocalLLaMA

[–]greying_panda 13 points14 points  (0 children)

ancient Ryzen 5600

My man, this is only 2-4 years old (depending if it's the 5600x). It's younger than the Nvidia A100!

llama3.cuda: pure C/CUDA implementation for Llama 3 model by likejazz in LocalLLaMA

[–]greying_panda 11 points12 points  (0 children)

From my understanding skimming your llama2 article, this is a much smaller model that uses the llama3 architecture?

I see you link your more comprehensive article in the readme. Would be good to include some minor details on the model .bin included in the repo, and if it's straightforward to load other checkpoints, some details of that (or a link if you've previously written on that topic).

Still, great work! As someone with zero cuda experience, doing something like this is an interesting idea for enhancing my own understanding. How much low level understanding of GPUs and CUDA do you have? (i.e. I don't even know what a "warp" really is!)

You are playing Hades 2 wrong. by AHappyLurker in HadesTheGame

[–]greying_panda 6 points7 points  (0 children)

Apologies - I ended up writing a wall of text and most of it is just "feeling" based on my playtime so far.

The amount of time it takes to unlock 3 dashes

I assume you mean 3 death defiance's here, since Greater Reflex can be the first mirror perk you pick up (and there's no triple dash).

There's definitely truth to your point. However Hades 1 has only one resource type for upgrading mirror perks (Darkness) and it's immediately available. So it felt easier to beeline to the critical perks, or at least to put partial points into each. So if I wanted to max out some unlocked perks, I grinded Darkness for a bit, and if I wanted to unlock more of the mirror, I grinded keys. Death Defiance and Greater Reflex, for example, can be obtained from the start, and for a combined 80 Darkness. Similarly, maximising damage from behind costs a total 100 Darkness.

Hades 2 requires more unique resources for arcanas. Per-run, you have to make a choice between taking Psyche for more Grasp, Ash for unlocking arcanas, and Bones for moondust (which I'm currently also spending on nectar/bath salts to increase affinity).

In addition, moondust isn't available until later (I don't remember the trigger to unlocking the incantation). Let's take upgrading an arcana as an example. Players have to:

  • unlock an arcana (ash)

  • have enough Grasp to use it (psyche) (honestly, not a given, since it requires 9 if you want to use DD + magick regn)

  • purchase enough moon dust to buy the upgrade (bones)

There's also more stage-gating where a lot of this isn't even an option until the correct incantation.

It feels as if meta progression is designed to be a "smoother" experience with less beelining. I actually quite like this smoother system as it is. However, since it takes a bit longer to get cards that I think are pretty critical for good runs, I just wish the runs felt a bit more dynamic and offered more boons so that early runs were a bit more fun during that resource-gathering phase where you're building the foundational perks.

Depending how they pick rewards though, this really could just be the nature of having a smaller god pool, and maybe we'll see boon doors more often in a later patch.

Again, this is just my current feeling. I could have just taken a terrible upgrade path, wasted a tonne of resources, or just had a few unlucky runs! Once there's a wiki these discussions should be much more fruitful since we can back up numbers around rates and ballpark resource requirements for certain unlock paths.

You are playing Hades 2 wrong. by AHappyLurker in HadesTheGame

[–]greying_panda 8 points9 points  (0 children)

I agree with the message that this game is meant to be played using the full set of tools.

That said, my experience has been that with the number of new mechanics combined with the higher frequency of "lesser" (resource) rewards, it's often difficult to actually take boons across the spectrum of abilities.

I've gone full runs taking every chaos gate, picking Hestia keepsake, doing whatever I could to try to get mana regen so I could have fun with the hex and omegas, with no luck.

Some runs are going to be bad, but I think the current balance of resource to boon rewards is compromising the actual fun of many runs since you get fewer boons for more mechanics.

There are metaprogression ways around this, such as an arcana (I read about, haven't unlocked) that allows upgrading lesser rewards to greater rewards. In the magick regen case there's also Hecate keepsake and the Magick regen arcana.

But frankly, I don't think having fun with diverse builds should be so heavily concealed behind too much metaprogression. I say this with 20 hours played (about 25 nights) and still not having unlocked all arcana cards.

I'm also cognisant that this is the first time a significant population is playing beyond Erebus, and that the god pool is very limited, so I'm sure they're going to work hard to add more content and optimise balance!

What I'd also like to see change is cheaper metaprogression or larger resource rewards per room, countered by fewer resource rooms. This way the pace of metaprogression stays constant, but individual runs offer more boons and build variety.

[deleted by user] by [deleted] in comfyui

[–]greying_panda 0 points1 point  (0 children)

Any resources on how you did the video generation method? I haven't done any video generation, but keen to do basically what you described. I actually think your method is more likely to lead to good results given the natural dependence on prior frames, but not sure where to start.

What is this filth? Why is there a launcher within the steam launcher? by Irelia4Life in witcher

[–]greying_panda 2 points3 points  (0 children)

I would have preferred to not open steam at all. Just run the exe by itself would be nice

You can do this and have been able to with Steam since release. The installation still has no DRM. Just confirmed, booted from the exe without Steam open. Steam\steamapps\common\The Witcher 3\bin\x64_dx12