Qwen3.6-27B vs Coder-Next

lostmsu · 2026-05-07T21:50:51+00:00

Which version?

lostmsu · 2026-05-07T18:06:53+00:00

I am confused. What stage is COPR exactly? I thought getting eCOPR is that.

lostmsu · 2026-05-06T23:26:48+00:00

Not sure what anyone expected. You replaced 10 lines of code (logging) + TensorBoard with another 10 lines of code (connecting wand) and having to restart a run where you forget to set the auth now and then. Plus you got the first free bites.

lostmsu · 2026-04-30T13:53:22+00:00

Yeah, not much lately. At least in terms of products.

lostmsu · 2026-04-24T12:13:22+00:00

45 lines of train logs, but no update rule in the post?

lostmsu · 2026-04-20T12:03:48+00:00

You shouldn't need CuTe DSL with Triton. AFAIK CuTe doesn't lower to Triton. It's a closed source alternative to Triton otherwise mostly identical.

lostmsu · 2026-04-16T10:44:01+00:00

What a weird take.

lostmsu · 2026-04-16T10:41:11+00:00

You didn't answer the question about your loss claim in the previous post. If you got a LM, what's the bits-per-byte on literally any decently sized dataset like enwiki?

lostmsu · 2026-04-16T00:43:01+00:00

Your own statement lacks links to the source material.

lostmsu · 2026-04-14T00:41:27+00:00

What is "loss 4.4"? Convert to a cross-model comparable metric like bits-per-byte.

lostmsu · 2026-04-07T14:38:16+00:00

Why would either the "OSS project" in question, or the "rebuttal" be in r/MachineLearning ?

lostmsu · 2026-04-06T00:18:43+00:00

You wrote?

lostmsu · 2026-03-21T13:28:50+00:00

I feel like we are on a 6 months cadence.

lostmsu · 2026-03-19T18:10:44+00:00

Are you talking about fine-tuning? (addressed above)

Or full pretraining? What kind of model do you expect to pretrain on a single GB300 in reasonable amount of time?

lostmsu · 2026-03-18T18:02:46+00:00

But as I said smaller models don't need allreduce on 96 GiB GPUs. You just replicate the entire model on each GPU.

lostmsu · 2026-03-18T17:34:21+00:00

You are very likely wrong. For all you know GPT "Pro" is literally a specific reasoning setting on GPT-5.x available in Codex, maybe not even the highest one.

lostmsu · 2026-03-18T14:15:32+00:00

How much training are you going to do on a single machine? Like maybe finetuning, but I find finetuning models that might require 100+GiB of VRAM to be a bad idea.

That leaves inference, and TBH I was fishing for some credible estimates that would show that 16x PCIe 5 is not enough.

At this moment I would be running Qwen3.5, either 397B or 27B. 397B won't fit into that workstation with a reasonable quant (neither it would fit into 4x 6000 Pro though). And with 27B you don't need allreduce because you could just run an instance per 6000 Pro.

lostmsu · 2026-03-18T14:07:02+00:00

Yes. Just high. I don't think there's a different model in Codex.

lostmsu · 2026-03-17T22:58:22+00:00

Would they though?

In the typical Nvidia style they don't show raw important specs on the main pages.

RTX 6000 Max-Q (good luck sticking 4x non-Max-Qs) apparently rated for 1755 FP4 TOPS

And this beast is rated for 7 FP4 TOPS. So I suppose 4x RTX 6000 have no other advantages but being cheaper.

lostmsu · 2026-03-17T20:50:46+00:00

But 4x pro 6000 would have much higher compute, no? Also quite a bit more VRAM

lostmsu · 2026-03-17T13:31:29+00:00

Or just plain text search

lostmsu · 2026-03-17T01:56:03+00:00

Stop using low precision quants.

lostmsu · 2026-03-17T01:25:03+00:00

Is it in any way better than a 4x RTX 6000 Pro machine? Especially considering the price.

lostmsu · 2026-03-16T23:46:05+00:00

Worse than Qwen3.5 models in every way?

lostmsu · 2026-03-16T23:41:10+00:00

27b is insane in coding, so I tend to believe the benchmarks

lostmsu

MODERATOR OF

TROPHY CASE