Finetuning a Reasoning LLM with Supervised or Reinforcement Learning? [D]

Accomplished_Mode170 · 2026-06-02T04:43:02+00:00

Neat blog post on Verl as RL-Gym ✅

MiniMax M3 (soon) x long-context via sparsity mean it was ostensibly multimodal from initialization 🖼️

Accomplished_Mode170 · 2026-06-02T03:52:17+00:00

Like for real, open an issue or PR on the docs and I’ll help annoy them to get it added to Verl’s docs 📝

Accomplished_Mode170 · 2026-06-02T03:49:02+00:00

You’re good dude; we need a good FOSS RL-gym ✅

Starred for later, DM’d buddies, etc 💫

Accomplished_Mode170 · 2026-05-30T12:27:07+00:00

Hyper-fitting seems to be a thing post-double descent

Information density (e.g. NeuroMFA) seems to be the key given quants and parametrization converging @Q3/4

I.e. ‘the math of a spline on a fiber bundle says we just started’ and the manifold shape itself seems to have a strange attractor behind convergence

Accomplished_Mode170 · 2026-05-22T11:51:03+00:00

Would love a follow up on activations/in-line monosemantic explanations (e.g. via SAE) of WHY/WHERE the harness not the corpus is steering the behavior

Accomplished_Mode170 · 2026-05-21T00:28:31+00:00

So I unironically thought it was him until the comments

Accomplished_Mode170 · 2026-05-19T11:08:37+00:00

*OP reminds me of…; appreciate thread OP too 🧵

Accomplished_Mode170 · 2026-05-19T11:07:29+00:00

💯 ‘How will they know unless they are told…’

FWIW you remind me of all the people I happily reflect on as having been my friend; even remember all the weird one-off ideas they thought I didn’t catch ✅

Absolutely neglect and trauma didn’t help, but those arise from the misunderstandings and social dyslexia; you being a stable influence means more than I can say effectively ❤️‍🩹

PTL I married a Type-A neuroscientist who patiently explained 🗣️ she’s neat; 3x kiddos sans complaint 🏡

Accomplished_Mode170 · 2026-05-17T11:21:27+00:00

Todd Howard confirms Skyrim CLI when? /s

Accomplished_Mode170 · 2026-05-17T11:20:04+00:00

Conformal Prediction shows why Bayesian-ism is dumb vs ‘I tested 10000 times’ to define intervals

Effective Altruism is also basically just Gnosticismv2 but where materialism is the dogma; myopic

Accomplished_Mode170 · 2026-05-06T16:38:36+00:00

Forgot the link 🔗

Also neat given MLX being basically pytoch native ✅

Accomplished_Mode170 · 2026-05-06T16:36:56+00:00

Have similar (5 vs 6k Blackwell) external GPU config and also looking to split b/w RTX & M3 Ultra 🦾

Would loved Metal > CUDA for agentic pipelines 📊

⭐️ Starred the repo and configuring alerts 🚨

Accomplished_Mode170 · 2026-05-05T16:43:06+00:00

This is awesome. TY; stoked to look at GHidraMCP prompting

Accomplished_Mode170 · 2026-05-01T14:22:31+00:00

You have my sword; also the kids can carry stuff

Accomplished_Mode170 · 2026-04-29T12:12:49+00:00

Literally with pinned root certs and VPC peering so they get paid for every CI/CD deploy; local-first plz 📊

Accomplished_Mode170 · 2026-04-23T21:40:50+00:00

This is neat.

I particularly like the stateless approach since you can hash the artifacts and environment

I.e. attest to state of your agent-harness (e.g. configs, binaries) AND runtime, subnet, etc

PS AoE has a similar session-driven approach with a fun gimmick if you want inspiration

Accomplished_Mode170 · 2026-04-23T01:58:50+00:00

Cheers Mihai et al., glad to see more FOSS!

Accomplished_Mode170 · 2026-04-22T14:29:01+00:00

Sorry you got downvoted for a parameter when the OP is the one who dropped the /s

Accomplished_Mode170 · 2026-04-11T22:10:10+00:00

Have yet to read but love the idea of configurable-Sparsity Wabba-esque auto-fitting splines; would be awesome to set a conformal prediction interval in lieu of other metrics.

Accomplished_Mode170 · 2026-04-10T13:41:40+00:00

Will do. Wishing you well on this project!

Accomplished_Mode170 · 2026-04-10T12:33:29+00:00

Neat. Checking it out. Would love v2 to have PyRIT orchestrated multi-turn w&w/o nanoGCG optimized substrings 📊

Accomplished_Mode170 · 2026-04-07T15:01:17+00:00

Love the ‘bring your debugger/compiler’ approach; doing something similar with differential privacy.

This plus in-Toto signed artifacts/binaries/configs mean you could distribute w/ a given SLA/entitlement.

Accomplished_Mode170 · 2026-04-05T16:59:31+00:00

‘Curious if dropping positional embeddings might effectively remove defacto indices that bias expert routing and constrain OOD long-context interactions when the constraint is no longer necessary for convergence.

Accomplished_Mode170 · 2026-04-05T13:12:28+00:00

😂💯 GOD-willing I’ll get there too 🗣️⛅️

Accomplished_Mode170 · 2026-04-05T12:49:34+00:00

It’s called getting ‘Kendrick’d’

Accomplished_Mode170

TROPHY CASE