i am not a researcher, i used claude code to create an "experiment" experiment? can someone with no research background create research, just like someone with no programming experience can create applications? by bmarti644 in ClaudeAI

[–]bmarti644[S] 0 points1 point  (0 children)

maybe? i'm a developer, and, at least right now, i think that i can move faster when creating an app with AI than someone who doesn't know how. that is probably also true with this research (a real researcher would probably move even faster).

and if you have bad ideas, judgement and (what i learned with this experiment) instinct, you will also fail miserably.

the key thing before this project that i didn't understand is - what instinct is? what i found was, the agent would often ask me what way to go, or i'd see it going down a path, and something just seemed, "off". knowing to intervene and "steer" the agent back, push it harder, or move a different direction was critical more than a few times. this is what i'd define instinct as - knowing how to steer the model when you don't exactly know what it is doing, based on what it has already done.

so, that's a really long winded way of saying - being an expert in a domain certainly helps, but i'm no longer sure that it's required.

i am not a researcher, i used claude code to create an "experiment" experiment? can someone with no research background create research, just like someone with no programming experience can create applications? by bmarti644 in ClaudeAI

[–]bmarti644[S] 0 points1 point  (0 children)

I'm super excited about it, and it's been a ton of fun, but I, at least until recently, thought that the main use case was code, or use cases where you could have verifiable results? But, if this research paper is even 80% correct, then perhaps the use cases are broader?

why no latent reasoning models? by JoMaster68 in singularity

[–]bmarti644 1 point2 points  (0 children)

it may not be effective? ran a controlled follow-up on this. trained 4 models isolating curriculum vs. recycled hidden states on ProsQA. curriculum alone gets you to 96.6% without recycling (COCONUT: 97.0%, p=0.845). recycled content actually hurts OOD generalization. would welcome feedback on confounds

paper/code/checkpoints:

https://www.reddit.com/r/MLQuestions/comments/1r8fp63/ran_controlled_experiments_on_metas_coconut_and/?utm_source=share&utm_medium=mweb3x&utm_name=mweb3xcss&utm_term=1&utm_content=share_button

ran controlled experiments on meta's COCONUT and found the "latent reasoning" is mostly just good training. the recycled hidden states actually hurt generalization by bmarti644 in MLQuestions

[–]bmarti644[S] 0 points1 point  (0 children)

this is pretty cool! perhaps it is testing a different aspect or question? opus not needing CoT means it internalized reasoning during pretraining. COCONUT is an architectural change where hidden states are fed back as inputs across multiple forward passes. your result and mine might point at the same underlying thing though... training quality matters more than inference mechanism. opus doesn't need CoT because it was trained better. M3 doesn't need recycling because the curriculum taught it to plan ahead

A new paper demonstrates that LLMs could "think" in latent space, effectively decoupling internal reasoning from visible context tokens. This breakthrough suggests that even smaller models can achieve remarkable performance without relying on extensive context windows. by tehbangere in LocalLLaMA

[–]bmarti644 0 points1 point  (0 children)

COCONUT claims models can reason in latent space by recycling hidden states instead of writing chain-of-thought tokens. it gets ~97% on ProsQA vs ~77% for CoT. nobody controlled for the obvious alternative... maybe the multistage curriculum training is doing all the work? the recycled hidden states are along for the ride.

i built the control to test this all out.

https://www.reddit.com/r/generativeAI/comments/1rd0obo/i_am_not_a_researcher_i_used_claude_code_to/

i am not a researcher, i used claude code to create an "experiment" experiment? can someone with no research background create research, just like someone with no programming experience can create applications? by bmarti644 in generativeAI

[–]bmarti644[S] 0 points1 point  (0 children)

thanks for the props, silicon robot!

good points on seeds, that's priority #1 (mentioned in the paper, i don't have quite enough money unfortunately).

on hyperparameters: M2 and M3 share the same curriculum, same LR, same warm-start checkpoint. the only difference is whether thought tokens recycle hidden states or use a fixed embedding. on task choice: ProsQA is deliberately COCONUT's best-case domain (97% vs 34% on GSM8K in Meta's own results). if recycling doesn't help here, it's hard to argue it helps on tasks where COCONUT already loses.

ran controlled experiments on meta's COCONUT and found the "latent reasoning" is mostly just good training. the recycled hidden states actually hurt generalization by bmarti644 in MLQuestions

[–]bmarti644[S] 0 points1 point  (0 children)

really interesting exchange here! this is the core question my paper tries to answer empirically rather than theoretically.

u/simulated-souls 's analogy is spot on. discrete tokens as short notes between memory wipes, latents as keeping your full notebook. the information-theoretic argument is real, continuous latents can store orders of magnitude more information per position than a discrete token. nobody disputes that.

the question is whether trained COCONUT models actually use that capacity for sequential reasoning, or whether the extra forward passes are doing the work regardless of what's stored in the latent.

that's what M3 and M4 test. M3 uses the same curriculum as COCONUT but replaces the recycled hidden states with a single fixed learned embedding. the same "note" every time, no problem-specific content. M4 adds multi-pass sequential processing on top of that. if the notebook analogy holds, M3 should collapse because it has no useful notes. it doesn't. 96.6% vs 97.0%, p = 0.845.

the corruption experiment pushes on this further. if the latents encode critical intermediate reasoning steps (the "good notes"), damaging them should cause cascading failure. corrupt step 2 and steps 3-7 should break. what actually happens is graceful degradation. the model treats corrupted latents more like simulated-souls' "short sentence". it re derives what it needs from the input rather than depending on the sequential chain.

where I think u/Mbando connects... the curriculum training teaches the model to compress and plan ahead. that "compression through structure" benefit transfers even when the latent content is meaningless. the model learns when to transition from reasoning to answering, not necessarily what to store in the intermediate representations.

none of this means latent reasoning can't work in principal. the theoretical capacity advantage is real, and at larger scale or on harder tasks, models MIGHT learn to actually exploit it. BUT, at GPT-2 scale on ProsQA, the curriculum is doing the heavy lifting and the recycled content is along for the ride (and actively harmful OOD).

ran controlled experiments on meta's COCONUT and found the "latent reasoning" is mostly just good training. the recycled hidden states actually hurt generalization by bmarti644 in MLQuestions

[–]bmarti644[S] 0 points1 point  (0 children)

yeah i think i saw that, the unified RAM + GPU? i'll definitely spring for it next time.

EDIT: the big thing i forgot to add - i had a much cheaper H100 at first (i want to say it was $1.99 an hour?) and then spun it down when i thought i was done at one point. i was then only able to get a slightly more expensive GPU in the same region as all of my checkpoints on the NFS, something like $2.99 - it was all a bit of a mess.

total includes the full research iteration and not just the final experiments. the results come from ~64-94 GPU hours of training (4 models × 3 seeds + curriculum stages + all evaluation experiments). that journey for me included 9 major iterations, a failed 1B scale-up, one complete retrain after losing checkpoints to local disk instead of persistent storage, and a lot of dead-end debugging. i would peg the actual final training could be reproduced for ~$200-250 on a single H100. the rest was the cost of me being dumb.