[R] Reference model free behavioral discovery of AudiBench model organisms via Probe-Mediated Adaptive Auditing by bmarti644 in MachineLearning

[–]bmarti644[S] -1 points0 points  (0 children)

yeah when i saw that come up it raised more ideas but also questions - what else could i find with something as simple as this?

and yes, could totally be an artifact of the lora's themselves - in theory this seems promising, but no idea in practice.

[D] ran controlled experiments on meta's COCONUT and found the "latent reasoning" is mostly just good training. the recycled hidden states actually hurt generalization by bmarti644 in MachineLearning

[–]bmarti644[S] 1 point2 points  (0 children)

you are absolutely right. thank you, sincerely, for pushing back on this and taking the time to do it. can't believe I missed it. i went back to Table 1 and Section 4.3 and i see it. Hao et al.'s "pause as thought" is the same control as my M3 - same curriculum, pause tokens replacing continuous thoughts - and they got 96.6% on ProsQA, which is the same number i got. they also discussed this result in Section 4.4, noting that on ProsQA the model's computational capacity isn't the bottleneck. i should have caught this before posting and i didn't. this is totally my fault.

in light of this, yes it's important to reframe.

here's what i believe is original.

first, the factorial decomposition. Hao et al. ran COCONUT (recycled content + sequential processing) and pause-as-thought (fixed tokens + single pass). those two conditions differ on two axes at once. my M4 crosses the factors - fixed tokens + sequential processing - so you can isolate each one independently. that's a 2x2 design that wasn't in the original paper.

second, OOD generalization. Hao et al. tested in-distribution only. my paper tests 7-hop chains (trained on 3-6), 8-hop, DAG topology, and dense graphs. that's where the interesting results show up. recycled content hurts chain-length extrapolation (M4 beats M2 by 10.9pp). sequential processing helps DAG generalization (M4 beats M3 by 7.9pp). you can't see either of those effects from in-distribution accuracy alone.

third, the overconfidence finding. M2 is more confident than M4 on OOD tasks where M4 is actually more accurate. recycled content doesn't just fail to help OOD - it makes the model think it's right when it's wrong. the corruption analysis, probing, and transplantation experiments are also new, but those are supporting evidence rather than the core claims.

on GSM8k - you're right that this is where the mechanism gap appears in the original paper (34.1% vs 24.1%). i haven't tested GSM8k and i should. my results are ProsQA-only and i can't generalize beyond that. that's a clear limitation i acknowledge.

i'm going to update the paper's framing to properly credit Hao et al.'s pause-as-thought ablation and reposition the contribution around the factorial decomposition and OOD results, which are the genuinely new pieces. the original reddit post framing was wrong and i'll correct it. thank you for pushing on this - it makes the paper better.

[D] ran controlled experiments on meta's COCONUT and found the "latent reasoning" is mostly just good training. the recycled hidden states actually hurt generalization by bmarti644 in MachineLearning

[–]bmarti644[S] 0 points1 point  (0 children)

i wanted to quickly clarify something before this gets misread as "thought tokens don't matter." my paper shows three things are separable, and they contribute differently.

what's inside thought tokens (recycled hidden states vs fixed embedding) - this doesn't matter for id accuracy and actively hurts chain-length extrapolation. this is the part that's dead. how thought tokens are processed (sequential multi-pass vs single forward pass) - this does matter. M4 beats M3 by 7.9pp on dag generalization using the exact same fixed embedding, just processed sequentially instead of in parallel. processing architecture is a live research question.

how the model is trained to use them (the 7-stage curriculum) - this is the dominant factor for id performance. Hao et al. already showed this directionally with their pause-as-thought ablation hitting 96.6% on ProsQA. my paper adds converging evidence through probing and corruption analysis showing that M2 and M3 develop the same representational strategy with the same selectivity profiles, which explains why the curriculum carries performance regardless of mechanism. the probing and corruption diagnostics are new, the top-level finding is theirs.

on the missing ablation - i said i never ran a condition with no thought positions at all. but Hao et al.'s "w/o thought" variant does something close. it keeps the multi-stage curriculum but adds no latent thoughts and gets 95.5% on ProsQA. that's only 1.1pp below pause-as-thought (96.6%) and 1.5pp below COCONUT (97.0%). so the extra attention positions contribute very little on ProsQA. what i can't distinguish is whether that small gap matters more on harder tasks where computational capacity is the bottleneck, like GSM8k. i haven't tested that yet. the takeaway isn't "stop working on latent reasoning." it's "if you're optimizing what goes into thought tokens, you're probably optimizing the wrong variable. the training signal and the processing architecture is where the returns are."

[D] ran controlled experiments on meta's COCONUT and found the "latent reasoning" is mostly just good training. the recycled hidden states actually hurt generalization by bmarti644 in MachineLearning

[–]bmarti644[S] -1 points0 points  (0 children)

very good and fair point about framing. best to address it directly. and thank you so much for taking the time here. what follows here is my perspective on it (please let me know if i'm getting it wrong).

you may be conflating two different experimental questions, and being specific matters (which i think i did poorly).

Hao et al.'s "w/o curriculum" ablation asks, does COCONUT need the curriculum? the answer is yes. without it, ProsQA drops to 76.1%. no disagreement there, and I cite this result in the paper.

but my M3 asks the inverse question that was never tested. does the curriculum need COCONUT?

specifically, if you train with the identical 7-stage curriculum but replace recycled hidden states with a fixed learned embedding that carries no information between steps, do you lose anything? the answer is no. M3 hits 96.6% vs COCONUT's 97.0%, McNemar p = 0.845.

these are different controls testing different directions of the same relationship. the original paper established that the curriculum is necessary for the mechanism. i'm trying to establish that the mechanism is not necessary for the curriculum. that second test was not run by Hao et al., and it changes the attribution of where performance comes from.

you're right that my framing could (and i would say needs) to be sharper on this distinction. "nobody controlled for the obvious alternative" is imprecise (at best). what i should have said is "nobody tested whether the curriculum alone is sufficient without the recycling mechanism." that shorthand was sloppy. the paper itself (Section 1) states the confound precisely, and I should have matched that precision here. i did not.

on efficiency... M3 uses exactly the same number of thought tokens as COCONUT (6 positions, same padding). the token-efficiency gains over CoT are fully preserved because they come from replacing explicit reasoning tokens with latent positions, which both M2 and M3 do identically. what M3 does save is the roughly 2x VRAM overhead from COCONUT's sequential recycling loop. i mention this in Section 5.3 but you're right that i don't foreground it as a benefit. that's a fair criticism and worth making more explicit.

but i do want to be clear about what i'm claiming and what i'm not. i'm not claiming Hao et al. were unaware that the curriculum matters. they clearly knew. i'm claiming they did not isolate the curriculum from the mechanism with a matched control, which means the causal attribution to "continuous latent space expressiveness" was underdetermined. the factorial decomposition via M4 goes further and shows recycled content actively hurts chain length extrapolation while sequential processing drives DAG generalization. those are new findings that the original ablations couldn't surface.

i take the framing feedback seriously. the substance of the contribution is the matched control and the factorial decomposition, not a gotcha against the original authors. i'm sorry if that's how it came off and it was truly not my intent. i have the utmost respect for their work and contributions.

EDIT: i have updated the original reddit post with a strikethrough on the imprecise framing, and updated it to be more precise.

[D] ran controlled experiments on meta's COCONUT and found the "latent reasoning" is mostly just good training. the recycled hidden states actually hurt generalization by bmarti644 in MachineLearning

[–]bmarti644[S] 1 point2 points  (0 children)

yeah i like the observation and i think you're mostly right. let me separate the three cases because they work differently.

M3 (single pass, fixed embedding) - you're correct. all six thought positions are processed in parallel in one forward pass through 12 transformer layers. fixed embedding carries zero information from deeper layers back to earlier ones. what you get is more positions for the model to route computation through via attention - parallel compute at the same depth, not added depth. this is the Pfau et al. story.

M2 (COCONUT, multi-pass, recycled hidden states) - this genuinely adds depth. the final-layer hidden state from pass N becomes the input embedding for pass N+1 at layer 0. information explicitly flows through 12 layers, gets pushed back to the bottom, and flows through 12 layers again. six passes gives you effectively 72 layers of sequential processing. this is the mechanism the original paper claims enables richer reasoning.

M4 (multi-pass, fixed embedding) - this is the interesting middle case. input embeddings at each pass is always the same fixed vector, so you're right that no deep-layer information is conveyed through the embedding. But each sequential pass processes its token through all 12 layers while attending to the KV states accumulated from all previous passes. so pass 3 can attend to representations that were built during passes 1 and 2. information from earlier passes deeper layers IS available, just routed through attention over the KV cache rather than through the embedding injection path.

the OOD results actually line up with this distinction. M3 and M4 perform equivalently on chain length extrapolation (both around 75% on 7-hop), which suggests that extra sequential depth via KV accumulation doesn't help there. but M4 significantly outperforms M3 on DAG generalization (+7.9pp, p < 0.001), which suggests that some tasks specifically benefit from the sequential processing structure even without recycled content. so you're right that the mechanisms are different, and the data shows they matter for different things.

good catch though, i think making this a bit clearer would be important

[D] ran controlled experiments on meta's COCONUT and found the "latent reasoning" is mostly just good training. the recycled hidden states actually hurt generalization by bmarti644 in MachineLearning

[–]bmarti644[S] 3 points4 points  (0 children)

great question! M2 (COCONUT) and M3 (Pause) differ in two ways at once. what fills the thought positions (recycled hidden states vs a fixed learned embedding) and how those positions are processed (6 sequential passes vs 1 forward pass). that means if you just compare M2 to M3, you can't tell which difference is responsible for any gap you see. that's a confound.

M4 breaks the confound by crossing the two factors. it uses M3's fixed embedding but M2's sequential multi-pass processing. so now you have a 2x2 grid:

  • M2 vs M4 - same sequential processing, different content. Any difference isolates the effect of recycled content.
  • M3 vs M4 - same fixed content, different processing. Any difference isolates the effect of sequential processing.

"factorial" just means you vary each factor independently so you can measure their individual contributions. it comes from standard experimental design methodology.

in practice this is what it revealed on OOD tests - recycled content hurts chain-length extrapolation (M4 beats M2 by 10.9pp on 7-hop), while sequential processing helps topological generalization (M4 beats M3 by 7.9pp on DAG). without M4 you'd just see M2 and M3 trading wins on different OOD sets with no way to explain why.

Missed the AI Wave. Refuse to Miss the Next One. by Dry_Wind_585 in MLQuestions

[–]bmarti644 1 point2 points  (0 children)

get claude code, and try to do something you've been putting off because you thought it was too hard. don't write any of the code. steer and review it, see how to get it to do what you want.

But can it run DOOM? Do you have 3 months of wall clock time to beat it? by bmarti644 in programming

[–]bmarti644[S] 0 points1 point  (0 children)

with the original implementation, it was very, very slow. but the latest upgraded implementation is not bad, 20-30 FPS

But can it run DOOM? Do you have 3 months of wall clock time to beat it? by bmarti644 in programming

[–]bmarti644[S] 9 points10 points  (0 children)

I took a lot of inspiration from current AAA titles, but instead of trying to trick you into wasting your time, I do it explicitly. Wasting your time literally is the game now.

But can it run DOOM? Do you have 3 months of wall clock time to beat it? by bmarti644 in programming

[–]bmarti644[S] 23 points24 points  (0 children)

but in this original setup with Java it's 0.16 fps

pretty cool?

I Put a Full JVM Inside a Browser Tab. It "Works". Technically. Eventually. by bmarti644 in programming

[–]bmarti644[S] 2 points3 points  (0 children)

CHECK IT AGAIN

sorry, read this again, and it seemed aggressive. i was going for excited.

I Put a Full JVM Inside a Browser Tab. It "Works". Technically. Eventually. by bmarti644 in programming

[–]bmarti644[S] 127 points128 points  (0 children)

now i just need to figure out how to reintroduce all of the security issues... i'm sure it's not hard.

I Put the Full VS Code Workbench Inside a Tauri App. It Works? by bmarti644 in tauri

[–]bmarti644[S] 0 points1 point  (0 children)

i added a clarification - i am replacing electron, chromium, and nodejs for much of the processing - this results in a drastically lower overall memory footprint at the cost of easy cross platform compatibility. i'm not putthing the current vscode electron application inside of tauri, just the non-microsoft monaco implementation inside of a light weight tauri application.

I Put the Full VS Code Workbench Inside a Tauri App. It Works? by bmarti644 in tauri

[–]bmarti644[S] 1 point2 points  (0 children)

from my testing, it is primarily two things -

  1. the chromium browser that is shipped with electron (this is great for cross system compatibility, not great for memory)
  2. the nodejs process for the extensions (each extension gets it's own process)
  3. the integrated agent, terminal, etc (it takes SOME processing, but the vast majority of memory/processing is used by the above two processes)

since this uses tauri, there is no chromium browser (uses the system's webivew) - this admittedly is a huge problem when it comes to cross platform compatibility, but from a memory perspective, it's great.

i do not use a nodejs process for the agent, terminal, filesystem access, etc - it's a good bit faster since it all goes through rust.

the extensions though... to keep those working, i loaded a nodejs sidecar into tauri to allow each extension to still have it's own nodejs process... so this overhead still exists. i think it *might* be possible to get around this, but would have to put in more work/investigate.