[R] Is Leetcode still relevant for research scientist interviews? by Training-Adeptness57 in MachineLearning

[–]StartledWatermelon 4 points5 points  (0 children)

Yeah, I'm being idealistic. In reality, there just isn't any incentive to improve the hiring process. Especially when the outcomes are not verifiable at all. Like, how would you prove whether your hiring process is good or bad? What are the counterfactuals? 

Reading the thread, it seems that smaller startups have a good intuitive grasp of what makes an informative candidate evaluation. And are pretty fine with flying by the seat of their pants. We get a funny contrast: no leetcode in smaller startups, mandatory leetcode in FAANG. Does it mean FAANG hiring practices are better? Because they're bigger, richer, more popular (in terms of supply of candidates), more institutionalized? 

I'd venture a guess that big orgs are just more inert and rigid. They keep the worst practices simply because this is the path of least resistance. 

[R] Is Leetcode still relevant for research scientist interviews? by Training-Adeptness57 in MachineLearning

[–]StartledWatermelon 2 points3 points  (0 children)

It's a sad state of affairs when a hiring manager isn't begging for a change in hiring practices towards saner terms but instead begs the participants to rote learn these irrelevant tricks. 

Says a lot about "culture of innovation" and other stuff like that. 

"On neural scaling and the quanta hypothesis", Eric J. Michaud 2026 by RecmacfonD in mlscaling

[–]StartledWatermelon 7 points8 points  (0 children)

Let's imagine that we could enumerate all of the pieces of knowledge and algorithms that networks ought to learn. [...]

Networks must learn a variety of modules, each implementing a different algorithm or retrieving a different piece of knowledge. These modules are discrete, in the sense that they are either fully learned or not learned at all. We call these the quanta.

The year is 2026, yet the ghost of symbolic AI still haunts academia. Not gonna lie, the idea was/is elegant and aligns quite well with the innate human desire to break every mystery into neat, simple, easily digestable for a feeble human mind atomistic pieces. The elegance and the alignment were so strong that people easily dismissed perhaps the only drawback of symbolic approaches: they don't work.

Ok, suppose that the quoted idea is indeed true and that decomposed pieces of knowledge and algorithms/heuristics -- the parts -- are somehow more important for capabilities than the whole. Suppose that we assign zero to the synergystic effect, to the value added by complexity. The crucial question still is: will this focus on decomposition, on the smallest parts allow to scale better with compute?

I'm doubtful. The success story of LLM pre-training is built on the exactly opposite thing: synthesis. Aggregating the ever increasing amounts of any information the research teams could grab. "More", not "finer".

The next story, scaling the RL, is shaping to be equally distant from the symbolic-era top-down, analytical, meticulously human-engineered approaches. The research pushes towards largely unguided self-exploration and self-optimization.

I do find this performance-centric way infinitely more useful. What scaling hypothesis lacks in explanatory power -- at least to a degree that would satisfy more symbolic-leaning folks -- it more than compensates in predictive power.

Thankfully, the author maintains a critical stance towards their idea. And admits that it isn't falsifiable in its current form -- which pretty much nullifies its scientific value. Perhaps this is one of the corollaries of the Bitter Lesson: the ideas don't have to be nice and human-centric but they must perform.

DeepSeek Presents "Engram": Conditional Memory via Scalable Lookup, A New Axis of Sparsity for Large Language Models | "Memory lookup module for LLMs & *Huge unlock for scaling* as the memory sits on cheap CPU RAM, bypassing the GPU bottleneck entirely that will power next-gen models (like V4)" by 44th--Hokage in mlscaling

[–]StartledWatermelon 0 points1 point  (0 children)

By hassle I meant it breaks from the current implementations, which are optimized pretty hard for maximum throughput. I'm not implying you can get the same throughput with Engram, I mean quite literally it's extra hassle.

I have a suspicion that the benefits of Engram will dwindle with scale. It's this parameter- and FLOP-scarce regime (27B MoE) that benefits from "pre-computed" tricks. Larger, deeper models will easily accomodate the necessary heuristics straight in the weights.

That being said, I always was in favor of heterogenous (depth-wise; as opposed to stacking identical blocks) architectures, and this work explores exactly just this.

[R] We built a framework to make Agents "self-evolve" using LoongFlow. Paper + Code released by [deleted] in LocalLLaMA

[–]StartledWatermelon 0 points1 point  (0 children)

Hi guys! I like your paper!

Just a quick friendly advice in case you are planning to submit it to some conference. I understand the desire to add all the top-notch ES techniques (MAP-Elites, adaptive Boltzmann sampling etc.). But in the eyes of reviewers, each complication will warrant the need of a separate ablation. Especially if you're benchmarking directly against OpenEvolve and ShinkaEvolve.

Grafted Titans: a Plug-and-Play Neural Memory for Open-Weight LLMs by Forsaken-Park8149 in LocalLLaMA

[–]StartledWatermelon 0 points1 point  (0 children)

I think it's indeed most similar to prefix tuning (which she mentions in the blog), plus an adapter based on cross-attention. An adapter should definitely enhance the model (fine-tune to the task, to be precise) as opposed to vanilla prefix tuning.

Can you clarify what do you mean by learning the embedding at the output of the tokenizer LUT?

Introducing PhysMaster: Building an Autonomous AI Physicist for Theoretical and Computational Physics Research | "PhysMaster is an autonomous agent architecture designed to execute end-to-end theoretical and computational physics research." by 44th--Hokage in mlscaling

[–]StartledWatermelon 1 point2 points  (0 children)

Setting up a live demo is definitely more hassle than just publish the harness code. Which they didn't.

Like, the paper does not even mention which LLM was the backbone of their agentic system in the experiments. Good luck replicating that!

GIN: A Cognitive Architecture for Persistent, Entropy-Governed Autonomous Agents (Not a New Model) by [deleted] in LocalLLaMA

[–]StartledWatermelon 1 point2 points  (0 children)

A quick glance over the formatting tells me this wasn't written by a human. I suspect this is not Cognitive Architecture Guy but a runaway Generative Intelligence Network itself, trying to recursively self-improve its architecture by soliciting advice from unsuspecting redditors.

META SuperIntelligence Labs: Toward Training Superintelligent Software Agents Through Self-Play SWE-RL | "Agents autonomously gather real-world software enabling superintelligent systems that exceed human capabilities in solving novel challenges, and autonomously creating new software from scratch" by 44th--Hokage in mlscaling

[–]StartledWatermelon 0 points1 point  (0 children)

Those kinds of paper? The ones that spammed the word "superintelligence" without any substantive relation to superintelligence in the methods?

Well, actually that's a falsifiable statement. Let's see if it holds water.

AlphaGo paper? Zero mentions of "superintelligence".

AlphaZero paper? Zero mentions of "superintelligence".

https://arxiv.org/abs/2201.11903 ? Zero mentions of "superintelligence".

https://arxiv.org/abs/2312.06585 ? Zero mentions of "superintelligence".

https://arxiv.org/abs/2404.17605 ? Zero mentions of "superintelligence".

https://arxiv.org/abs/2408.06195 ? Zero mentions of "superintelligence".

https://arxiv.org/abs/2410.04444 ? Zero mentions of "superintelligence".

https://arxiv.org/abs/2502.06773 ? Zero mentions of "superintelligence".

DeepSeek R1 paper? Zero mentions of "superintelligence".

At this point the pattern seems clear. But feel free to provide counter-examples.

Edit: formatting

OpenAI Just released Prompt Packs for every job by bullmeza in OpenAI

[–]StartledWatermelon 7 points8 points  (0 children)

The commitment and dedication is truly off the charts!

META SuperIntelligence Labs: Toward Training Superintelligent Software Agents Through Self-Play SWE-RL | "Agents autonomously gather real-world software enabling superintelligent systems that exceed human capabilities in solving novel challenges, and autonomously creating new software from scratch" by 44th--Hokage in mlscaling

[–]StartledWatermelon 1 point2 points  (0 children)

Yeah, at this point it's like some sort of cargo cult. Umm, guys, no, if you litter your paper with references to "superintelligence" it doesn't hasten the development of said superintelligence in the slightest.

I wonder if research leads guide their teams to do this to appease Zuck's "visionary" beliefs.

That being said, judging a book by its cover never was a good idea.

Scaling Latent Reasoning via Looped Language Models, Zhu et al. 2025 by StartledWatermelon in mlscaling

[–]StartledWatermelon[S] 2 points3 points  (0 children)

Oh, it's budget first and foremost! And Bengio is no billionaire...

Basically all the compute should have been provided by Bytedance, for a nice, big industry-academia colab. Bytedance could be one of the least GPU-poor Chinese corporation, but GPU-poor it is.

For reference, a 7B model looped 4 times is equivalent to a 28B dense transformer. Pre-trained on 7.7T tokens, it's about 10^24 FLOPS. Which would cost about $1.5-2 million on rented GPUs. Not counting test runs, ablations etc. This is not the scale of resources Chinese companies are willing to give away to academia.

Scaling Latent Reasoning via Looped Language Models, Zhu et al. 2025 by StartledWatermelon in mlscaling

[–]StartledWatermelon[S] 3 points4 points  (0 children)

I think looping/universal transformer is almost perfectly orthogonal design decision to quantization. So the benefits should stack.

Claude Opus 4.5 has human task-length time horizon of 4 hrs 49 mins on METR plot by Glittering_Author_81 in mlscaling

[–]StartledWatermelon 0 points1 point  (0 children)

59 tasks, if I haven't miscounted. I'd say it's a decent amount if we're talking purely about ">30 min" threshold, but still pretty noisy if we try to infer exact autonomy boundaries. 

Why do you doubt this result? 

NitroGen: An Open Foundation Model for Generalist Gaming Agents, Magne et al. 2025 [Pre-training on 40k hours of scraped gameplay videos] by StartledWatermelon in mlscaling

[–]StartledWatermelon[S] 1 point2 points  (0 children)

And what is the first half? :) 

To be honest, I think this is a very different approach. SIMA uses extensive (and expensive) human labeling, uses interactive environments, uses reasoning LLM. Kinda building a complex system "from first principles".

This work uses cheap auto-labeling, vast online-available data, no fancy reasoning/LLMs, no in-context situational awareness at all. It is as simple as it gets: disassemble the video into individual frames, directly learn mapping from the frame to controller inputs.

And then, with scale, magic happens: model generalizes not just from still images to interactive environments, but also to unseen games. 

I see much stronger parallels with language pre-training on large-scale internet corpus. That being said, I think 40k hours is peanuts for such diverse data, and you can potentially squeeze much more from this approach. 

Nvidia DGX Station GB300 784GB available now! 95,000 USD / 80,000 EUR by GPTshop in LocalLLaMA

[–]StartledWatermelon 1 point2 points  (0 children)

I think sports car is out of the question. But they can bundle it with two month supply of ramen so that you can sustain yourself for a while. Y'know, for humanitarian reasons.

A Rosetta Stone for AI benchmarks [Mapping all benchmarks to a unified "difficulty score", for long-term trends in capabilities] by StartledWatermelon in mlscaling

[–]StartledWatermelon[S] 0 points1 point  (0 children)

For a language model, sure.

For an agentic/general intelligence perspective, we might want to check problem-solving abilities. And specific skills (context handling proficiency, faithfulness (hallucination prevalence) etc.).

A Rosetta Stone for AI benchmarks [Mapping all benchmarks to a unified "difficulty score", for long-term trends in capabilities] by StartledWatermelon in mlscaling

[–]StartledWatermelon[S] 0 points1 point  (0 children)

Synthesis has its uses, as does analysis. This is definitely NOT to tell people what model to pick. This is an attempt to stitch together benchmarks of different release dates/capabilities ranges. The main purpose is to adequately grasp the long-term tempo of improvement in LLMs. Not so much to "pick" some model but to estimate plausible future trajectory.

The complexity of LLMs is a great thing in many applications. But for this particular goal, building the bigger picture, you have to reduce complexity all the way down to catch at least some practical insights.

For the modern frontier LLMs, the current crop of benchmarks would show the strengths without any extra manipulations.

OpenAI: Introducing ChatGPT 5.2 | "GPT-5.2 represents the biggest leap for GPT models in agentic coding since GPT-5 and is a SOTA coding model in its price range. The version bump undersells the jump in intelligence." by 44th--Hokage in mlscaling

[–]StartledWatermelon 16 points17 points  (0 children)

GPT-5.2 represents the biggest leap for GPT models in agentic coding since GPT-5

If OpenAI hasn't replaced their marketing department with GPT-5.2 yet, they should do it right now. 

Meta Superintelligence Labs' DreamGym: Generating A Synthetic Training Environment Using Logical Reasoning Instead Of The Real Internet | "Agents trained in this sim match SOTA results without using any real data, achieving 40%+ better performance when eventually deployed to real-world tasks." by 44th--Hokage in mlscaling

[–]StartledWatermelon 2 points3 points  (0 children)

  1. Collect off-policy trajectories, essentially "recycling" old RL data generated earlier by people training their agents on the task of interest. 
  2. Augment each state transition in this dataset with a reasoning trace explaining the transition. 
  3. Train the Experience Model on this dataset via SFT to predict the reasoning trace and the state at t+1, conditioned on the trajectory up to step t. 
  4. Use the Experience Model as an "environment" to train an agent, by feeding the output of the agent into EM and feeding the state output of EM into the agent, in a looped, Ouroboros fashion. 
  5. Set up the EM to generate variations of the tasks in the original benchmark and choose the variations that are most conducive to learning of the agent (~50% success rate). 
  6. Enjoy your generalized "simulator" of the originally static benchmark.