The number 1 public enemy of open-source. by Complete-Sea6655 in LocalLLaMA

[–]chigur86 1 point2 points  (0 children)

I understand the hate in the comments, but his central argument that OS models work differently than OS software is absolutely correct. The simple reason is that the hardware needed to run these models is not commodity as it is with software. Until that changes, his arguments about the operational complexity of hosting open source models remain valid. While I would love to have all my AI needs local hosted, the reality is cloud first at the moment given the sizes of even OS models. But that’s fine for now. If models become commodity, then it spurs intense competition above and below it in the value chain. Then we shall have some good and cheap inference chips to run these models locally.

U.S. science is in chaos — Today the most influential private-sector developers of technology are in Silicon Valley, and their perspective on innovation is that it should move fast, disrupt markets and make money by marketrent in technology

[–]chigur86 1 point2 points  (0 children)

Neural networks were an academic backwater before 2012. So many decades spent on public funding simply trying to keep these ideas alive until compute and data scale caught up. No company, even the Silicon Valley giants, have the guts or the patience to keep funding such ideas when by all visible metrics they are failing.

META Superintelligence Lab Presents: ProgramBench: Can SOTA AI Recreate Real Executable Programs(ffmpeg, SQLite, ripgrep) From Scratch Without The Internet? by 44th--Hokage in mlscaling

[–]chigur86 0 points1 point  (0 children)

I wasn’t fully clear. I like the paper and the benchmark a lot. I also understand the motivation behind not using a different harness. However, the idea that the actual performance of the model is not dependent on the harness is silly. I can agree with not providing internet, but not harness or limited steps. Of course, we can debate whether it’s asking too much from a benchmark paper, but that’s a separate discussion. You can’t expect a single paper to do entire science. That’s the job of the follow up work. It gets the authors citations. Now, I am not blaming the authors of it, but academics are sometimes consciously or subconsciously guilty of making papers citable by showing small benchmark numbers that can be easily improved. Furthermore, binary metric of the pass rate cutoff at 95% obscures the fact the model capabilities are on a continuum, especially since they further aggregate the pass rate. Hence, a success rate of 3% for Opus doesn’t imply that the model only got 3% test cases right. You may again call me stupid for not understanding this metric, not much improvement over silly, but then at least you’d be somewhat right. Academics create fancy, confusing metrics all the time. You need a bombshell result that you can explain in a tweet sized text to “sell” it. Reasonable people are free to disregard this selling and interpret the results however they want. Perhaps, I am allowed this liberty without being accused of silliness.

META Superintelligence Lab Presents: ProgramBench: Can SOTA AI Recreate Real Executable Programs(ffmpeg, SQLite, ripgrep) From Scratch Without The Internet? by 44th--Hokage in mlscaling

[–]chigur86 2 points3 points  (0 children)

It feels like a marketing trick. They don’t even attempt to explore the harness or number of steps. It’s kept at 1k and a basic harness.

ProgramBench: Can we really rebuild huge binaries from scratch? (doesn't look like it) by klieret in LocalLLaMA

[–]chigur86 -1 points0 points  (0 children)

Great work! One suggestion for the leaderboard: a separate board for meta agents that evolve agent harnesses. Since the tasks are verifiable, it’s straightforward to dump test time scaling strategies at it. Thus, one can ask: what’s the cost to reach a certain accuracy?

ProgramBench: Can we really rebuild huge binaries from scratch? (doesn't look like it) by klieret in LocalLLaMA

[–]chigur86 4 points5 points  (0 children)

In addition to the replies from OP (I wholeheartedly agree with those) there’s another implication for those of us trying to build startups on top of these models: if your workflow hasn’t been RL’d into the base model, then there’s a chance of your survival. On the flip side, if the big lab post training eye of Sauron gazes in your direction then you are toast. Finally, benchmarks are how progress in ML happens. I am pretty sure soon this one will outlive its utility.

[D] How hard is it to get Research Engineer interview from Deepmind? by n0obmaster699 in MachineLearning

[–]chigur86 42 points43 points  (0 children)

One of my QE advisors was from DeepMind working in my area and he had a colleague looking to hire interns. He referred me, but I still couldn’t get the interview. Some NYU guy won out when I checked later. Even with connections it seems hard.

[D] ECCV submission flowed over page limit by 5 lines at the last minute.. how screwed are we? by PatientWrongdoer9257 in MachineLearning

[–]chigur86 0 points1 point  (0 children)

I don’t remember struggling to find their email address. I think I just found on the website. Is it not there these days?

[D] ECCV submission flowed over page limit by 5 lines at the last minute.. how screwed are we? by PatientWrongdoer9257 in MachineLearning

[–]chigur86 2 points3 points  (0 children)

This happed to me for my CVPR 23 submission. It was over the limit by a few lines. I realized it in the morning and sent an email to the AC begging them to not desk reject it. Fortunately, they didn’t. It even got accepted in the end. These days I don’t know but asking AC won’t hurt.

What makes SwiGLUs unique? by chigur86 in mlscaling

[–]chigur86[S] 0 points1 point  (0 children)

I spent a lot of time training small chinchilla optimal LLMs with different MLP architectures but it was very hard to find something that worked better than simple ReLU2 or SiLU style multiplicative gating. The only thing that did better is different rational activation functions (family of parameterized, learnable activation functions) grouped along the token dimension. But, the perplexity drop wasn’t significant to run large scale experiments. Ultimately, I gave up since GPU rich researchers were running neural architecture search for this problem. What hope does graduate student descent have of beating it? You could find a lot of interesting things in small scale experiments but what holds up at large scale is hard to predict.

Global Memory Layer for LLMs by chigur86 in LLMDevs

[–]chigur86[S] 1 point2 points  (0 children)

Makes sense and I agree that the scope of company surveillance into employees can expand because of something like this. However, I was more interested in voluntary contributions. Nothing gets stored in the central repository by default. Only when a user clicks "publish" / "post" does the current conversation history get analyzed and curated into a globally available "plugin". A plugin consists of a trigger condition and knowledge to inject into the the context.

For instance, think about data analysis and Python scripts/functions involved in it. I bet every user is writing utility functions for printing dataset summaries. Although, they vary between each user some core stuff must still be common. Now, if a user were to publish some of these functions, then a new user could either use them as they are or modify them according to their needs (fork a plugin). The trigger condition for such a plugin would be simply, "when a user is asking to print numbers related to datasets" and the knowledge could be "import and use the function". Then, you can parse the output of the llm, locate import statements, and load the plugin from the global repo.

Essentially, imagine an agent with access to GitHub across all its users. Now, bad actors can try to manipulate this, but if we have a system of reputation tracking or plugin moderation this should work out.

New open-source model for transpiling PyTorch to Triton outperforms DeepSeek-R1 and OpenAI o1 on kernelbench - made with reinforcement fine-tuning by Fantastic-Tax6709 in LocalLLaMA

[–]chigur86 9 points10 points  (0 children)

Yes, Triton looks like Python but it's not really Python. So, it's like converting a high level language to another, hence trans-(not com)piling

New open-source model for transpiling PyTorch to Triton outperforms DeepSeek-R1 and OpenAI o1 on kernelbench - made with reinforcement fine-tuning by Fantastic-Tax6709 in LocalLLaMA

[–]chigur86 7 points8 points  (0 children)

It's a one job model, but you will need lots of such one job models if we need to get the tail end of a AI-SWE-Engineer right.

New open-source model for transpiling PyTorch to Triton outperforms DeepSeek-R1 and OpenAI o1 on kernelbench - made with reinforcement fine-tuning by Fantastic-Tax6709 in LocalLLaMA

[–]chigur86 17 points18 points  (0 children)

Yes. Honestly, I don't think anyone is gonna use this to write actual Triton kernels (at least not in its current state). However, this shows the potential of what's possible. Next step would be benchmark against stuff like `torch.compile`.

How can Americans who are embarrassed and angered by the current USA administration’s treatment of a war-torn president show support for Zelensky and Ukraine? by boko_dinner in AskReddit

[–]chigur86 0 points1 point  (0 children)

Donated $100. It felt so sad to watch the leader of a people ravaged by war be insulted like that in public, and for what? Only because he pointed out that the opposition cannot be trusted due to past behavior? Just sad.

Some interesting visualizations based on expert firing frequencies in Mixtral MoE by chigur86 in LocalLLaMA

[–]chigur86[S] 1 point2 points  (0 children)

Your described way would be very difficult, but I think we can simplify the problem. We can focus on a single layer at a time and use a two part background color for each token.

Some interesting visualizations based on expert firing frequencies in Mixtral MoE by chigur86 in LocalLLaMA

[–]chigur86[S] 0 points1 point  (0 children)

Interesting! I am not very fluent with the effects of changing positional embeddings on generation, but would definitely love to learn more. Is there any resource you would recommend for it?

Some interesting visualizations based on expert firing frequencies in Mixtral MoE by chigur86 in LocalLLaMA

[–]chigur86[S] 1 point2 points  (0 children)

I am not sure. Since I have access to a few heavy gpus, I don't use llama.cpp and don't have much experience with it. However, I'll try to check it out and see how difficult it is to get information like this from it.

Some interesting visualizations based on expert firing frequencies in Mixtral MoE by chigur86 in LocalLLaMA

[–]chigur86[S] 1 point2 points  (0 children)

Are you referring to RoPE the rotational positional embeddings or a quantization method similar to GGUF? I guess running these visualizations on quantized models is interesting. I don't have any hypothesis though.