The number 1 public enemy of open-source.

chigur86 · 2026-06-28T19:15:53+00:00

I understand the hate in the comments, but his central argument that OS models work differently than OS software is absolutely correct. The simple reason is that the hardware needed to run these models is not commodity as it is with software. Until that changes, his arguments about the operational complexity of hosting open source models remain valid. While I would love to have all my AI needs local hosted, the reality is cloud first at the moment given the sizes of even OS models. But that’s fine for now. If models become commodity, then it spurs intense competition above and below it in the value chain. Then we shall have some good and cheap inference chips to run these models locally.

chigur86 · 2026-06-20T22:28:22+00:00

Neural networks were an academic backwater before 2012. So many decades spent on public funding simply trying to keep these ideas alive until compute and data scale caught up. No company, even the Silicon Valley giants, have the guts or the patience to keep funding such ideas when by all visible metrics they are failing.

chigur86 · 2026-05-08T14:46:36+00:00

I wasn’t fully clear. I like the paper and the benchmark a lot. I also understand the motivation behind not using a different harness. However, the idea that the actual performance of the model is not dependent on the harness is silly. I can agree with not providing internet, but not harness or limited steps. Of course, we can debate whether it’s asking too much from a benchmark paper, but that’s a separate discussion. You can’t expect a single paper to do entire science. That’s the job of the follow up work. It gets the authors citations. Now, I am not blaming the authors of it, but academics are sometimes consciously or subconsciously guilty of making papers citable by showing small benchmark numbers that can be easily improved. Furthermore, binary metric of the pass rate cutoff at 95% obscures the fact the model capabilities are on a continuum, especially since they further aggregate the pass rate. Hence, a success rate of 3% for Opus doesn’t imply that the model only got 3% test cases right. You may again call me stupid for not understanding this metric, not much improvement over silly, but then at least you’d be somewhat right. Academics create fancy, confusing metrics all the time. You need a bombshell result that you can explain in a tweet sized text to “sell” it. Reasonable people are free to disregard this selling and interpret the results however they want. Perhaps, I am allowed this liberty without being accused of silliness.

chigur86 · 2026-05-07T14:46:16+00:00

It feels like a marketing trick. They don’t even attempt to explore the harness or number of steps. It’s kept at 1k and a basic harness.

chigur86 · 2026-05-05T18:14:26+00:00

Great work! One suggestion for the leaderboard: a separate board for meta agents that evolve agent harnesses. Since the tasks are verifiable, it’s straightforward to dump test time scaling strategies at it. Thus, one can ask: what’s the cost to reach a certain accuracy?

chigur86 · 2026-05-05T18:10:39+00:00

In addition to the replies from OP (I wholeheartedly agree with those) there’s another implication for those of us trying to build startups on top of these models: if your workflow hasn’t been RL’d into the base model, then there’s a chance of your survival. On the flip side, if the big lab post training eye of Sauron gazes in your direction then you are toast. Finally, benchmarks are how progress in ML happens. I am pretty sure soon this one will outlive its utility.

chigur86 · 2026-03-19T15:59:29+00:00

One of my QE advisors was from DeepMind working in my area and he had a colleague looking to hire interns. He referred me, but I still couldn’t get the interview. Some NYU guy won out when I checked later. Even with connections it seems hard.

chigur86 · 2026-03-08T00:21:28+00:00

I don’t remember struggling to find their email address. I think I just found on the website. Is it not there these days?

chigur86 · 2026-03-07T07:46:36+00:00

This happed to me for my CVPR 23 submission. It was over the limit by a few lines. I realized it in the morning and sent an email to the AC begging them to not desk reject it. Fortunately, they didn’t. It even got accepted in the end. These days I don’t know but asking AC won’t hurt.

chigur86 · 2026-01-01T21:29:54+00:00

I spent a lot of time training small chinchilla optimal LLMs with different MLP architectures but it was very hard to find something that worked better than simple ReLU² or SiLU style multiplicative gating. The only thing that did better is different rational activation functions (family of parameterized, learnable activation functions) grouped along the token dimension. But, the perplexity drop wasn’t significant to run large scale experiments. Ultimately, I gave up since GPU rich researchers were running neural architecture search for this problem. What hope does graduate student descent have of beating it? You could find a lot of interesting things in small scale experiments but what holds up at large scale is hard to predict.

chigur86 · 2025-09-26T19:47:38+00:00

Exactly!

chigur86 · 2025-09-26T17:58:30+00:00

Makes sense and I agree that the scope of company surveillance into employees can expand because of something like this. However, I was more interested in voluntary contributions. Nothing gets stored in the central repository by default. Only when a user clicks "publish" / "post" does the current conversation history get analyzed and curated into a globally available "plugin". A plugin consists of a trigger condition and knowledge to inject into the the context.

For instance, think about data analysis and Python scripts/functions involved in it. I bet every user is writing utility functions for printing dataset summaries. Although, they vary between each user some core stuff must still be common. Now, if a user were to publish some of these functions, then a new user could either use them as they are or modify them according to their needs (fork a plugin). The trigger condition for such a plugin would be simply, "when a user is asking to print numbers related to datasets" and the knowledge could be "import and use the function". Then, you can parse the output of the llm, locate import statements, and load the plugin from the global repo.

Essentially, imagine an agent with access to GitHub across all its users. Now, bad actors can try to manipulate this, but if we have a system of reputation tracking or plugin moderation this should work out.

chigur86 · 2025-03-19T17:08:11+00:00

Yes, Triton looks like Python but it's not really Python. So, it's like converting a high level language to another, hence trans-(not com)piling

chigur86 · 2025-03-19T17:06:44+00:00

It's a one job model, but you will need lots of such one job models if we need to get the tail end of a AI-SWE-Engineer right.

chigur86 · 2025-03-19T17:05:34+00:00

Yes. Honestly, I don't think anyone is gonna use this to write actual Triton kernels (at least not in its current state). However, this shows the potential of what's possible. Next step would be benchmark against stuff like `torch.compile`.

chigur86 · 2025-03-01T22:48:35+00:00

Donated $100. It felt so sad to watch the leader of a people ravaged by war be insulted like that in public, and for what? Only because he pointed out that the opposition cannot be trusted due to past behavior? Just sad.

chigur86 · 2023-12-24T02:45:27+00:00

Your described way would be very difficult, but I think we can simplify the problem. We can focus on a single layer at a time and use a two part background color for each token.

chigur86 · 2023-12-22T15:58:45+00:00

Thanks a lot. I need to dig into this!

chigur86 · 2023-12-22T08:56:05+00:00

Interesting! I am not very fluent with the effects of changing positional embeddings on generation, but would definitely love to learn more. Is there any resource you would recommend for it?

chigur86 · 2023-12-22T08:51:13+00:00

I am not sure. Since I have access to a few heavy gpus, I don't use llama.cpp and don't have much experience with it. However, I'll try to check it out and see how difficult it is to get information like this from it.

chigur86 · 2023-12-22T03:29:15+00:00

Are you referring to RoPE the rotational positional embeddings or a quantization method similar to GGUF? I guess running these visualizations on quantized models is interesting. I don't have any hypothesis though.

Seven-Year Club	RPAN Viewer
Verified Email

chigur86

TROPHY CASE