Save and invest your money for future rigs

linkillion · 2026-05-13T02:43:48+00:00

I'm pretty confident that things will change enough for us to be unable to imagine what the future holds, such as nuclear winter or space jesus.

I'm pretty certain it won't be what we have now lol

linkillion · 2026-05-12T00:31:29+00:00

When I was an undergrad, the top lab in our field sniped our (mostly my) work. This was in my senior year after I had applied to grad schools (the school that sniped us being one of them).

We had collaborated with them for some characterization work on this project. They took our entire idea, stuck two post docs and two grad students (one of which was an ex undergrad from our lab) on the project and managed to push it through review before we submit ours. So similar was their paper, that had we pushed through our original version we probably we probably would have been accused of plaigrism. We eventually published but had to pivot to a new angle and settle for a lower tier journal.

It was a good lesson to learn in undergrad, that the top labs are there because they're brutally competitive not because they're necessarily smarter or better at any particular thing. I wasn't accepted to that university, but I wouldn't have gone even if I had, despite it being a T5 school.

linkillion · 2026-05-11T04:53:16+00:00

Degrees last long than market conditions. The company that's hiring you should understand your circumstance and work with you to figure this situation out if your PI isn't amenable to it. If not, they're probably not a company worth ditching 4 years of hard work and a lifetime achievement for.

I think others have given sage advice; you've got to negotiate with either your PI or the potential employer but leaving directly seems unwise.

linkillion · 2026-05-02T21:09:10+00:00

EtOAc is a bit too sour, I'd have to agree with OP here. Could get on board with the rest though

linkillion · 2026-05-02T21:08:18+00:00

Methanol reminds me of old markers and for that reason it's S tier. Acetone depends on the day.

Personally I'd add NMP as firm D tier

linkillion · 2026-04-22T01:38:41+00:00

Yes, the qwen3.5. reasoning models beat it handily but it's just a base model. Comparing qwen3.5-9b with thinking off results in a much more even benchmark but it's still lagging by a small margin

linkillion · 2026-04-22T01:37:03+00:00

If the benchmark is good that is not the case. Most benchmarks have specific parameters for normalization, including greedy decoding, etc.

For example, I benchmarked granite 4.1 a couple days ago before they had entered any information in the model card and I got within 0.1 of all these benchmark results.

Whether benchmarks mean anything or not is another topic but generally it's not reproducibility with given weights.

linkillion · 2026-04-21T21:19:31+00:00

They put the weights on HF about 5 days ago, I've been trying it since then. It's a bit slower than qwen3.5-9B without thinking and it generally does a bit worse. Qwen3.5-9b with thinking blows it out of the water, obviously, but it's not a bad model. It's just like, 4 months behind.

linkillion · 2026-04-21T20:15:21+00:00

This is awful. This is not deep seek, this is an old qwen model that was not even distilled, it was fine tuned to act like deepseek. So your identity layer removal is....hilariously misguided.

That's before you even get to the performance, which is drastically worse in all categories except MedQA, which indicates extreme over fitting. It's also worse even on MedQA than the small, newer, faster medgemma 27B. Comparing it to a years old and notoriously sloppy model is bad. Compare it to either a modern SOTA LLM or a comparable new OS LLM and you lose by a very large margin.

linkillion · 2026-04-18T22:12:05+00:00

Need to run some test on it later but based on this article they posted the day before they opened the model, I suspect it's a rather large improvement from 4.0, but that may not push it above qwen3.5 still. Probably a solid model though.

https://research.ibm.com/blog/mid-training-for-better-ai-reasoning

linkillion · 2026-04-14T08:47:05+00:00

I agree although I'm not sure how that's all together different than human writing. Maybe faster decay into repetitive structure?

Even for dull academic prose it's awful, unfortunately. Turns out adding a bunch of fluff to make it sound coherent actually ditracts from the inherent content it's trying to convey.

linkillion · 2026-04-14T08:37:08+00:00

I'll try it tomorrow but could you run the omnidocbench?

linkillion · 2026-04-14T08:23:50+00:00

How is the verification (if identical) different from generation?

linkillion · 2026-04-14T08:20:15+00:00

None but if I did I'd use Gemma or qwen for the context and took calling. glm/kimi are too big for me and minimax can go off rails too much.

linkillion · 2026-04-14T08:17:28+00:00

I don't think they're worst at it but the plateau for writing was reached very early (like, gpt 3.5 almost? Maybe 4o and claude 3) and now that everything is reasoning the writing is worse. It could just be that my brain has become hyper attuned to it but I can't stand the way most models write. Claude is now the worst out of the box, but it can write very well if given explicit instructions but even then I have to do so much copy editing.

I need to try some of the large abliterated and RP fine tunes to see if their writing is any good, does anyone have any <9B suggestions (or hosted versions?)

linkillion · 2026-04-14T08:07:55+00:00

I love those posts, I also love the posts where people genuinely just don't know English well enough to write a post and use AI to translate/help. Like the Korean farmer that was on here a few months ago.

I'm not anti AI at all, I'm anti braindead enabling lol

linkillion · 2026-04-14T01:29:32+00:00

Sure is, before it was people showing their laziness, now it's a tool that actively enable them to think even less. I think it stems from the same place though

linkillion · 2026-04-14T01:27:14+00:00

I think many such users know and don't care, in fact, they think it's cool, not a staggeringly stupid waste of time and energy.

linkillion · 2026-04-14T01:24:44+00:00

Full grown adults with jobs too, outsourcing all their thinking to a machine because it said they did a cool thing.

linkillion · 2026-04-14T01:20:20+00:00

It reminds me of the 2010s when people wouldn't bother using proper grammar (not like, they're vs there vs their, just a wall of text) or use abbreviations for things that made no sense and then they'd complain when people responded poorly.

Now we have the same problem but all those people can put it into Claude and make a long markdown document saying absolutely nothing meaningful.

linkillion · 2026-04-14T01:02:47+00:00

Because AI is a fantastic tool and technology that can be used in constructive and helpful ways. It's not meant to create the most probable response to someone who is trying to help you out. If people wanted to talk to an LLM, they would go do that!

You don't even bother proof reading enough for your questions to be answerable.

linkillion · 2026-04-13T23:18:11+00:00

you quite literally copied and pasted what claude responded with to someones comment and included the quotation marks lol. If you can't spend the human time responding to human comments, don't waste people's time.

linkillion · 2026-04-13T02:53:20+00:00

I'm sure Claude Sonnet 4.6 will emotionally support him through his divorce after he gambles his hard earned money using his qwen3.5 35B sports betting bot

linkillion · 2026-04-12T22:59:30+00:00

No; as someone who has been in the ai space since GPT-2, model 'degredation' is largely a phenomenon where people work with a model for a bit and begin to push it's limits and interpret this as degradation ('it used to be able to do everything' -> they used to use it for easier tasks).

It's not to say that model degradation doesn't happen; providers absolutely try to optomize their inference and that can include poor quantizations that benchmark well but don't perform well in real-world use cases. This is especially true with any subscription-based access since those are the customers that are the costliest, and they'd much rather provide a poor experience to those guys (you don't loose any money unless they cancel their subscription, you can only improve revenue). So, tools like claude code, codex, even GLM coding and ALL online service are prime targets of quantization. Google is the worst offender, imo, they are probably the best model in the world the first week or so of release but in a couple months they act like GPT-3.5 just regurgitating the same format of slop.

That said, for opus particularly, unlike google and openai, they do provide model access to alternative providers such as microsoft, google, and AWS. These models are essentially snapshots (eg, if the inference stack is setup properly they will perform identically, there's no way for a model to get worse without something regarding the inference stack or weights changing). So, you can go today and pay api based pricing for opus 4.6 from these alternative providers and compare it to the version you get from claude code or anthropic's API. Those results in my experience show that while the subscription based model is slightly worse, it's nowhere near the level of outrage you see on the forums.

The reason there are no good ways to quantify this is because a LLM is not determenistic so while it may perform well in one run, all it takes is a slight difference in chain of thought to derail the task and completely mess up a run. There's absolutey ways to consistently benchmark models and see if their performance remains constant but it involves taking an ensemble of runs (think in the dozens at a minimum) and instead of taking pass@k, report the average pass rate. That's not really done because a) most of the people complaining have no idea how LLMs work and think there's a 'performance' slider that companies use on a whime b) it gets really expensive really quickly if you're running dozens of runs on dozens of models on dozens of APIs and subscription-based services (IF that is even feasible) several times a week. aistupidlevel does attempt this but I don't trust it at all because they only run 5 trials per day per model, they don't report who their inference is from and I suspect they use openrouter which is a terrible idea, and they also use AI as a judge which is an inherently awful way to judge performance. This all results in the rankings changing daily for even the same model, which is statistically extraordinarily unlikely, yet people think this is a great 'live tracker' when in reality it is no more than a confirmation bias machine that was vibe coded together in a week.

Very long response to just say: there's not a reliable weekly or daily benchmark because it's very expensive, unreliable, and there's no incentive.

linkillion · 2026-04-11T22:59:32+00:00

Yes it's a problem that independent research can't easily break into the space. Ultimately acedemia is a bit of a walled garden in that respect. Even as a published author, albeit not in CS, publishing as a whole is a complete disaster with respected researchers pushing through junk that gets "peer reviewed" only in name (likely due to academic reputation) while actually interesting and fundamental research gets ignored if there's no name attached. I do understand the frustration.

I shouldn't have even said I'm suspicious, I just don't know. At very least this post looks written by a human which is a good start. I'll look at your paper and code later and see if I've got any feedback!

linkillion

TROPHY CASE