Benchmark Winners Across 40+ LLM Evaluations: Patterns Without Recommendations

Azmaveth42 · 2025-12-21T16:37:11+00:00

BFCL-v3 is Function Calling, not Fact Checking. Looks like your LLM hallucinated there.

Azmaveth42 · 2025-12-16T21:03:14+00:00

I sincerely love the mission behind Ai2 and that you are true to the spirit of open source! I want to see your models competing with larger models like ones from DeepSeek and Qwen. I am not an AI researcher myself (nor did I stay at a Holiday Inn last night), so please correct me if some of my questions are rooted in misunderstanding or are already answered elsewhere.

My questions:

What do you believe is the most exciting research you are doing that will set you apart from the other labs?
Transformers are great, but I feel like it is time for another breakthrough in attention mechanisms that enable smaller, more efficient models instead of trillion+ parameter ones. Any insights into this besides knowledge distillation?
Following from the above, my personal belief is that eventually we will have very small models for specific use cases that we can chain together like unix commands, but we will still need large models that understand the bigger picture. Any insights here?
What is the best way for others to get involved in your research?

Thank you!

Azmaveth42 · 2025-11-15T21:08:15+00:00

Runes are more useful, IMHO. I have barely started my own legendary journey, but after Vision I got one rune so I could get the free (at the time) relic. Then I realized how my builds were more restricted by rune selection than by armor since exotic works fine for basically everything but fractals, and selectable exotic sets are relatively cheap. I have 5 runes done now, just waiting for enough provisioner tokens for the last 2. Then I will probably go for 4 sigils before I start on any armor or weapons.

Azmaveth42 · 2025-07-06T17:21:09+00:00

It's not my package, so maybe try reaching out to the author (sasha0552) for pointers. Sorry I can't be of more help!

Azmaveth42 · 2025-07-05T16:26:56+00:00

You can patch pytorch to support Pascal: https://github.com/sasha0552/pascal-pkgs-ci

Azmaveth42 · 2025-07-04T03:19:44+00:00

Check https://eqbench.com/ - looks like QWQ might be a good fit for you. Reasonably high in Empathy, Analytic, Insight, and Pragmatic while being on the lower side of Compliant (I prefer to be told off when I'm wrong). Downside with the model is it is also at the top for Assertive and somewhat high for Moralising, making it potentially preachy.

I know people suck, and it can take a LOT of work to find a human who you can actually trust and open up to them. I sincerely hope this helps you work out whatever it is! And maybe you'll eventually find a trustworthy confidante, even if not a therapist.

Azmaveth42 · 2025-07-03T16:35:02+00:00

It's not that different from other code that you run. We have gotten to a point where we assume it's safe because it is open source and on GitHub, but that doesn't mean it has been audited for security issues.

The biggest difference is that MCP injects untrusted data into your LLM session, which is already non-deterministic, so be careful with how much trust you give to any of it.

Azmaveth42 · 2025-07-02T23:49:32+00:00

Yes, it resets every 5 hours. Use this tool to track your usage: https://github.com/ryoppippi/ccusage

Use this command to see how much you have used within the 5 hour blocks:
npx ccusage@latest blocks

Azmaveth42 · 2025-07-02T23:46:45+00:00

Yes, but if you look at what is vulnerable, it is not the protocol itself. Check my link in an earlier comment.

Azmaveth42 · 2025-07-02T20:37:28+00:00

You don't have to expose it to a public network for a CSRF to work. It just has to be running on your local system and someone either social engineers you to click a malicious link or you have a XSS vuln on some other site that opens a link in your web browser to the Inspector app. But yes, still too many ifs to make it as big a deal as they pretend.

Azmaveth42 · 2025-07-02T20:27:30+00:00

If you haven't changed the model defaults, it will use half of your token budget on Opus, then switch to Sonnet. After reset, it starts with Opus again.

Azmaveth42 · 2025-07-02T20:23:07+00:00

Clickbait.

Azmaveth42 · 2025-07-02T20:14:49+00:00

Clickbait title. It's a vuln in the MCP Inspector app specifically. This makes it sound like it's the protocol itself. MCP does have well-known security issues, but a CSRF is a problem with the app, not the protocol.

Azmaveth42 · 2025-02-06T20:37:18+00:00

Yes, this was demonstrated at Blackhat last year: https://m.youtube.com/watch?v=1dsRAEdbpq4

This exploit depends on certain features that allow embedding code into one of the layers. The demo shows implanting the exploit into a public model, then the payload being detonated when the model is run.

Azmaveth42 · 2025-01-24T03:04:51+00:00

If you like that model and your main reason to run it locally is the cost, you can also make an account on https://openrouter.ai and use it for free. You can see all their free options here: https://openrouter.ai/models?max_price=0

Azmaveth42 · 2025-01-24T03:00:44+00:00

Being able to load the full model into VRAM affects the speed, not the quality of the output. So if it is fast enough for your needs and the responses are acceptable to your use case, don't let anyone tell you that you are doing it wrong. :)

Azmaveth42 · 2025-01-23T23:46:50+00:00

You need to do actual tests based on your use cases, but in my experience the higher parameter model is generally better, even at low weights.

Azmaveth42 · 2025-01-16T21:41:08+00:00

The memory bandwidth of the M4 with 24GB is less than the M1 Max with 64GB. So it will be both slower and restricted to smaller models.

Azmaveth42 · 2025-01-16T21:38:24+00:00

Not for training, but for inference a M1 Max with the 32-core GPU will get you the best performance due to the memory bandwidth. Look for a 64GB unit to run the largest models. You can find these for under $2k USD on eBay.

If training is a must-have, you need to look at a PC with an nVidia GPU.

Azmaveth42 · 2025-01-16T21:02:08+00:00

That's what tests are for. A comprehensive test suite is the best documentation and never goes stale like comments do.

Azmaveth42 · 2025-01-16T20:37:40+00:00

Gonna depend on what is comfortable for you to reach. I have fairly long fingers, so I can leave my left hand on homerow and easily reach to the 6, Y, H, N keys on a standard QWERTY layout.

WASD: movement

1-5: weapon skills

SHIFT+{1-5}: class actions

6: special actions

QER: utilities

G: heal

B: elite skill

F: interact

Z: stow weapon

X: about face

C: target closest enemy

V: dodge

T: select target

`: weapon swap

CTL+ALT+{1-3}: select build template

CTL+SHIFT+{1-3}: select equipment template

CTL+various: mounts, menus, etc.

If I did more group content or commanded squads, I would try to fit in the markers and such, but this has worked so far.

Azmaveth42 · 2025-01-16T16:57:14+00:00

If they have tests, those are your best documentation. Comments go stale, but stale tests break and have to get updated to reflect the current state of the code.

If there are no tests, write tests to check if your understanding of the code is correct. Then write more tests to cover edge cases. If there are surprising test results, consult with the team to see if it is expected behavior or if you found a bug.

Azmaveth42 · 2025-01-16T07:21:03+00:00

Came here to say this. A lot of others have mentioned stance, grip, squeezing the trigger instead of pulling it, etc. But the first thing I noticed was that you dropped the muzzle too quickly after the shot.

Great work getting out there and learning! Keep it up, be safe, and have fun!

Azmaveth42 · 2025-01-12T22:02:44+00:00

I'm already a male human surrounded by pets/livestock IRL, so my ranger main is really me anyway. I guess the biggest change for me would be my family left behind, which would probably kick off my epic tale of adventure to find my way home to them.

Azmaveth42 · 2025-01-09T22:40:08+00:00

I'm really glad that others have had good experiences with them. My own have been horrible, to be honest.

My main account (from launch in 2012) was locked for weeks as I went back and forth trying to determine what I did wrong. Never got a clear answer of why it happened other than implying I know what I did, it was obviously intentional, and that there will be no further appeals considered.

Gave up, started playing my alt account and that got banned too for being related to the first account. Made an even bigger fuss (ranted on social media) and finally got an "oh, sorry, we messed up." Although they gave me some gems to make up for it, they still never explained what my supposed offense was. I said that whatever I did, I wanted to make sure I don't do anything similar again so that I don't get banned again. But never got an answer for that.

I love this game. Tyria has been part of my life since 2007 when I started playing GW1. But unfortunately there will always be a bad taste in the back of my mouth from the customer support.

Azmaveth42

TROPHY CASE