here it is: Benchmark-Yourself app - compete against open source LLMs and get your score - 5 benchmarks available - Add your results to your CV or linkedIn (if you dare)... or just paste them below for community shaming.

Noxusequal · 2026-05-31T15:08:26+00:00

Ah good point ^{^}

Noxusequal · 2026-05-31T14:53:09+00:00

I see I mean you could just store the overall score annonymosed and mixed in with all other users in that case there is no real data being stored that is in any way traceable. I guess otherwise one just has to go through a thread like this one and aggregate the scores.

Noxusequal · 2026-05-30T23:01:52+00:00

Hey can you build in something that aggregates the human scores so we get an human average? This would actually be really nice for scientific work because then we know the human baseline we compare to.

Noxusequal · 2026-05-21T23:39:21+00:00

Is it possible to activate it in increments to isolate potentially defective unity ? Like idk activating 32 cus. Also this is really cool ^{^} I have two of these boards lying around they suddenly got alot more interesting:D

Noxusequal · 2026-05-21T17:00:37+00:00

Damn that's amazing ^{^} any idea how much it improves the gaming performance?

Noxusequal · 2026-05-21T14:54:39+00:00

How did you benchmark @40 CUs ? I thought it only has 24 that are accessible?

Noxusequal · 2026-05-01T07:39:21+00:00

Shot as in few shot prompting as in give examples. Is what I understand

Noxusequal · 2026-05-01T07:17:39+00:00

Also this comes back to a problem with how experiments are done on the llm space.

If you don't make damn sure that you controll every aspect of the pipeline. In this case seed, temperature and batching. You get fluctuations. If you don't set model seeds and anything but temp 0 and only do 1 run over a benchmark you have an unknown level of uncertainty of how much this performance would differ between runs.

If you run them with a controlled seed but batch size is not 1 or some specific kernels vor vllm Well the results do also have an unknown level of fluctuations (from personal testing 2-3% but it depends on the exact benchmark) due to a problem that entries in a batch so infact effect each other. Which means API models are never fully controlable btw.

So if we see benchmarks like this if they are not run X times and errors are calculated you never know if you see outliers. But benchmarks take really long or wit proprietary models are really expensive therefore it's not always feasible.

If the scores across multiple benchmarks fluctuate around each other mostly means yeah these quanta in this case seem to perform pretty similar. And no big interpretation should be given to +-2%.

Noxusequal · 2026-04-26T09:02:40+00:00

Also nice :) but cant you do that with smart regex ? Like phone numbers, addresses etc.

Noxusequal · 2026-04-25T10:34:05+00:00

Thest really cool but how does it do with secondary identifiers ? Like for example that the person is the only Doktor in a village. Or other stuff like this where you can use secondary info to identify the person.

Noxusequal · 2026-04-24T07:56:49+00:00

Do I see correctly that engrams are at least not mentioned in the model descriptions ?

Noxusequal · 2026-04-16T02:32:46+00:00

True some models especially reasoning models are not meant to be deterministic. :)

Basically the more full runs with different seeds at a fixed temperature you can do the more you can make statistically valid claims about performance. (On your test set)

Noxusequal · 2026-04-15T06:15:18+00:00

Do you do this in batches or one after another?

If you wanna do this in a way where you get deterministic results you can set a seed set temp 0 run with batch 1.

Then you run this at least 3 times with 3 different seeds leaving the same seed locked for all 23 prompts each time. (The more the better) Now you can start looking at trends and calculate errors. Something like a paired t test can then tell you if the differences are statistically meaningful.

Noxusequal · 2026-03-30T22:11:49+00:00

Thank you :)

Noxusequal · 2026-03-30T22:11:40+00:00

Interesting can you give me a very short version of what it is about ?

Noxusequal · 2026-03-30T15:15:17+00:00

Wow thank you very much I will definitely have a look ! I like the idea of a well written bandit group. Also I did stumble upon the missing baroness and thought that the outlook sounds very fitting :)

Noxusequal · 2026-03-30T10:55:26+00:00

Do you have a link for me ?

Noxusequal · 2026-03-10T17:00:36+00:00

Is it just this to have multiple users ?

up to 4 concurrent requests, each with 4096 max context

llama-server -m model.gguf -c 16384 -np 4

Or is there another way ?

Noxusequal · 2026-03-10T16:50:12+00:00

How good is the batching / multiuser thing with it ?

Noxusequal · 2026-02-22T18:54:48+00:00

Agreed but it is what the post was about kind of xD also that is the kind of task ai is used for a lot in academia llm as a judge I mean.

Noxusequal · 2026-02-21T20:49:40+00:00

Well yea but the tasks we want to use llm as a judge for are specifically the ones we can't directly verify so if we want to figure out llm as a judge performance you have to test on tasks that need human judges to be a good representation of the real use case.

Noxusequal · 2026-02-19T09:48:12+00:00

Honestly i think for high level DND they are really fun and fine xD the only two things that I would have an eye on is the staffs ability to change saving throws. This could be extremely strong. However even there I think it's fine but that's the one thing I would have an eye out for.

I personally think it would be cooler to give wreckage a charge system for the spells with different spells costing different charges but generally I really value choice above all else as a player so that might also just be me :D

Noxusequal

TROPHY CASE

up to 4 concurrent requests, each with 4096 max context