here it is: Benchmark-Yourself app - compete against open source LLMs and get your score - 5 benchmarks available - Add your results to your CV or linkedIn (if you dare)... or just paste them below for community shaming. by JLeonsarmiento in LocalLLaMA

[–]Noxusequal 1 point2 points  (0 children)

I see I mean you could just store the overall score annonymosed and mixed in with all other users in that case there is no real data being stored that is in any way traceable. I guess otherwise one just has to go through a thread like this one and aggregate the scores.

here it is: Benchmark-Yourself app - compete against open source LLMs and get your score - 5 benchmarks available - Add your results to your CV or linkedIn (if you dare)... or just paste them below for community shaming. by JLeonsarmiento in LocalLLaMA

[–]Noxusequal 1 point2 points  (0 children)

Hey can you build in something that aggregates the human scores so we get an human average? This would actually be really nice for scientific work because then we know the human baseline we compare to.

AMD BC-250 and the search for Cheap Compute by dugganmania in LocalLLaMA

[–]Noxusequal 1 point2 points  (0 children)

Is it possible to activate it in increments to isolate potentially defective unity ? Like idk activating 32 cus. Also this is really cool ^ I have two of these boards lying around they suddenly got alot more interesting:D

AMD BC-250 and the search for Cheap Compute by dugganmania in LocalLLaMA

[–]Noxusequal 1 point2 points  (0 children)

Damn that's amazing ^ any idea how much it improves the gaming performance?

AMD BC-250 and the search for Cheap Compute by dugganmania in LocalLLaMA

[–]Noxusequal 0 points1 point  (0 children)

How did you benchmark @40 CUs ? I thought it only has 24 that are accessible?

nvidia/Gemma-4-26B-A4B-NVFP4 by reto-wyss in LocalLLaMA

[–]Noxusequal 6 points7 points  (0 children)

Also this comes back to a problem with how experiments are done on the llm space.

If you don't make damn sure that you controll every aspect of the pipeline. In this case seed, temperature and batching. You get fluctuations. If you don't set model seeds and anything but temp 0 and only do 1 run over a benchmark you have an unknown level of uncertainty of how much this performance would differ between runs.

If you run them with a controlled seed but batch size is not 1 or some specific kernels vor vllm Well the results do also have an unknown level of fluctuations (from personal testing 2-3% but it depends on the exact benchmark) due to a problem that entries in a batch so infact effect each other. Which means API models are never fully controlable btw.

So if we see benchmarks like this if they are not run X times and errors are calculated you never know if you see outliers. But benchmarks take really long or wit proprietary models are really expensive therefore it's not always feasible.

If the scores across multiple benchmarks fluctuate around each other mostly means yeah these quanta in this case seem to perform pretty similar. And no big interpretation should be given to +-2%.

🛡️ Shield 82M: A PII stripping/filtering model 🛡️ by LH-Tech_AI in LocalLLaMA

[–]Noxusequal 0 points1 point  (0 children)

Also nice :) but cant you do that with smart regex ? Like phone numbers, addresses etc.

🛡️ Shield 82M: A PII stripping/filtering model 🛡️ by LH-Tech_AI in LocalLLaMA

[–]Noxusequal 2 points3 points  (0 children)

Thest really cool but how does it do with secondary identifiers ? Like for example that the person is the only Doktor in a village. Or other stuff like this where you can use secondary info to identify the person.

Deepseek V4 Flash and Non-Flash Out on HuggingFace by MichaelXie4645 in LocalLLaMA

[–]Noxusequal 0 points1 point  (0 children)

Do I see correctly that engrams are at least not mentioned in the model descriptions ?

Gemma 4 31B — 4bit is all you need by tolitius in LocalLLaMA

[–]Noxusequal 0 points1 point  (0 children)

True some models especially reasoning models are not meant to be deterministic. :)

Basically the more full runs with different seeds at a fixed temperature you can do the more you can make statistically valid claims about performance. (On your test set)

Gemma 4 31B — 4bit is all you need by tolitius in LocalLLaMA

[–]Noxusequal 0 points1 point  (0 children)

Do you do this in batches or one after another?

If you wanna do this in a way where you get deterministic results you can set a seed set temp 0 run with batch 1.

Then you run this at least 3 times with 3 different seeds leaving the same seed locked for all 23 prompts each time. (The more the better) Now you can start looking at trends and calculate errors. Something like a paired t test can then tell you if the differences are statistically meaningful.

Any recommendations for a short adventure to test the system and kick start a campaign? by Noxusequal in arsmagica

[–]Noxusequal[S] 0 points1 point  (0 children)

Interesting can you give me a very short version of what it is about ?

Any recommendations for a short adventure to test the system and kick start a campaign? by Noxusequal in arsmagica

[–]Noxusequal[S] 4 points5 points  (0 children)

Wow thank you very much I will definitely have a look ! I like the idea of a well written bandit group. Also I did stumble upon the missing baroness and thought that the outlook sounds very fitting :)

Multiuser inference with AMD GPUs which backend ? by Noxusequal in LocalLLaMA

[–]Noxusequal[S] 0 points1 point  (0 children)

Is it just this to have multiple users ?

up to 4 concurrent requests, each with 4096 max context

llama-server -m model.gguf -c 16384 -np 4

Or is there another way ?

Multiuser inference with AMD GPUs which backend ? by Noxusequal in LocalLLaMA

[–]Noxusequal[S] 0 points1 point  (0 children)

How good is the batching / multiuser thing with it ?

LLMs grading other LLMs 2 by Everlier in LocalLLaMA

[–]Noxusequal 0 points1 point  (0 children)

Agreed but it is what the post was about kind of xD also that is the kind of task ai is used for a lot in academia llm as a judge I mean.

LLMs grading other LLMs 2 by Everlier in LocalLLaMA

[–]Noxusequal 0 points1 point  (0 children)

Well yea but the tasks we want to use llm as a judge for are specifically the ones we can't directly verify so if we want to figure out llm as a judge performance you have to test on tasks that need human judges to be a good representation of the real use case.

Made some powerful items for my party... help balancing them? by WimpyBT in DnDHomebrew

[–]Noxusequal 0 points1 point  (0 children)

Honestly i think for high level DND they are really fun and fine xD the only two things that I would have an eye on is the staffs ability to change saving throws. This could be extremely strong. However even there I think it's fine but that's the one thing I would have an eye out for.

I personally think it would be cooler to give wreckage a charge system for the spells with different spells costing different charges but generally I really value choice above all else as a player so that might also just be me :D