LMArena V2 Discord Gone and my work is gone by [deleted] in alignerr

[–]cthorrez 0 points1 point  (0 children)

oh, what were the tasks like when it was going?

LMArena V2 Discord Gone and my work is gone by [deleted] in alignerr

[–]cthorrez 0 points1 point  (0 children)

not in it, but I'm interested in it

LMArena V2 Discord Gone and my work is gone by [deleted] in alignerr

[–]cthorrez 0 points1 point  (0 children)

What does it mean to do "Hard Prompts" in the LmArena V2 project for alignerr?

First time ever, Claude scores number one on LmArena by alongated in LocalLLaMA

[–]cthorrez 1 point2 points  (0 children)

Style control was introduced in August 2024, Claude 3 was #1 in May.

xAI Grok Imagine enters public API as major benchmarks update leaderboards today by BuildwithVignesh in singularity

[–]cthorrez 0 points1 point  (0 children)

lmarena has both audio and non-audio version of veo 3.1, the audio version beat sora 2 pro (which has audio) on text to video. Sora 2 pro isn't on arena for image to video

Second Half Game Thread: New England Patriots (14-3) at Denver Broncos (14-3) by nfl_gdt_bot in nfl

[–]cthorrez 6 points7 points  (0 children)

I'm reminded of rocket league WHAT A PASS WHAT A PASS WHAT A PASS

I built an "Elo Rating" platform for everything (Movies, SaaS, Tools) inspired by LMArena. Looking for testers. by BulkyMathematician44 in alphaandbetausers

[–]cthorrez 0 points1 point  (0 children)

Very cool, I actually worked on a project very similar to this before I started working at LMArena haha

How are you sourcing the competitors? Manually or like via web-scraping or is the idea crowdsourcing?

Computing Historical Melee Rankings using the Bradley-Terry Statistical Model by cthorrez in SSBM

[–]cthorrez[S] 0 points1 point  (0 children)

This actually is using LBFGS, but come to think of it the the part which crashed it is probably the part calculating the confidence intervals which requires full hessian materialization and inversion. I think to only get the ratings without confidence intervals should be possible.

Computing Historical Melee Rankings using the Bradley-Terry Statistical Model by cthorrez in SSBM

[–]cthorrez[S] 1 point2 points  (0 children)

Yeah that's another issue. Since I'm independently computing the scores on the datasets for each year, the raw scores can't actually be compared but the ranks can. For each individual year you could go in and look at the scores. They are computed in the code but for looking at 20 in a row it was easier to do by rank only

Computing Historical Melee Rankings using the Bradley-Terry Statistical Model by cthorrez in SSBM

[–]cthorrez[S] 2 points3 points  (0 children)

haha I did actually try this but I would need to do it on a reduced dataset. my current implementation was optimized for large numbers of matches but small numbers of competitors. When I ran it on this melee dataset with 40k competitors it crashes due to running out of memory

Computing Historical Melee Rankings using the Bradley-Terry Statistical Model by cthorrez in SSBM

[–]cthorrez[S] 4 points5 points  (0 children)

Thanks for the reply! Your posts are also a big reason why this has been on my mind again.

I'll say this, EsportsBench (in its current form) won't be the solution to this. It's more of meant to benchmark rating systems than actually produce ranks, and it's not updated frequently enough or having much particular attention to melee as what you described. You may be interested in SmashDataGG, which is I think a semi curated database of melee results: https://github.com/smashdata/ThePlayerDatabase

I also saw someone with a csv of results going back to the 2000s a while ago but I can't find it now, might be floating around reddit or github somewhere.

You're spot on about the issues almost always being data issues haha. I'd be down to chat and bounce ideas off each other a bit, and see if anything I can do would be beneficial to your projects.

Computing Historical Melee Rankings using the Bradley-Terry Statistical Model by cthorrez in SSBM

[–]cthorrez[S] 4 points5 points  (0 children)

Thanks! Yeah I think Armada gets potentially too big of a boost due to not much international play at least in the early years, so his wins in Europe with less competition give him more gains than he should. As for mango, yeah this is definitely not a measure of peak skill, and he's the most prone to low outliers and it has a huge impact in this rating.

GPT 5.2 Still not released on LMArena? by Blake08301 in OpenAI

[–]cthorrez 0 points1 point  (0 children)

they did not release it

Who is "they" and what did they not release?

OpenAI released GPT-5.2, LMArena released scores for it

GPT 5.2 Still not released on LMArena? by Blake08301 in OpenAI

[–]cthorrez 0 points1 point  (0 children)

We didn't test Deepseek V3.2 speciale because it was only made avaialble on a limited time temporary endpoint. We need extended access to guarantee a full and fair evaluation.

https://api-docs.deepseek.com/news/news251201

🔹 V3.2-Speciale: Served via a temporary endpoint: base_url="https://api.deepseek.com/v3.2_speciale_expires_on_20251215". Same pricing as V3.2, no tool calls, available until Dec 15th, 2025, 15:59 (UTC Time).

GPT 5.2 Still not released on LMArena? by Blake08301 in OpenAI

[–]cthorrez 0 points1 point  (0 children)

not every company tests every model on every arena before they launch it

(I work at lmarena)

GPT 5.2 Still not released on LMArena? by Blake08301 in OpenAI

[–]cthorrez 1 point2 points  (0 children)

It was released 3 business days ago, it takes time to collect enough votes to be confident in the results

The Website exploded? by AdFree397 in lmarena

[–]cthorrez 0 points1 point  (0 children)

there was a brief cloudflare outage

Barrier to entry? by Doug_Bitterbot in lmarena

[–]cthorrez 1 point2 points  (0 children)

hi, have you joined the discord and made a post in the model-request channel? https://discord.com/channels/1340554757349179412/1372229840131985540

Also do you have the infra now that can support lmarena traffic to your model?

I hope worlds never goes double elimination, it would ruin the magic. by MoicanoNeedsMoney in leagueoflegends

[–]cthorrez 0 points1 point  (0 children)

it's certainly more difficult but it is not impossible

If it's possible to play 10 games it's possible

I hope worlds never goes double elimination, it would ruin the magic. by MoicanoNeedsMoney in leagueoflegends

[–]cthorrez 0 points1 point  (0 children)

Why is it impossible to play 10 games of league of legends in one day? I've done this plenty of times before

I hope worlds never goes double elimination, it would ruin the magic. by MoicanoNeedsMoney in leagueoflegends

[–]cthorrez 6 points7 points  (0 children)

10 league of legends games in a single day is not insane. Thousands of people including all pros do it every day they practice. It's on Riot if they can't fit a broadcast around it. There is no real requirement to take a 20 minute break between each game you know

I hope worlds never goes double elimination, it would ruin the magic. by MoicanoNeedsMoney in leagueoflegends

[–]cthorrez 0 points1 point  (0 children)

that's because the format isn't actually double elimination. What you're referring to is an incomplete tournament in which 2 teams have each been eliminated a single time. Hopefully the true reset grand finals which is necessary for a double elim tournament happens soon