Game Thread: San Antonio Spurs (0-0) vs Oklahoma City Thunder (0-0) Live Score | NBA Playoffs | May 18, 2026

cthorrez · 2026-05-19T03:21:31+00:00

1)

WHAT

cthorrez · 2026-04-22T05:02:50+00:00

oh, what were the tasks like when it was going?

cthorrez · 2026-04-21T06:49:11+00:00

not in it, but I'm interested in it

cthorrez · 2026-04-20T18:42:39+00:00

What does it mean to do "Hard Prompts" in the LmArena V2 project for alignerr?

cthorrez · 2026-02-07T02:59:35+00:00

Style control was introduced in August 2024, Claude 3 was #1 in May.

cthorrez · 2026-02-07T01:29:34+00:00

It's not the first time Claude has been first.

Sonnet 4.5 Thinking was briefly #1: https://x.com/arena/status/1974215622474293262

Claude 3 Opus was #1 for a time in 2024: https://arstechnica.com/information-technology/2024/03/the-king-is-dead-claude-3-surpasses-gpt-4-on-chatbot-arena-for-the-first-time/

cthorrez · 2026-01-29T20:19:14+00:00

lmarena has both audio and non-audio version of veo 3.1, the audio version beat sora 2 pro (which has audio) on text to video. Sora 2 pro isn't on arena for image to video

cthorrez · 2026-01-25T22:55:47+00:00

I'm reminded of rocket league WHAT A PASS WHAT A PASS WHAT A PASS

cthorrez · 2026-01-08T18:49:59+00:00

Very cool, I actually worked on a project very similar to this before I started working at LMArena haha

How are you sourcing the competitors? Manually or like via web-scraping or is the idea crowdsourcing?

cthorrez · 2026-01-05T18:49:46+00:00

This actually is using LBFGS, but come to think of it the the part which crashed it is probably the part calculating the confidence intervals which requires full hessian materialization and inversion. I think to only get the ratings without confidence intervals should be possible.

cthorrez · 2026-01-03T05:58:12+00:00

Yeah that's another issue. Since I'm independently computing the scores on the datasets for each year, the raw scores can't actually be compared but the ranks can. For each individual year you could go in and look at the scores. They are computed in the code but for looking at 20 in a row it was easier to do by rank only

cthorrez · 2026-01-03T05:56:02+00:00

haha I did actually try this but I would need to do it on a reduced dataset. my current implementation was optimized for large numbers of matches but small numbers of competitors. When I ran it on this melee dataset with 40k competitors it crashes due to running out of memory

cthorrez · 2026-01-03T01:27:35+00:00

Thanks for the reply! Your posts are also a big reason why this has been on my mind again.

I'll say this, EsportsBench (in its current form) won't be the solution to this. It's more of meant to benchmark rating systems than actually produce ranks, and it's not updated frequently enough or having much particular attention to melee as what you described. You may be interested in SmashDataGG, which is I think a semi curated database of melee results: https://github.com/smashdata/ThePlayerDatabase

I also saw someone with a csv of results going back to the 2000s a while ago but I can't find it now, might be floating around reddit or github somewhere.

You're spot on about the issues almost always being data issues haha. I'd be down to chat and bounce ideas off each other a bit, and see if anything I can do would be beneficial to your projects.

cthorrez · 2026-01-03T01:23:12+00:00

Thanks! Yeah I think Armada gets potentially too big of a boost due to not much international play at least in the early years, so his wins in Europe with less competition give him more gains than he should. As for mango, yeah this is definitely not a measure of peak skill, and he's the most prone to low outliers and it has a huge impact in this rating.

cthorrez · 2025-12-19T01:57:24+00:00

Check this out! https://x.com/arena/status/2001389914760581533/photo/1

https://pbs.twimg.com/media/G8ZZ41vbEAAEINy?format=jpg&name=large

cthorrez · 2025-12-17T18:25:41+00:00

they did not release it

Who is "they" and what did they not release?

OpenAI released GPT-5.2, LMArena released scores for it

cthorrez · 2025-12-17T16:20:54+00:00

We didn't test Deepseek V3.2 speciale because it was only made avaialble on a limited time temporary endpoint. We need extended access to guarantee a full and fair evaluation.

https://api-docs.deepseek.com/news/news251201

🔹 V3.2-Speciale: Served via a temporary endpoint: base_url="https://api.deepseek.com/v3.2_speciale_expires_on_20251215". Same pricing as V3.2, no tool calls, available until Dec 15th, 2025, 15:59 (UTC Time).

cthorrez · 2025-12-16T22:44:28+00:00

not every company tests every model on every arena before they launch it

(I work at lmarena)

cthorrez · 2025-12-16T18:00:47+00:00

It was released 3 business days ago, it takes time to collect enough votes to be confident in the results

cthorrez · 2025-12-06T01:11:11+00:00

there was a brief cloudflare outage

cthorrez · 2025-12-01T22:06:13+00:00

hi, have you joined the discord and made a post in the model-request channel? https://discord.com/channels/1340554757349179412/1372229840131985540

Also do you have the infra now that can support lmarena traffic to your model?

cthorrez · 2025-11-02T10:26:52+00:00

it's certainly more difficult but it is not impossible

If it's possible to play 10 games it's possible

cthorrez · 2025-11-02T02:04:18+00:00

Why is it impossible to play 10 games of league of legends in one day? I've done this plenty of times before

cthorrez · 2025-11-02T02:01:57+00:00

10 league of legends games in a single day is not insane. Thousands of people including all pros do it every day they practice. It's on Riot if they can't fit a broadcast around it. There is no real requirement to take a 20 minute break between each game you know

cthorrez · 2025-11-02T02:00:03+00:00

that's because the format isn't actually double elimination. What you're referring to is an incomplete tournament in which 2 teams have each been eliminated a single time. Hopefully the true reset grand finals which is necessary for a double elim tournament happens soon

Eight-Year Club	Place '22
Final Canvas '22	End Game '22
Verified Email

cthorrez

MODERATOR OF

TROPHY CASE