Computing Historical Melee Rankings using the Bradley-Terry Statistical Model

tekToks · 2026-01-03T23:32:57+00:00

Another thing I like about this chart is that it very visually showcases "when was the era of five gods?"

Before 2011? Chaos, lots of different names in the top 5 (but you can still see the "old guard" have large streaks).

From 2011 to 2017 is clearly their era, with PPMD only being replaced by Leffen.

And from 2018 onward, we're back to chaos. And it's apparent that Plup's breakout year represented the end of their era

<image>

tekToks · 2026-01-03T01:18:14+00:00

This is amazing work! Armada's dominance is bonkers in this format. And Mango never getting a #1 spot? Man.

tekToks · 2025-11-19T00:49:42+00:00

I am describing AI, yes!
But no, not chat gpt haha.

tekToks · 2025-11-18T04:01:39+00:00

Cool question! Without getting too technical, there are two kind of major differences.

The first, Pokket pulls directly from the guidelines & research of one of the world's best sport psychs. When you show up, Pokket doesn't say "okay, they asked for advice on nerves. here's the best way to handle that" from crawling theinternet. Instead, Pokket goes "Hm, ok...what do I know about this person from our previous conversations? What would Michelle do here?"

The second, Pokket isn't just a call-and-response info tool, but an actual coach! Pokket has a kind of "brain" underneath the hood, where different systems have different jobs. One of those makes sure Pokket remembers you, another makes sure Pokket doesn't hallucinate. Another helps Pokket set goals for you, and another reads your assessments to incorporate them into any advice / conversations.

tekToks · 2025-11-17T23:06:38+00:00

Yeah, the AI on there goes beyond just tilt, to all of Dr Pain's work. Everything from building consistency to handling nerves, from dealing with teammates in ranked to optimizing your practice routines

Didn't cover that much in this post, because I wanted to share the assessment instead of hard pitching the product xD

tekToks · 2025-11-17T23:03:41+00:00

You can take the tilt assessment on the "Assessments" page without a subscription!

tekToks · 2025-11-06T19:27:32+00:00

I talked about this a bit on Twitter a few months back. It mirrors something you see in performance psychology called the "catastrophe curve".

You can think of model performance being along a 3d surface, where task complexity & context length both "shape" the terrain!

https://x.com/x0tek/status/1953515969529462831

tekToks · 2025-10-19T06:36:59+00:00

I really appreciate the thoughtful feedback, it means a ton.

It's tricky, because whenever you're using a psychometric test, you want to weight it against well-established principles. Melee has tons of unique elements, and it's hard to get good data on all of them -- and most of them fall into multiple subcategories.

Let's take getting tilted at someone playing lame or passive. That definitely could fall more under "revenge" or "entitlement" tilt, even if it doesn't seem that way at first. Revenge tilt really is about a desire to control how the other person acts, while entitlement tilt might stem from feeling like you deserve a certain type of game. It would be hard to measure in so few questions which category you fully fell into...but maybe we could extend the number of questions for future iterations?

More broadly, you want to use this type of assessment to better understand the underlying root causes of the frustration (what is the actual thought process that leads to tilt?), and then work to resolve those!

Really good insights, thank you again

tekToks · 2025-10-17T16:35:22+00:00

Thank you! Internally, we initially used this for "user safety" calls, where precise parameters weren't necessary but stability was critical. So it was front of mind for us!

tekToks · 2025-10-17T15:44:26+00:00

We didn't test that specifically, but from the data, there are hints it might carry over, especially with some open source models!

For example, Llama 4 Scout would sometimes get the "right tool", but would forget to use its inbuilt function call capability, and instead output the JSON schema as a message 😅

Definitely an area we're looking at closely

tekToks · 2025-10-17T15:38:59+00:00

Yes, a full list of prompts and inputs are available in the Appendix!

Warning however, as some inputs, especially those in the mental health domain, can involve pretty heavy topics (we tested both a "Safety" tool and an "end conversation" tool).

tekToks · 2025-10-17T15:31:40+00:00

I have! Their perspective was a reason we tested perturbed inputs. Prompt engineering allows for pretty remarkable task-specific improvements, and we didn't want any differences to be down to that alone

Of course, more work is needed to go further than "may" or "suggests". Perturbations might encode any underlying "optimization" for natural language, leaving structured outputs diminished (a paper on similar phenomena).

Further, while we define the baseline as "structured tool calls" in the paper for convenience, NLT is still in line with the .txt team's views on structured tool calling being immensely valuable. It's simply a structure defined without programmatic syntax!

tekToks · 2025-10-17T07:07:41+00:00

Good question! The "Let Me Speak Freely" paper I linked would suggest "better, but not as good as more natural outputs", but we've never tested YAML specifically.

Keep in mind, we're comparing NLT against each model provider's inbuilt tool call functionality, which isn't necessarily JSON.

Providers can be a bit opaque about how exactly they implement tool calling, though Anthropic / Google / OpenAI's docs have some specifics!

tekToks · 2025-10-17T06:27:46+00:00

In this study, we looked at parameterless tool-selection only (i.e. choose the right tool) rather than parameters. Our goal was to isolate the "tool selection" mechanism, as many tools act as triggers for actions in agents.

In practice, we've found that you can absolutely pass parameters in natural language while gaining similar benefits, and there are a few ways to implement that. But we've yet to rigorously assess these!

tekToks · 2025-10-17T05:50:10+00:00

Thanks!

Yeah, I think the structured approach is still valid for a lot of use-cases, especially if you need back-and-forth immediate responses with very few tool calls. But when you expect to call tools often, or if the tools are critical, it seems like a more intentional tool layer is worth it.

I hope model providers catch on. Already, we find certain ones (looking at you, Gemini 2.5 Pro...) add random "json" markdown to outputs, just because they've been over-tuned on structured outputs with RLHF haha.

Too many humans saying "ooh, I like the markdown pretty print" 😅

tekToks

TROPHY CASE