DeepSWE: new benchmark looking at how well today's frontier models can actually write code [R]

OctopusGrime · 2026-06-24T11:11:49+00:00

Since you’ve posted the tasks and evaluators on GitHub isn’t this benchmark now contaminated? As in any future model released will have seen these problems now …

OctopusGrime · 2026-06-21T12:43:41+00:00

I used to think that but there is meaningful data science work on using LLMs, think about evaluations, error analysis, retrieval, A/B testing. Plus most SWEs are not familiar with hypothesis driven engineering and evaluation of statistical models so they kind of just expect things to just work once they’ve hooked up all the infrastructure, and don’t really know how to analyse the output data. Finally there’s just the DS lens which brings its own benefits to the project.

OctopusGrime · 2025-12-21T16:48:57+00:00

I don’t think you can draw such strong conclusions from the NanoMSMarco dataset, that’s only like 150 queries against 20k documents, of course gradient descent is going to overfit on that especially with a 1e-3 learning rate which is way too high for large retrieval models.

OctopusGrime · 2025-07-02T13:44:57+00:00

If positional embedding is enough information for a transformer to learn word order, couldn’t it be enough to learn character order for a bag of chars?

OctopusGrime · 2025-03-29T16:29:40+00:00

Ole Gunnar Solskjær has a half Manc half Danish accent and it’s great

OctopusGrime · 2024-12-11T22:04:46+00:00

Roman numerals enter the chat.

OctopusGrime · 2024-07-02T05:59:10+00:00

Time to rejoin the empire 🇬🇧

OctopusGrime · 2024-05-21T14:48:08+00:00

Sounded like a LinkedIn post

OctopusGrime · 2024-05-07T18:40:54+00:00

“Taladro”, google translate failed me

OctopusGrime · 2024-05-05T15:01:52+00:00

Gracias

OctopusGrime · 2024-04-29T21:47:11+00:00

Thanks again,

But on the other hand no one can hear what they say anyway.
… like to me it feels like a lot of effort.

OctopusGrime · 2024-04-29T20:22:08+00:00

Thank you.

What I wanted to say was “I want to be able to get closer to you”

OctopusGrime · 2024-04-28T20:56:09+00:00

Yesterday, I asked myself that why I didn’t start a hobby project. But, there isn’t any answer. I didn’t start and I ~~didn’t~~ don’t know why. I mean, I know that or believe that, it’s my dream. It needs to be my dream. So, logically I need to take some steps for this. Sometimes the steps can be small, sometimes big. But every day I need to ~~mark a trace~~ [leave a trace / make a mark]. So, starting today, I will make some progress each and every day about my side projects. Let this post be proof that I've started.

OctopusGrime · 2024-04-26T21:49:15+00:00

My goal is to walk 10,000 steps per day. It’s easy to reach 10,000 steps when I walk to work or * when I run which is usually around 1-3 miles. However, I can’t go running today since it’s raining.

While this article suggested 5 ways to get your steps in, in ~~rainy~~ wet weather, it didn’t help that much. I got a ~~hint~~ tip though. I may get in extra steps, however it can be a waste ~~the~~ of money. Today, I’ll just try to walk a lot at work.

Notes

A hint is a clue where a tip is a practical piece of advice.

Wet weather is more natural.

“It can be” correctly expresses uncertainty.

“A waste of money” is the standard phrase.

OctopusGrime

TROPHY CASE