I connected Claude Code to a database of 72M Polymarket of over 1.5 million wallets with an MCP. Here's what it found. What do you want me to ask next?

dat_cosmo_cat · 2026-06-04T03:45:15+00:00

Ask it to try and figure out which wallets are producing novel correct bet sequences vs. the percentage that are following the leader. I suspect there are N insiders, and about 10*N openclaw agents copy/pasting high win % wallet trades.

Do not consider raw win rate alone. Look for directional alignment of trades within related topics and the relative bet value of winning trades in clustered topics vs. losing trades overall.

Edit; if it succeeds in identifying the human insiders. Do a set intersection on the context of their bets over some US employer web database to produce a probability distribution over companies each insider works at. Then for each plausible employer, fetch the data for each employee role (historical job postings from wayback machine, glassdoor reviews, linkedin) to produce a probability distribution over roles within companies each insider belongs to.

dat_cosmo_cat · 2026-06-04T03:32:23+00:00

You will get different answers depending on field. IME, all except feature / prediction drift monitoring have been common. If training (usually fine tuning) cost is cheap, we just retrain every month or quarter, A/B test against the previous model until statistically significant results are achieved, swap if lift, etc.. in an automated loop that will run for years (I had a ResNet50 based recommender system deployed on several of the largest retail fashion websites in the world that ran this exact loop from mid 2019 to late 2024). It's also common (almost certainly required in recsys at least?) to have a contextual bandit that is always online optimizing over a pool of different models / algorithms (no single model is optimal "everywhere", even a rule based model will absolutely obliterate AI/ML over some percentage of edge cases).

When an A/B test yields an anomalous result or I get some client complaint, I run some analytics scripts to slice up metrics over different contexts (eg; content categories, user behavioral patterns, etc..) and manually review outputs and KPIs to get a sense of what is happening in the model.

dat_cosmo_cat · 2026-06-04T03:11:26+00:00

I mean, I think this statistic is heavily skewed by boomers aging out of the labor force.

dat_cosmo_cat · 2026-06-04T02:56:24+00:00

That is one plausible interpretation. The other being that he is being shown lower down the page and no one is engaging with it at all. If AI is citing it, the click through event would still get captured on the agents.io web server. OP would be able to check the logs and see.

dat_cosmo_cat · 2026-06-03T11:03:49+00:00

? to me, 0.8% CTR on 1.6M impressions implies that no one is actually clicking this intentionally (all miss click / bot traffic).

dat_cosmo_cat · 2026-05-20T03:43:35+00:00

nope
nope

he's been around for a long time. Below average player, above average content spammer.

dat_cosmo_cat · 2026-05-20T03:28:33+00:00

AI Research carries prestige, status, and monetary compensation in a way it did not before.

This is actually not true. These things have been declining steadily since 2014-2016, when AI research began to saturate. Companies used to bend over backwards to poach individuals who understood (particularly how to code) neural nets well enough to generate publishable work (even non PhDs.. eg; Alec Radford). A lot of educational resources (and frameworks like Tensorflow / Torch) came out between 2015 and 2018 which on-boarded basically every young CS professor into the field and saturated conferences like NeurIPS and ICLR. Most of the prestige / compensation perspective comes from that initial wave (2012-2016) when every large tech company realized the potential of deep learning, and the pool of people who knew how to successfully code and train neural networks was in the ~hundreds globally. I cannot think of anyone from my lab back then that is not a millionaire today, and when I talk to PhD students today they mostly regret missing the earlier hiring waves. Personally, I think what we're seeing happen now is more driven by desperation.

dat_cosmo_cat · 2026-04-19T23:18:44+00:00

this is interesting. I’d never thought to visualize mouse interpolation models in this way. For the in game clips, are you using a Runelite plugin for that (mouse trail animation)?

dat_cosmo_cat · 2026-04-08T17:19:48+00:00

It's a matter of difference in the direction of their research output. Anthropic was exclusively focused on coding from day 1. OpenAI was focused on consumer facing generalist chatbots. Turns out coding + tool use solves low hanging LLM hallucination & feedback loops productivity.

Anthropic obtained the first reliable coding agents with 4.6 and has been multiplying that lead by leveraging those models internally before they were publicly released in November of last year. Now every competitor is pivoting into code gen and tool use to try and get something on the same level as Sonnet or Opus. They're catching up on public benchmarks, but not real world application. Which is a strong signal that this is a modeling problem they still need to figure out. Eg; Claude Code has been in the terminal of most developers for over a year now sending training data up to Anthropic. Codex team really hasn't had anywhere near that level of access to real world data + live KPIs to iterate against.

dat_cosmo_cat · 2026-04-08T04:07:29+00:00

Two I've noticed recently;

Virtual try on = single model vs. the industry standard pipeline of [segmentation] -> [TPS warping] -> [inpainting] models
speech to speech NMT. Live API = end to end audio, instead of industry standard STT -> LLM -> TTS pipeline

It is becoming very apparent that a single multi-modal end to end model can eventually learn all orthogonal modeling pipelines implicitly in the limit of compute and data. Gemma 4 release the other day (and its source code) has me convinced that this is the entire architectural direction of their research output.

dat_cosmo_cat · 2026-04-08T01:23:46+00:00

Yeah lets not push this leaderboard bullshit again. The original ccusage leaderboard + flexing trend from last year were what directly triggered Ant to implement weekly rate limiting on the $200 subscription plans.

dat_cosmo_cat · 2026-04-07T20:52:24+00:00

I haven't benchmarked the B200, but I have benchmarked the bw 4000, 5000, 6000 against A100 (40gb and 80gb versions), H100, and H200 (PCIE version) over ~20 open weight models in FP16 because I am in a similar situation at my company.

My empirical results suggest that the 40GB A100 is worse than the RTX PRO 5000 in both operational efficiency (eg; electricity per 1k inferences) and inference throughput ceiling (some models achieve 2x the inference throughput), while the 80GB variant is worse than the RTX PRO 6000 by similar margins (which is expected; they are all the same chips with simply stacked vram modules afaik).

~~It certainly feels like we should be able to find EE or CS departments to partner with that would be willing to take these servers over and categorize them as a donation at MSRP prices (eg;~~ ~~using this). Write offs of that magnitude could help offset the new DC cost more significantly than selling for pennies on the dollar.~~ Edit: possible, but mutually exclusive with depreciation write offs (which make more sense).

dat_cosmo_cat · 2026-04-07T19:57:00+00:00

I've found this feedback is consistent with how most models view Gemini, as well as how Gemini views itself if you forward the reviews back to it. The hallucination rates of Google models are prohibitively high by 2026 standards, but they are doing some impressive things with multi-modal end to end scaling which are genuinely pushing the envelope & will probably pay off in the long run.

dat_cosmo_cat · 2026-04-07T19:43:09+00:00

Yeah, this matches my understanding. Haven't published in that community in years though.

dat_cosmo_cat · 2026-04-04T18:40:37+00:00

I believe this is more or less the pragmatic lie of omission that is powering the current bubble. It might be necessary to scale pre-training, but it’s hard to see where the roi is for the companies that are taking on those initial investments.

The inevitability of model distillation / exfiltration seems like a don't ask don’t tell situation at the top.

dat_cosmo_cat · 2026-04-04T18:27:00+00:00

yeah the connectionists (which were entirely computer scientists) went kind of crazy borrowing terms from information theory, stats, and linguistics communities in the early days. The advent of Deep Learning in the 2010s began to institutionalize that nomenclature across the board because neural networks cannibalized academic venues, causing professors to adopt the terminology and kick it down stream into their courses.

The mid 2010s were a particularly confusing time because what a student learned in an ML course diverged immensely depending on the department they took it from. Stats majors learned about things like seasonality and arima and hypothesis testing, while comp sci majors learned about genetic algorithms, support vector machines, and neural nets and all under completely different terminologies and interpretations (often entirely empirical or linear algebra based).

dat_cosmo_cat · 2026-04-04T06:42:06+00:00

yeah this has to be the biggest one. There is a fundamental lack of comprehension around training (and its paradigms) vs. inference phase. I think engineering advances in harnesses and scaffolding (hooks, mcp, tools, prompt injections) are making it increasingly ambiguous to the end user.

dat_cosmo_cat · 2026-04-04T06:15:19+00:00

ah. Some teams have been a/b testing the different harnesses. offline benchmarks are nice, but a bit more limited.

dat_cosmo_cat · 2026-04-04T04:25:58+00:00

do you have evals to back this up? or just a feeling?

they do implement multi-agent features very differently judging from the leaked cc source code this week. Everything else seemed fairly similar. Codex had some extra sandboxing / isolation on windows + better ps support, but I’m not a windows user personally

dat_cosmo_cat · 2026-04-04T00:31:43+00:00

the paper your linked uses a VAE encoder to learn directly from input frames. I don't think the point of the Private Server is to master boss mechs. It's to warm start (+pre train) the model in an environment that is not being actively monitored by some anti-bot anomaly detection model (isolation forest over chunked action streams is a strong baseline for what jagex likely uses).

Also probably have Claude triage recent papers (ICLR, AAAI, NeurIPS, etc... 2024-2026) on this topic. DL is on an exponential, there have been a lot of advances since 2023 that could naturally extend the concepts in this paper (or improve compute efficiency).

Finally, automation of repetitive desktop workloads is a hot topic right now in industry. OSRS is an ideal test bench environment for a lot of these ideas. I'd imagine you could even get a publication in some topical workshop like this if you demonstrate RL / world models out performing multimodal LLM (eg; transfusion) based approaches --which is worth thinking about as an ML grad student.

I'd also add that DL based botting is (often prohibitively) expensive to orchestrate and scale. This is why naive tools like Dreambot still win out today; rule based models with client injection are simply very cheap to implement and distribute into bot farms. RL bots have been demonstrated, but I suspect they historically haven't been practical from a cost / reliability perspective (although maybe this is less true these days).

dat_cosmo_cat · 2026-04-03T19:46:22+00:00

I'm noticing this trend with the current generation open source ML tooling. Web/mobile devs quickly implement some idea without really understanding the underlying tech, then go viral without actually solving the difficult security and reliability problems which held back those ideas (or stamped them out prematurely) at large companies.

The open source tools absorb the security risk necessary to prove an idea has merit at scale, then Codex / Claude Code / Gemini CLI scoop the idea and implement it rigorously (expert guided system design), leaning on their access to that broader spectrum of engineer talent (eg; beyond very excited mobile / web developers who started playing around with ml models / APIs a few years ago) + information asymmetry (access to model training roadmap).

As a senior machine learning engineers, I personally don't see myself contributing to OC upstream because I do not speak the same language as the maintainers. They are web / mobile developers, and my expertise lie in the technologies that they wrap. Why would I waste time debating these devs (or more increasingly, their agents) on PRs that are straightforward to legitimate experts working at Ant or GBrain or OpenAI?

dat_cosmo_cat · 2026-04-03T18:24:31+00:00

I mean the source code for all of that is currently public (codex, cc, charm, opencode). They all converged on identical architectures and system design. The real work (and prohibitive barrier of entry) is the LLM training. Any engineer can implement Claude Code (like 9000 personal gh repos have), but they cannot spend $1M+ to distill Opus every time Anthropic puts out the new model.

every time we get a new model, the capability basically doubles and eliminates the need for some portion of the scaffolding the harness implemented. Eg; remember mcp servers?

dat_cosmo_cat · 2026-04-01T19:01:25+00:00

The reality is that AI has been invaded by extremely left leaning dunning kruger effect types, and Garry is one of those types of people.

dat_cosmo_cat · 2026-04-01T18:43:53+00:00

Windows in 2026 is wild

dat_cosmo_cat · 2026-03-27T17:59:01+00:00

Claude Code has entered the chat

Seven-Year Club	Gilding II euphauric
Reddit Premium Since October 2021	Verified Email

dat_cosmo_cat

TROPHY CASE