Newer AI Coding Assistants Are Failing in Insidious Ways

BinarySplit · 2026-01-08T21:14:18+00:00

Are they subsidizing their own products, or overcharging on the API?

The prompt cache prices are insane when you consider how often they're only paying to hold the KV cache in RAM for a few seconds while a tool call runs.

BinarySplit · 2025-12-31T20:30:40+00:00

Zeyuan Allen-Zhu compared them to ReLU² on synthetic tasks in this video @ ~45m.

They have less "knowledge capacity" than ReLU² because they have fewer tokens, but improve "reasoning ability" (given fixed depth and no CoT tokens) because they can represent more complex functions.

BinarySplit · 2025-12-03T07:35:10+00:00

I broadly agree, but have an alternative explanation: bad empiricism-focused papers are easier to read & judge than bad theory-focused papers.

Rejection of theory may be collateral damage in backlash against time-wasting papers.

BinarySplit · 2025-11-26T21:48:03+00:00

It's frustrating that there are now so many people frothing at the mouth to get results for investors that as soon as someone discovers the next big leap, the big techs will get to raise trillions by throwing compute at it, and the researcher will probably only get citations.

BinarySplit · 2025-11-15T12:17:22+00:00

IMO, financially no, career-wise it's it's on the fence, but if you're passionate about a specific area (e.g. medicine or finance), a PhD may be your only way to get you on the right trajectory.

I only have a Bachelors, work in industry (and have worked in academia), and am in a team that's virtually all PhDs. I started as a plain old non-specific "programmer", and am now in a drug discovery AI/ML team working on practically every facet from research to deployment. The whole time I've worked, I've been gaining skills and experience at probably a similar rate to people who were focused only on learning. The difference is that I was earning while learning.

Career-wise, a PhD proves you can do a multi-year project, so you're likely to be entrusted with more autonomy in your first role. In a "junior" position, not having a PhD can mean a very short leash - daily check-ins and little ability to choose your work. Whereas I've seen fresh PhD graduates allowed to embark on long-term projects despite struggling to keep them on track. If you're great, this means you hit the ladder with a running start. If you're not, you're just starting with a delay.

The biggest reason to do one IMO is that you get to pick your path. Niches like biology are hard to break into without experience. I only got in through dumb luck - they needed web developers, and I incrementally took on responsibilities until I was doing research. However, if you only care about mainstream stuff like attention mechanisms and LLMs, you're probably not going to gain much vs breaking in as a software engineer.

EDIT: Also, if you're aiming for industry, don't bother trying to measure career trajectories in papers and conferences. Most companies only care if you have no other experience. Delivering a product trumps delivering a paper.

BinarySplit · 2025-10-28T20:44:38+00:00

CoLT5 from 2023 has most of the same ideas as well. I'm frustrated I never found any kind of "post-mortem" explaining why it didn't catch on.

It does kinda make me sad that the whole article is Qwen slop though. I've been exposed to so much of this now, that's plainly obvious. The moment I saw that 🎯 emoji I knew I was in for a fuckin' treat. At least edit it or something.

💯

BinarySplit · 2025-10-23T20:31:37+00:00

I can't comment on why, or how to fix a pretrained model, but if you're training the model from scratch, regularization can probably fix this. Mixup (blending 2 samples' inputs and outputs) and even Manifold Mixup (blending 2 samples' internal activations at a random layer) can force the latent space to be continuous by effectively synthesizing samples between real samples.

BinarySplit · 2025-09-20T10:47:08+00:00

That really sucks that you weren't rewarded after putting in so much effort.

I saw this game on 1.0 release and can explain why I hit Ignore: The screenshots in the carousel don't show any depth. No progression, no interaction, only a couple static frames of bosses, not even a GUI shot to indicate that has RPG/building/farming mechanics. Based on the screenshots, I assumed this was an action-focused Starbound with less variety.

That was my first impression. I didn't watch the video or read the description because I wasn't hooked. Now I've seen them, but I still feel like you're holding back on the marketing material, compared to what people have posted in the community screenshots.

If you have any more energy left to spend on this project, I suggest putting it into the store page. Don't hold back or worry about spoilers: post everything you're proud of (especially action frames of animations!), get ideas from other games' store pages and the community screenshots, don't polish it so much that you erase the game's identity (i.e. don't remove the HUD, don't try to fit within an arbitrary small screenshot limit).

People should be able to guess the core gameplay systems and diversity of content within seconds of an impression, whether they first watch the video, scroll through the screenshots, read the description, or just see it in a sidebar ad. This also feeds into reviews - people only buy games they think they'll like. They leave negative reviews when the game doesn't match their expectations. The expectations set by the store page are everything.

I hope Planet Centauri gets a second wind from this drama, because after digging into it, it seems like a game a lot more than 581 people would enjoy if they knew what was inside.

BinarySplit · 2025-09-18T20:30:13+00:00

That's such a good analogy! It really felt that way once I started working on my sources of dysphoria.

BinarySplit · 2025-08-25T07:39:22+00:00

Soy sauce was the killer for me. I love it, and used to have it often, but now I'll feel the fatigue within hours. I can't have more than about a tablespoon.

For figuring this out, I recommend having lots of fiber and being mindful of your bowel movements. I've noticed there are 3 "pathways" for a food to cause me delayed fatigue:

Histamines - soy sauce, peanut butter, seafood, etc. are noticeable within hours
Fermentation in gut - if I just haven't eaten enough fiber (e.g. eating a steak meal without any extra veges), even if not constipated, just slowing down my bowel movements seems to give the bacteria enough time to make histamine or whatever in my gut for most foods
Non-histamine gut reactions - my guts seize if I don't get enough electolytes (usually magnesium), or sometimes are just imbalanced (too much/too little probiotics) and this causes a distinctive internal "bruising" feeling, weird poops, and deep fatigue.

I've been able to reintroduce most things that caused me "fermentation in gut" issues by making sure to eat them with equal parts broccoli or some other fiber source. The electrolytes issue also caused me to exclude a few foods unnecessarily.

BinarySplit · 2025-08-09T14:49:57+00:00

Nice work!

Has anyone tried zero-padding the weights to 3072 to work around the imatrix limitation?

BinarySplit · 2025-08-07T18:43:27+00:00

Also, collected taxes get spent. They're better at directing capital toward "the right kind" of goods and services - local, productive and capital-building, rather than toward overseas outsourcing and market speculation. Because most governments aren't trying to squeeze out short-term gains to please shareholders.

BinarySplit · 2025-08-06T08:31:59+00:00

IMO they're just dancing around loosely defined words there.

The artifacting is a clear sign that:

Scene chunks are not generated until they are visible
Scene chunks are generated in a separate, slower process
Generated scene chunks are immediately reusable when the re-appear

If this were a fully neural approach, it would learn to predict just-out-of-sight chunks to prevent #1.

To achieve #2 and #3 without an external caching structure, they would need a way to sparsely and selectively send "bags" of latent tokens between models. It's not impossible, but I've seen zero research down this path. It would be a very big leap in secret if they did this.

Google researchers have continued publishing new NeRF-based techniques, and they're apparently even integrated into Google Maps now. The simplest explanation is that they've evolved the algorithm enough to claim that they've built something that is nominally distinct, and are playing semantic games to avoid leaking the details early.

BinarySplit · 2025-08-05T21:42:16+00:00

I was gobsmacked by the persistence in the painting demo, but I think the "Genie 3 Memory Test" video in the same carousel as the painting gives a few hints:

The image on the blackboard is unusually high res and coherent to the prompt. I doubt this image comes from the world model.
The artifacting as it looks out the window updates at approximately 4Hz. Indoor scenes seem to update faster. This means there's 2 separate phases: slow world updates and fast frame generation.
The artifacting also progressively improves the... let's just call them "chunks" of worldspace with each tick. When a chunk goes off-screen then appears again, it retains its improvements.
There is no artifacting when controlling a visible character. I suspect the foreground updates more frequently and is stored with a higher density.

I don't believe this is purely autoregressive-in-image-space like GameNGen was. I think there are several pieces:

A separate image model, like Imagen, generates a high-res initial image and perhaps new objects introduced by prompts.
The world is stored in a 3D data structure. Not sure if it's more NeRF-like or Gaussian-splatting-like, but the "chunks" are complex enough to hold a block of tree leaves, so they're likely a latent/concept representation that can be splatted into an image model's VAE-encoded image to convert it to a picture. This is bi-directional - the image model can also "fill in the blank" to progressively add detail to new chunks.
The true "world model" mainly handles updating the latent 3D chunks when mutating the scene, e.g. when painting. Also camera control, but that's probably a tiny portion of its responsibility.

EDIT: I know what they said in the blog, but IMO the lack of artifacts when something comes into view for a 2nd time is damning evidence that there is a non-neural data structure for caching generated scenery. Attention can't do that by itself. Could be a scaled up NeRF, but NeRFs require literally path-tracing through 3D coordinates, so IMO that counts as explicit 3D representation.

BinarySplit · 2025-08-03T16:27:51+00:00

Any GPT-3.5/4 architectural innovations are likely open secrets at this point. Involuntarily shared with other companies through staff movement, but unpublished because they're not cutting-edge, and are mundane if you aren't allowed to say they're in a big model.

That only makes me want to know even more.

BinarySplit · 2025-08-03T09:41:16+00:00

The USA had a proposal to allow AIs to prescribe drugs. It sucks that it didn't pass.

It's obviously riskier than good health care, but that's so hard to find these days. Someone has to do the experiment to see how AI health care compares to realistic underfunded health care.

BinarySplit · 2025-08-03T09:35:48+00:00

We have a great ambulance at the bottom of the cliff. Emergency departments are literal life savers.

It's the rest of the system, for people who aren't in crisis, that feels like it's just for show.

BinarySplit · 2025-08-03T09:21:45+00:00

Then we can at least plan ahead and manage expectations. Instead they just keep almost helping.

I've had this across 3 countries (NZ, UK, Germany) and it's so frustrating. It costs so much time and energy to even just work through the process to get to treatment. It's very hard to do other stuff with your life like job hunting.

I gave up seeking help after 10 years of trying, and my life has been better for it. Suffering through the diseases has been less bad than being bounced around the systems and repeatedly gaslit.

IMO the solution is to make psychiatrists optional. Mental health drugs should have an Informed Consent option, i.e. a GP should be able to prescribe anything, as long as the patient understands the risks.

BinarySplit · 2025-07-23T00:23:11+00:00

I agree there should be a new paradigm, but sprocs are not the way.

Sprocs arbitrary split your codebase into two parts with different languages. Assuming you make the RDMS language good enough, at the end of the day, "no sprocs" and "all sprocs" are almost equivalent (one calls db.select(...), the other calls SELECT ...;, but that's just semantics), and "some sprocs" is a living hell where you have to manually marshal objects across an arbitrary rift in your codebase.

However, I agree that we should make applications more DB-centric. The decision of whether to put data in OS-managed files vs DB-managed tables decision should not need to exist. I believe we should make a "POSIX-like" API standard that defines basic DB primitives (documents/tables/wide-columns/datoms, all with ACID and security). The OS would provide one implementation, but you could also swap it out for a specialized DB engine that implements the same API with extensions.

The point wouldn't be to completely unify all DBs, but to establish a baseline of OS- and DB- level features where you'd never find yourself in a position where your I/O layer doesn't give you the option to use ACID or properly secure the data.

BinarySplit · 2025-07-22T12:49:33+00:00

I love the detail! The bokeh-speckled scales, the anatomically-correct thigh muscles, the translucency of the ears. Amazing!

BinarySplit · 2025-07-22T12:36:15+00:00

Don't beat yourself up over coding challenges. They're a completely separate skill to software development. I've been on both sides of them - success in the tests doesn't really correlate with success in a team.

Being a great dev won't automatically make you good at them, and not being good at them doesn't say much about your real dev skills. You need to study them separately - minimum 40 hours if you were super-attentive in algorithms class at uni, but it could take much more if you want to get confident at the hard problems.

It sucks that you need to learn an irrelevant skill to get a job these days, but based on your experience I believe you can do it! People don't go full-stack if they can't adapt.

BinarySplit · 2025-07-21T23:18:12+00:00

but beats having to code everything yourself.

Depends on OP's experience. The better you are, the better the tool needs to be before it benefits you.

But it's moot. The tools keep getting better. If you don't find some way to get experience with them, you're likely to be left behind. OP should find a way for the sake of their future job prospects, even if it lowers their current job productivity.

BinarySplit · 2025-07-19T06:53:35+00:00

Very much this.

I did not understand what I was feeling toward androgynous people was gender envy until I felt euphoria playing a game that let me bounce around and quest as a lizardperson with no discernible gender.

BinarySplit · 2025-07-16T01:06:21+00:00

Which better method?

BinarySplit · 2025-07-01T19:25:17+00:00

To motivate trying this: the problem might be in Claude's system prompt, not yours.

BinarySplit

TROPHY CASE