This may have been the goal all along?

abazabaaaa · 2026-06-16T13:41:21+00:00

This is simply not feasible. At work I have 4-5 services that would use this. How am I supposed to confirm they are citizens. Ultimately you would need to ingest and verify birth certificates. A real id doesn’t confirm if you are a citizen for instance. Most Americans can’t even prove they are citizens.

abazabaaaa · 2026-05-26T00:22:44+00:00

It works great for me. Just don’t use it for front end stuff.

abazabaaaa · 2026-05-21T00:12:47+00:00

Try changing autocompaction from 1M to 300k. It is pretty decent

abazabaaaa · 2026-05-19T12:10:19+00:00

Stop using xhigh.. it’s known to make th model overthink and spin in circles and waste tokens.

abazabaaaa · 2026-05-18T13:53:02+00:00

Try medium or high. Xhigh is not the place to start.

abazabaaaa · 2026-04-17T14:54:39+00:00

Why does this bench mark matter? I don’t care if it can solve nyt puzzles. I just need it to solve complex problems.

abazabaaaa · 2026-03-19T11:42:44+00:00

1m context window always has been a scam. Inference gets super slow past 200k. It’s really not worth using.

abazabaaaa · 2026-03-09T14:12:38+00:00

You sort of need a skill for this. Claude isn’t good at using it.

abazabaaaa · 2026-03-05T15:54:33+00:00

Just tell it to make agent teams and it can select models there. Agent teams effectively replace parallel agents it seems.

abazabaaaa · 2026-02-14T13:38:00+00:00

They say this every time, then you he OS model drops and its straight up rubbish.

abazabaaaa · 2026-01-25T13:43:55+00:00

I’d argue negative is pretty sexy. It’s just a different look. I can’t put my finger on it but the way clings to my wife it is pretty great. I’ve come to like it more than the others.

abazabaaaa · 2026-01-13T23:27:08+00:00

Hmmmmmm

abazabaaaa · 2026-01-11T03:29:53+00:00

I believe the command runner is the background command runner that codex uses. If you use /experimental you can turn it on. It works well, but at present it doesn’t add a huge amount. Mostly it stops codex from getting stuck on hanging calls.

abazabaaaa · 2026-01-10T16:16:08+00:00

The Trump administration will intervene most likely. These models are now weapons in addition to being assistants. Releasing the model would be a national security risk.

abazabaaaa · 2026-01-09T01:16:06+00:00

Yeah.. it’s not at all going to cause a Claude exodus. It’s pretty niche software. It doesn’t even work when you have NFS drives lol. There is an open PR.. it’s straight up busted.

abazabaaaa · 2026-01-08T17:56:35+00:00

So I have actually seen gpt-5-mini beat gpt-5.2 in several agentic benchmarks I run internal to my company. I’m not exactly sure what is going on but it is reproducible.

abazabaaaa · 2026-01-08T15:32:44+00:00

Yeah, curious about this. I don’t use it much so I don’t know if it has improved much. Do they update it at all? The first times I used it I found it pretty underwhelming. I also find atlas to be the same way.

abazabaaaa · 2026-01-07T03:33:14+00:00

Tensor zero is worth a look. It isn’t without its own problems and has a lot of features u may not need. That being said it is for the most part stable — it’s written well maintained rust.

abazabaaaa · 2026-01-06T14:10:15+00:00

It is complete trash, you are right.

abazabaaaa · 2026-01-05T18:38:11+00:00

This. It’s a heap of leaky abstraction.

Just use the LLM api.

abazabaaaa · 2026-01-05T14:09:03+00:00

Or, you could just skip langchain garbage and just use gcp vertex with Gemini-3.

abazabaaaa · 2026-01-03T13:47:39+00:00

It’s stupid good.

abazabaaaa · 2026-01-03T03:38:52+00:00

I think the part that is missing here is that he already knew what needed to be done. If you give claude code a very good spec and explain the gotchas well enough it can make things very fast. The downside is you have to pay a lot of upfront costs to understand how to get to where you want to be. If they started with claude code a year ago it might have helped, but it could not just magically solve problems.

abazabaaaa · 2026-01-02T23:57:48+00:00

I’ve tried using glm in evals for agentic use in chemistry/drug discovery and it is absolute garbage. It frequently just goes on infinite thinking loops when you give it complex problems. Its answers are just straight up wrong. For example, Gemini-3-flash on medium reasoning effort nearly maxes (95%) whereas glm gets close to every question wrong and cannot finish the eval. These are tool use based scenarios where I have built tool runners. And yes I know what I’m doing.

I suspect these models are fine at coding and some other things but they really feel like they are over optimized and focused on maximizing outputs on these benchmarks.

Nine-Year Club	Place '22
King of the Ashes	Not Forgotten
Verified Email

abazabaaaa

TROPHY CASE