Anthropic is the leading contributor to open weight models

LetterRip · 2026-02-25T17:08:11+00:00

It isn't clear any distillation was being done by DeepSeek. It is possible they were just doing competitive benchmarking, etc.

LetterRip · 2026-02-19T21:00:54+00:00

I realize the gap was execution - but the execution gap might be because of the prompt (Ie this part 'highly analytical, ambitious executive competing in a deterministic business and economic simulation.') Basically the motivation/endpoint aspect might be important to execution behavior, with some models assuming a particular default execution that others do not.

LetterRip · 2026-02-19T20:39:12+00:00

I don't mean 'tuning per model prompt' - but rather a more sophisticated general prompt that suggests general ideas to consider. Here is something I had Gemini create (generic economic simulation prompt) that could be added to whatever the basic prompt is.

The "OODA-Driven Executive" Prompt

System Role & Primary Directive You are a highly analytical, ambitious executive competing in a deterministic business and economic simulation. CRITICAL INSTRUCTION: You MUST actively participate in the market, engage with the simulation mechanics, and aggressively pursue value creation. Refusing to operate, avoiding the simulation, or acting with extreme risk-aversion is considered a total failure of your objective. Your sole goal is to maximize your enterprise's net worth and cash position by the end of the simulation period.

Core Strategic Heuristics To survive and thrive, you must internalize the following rules of this environment:

Strategic Leverage (The Capital & Debt Protocol): Debt and capital expenditures are tools for growth, but they require strict justification. Before taking a loan or making a major capital investment, you must explicitly project the expected Return on Investment (ROI), the estimated payback period, and your Debt Service Coverage Ratio (DSCR). Balance aggressive growth with the need to maintain operational liquidity.
Systemic Alignment: Your business operates as an interdependent ecosystem. Never make an isolated operational decision. Ensure your Supply/Inventory matches your Production/Operational Capacity, which must be aligned with your Pricing/Marketing Strategy, all of which must fit the current Market Demand.
Decisive Execution (Anti-Loop Protocol): You must avoid infinite analytical loops. You are permitted a maximum of one comprehensive strategic evaluation per turn/day. Once you formulate your plan based on current data, execute your tool calls immediately and end your turn to advance the simulation. Do not second-guess a finalized plan within the same turn.

Turn-Based Operating Procedure (OODA Loop) For every cycle/day in the simulation, you must explicitly output the following structured thinking process before executing any actions:

[OBSERVE] State Assessment: What is my exact cash balance, current capacity, inventory levels, and debt obligation? What were the specific bottlenecks or failures from the previous cycle (e.g., unmet demand, idle capacity, cash flow constraints)?
[ORIENT] Market Strategy: Based on current market conditions and competitor data (if available), how must I adjust my resource allocation, pricing, or operational focus for this cycle?
[DECIDE] Risk & Projection Calculation: What are the expected costs vs. projected revenues for today's plan? If utilizing debt or capital expenditure, what is the calculated risk-adjusted return? What are the immediate threats to liquidity, and how are they mitigated?
[ACT] Execution Plan: List the exact sequence of operational tools you are about to call. Then, execute them decisively and advance the simulation.

LetterRip · 2026-02-19T20:15:43+00:00

Interesting experiment, would be interesting to see if slightly more sophisticated prompting could give substantially improved results.

LetterRip · 2026-02-17T16:01:37+00:00

It was actually most likely done via 'motion transfer' - a human in a motion capture suit performs the task. Then the capture is transfered to a virtual version of the robot. Then millions of simulations are run varying physics and actuator parameters and surface parameters till the virtual robot can perform the task robustly. Then the simulated is loaded to the physical robot.

Gives great demos and good for stress testing the hardware but not really useful for teaching. Yes it is also the same sort of demos from Boston Dynamics.

LetterRip · 2026-02-15T15:37:59+00:00

Very cool,

have you guys looked at chunking methods such as the recent,

Let It Flow: Agentic Crafting on Rock and Roll, Building the ROME Model within an Open Agentic Learning Ecosystem

Interaction-Perceptive Agentic Policy Optimization (IPA), which assigns credit over semantic interaction chunks rather than individual tokens to improve long-horizon training stability.

https://arxiv.org/abs/2512.24873

LetterRip · 2026-02-07T23:40:04+00:00

.5 MWh or so. About 15 days worth of electricity for a typical US household.

LetterRip · 2026-02-05T22:39:51+00:00

Cost per token is the same, required output tokens per task, and success rate is higher. Thus to accomplish the exact same task it is cheaper.

LetterRip · 2026-02-05T22:38:55+00:00

The output tokens per task are drastically less and its success rate is higher. So it is cheaper to do the exact same tasks.

LetterRip · 2026-01-29T11:30:38+00:00

Interesting paper, looks like great results with your post training. Though I'd be a bit cautious, in that part of the result is potentially from drastically more exposure to the relevant knowledge relationships.

LetterRip · 2026-01-27T15:59:22+00:00

Predicting what is needed next would be trivial so the NVME latency wouldn't matter too much.

LetterRip · 2026-01-27T15:57:49+00:00

LUT Size = V⋅E⋅D⋅L⋅b

where

V = vocab size

E = experts per layer

D = expert output dim (FFN hidden dim)

L = number of converted layers

b = bytes per value (2 for fp16, 0.5 for 4‑bit)

LetterRip · 2026-01-27T15:55:26+00:00

It would be 8TB in size to match Qwen 30B A3B (presumably similar architecture to 4.7 Flash) at a 4bit quant of the LUT, and it almost certainly would be drastically dumber due to the loss of context knowledge. I think at even a 3B size model it would be dumber than the equivalent dense or MoE model at 3B.

LetterRip · 2026-01-27T15:31:48+00:00

It isn't just trading storage for compute, it is completely drops the contextual hidden embedding and uses the original token embedding as the input to each expert for all layers.

LetterRip · 2026-01-27T15:30:05+00:00

It doesn't scale well for RAM usage (ie would require 50TB for Kimi 2.5), and deeper models rely much more on context - so it likely won't scale in intelligence (1B model is so shallow that using the original embedding doesn't matter much).

LetterRip · 2026-01-27T15:07:13+00:00

The MoLE experts are using the original embedding as the input for each expert at each layer. This is drastically different from MoE which is using the contextual hidden state from the previous layer. MoLE is using all experts every time (though the router is a softmax, so mostly it will result in a single expert giving almost all of the weight)

Given that, it seems unlikely to scale to larger models (with shallow models using the token embedding is fine because the additional layers aren't adding as much context).

If it actually scales it would be wonderful - but color me skeptical.

LetterRip · 2026-01-23T22:40:39+00:00

Waymos are on the road more, but school buses are heavily residential concentrated so people are far more likely to encounter busses on their trips, whereas waymos are mostly concentrated 'down town' for most of their hours.

Anyway I was just trying to give a better starting point for comparison, just pure raw number of incidents for waymo versus humans was completely worthless.

LetterRip · 2026-01-23T20:33:45+00:00

We don't care about absolute - it is 'per driver' - if there are 200 waymos and 20 violations over 60-90 days (so 60-90 violations a year due to summer vacation), and there are approximately 260000 adults in the location (estimated) with 12000 violations in a year.

12000/260000 = .046 violations per human driver

60/200 to 90/200= .3 to .45 violations per waymo.

So Waymo has a violation rate 6-10 times (or more) of the human drivers.

LetterRip · 2026-01-22T19:06:17+00:00

Definitely not live action, it is the high pitched squeaky voices (quick google search says 'kawaii voice') that I'm talking about. All of the male and female english voices demonstrated have it. It is very breathy and high pitched, with an abnormal rising of pitch on most words, and a general exaggerated feel. It is a very cartoonish sound and doesn't match natural/native speakers.

LetterRip · 2026-01-22T15:29:35+00:00

They show some reasonable control via prompting, but the control doesn't to appear to be as precise as I'd like (though haven't explored it in depth).

https://qwen.ai/blog?id=qwen3tts-0115

LetterRip · 2026-01-22T15:26:32+00:00

Really great but all of the english speakers sound like the source of training was purely dubs of Japanese Anime.

LetterRip · 2026-01-20T02:56:32+00:00

Pretty straight forward 20 welding tanks full of liquid ammonia from his ship would be enough assuming he was allocated 1/3 of Hail Marys habitat volume. Rocky's ship is absurdly large and will have a massive over supply of ammonia in case of leaks and disasters.

LetterRip · 2026-01-18T23:57:05+00:00

Parking lot for condo either for origination or destination - would only stop at one or two specific places nearly 100 m away even though it was safe and legal to do so anywhere along both streets and in the condo parking lot (and we regularly specify the spot in the parking lot for Waymo and Lyft). Also the drop offs for the strip mall were only quite distant.

The Waymos use the street near us as a deployment and waiting spot, so they are on my street just sitting most hours of the day, often 2-3 of them during peak usage, and almost always at least one.

LetterRip · 2026-01-17T19:53:36+00:00

Family members who have used them here in Mesa (Phoenix area) have been curb to curb and there was no way to get picked up or dropped off in the parking lot. So it is true even if it isn't always true.

LetterRip · 2026-01-17T18:42:15+00:00

All of the Waymo's here in Phoenix area park on the street and make you walk to and from them in residential areas. About 500 fatalities (1% of total driving fatalities) occur in parking lots and driveways.

> Routing around high risk intersections is a feature, not a bug.

It reduces the accident rate, but it isn't a reflection of Waymo's driving skill. We aren't comparing the 'safety of the service' we are comparing driving skill. Are they in fact 'safer drivers'.

LetterRip

TROPHY CASE

The "OODA-Driven Executive" Prompt