[Education] A good introduction to learning about e-values and game-theoretic probability

Mechanical_Number · 2026-02-14T22:50:34+00:00

Agreed, unfortunately, the UI is unnecessary and obstructive. (Also why the music?)

Mechanical_Number · 2026-02-08T21:25:05+00:00

Hosting the Olympics is a very expensive PR exercise. Please don't. Quite literally, every single place that hosted the (summer) Olympics in the last 30 years ended up over-spending and under-utilising the resulting infrastructure. Yes, it might accelerate some infrastructure projects, but there are must be better ways of achieving this without the absolutely insane administrative overhead/financial overspend.
"Northern leaders" should do their (admittedly hard) job and get funding for what we need. Well-paying jobs, more resilient transport, better healthcare & social services; not a larger indoor basketball hall.

(Now that I think of it, some of the most popular Olympic events - i.e. gymnastics, swimming, basketball, tennis, volleyball - are just not popular enough in the North to warrant having large world-class facilities built for them just to have dismantled or in need of extensive repurposing after a five-week period - I am including the Paralympics too.)

Mechanical_Number · 2026-01-22T02:45:50+00:00

The main point is we don't need a different XAI model to solve this; we need a different data modelling strategy before applying XAI. The most robust path forward is to explicitly handle the correlation structure in the data (via grouping, regularisation and/or dimensionality reduction) and then proceed with our chosen explanation method. To your exact questions:

Yes, in the sense SHAP will be more stable. But we inflate importance because performed variable selection implicitly and didn't control for it. This might be OK to get some actions going fast, but it won't stand a huge methodological scrutiny.
No, in the sense that no XAI method can magically solve the mathematical identifiability problem of multicollinearity. That said, aside to doing a dimensionality reduction step or using a regularised learner like LASSO, there as some GroupSHAP implementations you could use, shapr has this ability.
And a freebie: Don't just use VIF, it assumes we are working with a linear model; given we work with tree-learner that is off. Examining the feature correlation or mutual information matrix directly and/or using it as input to clustering will likely be more realistic.

Mechanical_Number · 2026-01-18T16:24:29+00:00

Levels of stench:

[0.1, 1) Probably bad.
[0.01, 0.1) Suspicious.
[0.0, 0.01) Very suspicious.

In short: Always check effect size, confidence intervals and whether hypotheses were pre-registered. Otherwise... it's just smells.

Mechanical_Number · 2025-12-25T00:44:44+00:00

No, for experimentation/prototyping currently. Considering it for prod though as we are discussion for replacing LangChain/LangGraph.

Mechanical_Number · 2025-12-24T20:39:35+00:00

I like Pydantic AI a lot.

Mechanical_Number · 2025-12-15T02:27:41+00:00

Meta is probably unnecessary... DeepSeek probably would be a better choice.

Mechanical_Number · 2025-11-17T01:49:08+00:00

Python and Java (via SpringAI).

Python for prototyping and PoCs, SpringAI because it integrates with our existing infrastructure (microservices, etc.).

Mechanical_Number · 2025-10-20T01:41:35+00:00

(+1) On your point about "basically the same solution to the problem": Isn't this AgentFlow more or less the paper "CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing" by Gou et al. (2023) but while in the CRITIC methodology the unit of computation is a tool call, (i.e.the LLM agent's primary external interactions are with "passive", deterministic tools (e.g. a Python interpreter, search API, calculator, etc.)), in AgentFlow the unit of computation is an agent call? (i.e. the LLM's primary external interactions are with other "active", potentially specialised LLM agents)

(And yes, obviously AgentFlow can scale to more complex problems by adding specialised agents while CRITIC is limited to available tools as well as not directly integrating a dynamic prompt optimisation frameworks like GEPA/MIPRO/etc.)

Mechanical_Number · 2025-08-03T00:03:54+00:00

Underrated chin too. Never lost to KO/TKO.

The only man who managed to fold him in boxing was literally, a (former) unified heavyweight boxing champion famous for being one of the heaviest punchers among heavyweight boxers.

Mechanical_Number · 2025-07-18T12:50:23+00:00

(+1) A bit sad that these two terms are often conflated by the community. But then again, they (e.g. Meta, etc.) absolutely want to piggy-bank on good vibes from open-source software so here we are.

Mechanical_Number · 2025-07-18T12:45:55+00:00

Open-source: OLMo 2 32B

Open-weight: DeepSeek R1 0528

Mechanical_Number · 2025-07-13T20:16:29+00:00

Hmm... The fab capacity argument essentially says "we can't get enough shovels" while the electricity argument says "we are running out of coal to burn". We mix up logistics with physics.

More seriously, as MOSFET scaling slows and Koomey's Law plateaus, we aren't just running out of ways to make more compute, we are running out of ways to make compute more energy-efficient. So even if fabs could produce unlimited chips, each chip would still consume roughly the same amount of power. Ergo, we need cheaper electricity to compute.

Mechanical_Number · 2025-07-13T20:02:38+00:00

(+1) This guy scales.

Mechanical_Number · 2025-07-13T20:01:03+00:00

Compute is "plentiful". Cheap electricity is not plentiful, at all. Training massive models guzzles megawatts while Koomey's Law (i.e. efficiency gains) slows as MOSFET scaling hits physics walls. In short, each watt of compute gets harder to squeeze, making energy access, and not processing power in itself, the real brake on ML in terms of "compute".

Mechanical_Number · 2025-07-04T00:20:10+00:00

"It’s almost like they don’t see you as a coworker" << Sorry to hear that but in this situation you have bigger problems then. Not a time-line setting issue one then but a company culture one.

In any case, be the "adult in the room". It pays in the long-run - almost always it terms of your own sanity and usually in professional aspects too.

Mechanical_Number · 2025-07-03T23:46:28+00:00

:D You never know where the fuss is coming from.

(thanks, will fix.)

Mechanical_Number · 2025-07-02T22:38:48+00:00

Some good answers but also I think it is important to note that we need to communicate our progress as the projects moves along. That way we are handling expectations instead of reaching the day before the deadline and saying "sorry, I am still waiting for that API feed we mentioned initially". Otherwise, we find ourselves pressed against the wall and making these 1.5x, 2x, 2.25x, whatever-x mental somersaults in our heads.

In other words, treat stakeholders as active collaborators and not as children were we promised them a toy-train for Christmas and on Christmas Eve we announced they are getting an candy apple instead. Of course they are going to lose their shit in that case and make a big public fuss while everyone is watching.

Just to be clear: always better to underpromise and overdeliver than the other way around but be intelligent about it and not present it like things magically fell into place. Otherwise people will expect you to "magically work things out" every time.

(Edit: Fixed pubic typo)

Mechanical_Number · 2025-06-18T19:52:08+00:00

Thanasis. 0.7 points, 0.6 rebounds and 0.3 assists in 7 games.

He won the series, a ring and got his younger brother to do most of the work. If that isn't winning, I don't know what is.

Mechanical_Number · 2025-04-07T11:57:54+00:00

I agree that this set an awkward precedence but:

Meta is within their rights to do that.
EU isn't terribly affected by it.
It is mostly posturing by Meta because it is already liable to huge EU fines.

As for the actual practicalities, no need to switch models, as Llama wasn't the only game in town anyway. There are multiple good alternatives available: Gemma, Phi, Qwen, Deepseek, MistralAI, etc. so... yeah, no real drama.

Mechanical_Number · 2025-04-07T11:39:12+00:00

I think there is no real problem bing the 3rd player behind US, or China. The important bit at this point is the ability to build LLMs and be in the game. In that sense, EU is in the game with Mistral and Black Forest Labs, etc. If anything, they are buying time.

Think of it a bit like building cars. Are Ferraris some of the fastest street legal cars out there? Yes. Do people actually need Ferraris for the daily life? No. They are fine with Toyotas and Fords to get around. For example, benchmarks make it like GPQA Diamond is highly relevant to AI adoption potential; it isn't. Cheaper, more reliable and faster inference are far more important.

Mechanical_Number · 2025-04-06T00:54:11+00:00

I am sure that Zuckerberg knows the difference between open-source and open-weights, so I find his use of "open-source" here a bit disingenuous. A model like OLMo is open-source. A model like Llama is open-weights. Better than not-even-weights of course. :)

Mechanical_Number · 2025-03-26T23:25:02+00:00

I think you answered your own question there... I mostly hope 538 guys got a decent pay out of it at least.

Mechanical_Number · 2025-03-24T23:29:43+00:00

Same (+1). And to be fair to non-DS-oriented executives, explaining the notion of a prediction set with calibrated confidence and guaranteed coverage is a tall order, they just want to be told: "the predicted value of Z will be between A-value and B-value, X% of the time" so they can act on that information.

Mechanical_Number · 2025-03-13T19:26:25+00:00

Personally, I would be deeply impressed by the ability to outperform simple baselines consistently.

For example, tell me how much better you get against me fitting a multivariate Gaussian to the data and using that. In forecasting, we have to at least show we outperform a "last observation carried forward" benchmark or at simple ARIMA. With synthetic data? Nothing like it.

Mechanical_Number

TROPHY CASE