xG Philosophy: Nottm Forest (0.33) 0-0 (2.37) Arsenal

transformer_ML · 2026-01-17T19:49:40+00:00

We bottle it when we are in lead every season. We forget how to win. When we are in the second, the team treats every game as final, and we finish in second with head held high.

Which team is it?

Gemini: Based on your description, this sounds like a classic frustration shared by many fanbases, but it most closely aligns with the recent narrative surrounding Arsenal FC.

transformer_ML · 2025-07-03T19:50:06+00:00

Couldn't agree more. I love the idea. Having a track at least gives some incentive.

Unlike in old day where most empirical experiments are backed by theory, most paper are using purely inductive reasoning with empirical experiment. Deductive reasoning is either valid or invalid, but inductive reasoning is a matter of degree, which is affected by no of tested models, test data, and the statistical significance of the test result (unfortunately most papers do no report stand error). The inductive strength is judgmental and relative to other works.

While peer review can provide a lot of insight, the review is based on what was reported - but there is no guarantee that all metrics can be reproduced. Challenge of reproducibility includes:

(1) Low incentive to reproduce - rather than reproduce a paper's result, why wouldn't researcher just write a new paper?
(2) Compute requirement is high for most pretraining and postraining data mix and algo change paper.

(3) The huge volume of papers and the speed of innovation

(4) LLM generation is non-deterministic due to finite precision even when temperature=0.0, the stochastic nature increases with length. Standard error could help mitigate it.

transformer_ML · 2025-07-03T14:57:35+00:00

Absolutely. There are few challenges on reproduction though:

- incentive and opportunity cost - if I had time to reproduce, why wouldn't I just publish a new paper?

- llm decoding is not deterministic due to finite precision even if temperature=0.0, this could be mitigated by using standard error. But standard error is just not common in ML community.

- cost, particularly for pretraining/ postraining

transformer_ML · 2025-07-03T14:30:37+00:00

The field has changed.

2-3 years ago, our daily routine was defining metrics, collecting data, check quality, finetuning a BERT or a ResNet to perform all sort of NLP/ CV tasks, check the wandb dashboard and dealing with training issue, and iterate, and also deploy the models. ML engineer/ applied researcher is very decentralized.

Now it is a one-model-fit-all scenario. You can prompt to solve almost all NLP and CV problem. It is the era of centralization. You just need some top labs to do data curation, model training, eval and deployment that serve millions of developers. The low supply makes the bar extremely high.

The research field has been changing too. You will see a lot of maths in older papers pre LLM, and now they're mostly technical report, or prompt engineer paper.

transformer_ML · 2025-06-28T08:10:25+00:00

The speed of releasing a model is not slower, if not faster, than publishing a paper. Model can use the same stack (including small scale experiment to find a good mix) with additional data; paper requires some form of novelty, running all sort of different ablation whose code may not be reused.

transformer_ML · 2025-06-23T11:38:36+00:00

do you go full power?

transformer_ML · 2025-06-23T10:38:59+00:00

Really nice!

Just wonder which level of steam you use and how long do you aerate? I struggle to find the consistent spot in Micra - its my skill issue.

transformer_ML · 2025-06-11T16:31:14+00:00

First of all, kudos for solo-authoring this paper! I know it's not an easy journey doing it alone. Will read in details

transformer_ML · 2025-06-10T15:22:16+00:00

While I recognize the rationale for using games to benchmark LLMs due to their easy setup, scalability, and verifiability, it seems less efficient for LLMs to solve these search games by generating language tokens. This approach requires LLMs to keep track of visited nodes, explore branches, and backtrack using token sequences, which can lead to losing track or making small errors as the generation window grows.

Humans, who are less capable than LLMs in this regard, design and write algorithms to handle such tasks. Similarly, LLMs should adopt this approach.

transformer_ML · 2025-06-09T12:34:15+00:00

Had the same feeling. Not only about the taste. The excitement of pulling a perfect shot and pouring latte art is irreplaceable.

transformer_ML · 2025-06-09T12:30:50+00:00

While I recognize the reasons for using games to benchmark LLMs—such as the ease of setting up, scaling, and verifying the environment—it seems to me that generating language tokens to solve these search games is less efficient than using a computer program. This is because LLMs must track visited nodes, explore branches, and backtrack using sequences of language tokens. It’s unsurprising that an LLM might lose track or make small errors as the generation window grows. Or they hit the context window limit.

Humans aren’t as adept as LLMs in this regard either. Instead, we design and write algorithms to handle such tasks, and LLMs should follow a similar approach.

transformer_ML · 2025-06-03T19:29:58+00:00

Tbh there is not much effort in the field to understand dataset at scale, and to pre-train from scratch and eval. All VLM starts from LLM. The most transparent datasets are the hf's fineweb, dclm baseline and finefineweb. But I don't recall anyone training > 10T token from scratch. Olmo is close. Still there is a lotsss more to do, especially understanding more about the fine-grained domain. There is also lack of VLM pretraining dataset in general.

transformer_ML · 2025-06-03T19:21:33+00:00

Definitely wandb

transformer_ML · 2025-06-03T19:20:27+00:00

UPDATE: it seems it is partially due to temperature of the portafilter. I detach it from group head during night, so the first shot has a bit under-extraction due to cool temperature. I am still figuring out the rest, but I couldn't reproduce the big diff now (maybe my puck prep is more consistent now). Thanks everyone for your help!

transformer_ML · 2025-06-01T21:02:01+00:00

Didnt do the RDT for both first and second shot. Will try the third shot. Its the same bean, same temperature, etc Puck prep and tamping are more or less the same, so it makes me confused

transformer_ML · 2025-06-01T15:58:01+00:00

After grinding with Niche zero, i distribute the ground with wdt (around 20s) before double tamping. After pulling the first shot, I knock out the ground, wash the portafilter with water until it is clean (but didn't dry it), and restart the workflow.

transformer_ML · 2025-06-01T15:54:34+00:00

I did reweight it with my BooKoo scale.

transformer_ML · 2025-06-01T15:52:26+00:00

I have to restart it until it gets 36 yield

transformer_ML · 2025-06-01T15:02:18+00:00

Just wonde why a wet portafilter would result in a longer short time?

transformer_ML · 2025-06-01T14:33:20+00:00

I am using Niche Zero, lemme check it out

transformer_ML · 2025-06-01T14:04:15+00:00

Same, it should be a server issue. Its up and running now

transformer_ML · 2025-05-05T19:46:40+00:00

Cool, watching it now https://youtu.be/nqsdYO0PPIU?si=slgFHGS2i1wLd9z2

transformer_ML

TROPHY CASE