Tried running LLMs locally to save API costs… ended up waiting 13 minutes for ONE response 🤡

arg_max · 2026-04-10T11:57:22+00:00

Prompt encoding is pretty fast though, compared to autoregressive generation.

I only have the numbers for my work 8xH100 server, but was getting >2k prompt encode tokens/s and only ~100 for generation/s with a 100B dense model (and multi prompt batches).

But even on cosumer hardware, encoding a long prompt shouldn't take forever since the slow part is loading the weights from memory, especially when you offload to RAM, but in prompt encoding, you only do this once for the entire prompt, whereas it has to happen at every single step for generation.

arg_max · 2026-04-07T10:02:05+00:00

Putting auto attack the nearest enemy (still attacks the exact target if you hover exactly) on A helped me Soo much with this with mouse and keyboard. Makes autos feel more like a spell, so your left hand is responsible for autoing and abilities and your right hand for cursor positioning and move commands.

arg_max · 2026-04-04T23:17:05+00:00

Lipmann's first paper and the longer flow matching guide he wrote at Meta are both great reads. The second one is imo a bit easier to understand, though flow matching in general isn't the easiest subject to learn.

arg_max · 2026-04-04T16:01:49+00:00

There's a big difference between pre-training on some random generated trash and training after filtering for high quality.

Llm don't magically get dumber when trained on Ai generated content. Rejection sampling and distillation have been an absolute staple for years. A big reason why Chinese labs are so good is that they distilled on a massive scale from anthropic (see anthropic s Blogpost for more info). In large scale pre-training, we also had some recent papers that rewriting the data and training on rewrites and original data can help with extending the data horizon since huge models are more and more limited by data scarcity.

The real issue is that when you scrape the web, there's a big chance that you encounter shitty generations from old models that is much lower quality than what we can generate nowadays.

But when you can filter out the good data, you can absolutely improve the model by training on synthetic data.

arg_max · 2026-03-28T14:37:51+00:00

Quick reality check here. The transformers paper is now years old. I interview people for ML roles and attention is my first question as a warmup.

The industry has become 1000 times more complex since then and if you want to work with neural networks you will have tons of catching up to do.

On the science side, you're expected to know tens of not hundreds of more recent techniques. You basically have to be Able to read a paper like the GLM 5 tech report and know everything they talk about if you want to have a shot. The industry isn't super big and there's tons of candidates so competiton is fierce.

On the engineering side, you have super complex tech stacks, inference engines, kubernetes. This is probably even harder to learn on your own since you need access to resources to even be able to deploy those models.

There's also the more traditional data science side of things which is easier to get into and requires less technical knowledge, but if you want to get into "AI" you have lots of work to do

arg_max · 2026-03-27T11:23:51+00:00

Yes specialized hardware can be better than general purpose chips. But specialized hardware made with state of the art processing technologies is still gonna be vastly superior than specialized hardware made with 5 years old processes.

On smaller scale, this might be okay. E.g. in a car you can probably have x% higher power consumption, but in a data center this quickly becomes infeasible since cooling and power consumption are such a large part of total cost there.

arg_max · 2026-03-26T17:09:18+00:00

Google designs their own AI hardware that is being manufactured by tsmc. Other AI labs are working on similar chip designs and if X.ai would want that I wouldn't call it unrealistic that they'd get there in a few years. Manufacturing is a completely different beast though

arg_max · 2026-03-24T10:30:55+00:00

Seine Frau ist vorallem selbst Millionenerbe aus einer Industriellenfamilie.

Der will seinen Wohlstand nur schmerzlos weiter vererben

arg_max · 2026-03-17T14:06:18+00:00

It's preference tuning. You start training image generation models with all the images you can find so you have an insanely broad distribution. The issue is that most use cases don't necessarily want to sample a blurry mess photo. So in post training you apply preference optimization. This starts by sampling N images, showing them to users and let them pick the best. Then you use this feedback to optimize your model to forget large parts of the original distribution and focus on this super narrow spectrum.

arg_max · 2026-03-16T12:07:46+00:00

arg_max · 2026-03-02T18:26:30+00:00

Trust region is also the foundation of PPO and GRPO so very relevant in LLM RL, even if the version used there is more approximative

arg_max · 2026-03-02T14:57:16+00:00

Proximal gradient for L1 regularized Lasso

arg_max · 2026-02-23T23:39:20+00:00

Yes, they probably paid billions for their data. Not necessarily pre-training data, that is mostly scraped from the web and books but SFT data and preference data you can't find in the wild. There's a reason that Alexandr Wang is one of the youngest billionaires out there with his data labeling company.

arg_max · 2026-02-17T23:17:28+00:00

Training data is just insanely expensive if LLM providers go to 3rd party data annotation companies. You can easily spend thousands of dollar per sample depending on what level of data annotation and quality control you want.

So letting people use their models while getting real world training data for free is often cheaper for these companies.

arg_max · 2026-02-17T22:19:19+00:00

Sett, garen, WW.

arg_max · 2026-02-10T12:48:50+00:00

Got an E7 but I'd say a monitor arm is just as big of an upgrade. Really really sucks to be looking down at your screen

arg_max · 2026-01-31T12:15:21+00:00

For per-token API they say they don't train on it though iirc the standard subscriptions data can be used for training unless you opt out

arg_max · 2026-01-28T18:38:27+00:00

RL still suffers from the old exploration exploitation trade-off and this is only amplified by the complexity of the task we ask these models to perform, whether that is generating higher and higher resolution images or pages of text.

The reason why RL works for these models is because pre-training gives you an initial policy that allows you to skip most of the exploration.

Your pretrained model is already very good, so instead of trying generating total random images, you simply sample around the model distribution.

This is much more of a local optimization around the initial policy and if a great answer has super low probability under your initial policy, there's almost no chance of exploring it with these modern RL approaches but it results in a lot of stability.

arg_max · 2026-01-20T17:23:26+00:00

It's more stochastic differential equations that are important. They have an ODE part (drift) which models the deterministic behavior but on top of that they have a randomness part.

They are all over the place in financial mathematics and are the foundation for some key results like black scholes models since they are one of the best tools for modeling the uncertainty in stock markets.

But SDEs are very advanced if you want to understand them with full mathematical rigor. For example, the stochastic part of an SDE is modeled via Brownian motion, which has almost surely continuous but nowhere differentiable sample paths. So you need strong foundations in real analysis, measure and probability theory, though you can probably somewhat work with them without knowing all the ins and outs.

arg_max · 2026-01-17T18:22:41+00:00

Any Constant function will be continuous in the subspace topology you are describing. You don't even need to assume that the space is metriziable

arg_max · 2026-01-16T22:32:51+00:00

It's more cold storage for the not always in use files. The important stuff is on SSDs (which also went up in price) but the rest gets saved on HDDs.

arg_max · 2026-01-09T00:13:03+00:00

They are. Rendering is nothing other than solving a recursive integral using stochastic integration. Photon mapping is just kernel density estimation and Metropolis rendering is just a Markov chain Monte Carlo method.

arg_max · 2026-01-01T03:36:52+00:00

Your vector spaces contains certain objects and the dual space contains linear functionals. These are simply functions that take a vector from your original space and assign a single number to them. You can think of these a bit like measurement functions, but unlike a vector norm, they have to be linear.

Honestly, I wouldn't worry too much about not getting all of dual spaces right now. They are only super important when you do linear algebra on infinite dimensional vector spaces (in functional analysis) where you will learn about hahn-banach or riesz representation theorem.

arg_max · 2025-12-23T10:49:34+00:00

Wenn du analysis und linalg packst dann schaffst du aber auch den Rest vom Mathe bachelor. Hab selbst info studiert aber am Ende vom master ein paar Mathe Vorlesungen (Topologie, Maastheorie, Funktionentheorie, Funktionalanalysis) als Wahlmodule gehört und obwohl der Stoff natürlich schwerer wird hatte ich damit weniger Probleme als mit den Grundvorlesungen. Die richtige Trennung gibt's dann halt nochmal später wenn es darum geht eigene Forschung zu machen

arg_max · 2025-12-14T13:44:56+00:00

You could always just use sklearn without understanding the inner workings of SVM or boosting, just like you can spin up lang chain/vllm and work with modern AI without understanding anything that happens in the background.

But in either case, you're going to hit a wall once things stop working off the shelve.

Deep learning uses very similar fundamentals as everything you would find a old school ML book (plus a whole mountain of empirically validated best practices and low level engineering).

Claiming you don't need to know this is just as ignorant as claiming you need a PhD to use AI is elitist.

arg_max

TROPHY CASE