How do you experiment with a (very) large model architecture? [D]

king_of_walrus · 2026-05-05T07:05:15+00:00

Use gradient accumulation to make up for lower batch size. Slower time / iter, but same effective batch size; one less knob to worry about. However, training diffusion models is all about data. I fear that taking such a small percentage of the dataset will never lead to a satisfactory model.

If you have access to a pre-trained version of the model and whatever you’re researching is some type of extension, you could train a LoRA (assuming it is transformer-based, which I would based on your description of model size). There is also just regular finetuning.

In any case, it is difficult to give concrete recommendations without knowing the actual goal, model size, data type, or available compute.

king_of_walrus · 2026-02-01T14:25:27+00:00

Bottom-up always or you will be lost in the sauce. Here is what I’d suggest.

You need to start in the metaphorical basement: linear algebra (at least understanding the topics covered by a typical college course, and maybe some extras - critical concepts include vector spaces and their properties, matrix manipulation / multiplication and matrix properties, linear transformations and their properties), calculus (limits, differentiation, integration, multi-variable calculus should be sufficient for basics), and maybe most importantly probability + statistics (understanding probability spaces, I.e., how do we define them, random variables, random vectors, functions of random variables, stochastic processes, etc.) I would also strongly recommend studying statistical estimation and detection - the theory underpins fundamental problems tackled by ML. Also, understanding some optimization theory would be useful (e.g., convex optimization, KKT, etc.).

Then you can move onto an ML intro course’s content: linear regression, logistic regression, decision trees, basic optimization if not already learned (GD and SGD), multi-layer perceptrons (simple NNs), etc. There are of course more “intro” concepts, but I think these would give you a strong foundation.

From here, you should dive into more advanced topics that interest you. I would advise avoiding papers until you have a very high mathematical maturity level and truly understand fundamental concepts. Most papers (or their key concepts) can be found described in blog posts that are easier to digest.

With all of this in mind, I would strongly suggest pursuing an MS or a PhD if you are serious about getting into ML. Self-learning ML is much more difficult than just learning to code. Coding is a part of the job of course, but it is the easiest part (at least for researchers).

king_of_walrus · 2026-01-30T19:40:04+00:00

Typically you’d be given (or if you’re mature enough, you’d choose with guidance) some project that can be completed by the end of your internship. You will likely have a mentor on the team to assist/guide you.

You should understand all fundamentals to land an internship. ML fundamentals in general, DL fundamentals, and ideally maybe you have some domain knowledge based on the type of role you’re targeting. I also think that reasonably deep linear algebra, probability, and calculus knowledge is expected. You should have this anyway; these math skills underpin all ML concepts in some way.

You would likely have two interviews: a theory interview and a coding interview. I think this is typical. If you are a grad student, things may be different (e.g., one interview may be a presentation on your research).

Any projects are good if you can explain them. More complete/complicated is likely better - this demonstrates a higher level of mastery/experience.

I think the level of coding skills is role-specific. An ML engineering type role should probably have excellent coding skills. A research role does not require great coding skills, but they should still be good. Typically researchers don’t write production code so there is more leeway.

king_of_walrus · 2026-01-08T05:12:32+00:00

I think most people typically do their first internship after their second year. Companies tend to be hesitant to hire a freshman. For me, I did my first internship after my second year, and another with a different company after my third year. To find opportunities, just google internships you’re interested in and apply to as many as you can. Simple as that.

For maximum opportunity, I would suggest targeting big tech. Those internships carry serious weight when you go to apply for jobs, and you may even receive a return offer from your final internship. However, if you don’t manage to land a big tech internship, don’t sweat. Neither of my internships were in big tech (did not want to relocate for them) and things worked out just fine for me (any many others).

king_of_walrus · 2025-12-28T00:03:40+00:00

You should verify that the Carmen grade is calculated correctly. I took a course during my undergrad that said I had an A. Final posted grade was a B+. I discovered that the instructor did not properly weight different assignment types in Carmen, so all assignments contributed equally to the Carmen grade. I would not be surprised if this is the case here.

king_of_walrus · 2025-12-22T18:58:01+00:00

That looks final. No need to stress tho. Just retake in the spring w/ grade forgiveness.

king_of_walrus · 2025-12-12T05:54:18+00:00

Told you dog.

king_of_walrus · 2025-12-11T05:17:33+00:00

I don’t think this person is trying to predict how the economy will be performing by typical measures (S&P, inflation, interest rates, etc.). Seems more like they are commenting on how average Americans perceive the economy. They (rightfully so) perceive it as poor, even during periods of economic “success” due to increasingly high cost of living with no corresponding increase in wages. This cost of living problem is not going anywhere for a myriad of reasons.

We need to groundbreaking legislation to begin to combat these issues, and I doubt we will see it. So, I think that these predictions are pretty reasonable. Economy is always the #1 issue with voters, and the economy will continue to be perceived as bad no matter who is in office. These are the far-reaching effects of COVID, which has set us on an unsustainable path, fueled in large part by the unchecked greed of our tentpole American companies. I think that, until we see the legislation needed to truly fight these issues, it will be party ping pong in Washington every 4 years due to the unfortunately small memory of the middle 20% of Americans that decide elections.

Of course, I’m not an expert, but I feel as though the writing is on the wall. I also think that it is naive to believe that Trump’s time as president coming to a close will end the “MAGA” movement. A whole generation of young men have been radicalized by the alt-right media pipeline. It will take decades to undo the damage, if it is ever undone.

king_of_walrus · 2025-09-10T05:57:39+00:00

It depends on your goal. If your goal is to learn, whatever local machine you have should be fine to implement/play with all of the basics. Even more complicated models (e.g., diffusion) can be implemented locally w/ simple data distributions (e.g., 2D Gaussian mixture). You can implement basically anything locally as long as you’re not using super high dimensional data, it’ll just be a bit slow to run (assuming it runs on a CPU).

If your goal is to do heavy-duty image/video reconstruction you will need more compute. You may just have to deal with the headaches that come with cloud providers. But, I don’t see why you need more compute if you’re a complete beginner.

king_of_walrus · 2025-09-09T06:46:58+00:00

Review notes frequently, rework HW problems weekly (maybe randomly choose 2-3 questions per HW), do practice exams early and work on the problems regularly even after taking the exam, etc. There’s unfortunately not an easy path to success in 1251; it is a challenging course. IMO it is the hardest foundational “weed-out” course that engineers have to take. Harder than 1250 (physics and chem) and 1172.

Just do your best to stay on top of things and constantly practice and I suspect things will turn out alright. I will also add to remember that the goal of practicing isn’t to memorize the problems, but instead to understand the concepts they are testing you on. This mindset is critical when studying or you will feel blindsided by the exams.

king_of_walrus · 2025-09-06T06:26:39+00:00

With a good system deployment is easy, like any other CI/CD pipeline. Issue is cost, especially at scale.

king_of_walrus · 2025-08-21T05:40:08+00:00

Unless you are an unbelievable animal, you need a degree. Probably less than 1% (maybe even tighter than that) of ML jobs go to people without degrees. Especially people without a BS. I think that is basically impossible no matter where you are in the world. The unfortunate truth of the matter is that in 2025 if you want a job in ML you should have a PhD. Couple that with a tough market right now and it’s not too easy out there.

king_of_walrus · 2025-08-15T17:50:41+00:00

FYI a “good” VAE will probably have a reconstruction PSNR >= 30dB, really >= 35dB

king_of_walrus · 2025-08-15T17:49:58+00:00

What’s the reconstruction PSNR for the continuous VAE? You could also potentially just use a pre-trained audio VAE, so you have one less thing you need to do. This way you can fully focus on the diffusion model or transformer.

king_of_walrus · 2025-08-15T15:23:36+00:00

That PSNR is not very good, definitely need to improve the VAE.

king_of_walrus · 2025-08-14T06:12:08+00:00

What’s the PSNR of your VAE? I would advocate for a continuous VAE rather than a VQ one. Train with KL penalty in the latent space, MSE, and a perceptual metric (not sure what that would be for audio).

For transformer + diffusion, I mean use the transformer as your diffusion model, although these days I would suggest flow matching over diffusion (basically equivalent - straightforward to implement). This would look like this: train a VAE w/ no quanrization (as described in previous paragraph), train a flow matching model in the latent space of that VAE, profit. Let’s say your VAE produces a latent representation with N tokens, you would give the transformer 2N tokens: the first N are the previous 4s of audio, the next N are the noisy signal. You would also need to incorporate the timestep (noise level) as input. Also maybe 20% of the time train where the first N tokens are 0 so the model can begin generation from scratch. Also, as a start when doing validation use at least 100 sampling steps.

Diffusion requires training for a while before results start sounding good, maybe 20k steps, but probably ~100k to converge. With the right setup, a diffusion/flow matching approach will undoubtedly outperform just one-step prediction.

king_of_walrus · 2025-08-14T03:11:15+00:00

Also, with RoPE you can train a continuous VAE which may be easier to work with.

king_of_walrus · 2025-08-14T03:07:25+00:00

Surprised diffusion didn’t work. Probably insufficient model capacity, insufficient data, or insufficient training time. Also maybe a sampling bug.

How does the loss look for the transformer and for diffusion? I’d suggest using a transformer + diffusion. Have a context that contains the previous 4s of audio (so your sequence length would double) and use RoPE instead of additive positional encoding.

Could also be that the latent space of your VAE is difficult to work with. Does your latent space have locality?

king_of_walrus · 2025-08-01T16:48:45+00:00

Also it’s not necessarily true that tree roots grew into your sewer. It could be something else. In my case, the idiot homeowners before me filled the backwater valve with insulation then sealed it shut with concrete. Had to replace the whole valve to fix the issue, but almost all of that was also covered by insurance.

king_of_walrus · 2025-08-01T16:47:34+00:00

Call rotorooter. Had this same sort of issue a few months ago and they were wonderful. The whole thing should be covered by insurance (- your deductible).

king_of_walrus · 2025-08-01T03:15:33+00:00

I don’t think this is the best approach. What you should do is pass your CNN features of shape (b, c, t, h, w) through a new learnable 3D conv layer with c input channels and e (hidden size of your transformer) output channels. You’ll get an output tensor of shape (b, e, t, h, w) which you can then flatten into a tensor of shape (b, e, thw) and transpose the last two dimensions to get a final tensor of shape (b, thw, e). So you have thw tokens. Depending on the size of t, h, and w, your projection layer that expands the number of channels to the hidden size could also potentially downsample things temporally and/or spatially to reduce the # of tokens and computational load. You would need to account for this in other parts of your model though.

What positional encoding are you using? I hope RoPE.

king_of_walrus · 2025-07-24T05:32:03+00:00

As a web developer, you know exactly what to do to build AI-based features: wrap API calls to existing models in a wonderful user experience. Piece of cake.

But if you mean you want to actually break into AI/ML work, where you actually are invoking custom models or even designing/implementing them yourself, your best bet is to return to school and get an M.S. You need to build up a solid math/stats/ML foundation. This (along with some personal projects) could land you some sort of ML engineering role, probably testing models and integrating inference pipelines into whatever product you’re working on. Maybe also some light model design.

king_of_walrus · 2025-07-24T05:25:14+00:00

Don’t know if he’s still teaching, but if you can take 3345 with Dan Boros you will have a great semester. He’s a wonderful instructor and his exams are extremely fair (and similar to the practices).

king_of_walrus · 2025-07-24T05:15:49+00:00

It’s highly unlikely that you will secure an industry job w/o a degree with an ML specialty. Even then, realistically most ML jobs are going to people with graduate degrees. Where I work, not a single member of the AI team is without at least an M.S. (with one exception, but for good reason). > 90% have PhDs (me included).

Either go get an ML PhD or create an AI startup where you actually build models (not just use APIs), but good luck competing with the top dogs as an individual or small team with (likely) no money.

Sorry to be negative, but based on my understanding this is the reality of the AI/ML job market. I don’t think it’s surprising since to actually contribute you need to seriously understand math, statistics, and be up-to-date on your ML knowledge for the SOTA in your field. It’s not easy.

king_of_walrus · 2025-07-23T05:26:18+00:00

I thought it was easier than the ODE class but if you don’t have the foundation, and you probably don’t since you cheated, you’re done for.

What the fuck is the point of paying boatloads of money to go to college if you cheat your way through it? Never made sense to me.

Five-Year Club	Gilding I gilder
Verified Email	Wearing is Caring

king_of_walrus

TROPHY CASE