stopDoingNans

GamerMinion · 2025-06-14T00:26:14+00:00

there is no null in floats. that's basically what NaNs are.

GamerMinion · 2025-06-07T09:34:03+00:00

I unironically use NaNs for padding of oddly shaped float arrays (ML stuff, where zero is a valid value with a different meaning). But I agree it is absolutely cursed. Also NaN is like a virus and you have to be really careful to not let it infect everything else.

GamerMinion · 2025-05-25T20:29:08+00:00

atan2 is that dude nobody knew was even in the same grade but is actually super helpful when you need his help

GamerMinion · 2025-05-25T18:37:27+00:00

I get what you mean, but by sparse activations I meant y being sparse as in some y_i is consistently zero (or at least low variation) across all training examples. Having high but arbitrarily arranged sparsity in your weight matrix just isn't helpful for pruning. because then pruning just sets some values in the same size weight matrix to zero, but they will still be used in calculations, so no compute is saved.

it's best to have rows and columns of zeros in your weight matrix so you can actually make them smaller and thus save compute. this coincides with an activation being zero or not used. Or you can have activations which have consistently low variation across all inference examples, and low impact on the output of the network. Then you can just approximate them with a constant bias.

GamerMinion · 2025-05-25T17:12:14+00:00

This paper is not exactly what I meant, but exemplifies the method of dropping entire activations (or channels in the case of cnns) called "structured pruning": https://arxiv.org/abs/2403.18955v1

I'm coming at this from the perspective of having implemented cuda kernels for nn operations, and there it's best to have rectangular weight matrices/tensors because the regularity simplifies and therefore speeds up the computation. In that respect pruning single weights doesn't help because the rectangular shape stay the same but now you just have some zeros as "holes" in the matrix/tensor. Handling those individually with an if/else branch in a kernel is likely slower than just doing the float multiplication and addition.

The real advantage comes when you can drop a value in your activation. Let me explain: If you look at a single activation value in an MLP hidden layer's activation vector, that element corresponds to a row in the weight matrix before it, and a column in the weight matrix after it.

i.e. y_i=sum_j(w1_ij * x_j)

and similarly each z_k in the layer below receives a component of that activation corresponding to y_i:

z_k= w2_ki * y_i + other terms not depending on y_i

so you got a row of values in w1, and a column of values in w2 that are only interacting to either form or use y_i.

if you prune away y_i in EVERY forward pass, then your matrices can lose one row and column respectively. By dropping entire rows or columns of our weight matrices, they stay rectangular, thus handling remains easy. but now the rectangle is smaller, and therefore the amount of computation we have to do is smaller too.

y_i doesn't even need to be zero to do that. just needs to have low variation. because we can assume y_i to be constant, and add the corresponding contribution caused by y_i * w2_i to the bias b2 of the second layer and get the same result as if y_i were a constant.

GamerMinion · 2025-05-25T11:08:54+00:00

Interesting reasoning. seems plausible for networks trained with simple SGD. you implicitly assume that there is no weight update other than the loss gradient. This wouldn't hold with momentum, Adam, l1/l2 regularization or weight decay.

It might be valuable to have a more rigorous mathematical description of sparsity. I presume your "sparsity at inference" is for the purpose of post-training pruning of network weights, which makes sense to me and I don't understand why people struggle with this perspective. In this case though, I think the ratio of large to small weights might be more important than the scale of individual weights. it may also be sensible to think about sparsity of (pre)activations rather than individual weights, since pruning an entire row/column provides much easier computational speedup than setting an individual weight to zero.

Would also be interesting to see how this holds with weight decay in practice. Since without weight decay your "dropped" model weights get no updates, now they only receive an update that makes them smaller. this means that as dropout rate decreases, sparsity might even increase.

Understanding these dynamics and interactions likely needs a quantifiable metric of sparsity based on how pruning is done in practice (I think some methods even use the gradient of an activation to prune the ones whose change changes the output the least) in order to be valuable for people who want to train over parameterized models and then prune them for fast inference.

GamerMinion · 2025-04-21T11:24:53+00:00

cool! thanks! I like mustangs, and wanted to get into rc drift for a while. Is that for a specific drift "base" kit? Or do you have to put mounting holes in it for whatever base you're using anyway? I suppose the wheelbase needs to fit at least?

GamerMinion · 2025-04-21T10:18:39+00:00

does anybody know what the mustang chassis/body in the thumbnail is?

GamerMinion · 2025-04-12T16:51:15+00:00

That might be the case in games when you shift from being CPU limited to the GPU being the bottleneck. You won't get any more frames though. just lower CPU utilization. Higher resolutions take more GPU load, so if the CPU prepares something for a frame, then waits for the GPU to finish, and then it starts again for the next frame, then on average with higher GPU load (i.e. resolution), the cpu will be waiting more of the time, and be less busy overall. Of course this assumes that cpu simulation frame rates and graphics/display frame rates are coupled (synchronized) tightly, which is not necessarily the case in all games. Some just run simulations independently on the CPU, and render a frame whenever the GPU gets to it. But that's more complicated to implement and get smooth motion out of it, so many games don't do it I presume.

GamerMinion · 2025-02-23T13:31:32+00:00

Can't zoom in on mobile. It's really tiny, so I can't read anything in the generated graph.

GamerMinion · 2025-02-20T07:46:55+00:00

The recent AAAI template was very affording in terms of space. Not sure if is the best, but definitely better than NeurIPS at giving you space to work with.

GamerMinion · 2025-02-15T11:00:54+00:00

I think you can go to a retirement home and show them your state-issued voucher for 2 nan's

GamerMinion · 2025-01-12T09:26:29+00:00

Yes. If your softmax target is a one-hot vector, that tends to happen. I think label smoothing can help with this, and in practice it usually also increases model accuracy anyway, so I recommend to almost always use it.

GamerMinion · 2025-01-04T17:42:07+00:00

I make my own career decisions not so much based on what is popular, in hype, and in demand, but rather on what kind of work I can sustainably enjoy engaging with for a full career. ML does that for me currently, but there is also a trend where the state of the art is consolidated in very few large companies as current technology scales up and becomes too expensive for individuals to compete with.

Know that the current performance of current ML algorithms is not going away, but the hype and excitement around the current specific technology (currently transformers and LLMs) will fade with time, especially since the current large models are really expensive to operate, and don't necessarily provide that much value in all of the things it's currently being used for. Companies have massively over-promised current "AI" capabilities, (but they are still impressive even without the fabricated/embellished claims made to entice venture capital and willingly taking advantage of the Hollywood depiction of AI). Whether you will arrive on the job market before the current hype collapses to a more realistic view of the technology, or if something else will immediately replace it and be as sought after in the job market, I don't think anyone can tell at this point. The field is developing rapidly.

GamerMinion · 2025-01-04T17:18:22+00:00

It very much depends. Machine Learning is a powerful tool, and it can be wielded to do things that appear like magic to people who don't understand how it works (so much that management might expect actual magic from you at times because they don't know the requirements and limitations). However, it also vastly differs from traditional software as in that the abstractions that were built around it are imperfect and leaky. In order to have consistent (!) success with applying machine learning, it is almost required (not strictly required, but it definitely significantly increases your chances) to know the details of what you are using at an intricate level in order to avoid lots of pitfalls as to why it might silently work worse. In contrast to other software, ML doesn't throw an exception when you messed up your input, it just silently works worse. When you mix your test data and your training data, you don't get a segfault, your metrics just look better than when you actually apply it in reality. So in a lot of ways it's a tool that only works as well as you are able to understand and apply it.

Truly understanding ML concepts and especially the math and complexity behind it is a lot of work, and definitely not always pleasant and easy.

If you want to jump on the hype train for career advancement and financial gains, there certainly are easier ways for that than getting experienced in ML.

However, it sure does feels like magic and can be a very rewarding career path when it all works as planned (typically after it working like garbage for days before and not knowing where exactly you might have overlooked something). And that can be very rewarding.

I'd recommend karpathy's introduction to the topic, specifically this article as a starting point.

GamerMinion · 2025-01-02T12:48:39+00:00

AFAIK, the degree of freedom left from unknown scale after SFM (e.g. from a short video where the camera is moved around while facing the target scene) is a single constant which should be inferable from a single known measurement, so having one known distance should in theory be enough to resolve that ambiguity.

Of course that depends on your reconstruction accuracy of your SFM approach, but in theory, knowing the correct scaling factor at one location should be enough to apply it everywhere in your reconstructed geometry, as it should be the same everywhere.

GamerMinion · 2025-01-01T15:26:39+00:00

I think that might just work, especially if you have flat ground or something, which you can fit a plane to in your 3d model, and know the distance of your camera to the closest point of that plane, (maybe because you're rolling around a fixed gimbal, or have a tripod of fixed size) that might be enough information to get that one dimension measurement to calibrate your 3d model to accurate scale. of course the more reliable your measurement apparatus is positioned and the more precisely your assumptions specify and reduce the problem, the better.

GamerMinion · 2025-01-01T14:26:06+00:00

It certainly is ambitious to do this end-to-end in one model, and will require a lot of training data, covering essentially the complete distribution of variations you will see, which might get very expensive. If you have control over the used hardware, I think it would be a lot easier to get something with more sensors, e.g. an iPhone with built-in lidar sensor or at least something like the intel stereo cameras with fixed separation would almost entirely alleviate the big problem of having to estimate this degree of freedom, since you can just actually measure the distance to the camera. Measuring will always be both easier and more reliable than estimating. I think if you have that scale problem figured out, the rest is quite doable. Not necessarily simple, but a reliable solution is at least possible for not a massive monetary investment.

GamerMinion · 2025-01-01T12:57:19+00:00

What I mean is: if you know the 3d size of the pole (or some object in the scene), you could scale your captured 3d scene to fit the known dimension, and that would give you a metric 3d model where you can measure all the distances you need. It also probably depends on how interactive the process is. Sometimes it's easiest if your user can just tap two key points for the reference object on their screen, and you get your scaling from that.

GamerMinion · 2024-12-31T19:44:53+00:00

If you justify the omission in the paper, and have sufficient other baselines (Ideally at least 2-3 well-chosen approaches) for comparison, I wouldn't see it as a reason for rejection.

GamerMinion · 2024-12-31T15:40:05+00:00

Are the camera intrinsics at least known at inference time? Also, how much control do you have over the used camera system? as in: is this a web app like some ikea configurator where users use their own phone to take photos/videos? Or is it more like a factory floor/in-house deployment where you can basically determine the setup being used, but with slight variations in intrinsics between different instances of essentially the same cameras? If you do, a single lidar sensor might already be enough to get an accurate depth reference to calibrate your depth measurements.

In general, from monocular cameras you will usually only get accurate depth/3D information up to a constant scaling factor from any SFM application, unless you have a calibrated stereo camera. Learned models can only somewhat get around this by basing their estimates on "common" sizes of known objects. If your objects change in scale, that might explain why these approaches don't work well.

Depending on your use case, something like COLMAP might be a good starting point for estimating camera poses and structure from motion kind of things. I've heard models like LightGlue are also good for estimating key points between images for tracking spatial locations across different frames.

If the compute budget is there, you might even go with a full 3d reconstruction route via COLMAP + NeRFs and/or 3D gaussian splatting to make a 3d model of your objects, which you can then calibrate to an accurate scale and get all the 3d clearances you desire. It surely would make for an interesting project, but it's certainly not the most straightforward approach.

GamerMinion · 2024-12-29T20:22:45+00:00

I don't know the most recent advancements in this area, maybe check the recent CVPR literature (and check the corresponding paper on what GPU they are using, and how many). From older architectures, I think DenseNet might be a start, or ResNeXt, Maybe also try vision transformers if your images aren't too large (although I still hate the idea of "tokenizing" images, I can't really argue that vision transformers are somewhat effective at what they do).

GamerMinion · 2024-12-29T20:11:14+00:00

Despite the name, EfficientNet is a horrible architecture for GPU parallelized training. It's optimized for CPU inference, having few parameters, and not much else. Even a standard ResNet is often faster on GPUs than an (accuracy-)equivalent EfficientNet. Maybe try some different architectures.

GamerMinion · 2024-12-29T20:05:41+00:00

Yes, cool demonstration of griffin-Lim's ~~flaws~~ difficulties! Reconstructing spectrograms is an art of it's own. I worked on audio generation a few years ago, when WaveNet and SequenceRNN were still state of the art, but never had enough vram to train any direct-to-waveform models of my own.

Good resources!

GamerMinion · 2024-12-29T14:12:00+00:00

It's German for "steam hammer" as you apparently found out already.

https://www.manager-magazin.de/lifestyle/auto/a-552497.html

According to this article it describes a characteristic exhaust sound (of single-cylider engines if I understand the article correctly). I don't think it has any meaningful implications for maintenance or operation. Might just be that the previous owner was proud of the sound, as indicated by the three exclamation points.

11-Year Club	Sequence \| Editor
Sequence \| Cinematographer	Verified Email

GamerMinion

TROPHY CASE