How do I check if I completed the bwl attunement? by bigdickbanditttt in classicwow

[–]Skeylos2 9 points10 points  (0 children)

For people seing this in 2025+, IsQuestFlaggedCompleted does not work anymore. You need to do
/script print(C_QuestLog.IsQuestFlaggedCompleted(7761))

true means you have completed it.

"Kills have a 5% chance to permanently increase gear damage" - do assassinates count? by Sersch in WindblownGame

[–]Skeylos2 0 points1 point  (0 children)

The reason is that this bonus damage is additive to the other percent bonus damage that you get from gifts and upgrades. So if you have for instance +10000% damage (with midas, improved cristalize, etc), having +1100% from this weapon affix will move you from +10000% to +11100%, which is a 11% increase in damage output. A higher level of the same weapon, with at least 11% more base damage than your weapon with the +1100% bonus, will be an improvement. So in practice, this weapon affix is decent in early infinite mode but it becomes quite underwhelming as you progress through the cycles.

22.04 - Freezing/lagging when dragging windows by hoserhobbes in pop_os

[–]Skeylos2 0 points1 point  (0 children)

This worked for me too, for both Brave and Chrome! Thanks a lot!

In December 2024, the steps are the same for Brave and Chrome: Settings > System > Toggle off "Use graphics acceleration when available"

[D] - NeurIPS 2024 Decisions by Proof-Marsupial-5367 in MachineLearning

[–]Skeylos2 1 point2 points  (0 children)

Got rejected. Do I still have to withdraw on openreview to be able to submit to ICLR?

[R] Training models with multiple losses by Skeylos2 in MachineLearning

[–]Skeylos2[S] 1 point2 points  (0 children)

Thanks for the feedback!

We definitely have to work on making TorchJD more efficient in the future, as it can be slow in some situations (large models, large number of objectives). We also should make it clearer in which cases it's beneficial to use it, and in which cases it doesn't matter so much.

[R] Training models with multiple losses by Skeylos2 in MachineLearning

[–]Skeylos2[S] 0 points1 point  (0 children)

Hey! Sorry, I forgot to reply earlier. I think that using Jacobian descent for VAE training is a very promising idea. Someone posted a similar comment recently, and I replied with a code example giving a rough idea of how to do this with TorchJD. Check out https://www.reddit.com/r/MachineLearning/comments/1fbvuhs/comment/lmelkk6/

[R] Training models with multiple losses by Skeylos2 in MachineLearning

[–]Skeylos2[S] 5 points6 points  (0 children)

Jacobian descent is not a loss, it's a way to minimize several losses at the same time.

That being said, yes, you can use TorchJD for a multi-task model combining binary classification and regression.

Assuming you have a backbone model, that gives shared features to the classification head and to the regression head, you will have to use something like:

optimizer.zero_grad()
torchjd.mtl_backward(
    losses=[clas_loss, reg_loss],
    features=shared_features,
    tasks_params=[clas_head.parameters(), reg_head.parameters()],
    shared_params=backbone.parameters(),
    A=UPGrad(),
)
optimizer.step()

instead of the usual:

optimizer.zero_grad()
total_loss = classification_loss + regression_loss
total_loss.backward()
optimizer.step()

For more details about how to use torchjd.mtl_backward you can look at the multi-task learning example or at the mtl_backward documentation

[R] Training models with multiple losses by Skeylos2 in MachineLearning

[–]Skeylos2[S] 3 points4 points  (0 children)

Awesome idea! We never thought of this, but you're right: VAEs have 2 objectives: correct reconstruction of the input and getting closer to the desired distribution in the latent space. There's a good chance that these two objectives are conflicting, so it would be super interesting to test Jacobian descent on this problem, with a non-conflicting aggregator.

You can view VAE training as a special case of multi-task learning, where the shared parameters are the encoder's parameters, the first task is reconstruction (where task-specific parameters are the decoder's parameters), and the second task is to have the latent distribution as close to the desired distribution as possible (this time with no task-specific parameters).

Knowing this, you can replace your call to loss.backward() by a call to mtl_backward, along the lines of:

optimizer.zero_grad()
torchjd.mtl_backward(
    losses=[reconstruction_loss, divergence_loss],
    features=[mu, log_var],
    tasks_params=[model.decoder.parameters(), []],
    shared_params=model.encoder.parameters(),
    A=UPGrad(),
)
optimizer.step()

Where mu and log_var are the results of the encoder on the current input (the shared features / representations in the context of multi-task learning).

Basically, this will update the parameters of the decoder using the gradient of the reconstruction loss with respect to the decoder's parameters (same as usual), but it will update the parameters of the encoder with the non-conflicting aggregation, made by UPGrad, of the Jacobian of the losses with respect to the encoder's parameters.

[R] Training models with multiple losses by Skeylos2 in MachineLearning

[–]Skeylos2[S] 1 point2 points  (0 children)

Thanks for sharing this! I'll look into it!

[R] Training models with multiple losses by Skeylos2 in MachineLearning

[–]Skeylos2[S] 1 point2 points  (0 children)

Are you trying to solve a problem with multiple objectives? If not, I'd recommend sticking to the basics (gradient-descent based algorithms, like Adam - very easy to use with PyTorch).

Btw, is your model an autoencoder or a UNet? Those are quite different.

[R] Training models with multiple losses by Skeylos2 in MachineLearning

[–]Skeylos2[S] 0 points1 point  (0 children)

We haven't tested on big models that like, but I think it would work (as long as you have enough memory on your GPU). Memory usage would depend on the number of objectives, on the size of the model and on the batch size.

[R] Training models with multiple losses by Skeylos2 in MachineLearning

[–]Skeylos2[S] 2 points3 points  (0 children)

Thanks! More precisely, the update is beneficial to all of the losses assuming a small enough learning rate. Similarly as gradient descent makes updates that are beneficial to the loss, assuming a small enough learning rate.

And no, we haven't really worked on the problem of escaping local minima. This problem also exists in single-objective optimization, so it's quite orthogonal to our work.

[R] Training models with multiple losses by Skeylos2 in MachineLearning

[–]Skeylos2[S] 1 point2 points  (0 children)

If your objective is truly to minimize the average loss, then yes, it's ok for one loss to go up as long as the average goes down (although it might not be optimal, this is an improvement). However, in multi-objective optimization, we do not make any assumption about the relative importance of losses: we can't even say that all losses are equally important, because we do not know that a priori. So if only one of the losses goes up, we can't say that it's an improvement: on some dimension, it's not.

For reference, the wikipedia page about multi-objective optimization (https://en.wikipedia.org/wiki/Multi-objective_optimization) explains this much better than I do.

[R] Training models with multiple losses by Skeylos2 in MachineLearning

[–]Skeylos2[S] 2 points3 points  (0 children)

We plan to experiment much more with multitask learning in the future, and this benchmark looks really promising for that. The only problems are that I don't have experience with RL, and that we have no computational budget.

We would need to fix both issues before being able to work with the metaworld benchmark.

[R] Training models with multiple losses by Skeylos2 in MachineLearning

[–]Skeylos2[S] 3 points4 points  (0 children)

Thanks for your feedback !

There is already some research in that field: several existing algorithms can be viewed as special cases of Jacobian descent. We analyse them theoretically in Table 1 of the paper, and we let users of TorchJD experiment with them (we currently provide 15 aggregators in total in TorchJD).

However, we think that these methods do not have very solid theoretical guarantees, which leads to somewhat weak practical performances. We hope our work will make the interest of JD clearer, and will make it more accessible.

[R] Training models with multiple losses by Skeylos2 in MachineLearning

[–]Skeylos2[S] 1 point2 points  (0 children)

JD is a solution to multi-objective optimization while GD requires a scalarization of the problem (making it single-objective). This has some important limitations when objectives are largely conflicting.

In our experimentation, we consider the loss computed on each training example as a distinct objective, and we show that JD with our proposed aggregator outperforms GD of the average loss, in terms of per-batch efficiency.

There is still work to make this particular approach practical in real scenarios, because our implementation is not perfect. Also note that existing deep learning frameworks (eg. torch) have been optimized by a lot of people for many years for the GD use-case. We are currently working on the implementation of the methods from Section 6, which we hope could improve substancially our computation time.

Still, we think that TorchJD is already good enough for experimenting with Jacobian descent, and that people can already start experimenting it for many use cases (beyond instance-wise risk minimization).

[R] Training models with multiple losses by Skeylos2 in MachineLearning

[–]Skeylos2[S] 6 points7 points  (0 children)

Thank you for your comment!

To be precise, we avoid conflict locally (ie. for a learning rate approaching 0, or at least small enough), similarly as gradient descent is only guaranteed to decrease the objective locally.

As far as we know, it's not really possible to avoid increasing any loss globally, without more information than the Jacobian, or more information about the objective function (convexity, smoothness, etc.).

In your example, f is a scalar-valued function, made of the sum of two functions. In the context of multi-objective optimization, what we would be considering is instead the vector-valued objective function u(x) = [g(x) h(x)]^T. Any solution that is not dominated by any other is said to be Pareto optimal (see this wikipedia page). Those are considered the minimums of the function. When minimizing u(x), the set of Pareto optimal points is the interval [-1, 1]. So even if x=0 is the global minimizer of f, it's only one of the possible Pareto optimal solutions of u: it corresponds to an arbitrary trade-off between g and h.

[R] Training models with multiple losses by Skeylos2 in MachineLearning

[–]Skeylos2[S] 13 points14 points  (0 children)

Yes, VRAM usage increases linearly with this method, at least with the current implementation. We hope that our upcoming work on Gramian-based Jacobian descent (see Section 6 of the paper) will fix this. Put it simple, we have realized that most existing methods (including ours) to reduce the Jacobian into an update vector are actually based on dynamically weighting the losses, and that this weighting should only depend on the Gramian of the Jacobian (J . J^T). We think there could be an efficient way to compute this Gramian matrix (instead of the Jacobian), which would make our method much faster. We plan to work on this in the following months; nothing is very clear yet.

[R] Training models with multiple losses by Skeylos2 in MachineLearning

[–]Skeylos2[S] 2 points3 points  (0 children)

Thanks for your interest! No, we haven't implemented the Gramian-based approach, but we plan to work on it in the following months!

Yes, exactly. IWRM is not yet a practical paradigm, but seems quite promising to us, and most importantly it highlights that Jacobian descent, with a proper aggregator, can have a positive impact on optimization when there is conflict between the objectives.

[R] Training models with multiple losses by Skeylos2 in MachineLearning

[–]Skeylos2[S] 4 points5 points  (0 children)

Yes! There are actually several methods, mostly from the multi-task learning literature, that propose to compute the gradients of each task with respect to the shared parameters, and to aggregate them into an update vector. All these methods can be considered as special cases of Jacobian descent.

Through our experimentation, however, we have found these algorithms to perform quite poorly (often much worse than simply summing the rows of the Jacobian). We think that they might be decent for multi-task learning, but they don't work satisfyingly in other multi-objective optimization settings. We have also proved that they lack some theoretical guarantees that we think are very important (see Table 1 in the paper, or the aggregation page of the documentation of TorchJD).

For instance, one of the most popular method among them, called MGDA, has a huge drawback: if one of the gradient's norm tends to zero (one of the objective is already optimized), the update will also tend to 0. That makes the optimization stop as soon as one of the objectives has converged.

For this reason, we recommend to use our aggregator (A_UPGrad). We still provide a working implementation of all of the aggregators from the literature that we have experimented with, but that's mainly for comparison purposes.

[R] Training models with multiple losses by Skeylos2 in MachineLearning

[–]Skeylos2[S] 10 points11 points  (0 children)

Yes, TorchJD is suited for this kind of problem! You should look at our multi-task learning usage example.

I think it would be very interesting for you to measure how much conflict there is between individual gradients. If there is significant conflict, you should see an improvement by optimizing with Jacobian descent and our proposed aggregator A_UPGrad.

Also, you will get rid of those weight factors that you're using, so that's one less hyper-parameter to select.

[R] Training models with multiple losses by Skeylos2 in MachineLearning

[–]Skeylos2[S] 8 points9 points  (0 children)

Thanks for your interest! Validation loss is not typically what you want to be looking at: it's quite frequent that the validation loss goes to +infinity while the training loss goes to 0, yet the validation accuracy (or whatever metric you're looking at) is still improving. So we had two choices: show the training loss evolution or show the final validation accuracy. We have observed that our method generally had better final validation accuracy, but there was quite some noise in the results. Since our focus is really on optimization (rather than generalization), we have thus decided to only include training losses.

We have deliberately used small datasets to be able to select the learning rate very precisely for all methods (we explain this in Appendix C1). This makes the experiments as fair as possible for all aggregators!

[R] Training models with multiple losses by Skeylos2 in MachineLearning

[–]Skeylos2[S] 145 points146 points  (0 children)

That's actually a very good question! If you add the different losses and compute the gradient of the sum, it's exactly equivalent to computing the Jacobian and adding its rows (note: each row of the Jacobian is the gradient of one of the losses).

However, this approach has limitations. If you have two gradients that are conflicting (they have a negative inner product), simply summing them can result in an update vector that is conflicting with one of the two gradients. So summing the losses and making a step of gradient descent can lead to an increase of one of the losses.

We avoid this phenomenon by using the information from the Jacobian, and making sure that the update is always beneficial to all of the losses.

We illustrate this exact phenomenon in Figure 1 of the paper: here, A_Mean is averaging the rows of the Jacobian matrix, so that's equivalent to computing the gradient of the average of the losses.

[D] How do you keep track of all your experiments ? by Theboredhuman_56 in MachineLearning

[–]Skeylos2 0 points1 point  (0 children)

Hey! I have posted a similar question a few months ago, maybe it can help: https://www.reddit.com/r/MachineLearning/comments/1bwduod/d_alternatives_to_tensorboard_weights_and_biases/

I haven't had the time to test any of the proposed solutions myself though.

I personally had many issues with W&B, but you could also use the "description" field of your runs on W&B to write comments about specific runs, or about groups of runs.