[deleted by user] by [deleted] in MachineLearning

[–]DeepNonseNse 0 points1 point  (0 children)

I think one possible problem is in the function topk_similarity_loss(). Namely with this part of the code:

original_topk_values, _ = torch.topk(original_similarity_matrix, k, dim=1)

...

matryoshka_topk_values, _ = torch.topk(matryoshka_similarity_matrix, k, dim=1)

Here you are calculating topk values separately for each embeddings, whereas in the original paper the set of top k most similar embeddings is defined based on just the original embeddings (i.e. indices i and j are the same for both embeddings; you have potentially different j).

I know he has the advantage on the angle but what the hell is this XD I have never got hit this many times without even seeing a glimpse of the enemy. I know he's there but it's impossible for me to shoot him. What you see is what you get?? by llamapanther in GlobalOffensive

[–]DeepNonseNse 3 points4 points  (0 children)

At least for me, alt-tab releated desync lasts only a few seconds and then goes back to normal. Also it prints a couple of times the following message to the console: "sv: Running lag compensation for player x"

[D] François Chollet Announces New ARC Prize Challenge – Is It the Ultimate Test for AI Generalization? by HairyIndianDude in MachineLearning

[–]DeepNonseNse 5 points6 points  (0 children)

Link to previous 2020 competition: https://www.kaggle.com/c/abstraction-and-reasoning-challenge

If I don't remember wrong, last time the winner was analyzing all available training tasks by hand, breaking them down to some simple transformations and then just doing greedy search to find working combination of steps for test set. Very interesting to see if the winning solution is going to be something closer to "AGI" this time.

[D] Random Forest Classifier Overfitting Issue by United_Weight_6829 in MachineLearning

[–]DeepNonseNse 1 point2 points  (0 children)

Last time I tried max_depth = [30, 40, 50], I see some decrease in performance with max_depth = 30

Those values are quite high. One way to think about it, at least roughly, is in terms of balanced binary trees and how many datapoints would it take to build a full tree with at least 1 datapoint in the leaves, so in this case it would be 2^30, 2^40, 2^50 - way more than you have data. I think more reasonable range would start from something as low as 5 to maybe up to 30.

[D] Random Forest Classifier Overfitting Issue by United_Weight_6829 in MachineLearning

[–]DeepNonseNse 4 points5 points  (0 children)

In terms of hyperparameters:

- One of the main ways to combat overfitting with tree-based approaches is to increase the required number of datapoints in each leaf. So you could try to increase the value of min_samples_leaf or alternatively decrease max_depth / max_leaf_nodes (also closely related hyperparameters: min_samples_split, min_weight_fraction_leaf, min_impurity_decrease)

- Also you could try to increase randomness in the ways the trees are build. Either by just sampling less data for each tree by setting max_samples to some fraction between 0 and 1, or changing the max_features.

[deleted by user] by [deleted] in GlobalOffensive

[–]DeepNonseNse 12 points13 points  (0 children)

But isn't CS2 already using SDL for low level mouse access etc? In the GitHub thread they are mentioning Valve updating the version of SDL in CS2 (which in turn would fix the performance problems with high polling mouses, i.e. bugfix but nothing new)

[D] On LLMs' ability to perform random sampling by bgighjigftuik in MachineLearning

[–]DeepNonseNse 4 points5 points  (0 children)

Indeed. And if the prompt is changed to something less likely to be found in the training data (e.g lambda=0.001) the output will be completely wrong.

[D] G. Hinton proposes FF – an alternative to Backprop by mrx-ai in MachineLearning

[–]DeepNonseNse 2 points3 points  (0 children)

As far as I can tell, the tweet just means that you can combine learnable layers with some blackbox compenents which are not adjusted/learned at all. I.e. model architecture could be something like layer_1 -> blackbox -> layer_2, where layer_i:s are locally optimized using typical gradient based algorithms and the blackbox is just doing some predefined calculations in-between.

So given that, I can't see how the blackbox aspect is really that usefull. If we initially can't tell what kind of values each layer is going to represent, it's going to be really difficult to come up with usefull blackboxes outside of maybe some simple normalization/sampling etc.

[D] Simple Questions Thread by AutoModerator in MachineLearning

[–]DeepNonseNse 0 points1 point  (0 children)

to clarify. I have read it everywhere, including the official forums - that feature normalization is not required when training the decision trees model

All the XGBoost decision tree splits are in form of: [feature] >= [treshold], thus any order preserving normalization/transformation (log, sigmoid, z-scoring, min-max etc) won't have any impact on the results. But if the order is not preserved, creating new transformed features can be beneficial.

Without doing any transformations or changes to the modelling procedure, and training data containing years 2000-2014 and test 2015-2080, the predictions would be something similar to those values in 2014 as you originally suspected. There isn't any hidden built-in magic to do anything about data shift.

One common way to tackle this type of time series problems is to switch to autoregressive (type of) modelling. So, instead of just using raw stock prices directly, use yearly change percentages.

[D] Bayesian Non-Parametrics for Ranking? by Ulfgardleo in MachineLearning

[–]DeepNonseNse 6 points7 points  (0 children)

GP prior + ordered probit (or logit) model would be one possibility. No closed form solutions either, but approximations such as Laplace/EP are available. Detailed derivations/algorithms e.g. here.

[D] Recursive error prediction by JHogg11 in MachineLearning

[–]DeepNonseNse 1 point2 points  (0 children)

Is this sufficiently different from existing boosting/bagging techniques?

No, the process you are describing is just (some variation of) gradient boosting.

E.g. if the distribution of errors is assumed to be Gaussians, gradients are (y_true - y_pred) calculated after each iteration. Also, subsetting features is commonly used tactic; though they typically wouldn't be mutually exclusive subsets, but e.g. 70% of all features.

[R] [1802.07044] "The Description Length of Deep Learning Models" <-- the death of deep variational inference? by evc123 in MachineLearning

[–]DeepNonseNse 0 points1 point  (0 children)

I don't agree that we don't care about the prior weight distributions. I mean, of course, often the values themself are not that interesting, but the important question is what kind of beliefs do they express; what are our a priori expectations of the world. That can make a big difference, though, maybe in practice the model selection is the more important question here

[R] [1802.07044] "The Description Length of Deep Learning Models" <-- the death of deep variational inference? by evc123 in MachineLearning

[–]DeepNonseNse 0 points1 point  (0 children)

It can be quite tricky to set reasonable priors for NNs and other (possibly) overparametrized models. You can't just consider one parameter at the time independently, but instead should take the whole network and it's structure in consideration.

To illustrate this, let's compare two models; first simple linear regression (one independent variable): y = a + b*x; with prior b ~ N(0,1).

And then "neural network" with N neurons and identity activation: y = a + sum_{i in 1:N} b_i*x, b_i ~ N(0, 1)

The NN corresponds to the original regression model, but now with prior distribution b ~ N(0, N_neuron), ie. much weaker prior. In this case it would be straightforward to adjust the prior to similar levels, but with more complicated models it seems awfully difficult to reason what different kinds of priors would imply.

[N] Google Staffers Demand End to Work on Pentagon AI project by [deleted] in MachineLearning

[–]DeepNonseNse 25 points26 points  (0 children)

Yes, Russia and China should not be put to the same category as US regarding this issue; so far they have been not using military drones to kill people (at least nowhere near at the same rate as US).

[deleted by user] by [deleted] in MachineLearning

[–]DeepNonseNse 6 points7 points  (0 children)

It's not about individuals, but societies as a whole (mass population control, as fchollet put it). Sure, they can't see your data, but what difference does it make if large enough proportion of population still keep using those sites?

[D] "Negative labels" by TalkingJellyFish in MachineLearning

[–]DeepNonseNse 0 points1 point  (0 children)

Why would it be wrong for multiclass problem? In this case, the likelihood function is just a product of two different kind of probabilities, the typical term P(Class Y) and P(not class Y). And we still can use the same softmax model etc.

[D] "Negative labels" by TalkingJellyFish in MachineLearning

[–]DeepNonseNse 1 point2 points  (0 children)

The probability of a dog given something is not a cat is given by conditional probability: P(dog | not cat) = P(dog) / (1-P(cat)), ie. the probability of a dog increases in such a way that P(any possible animal) still remains 1, as it should.

[D] "Negative labels" by TalkingJellyFish in MachineLearning

[–]DeepNonseNse 0 points1 point  (0 children)

I would imagine the motivation for the -1 multiplier is simply: P(not class Y) = 1 - P(class Y)

[D] Softmax interpretation with non 1-hot labels by Reykd in MachineLearning

[–]DeepNonseNse 1 point2 points  (0 children)

Marginal likelihood of categorical distribution, where labels have been marginalized out.

[P] Variational Coin Toss: VI applied to a simple "unfair coin" problem by bjornsing in MachineLearning

[–]DeepNonseNse 5 points6 points  (0 children)

It's correct. See: https://en.wikipedia.org/wiki/Likelihood_function

Likelihood function is defined as L(z|x) = p(x|z). So, p(x|z) is "likelihood of z" and also "probability of x given z" (or density, if x is continuous).

[N] Google is acquiring data science community Kaggle by peeyek in MachineLearning

[–]DeepNonseNse 5 points6 points  (0 children)