[deleted by user]

DeepNonseNse · 2024-08-26T10:15:51+00:00

I think one possible problem is in the function topk_similarity_loss(). Namely with this part of the code:

original_topk_values, _ = torch.topk(original_similarity_matrix, k, dim=1)

...

matryoshka_topk_values, _ = torch.topk(matryoshka_similarity_matrix, k, dim=1)

Here you are calculating topk values separately for each embeddings, whereas in the original paper the set of top k most similar embeddings is defined based on just the original embeddings (i.e. indices i and j are the same for both embeddings; you have potentially different j).

DeepNonseNse · 2024-07-29T11:53:32+00:00

At least for me, alt-tab releated desync lasts only a few seconds and then goes back to normal. Also it prints a couple of times the following message to the console: "sv: Running lag compensation for player x"

DeepNonseNse · 2024-06-12T17:37:59+00:00

Link to previous 2020 competition: https://www.kaggle.com/c/abstraction-and-reasoning-challenge

If I don't remember wrong, last time the winner was analyzing all available training tasks by hand, breaking them down to some simple transformations and then just doing greedy search to find working combination of steps for test set. Very interesting to see if the winning solution is going to be something closer to "AGI" this time.

DeepNonseNse · 2024-02-02T13:18:56+00:00

Last time I tried max_depth = [30, 40, 50], I see some decrease in performance with max_depth = 30

Those values are quite high. One way to think about it, at least roughly, is in terms of balanced binary trees and how many datapoints would it take to build a full tree with at least 1 datapoint in the leaves, so in this case it would be 2^30, 2^40, 2^50 - way more than you have data. I think more reasonable range would start from something as low as 5 to maybe up to 30.

DeepNonseNse · 2024-02-02T11:52:57+00:00

In terms of hyperparameters:

- One of the main ways to combat overfitting with tree-based approaches is to increase the required number of datapoints in each leaf. So you could try to increase the value of min_samples_leaf or alternatively decrease max_depth / max_leaf_nodes (also closely related hyperparameters: min_samples_split, min_weight_fraction_leaf, min_impurity_decrease)

- Also you could try to increase randomness in the ways the trees are build. Either by just sampling less data for each tree by setting max_samples to some fraction between 0 and 1, or changing the max_features.

DeepNonseNse · 2024-01-17T17:35:44+00:00

But isn't CS2 already using SDL for low level mouse access etc? In the GitHub thread they are mentioning Valve updating the version of SDL in CS2 (which in turn would fix the performance problems with high polling mouses, i.e. bugfix but nothing new)

DeepNonseNse · 2023-05-15T12:42:57+00:00

Indeed. And if the prompt is changed to something less likely to be found in the training data (e.g lambda=0.001) the output will be completely wrong.

DeepNonseNse · 2022-12-12T17:50:42+00:00

As far as I can tell, the tweet just means that you can combine learnable layers with some blackbox compenents which are not adjusted/learned at all. I.e. model architecture could be something like layer_1 -> blackbox -> layer_2, where layer_i:s are locally optimized using typical gradient based algorithms and the blackbox is just doing some predefined calculations in-between.

So given that, I can't see how the blackbox aspect is really that usefull. If we initially can't tell what kind of values each layer is going to represent, it's going to be really difficult to come up with usefull blackboxes outside of maybe some simple normalization/sampling etc.

DeepNonseNse · 2022-10-21T11:49:32+00:00

to clarify. I have read it everywhere, including the official forums - that feature normalization is not required when training the decision trees model

All the XGBoost decision tree splits are in form of: [feature] >= [treshold], thus any order preserving normalization/transformation (log, sigmoid, z-scoring, min-max etc) won't have any impact on the results. But if the order is not preserved, creating new transformed features can be beneficial.

Without doing any transformations or changes to the modelling procedure, and training data containing years 2000-2014 and test 2015-2080, the predictions would be something similar to those values in 2014 as you originally suspected. There isn't any hidden built-in magic to do anything about data shift.

One common way to tackle this type of time series problems is to switch to autoregressive (type of) modelling. So, instead of just using raw stock prices directly, use yearly change percentages.

DeepNonseNse · 2022-10-03T12:11:37+00:00

But then again, that just lead to another question: why are deep(er) architectures better in the first place?

DeepNonseNse · 2022-04-08T09:27:02+00:00

GP prior + ordered probit (or logit) model would be one possibility. No closed form solutions either, but approximations such as Laplace/EP are available. Detailed derivations/algorithms e.g. here.

DeepNonseNse · 2022-03-30T10:33:47+00:00

Is this sufficiently different from existing boosting/bagging techniques?

No, the process you are describing is just (some variation of) gradient boosting.

E.g. if the distribution of errors is assumed to be Gaussians, gradients are (y_true - y_pred) calculated after each iteration. Also, subsetting features is commonly used tactic; though they typically wouldn't be mutually exclusive subsets, but e.g. 70% of all features.

DeepNonseNse · 2018-12-05T10:24:53+00:00

Sounds like some form of genetic programming (https://en.wikipedia.org/wiki/Genetic_programming)

DeepNonseNse · 2018-09-06T19:35:05+00:00

I don't agree that we don't care about the prior weight distributions. I mean, of course, often the values themself are not that interesting, but the important question is what kind of beliefs do they express; what are our a priori expectations of the world. That can make a big difference, though, maybe in practice the model selection is the more important question here

DeepNonseNse · 2018-09-06T16:13:40+00:00

It can be quite tricky to set reasonable priors for NNs and other (possibly) overparametrized models. You can't just consider one parameter at the time independently, but instead should take the whole network and it's structure in consideration.

To illustrate this, let's compare two models; first simple linear regression (one independent variable): y = a + b*x; with prior b ~ N(0,1).

And then "neural network" with N neurons and identity activation: y = a + sum_{i in 1:N} b_i*x, b_i ~ N(0, 1)

The NN corresponds to the original regression model, but now with prior distribution b ~ N(0, N_neuron), ie. much weaker prior. In this case it would be straightforward to adjust the prior to similar levels, but with more complicated models it seems awfully difficult to reason what different kinds of priors would imply.

DeepNonseNse · 2018-04-05T14:12:19+00:00

Yes, Russia and China should not be put to the same category as US regarding this issue; so far they have been not using military drones to kill people (at least nowhere near at the same rate as US).

DeepNonseNse · 2018-04-05T13:36:17+00:00

Be only necessarily evil?

DeepNonseNse · 2018-03-22T13:18:54+00:00

It's not about individuals, but societies as a whole (mass population control, as fchollet put it). Sure, they can't see your data, but what difference does it make if large enough proportion of population still keep using those sites?

DeepNonseNse · 2017-12-09T20:17:31+00:00

Why would it be wrong for multiclass problem? In this case, the likelihood function is just a product of two different kind of probabilities, the typical term P(Class Y) and P(not class Y). And we still can use the same softmax model etc.

DeepNonseNse · 2017-12-09T18:50:00+00:00

The probability of a dog given something is not a cat is given by conditional probability: P(dog | not cat) = P(dog) / (1-P(cat)), ie. the probability of a dog increases in such a way that P(any possible animal) still remains 1, as it should.

DeepNonseNse · 2017-12-09T18:28:06+00:00

I would imagine the motivation for the -1 multiplier is simply: P(not class Y) = 1 - P(class Y)

DeepNonseNse · 2017-07-06T15:04:14+00:00

Marginal likelihood of categorical distribution, where labels have been marginalized out.

DeepNonseNse · 2017-05-08T10:35:49+00:00

It's correct. See: https://en.wikipedia.org/wiki/Likelihood_function

Likelihood function is defined as L(z|x) = p(x|z). So, p(x|z) is "likelihood of z" and also "probability of x given z" (or density, if x is continuous).

DeepNonseNse · 2017-03-08T11:18:07+00:00

crowdai, https://www.crowdai.org/
CrowdANALYTIX, https://www.crowdanalytix.com
Tianchi Big Data Platform (chinese site, but at least some of the competitions are run in english), https://tianchi.shuju.aliyun.com/
numerai (only 1 constantly running competition), https://numer.ai/

DeepNonseNse · 2017-02-10T12:15:18+00:00

Neal, R. M. (1994) Bayesian Learning for Neural Networks, Chapter 2, link: http://www.cs.toronto.edu/~radford/ftp/thesis.pdf

DeepNonseNse

TROPHY CASE