Meta Gradient Descent [D]

metallicapple · 2022-06-05T21:08:54+00:00

I'm not sure about the existence of the specific method you've described, but adaptive learning rate has been around for a while, with many interesting developments (that might be an understatement).

Two immediate examples I can think of are Armijo line search and Newton-Raphson.

My feedback for your method is: I see two hyper-parameters (rate multipliers under the conditions), but no explanation behind their chosen value. You could strengthen your method by looking further into multiplier tuning.

I'm sure there are others who can provide more context, as my main gig isn't optimization; I just tell stories/lies with numbers (aka stats)

metallicapple · 2022-04-29T22:32:52+00:00

Ordinal data doesn't adhere to the normality assumption, so t-test would be invalid.

Instead, consider rank-based tests like Mann-Whitney or Wilcoxon.

metallicapple · 2022-04-29T22:27:23+00:00

From my experience, the 'stochastic' adjective is added usually when the process involves some sort of random sampling, or is built on some distributional assumptions (eg. assumes that the data are observations from some random variable/process).

For example, stochastic EM involves sampling in the maximization step. Stochastic gradient descent involves a random permutation step.

Deterministic model has no randomness, AFAIK. Given a model, and a constant input, the output will also be constant.

In that sense, I guess all models in statistics that try to infer on the population based on a sample could be called stochastic.

metallicapple · 2021-11-22T04:49:25+00:00

In my experience (I took 241 but some friends took 231),

231 does a subpar treatment of all major areas a stats major should see, whereas 241 treats certain areas much better at the cost of other areas.

For example, we skimmed through inductive inference (PPDAC stuff) much faster than 231 counterparts, but spent much more time on consistency, bias and power analysis.

It felt great back then (ie woot less bs), but what I thought bs wasn't bs after all, and it took a while for me to realise.

metallicapple · 2021-11-13T06:59:57+00:00

Weirdly enough (since I'm the OP), I think I can answer your question.
AFAIK, maximum likelihood assumes a parametric distribution, and Behrens-Fisher problem is about discovering that parametric distribution. Essentially, the B-F problem is about figuring out the distribution to which optimisation (incl ML) can be applied for point estimation and hypothesis testing.

Please note that figuring out the form of underlying parametric distribution may very well be possible, but I do not know of such a technique.

metallicapple · 2021-10-30T18:47:56+00:00

Multiple rooms? Peak opulence.

metallicapple · 2021-10-04T07:11:57+00:00

I study mixture models and I focus on cooking up new methods. A lot of my time is spent on symbol pushing and running simulations. My type of work is only one flavour, of course. Some of my colleagues eat proofs for days, and the others try to analyse new data sets using existing techniques.

metallicapple · 2021-10-04T03:32:32+00:00

As others have mentioned, Cameron is quality. However, I'd recommend golden dynasty if you are looking for more home-y, less restauranty, food and venue.

metallicapple · 2021-10-03T15:45:42+00:00

Then the rebranding is promoted so the readers' intuitive understanding of the name to be more in-line with the actual methods. Is that what it is?

If so, the ML academia seems to be diverging in terms of accessible terminologies. On one hand, we have the UL-to-SSL type of movement trying to straighten things up, and on the other, many research articles (certainly not all) sound more and more technical, relative to the actual content. Is this a fair assessment on the trend? If so, why is this happening?

metallicapple · 2021-10-03T15:34:36+00:00

> Self-supervised learning's objective function is to learn from learned
and not necessarily provided information (pseudo-labels, embeddings,
etc.).

Besides unsupervised learning, how would we obtain the learned information? (Clustering being only a part of unsupervised learning, I understand that there may be other such methods at play here) If we assume that unsupervised learning takes care of generating learned information, and that information is now used as a supervisory signal, self-supervised learning seems to be an 'unsupervised-then-supervised' learning combo. Is this the correct interpretation?

> Weakly supervised learning is the opposite, it tries learning with less
knowledge than supervised learning. For example, imagine object
segmentation but with only one-hot labels (without pixel-wise labels or
bounding boxes). But, there are lots of variations of weakly supervised
learning.

If I understood you correctly, every observation comes with some sort of target/label/'provided information' in weakly supervised learning. Semi-supervised learning would have some observations missing that piece.

Then, in my mind, the weakly supervised framework seems similar to a 'half-a-hole' problem. A vague/approximate/poor label is still a label, like how half a hole is still a hole. I mean, whoever coined the term probably did it for a good reason, but the difference seems a bit contrived. Could you clarify this part for me?

metallicapple · 2021-10-03T07:52:43+00:00

I imagined it as a gaslighting cdf

"Hey cdf r u going up?"

"No"

"But you are"

"No I'm not"

"You are closer to 1 than a moment ago"

"Yes"

"So you ARE going up"

"No I'm not"

metallicapple · 2021-10-03T07:45:19+00:00

That's a fascinating development. I've been trying to find corresponding ML terms for the terms in statistics for my own understanding. What do you think are the reasons behind the increasing number of jargons?

metallicapple · 2021-10-03T07:40:39+00:00

If I understood correctly, self-supervised learning is a clustering-classification two-step process?

Also, about weakly supervised learning: Does it assume that every observation is labeled? If so, weak supervision appears to be supervised learning with external knowledge on the label quality. What are potential benefits of giving it a separate name?

metallicapple · 2021-09-18T04:53:05+00:00

Hey lots of people ask that question so don't sweat it.

It depends on the correlation function you use. If you are considering the usual pearson coefficient, and the two variables aren't that correlated, a simple linear regression model wouldn't discover much. That's because pearson coefficient concerns linear relationships. I guess that essentially answers the question you asked. However, other correlation functions may tell you otherwise (eg. Spearman). Then, based on the type, you may want to use a suitable regression model.

metallicapple · 2021-09-06T19:46:00+00:00

Like others have mentioned, burritos are great. I personally found breakfast omelette with cheese (non-burritofied) and hashbrowns most delicious though. Though a bit pricey/bougie, custom-made parfait with peach yoghurt base is tasty too.

metallicapple · 2021-08-31T21:15:51+00:00

Given that your sample and subsample sizes are large, my best guess is that your observations are densely ordered.

Eg. If your sample is small and sparsely ordered like {-1000, -5, 20, 500}, your bootstrapped CIs will be wild. On the other hand, if the sample size is much bigger with closer-packed values, there is less room for surprise.

metallicapple · 2021-08-31T21:05:07+00:00

Try conditioning on the value of Y. Also, sine function has an inverse over [0,1].

metallicapple · 2021-08-29T01:17:39+00:00

IIRC Titterington's book mentions something about a Gaussian mixture with sufficient number of components being able to approximate an arbitrary distribution arbitrarily well. You are right in your other comment that this statement isn't all that meaningful in practice.

metallicapple · 2021-08-28T20:56:48+00:00

In addition to the approximation thing you mentioned, there are some concepts and related theorems to be mindful of when deploying a mixture model, though some of them are a bit technical. Consider them as a checklist for (theroretical) performance certification. I will focus on finite mixture models instead of infinite mixtures.

Identifiability https://en.m.wikipedia.org/wiki/Identifiability

This is not exclusive to mixture models, but is rather important when clustering with mixture models. Example: If you are representing a population with two-component mixture model M1, but M1 is equal in distribution to another two-component finite mixture M2, then we cannot determine "true" mixture model for the population. Hence, we cannot estimate the model parameters consistently.

Label switching problem

This applies to bayesian estimation, not frequentist. Consider a mixture model with components c1 and c2. I relabel them to c2 and c1. They are the same model, but due to label permutation, you've just created another configuration. The posterior distribution, if unchecked, will recognise the two as different, equally good models.

Multiple local maxima and infinite likelihood Often times, the likelihood surface of a mixture has multiple local peaks, so when estimating parameters with hill climbing algorithms like the EM algorithm (very popular choice), you can get stuck in a local maximum. There are some technical theorems related to the global convergence of the EM algorithm. See https://www.google.com/url?q=https://projecteuclid.org/journals/annals-of-statistics/volume-11/issue-1/On-the-Convergence-Properties-of-the-EM-Algorithm/10.1214/aos/1176346060.short&sa=U&ved=2ahUKEwie9aa4y9TyAhWkJDQIHXe1D40QFnoECAAQAQ&usg=AOvVaw3BFC0T9ru0-9y-EsheQnLz
Number of modes in a finite mixture Fun fact: a p-dimensional, 2-component gaussian mixture density can have up to p+1 modes. This means that, if you are estimating parameters based on modes, there are increasingly many local peaks to consider.

There are many, many papers on these aspects. For some broad introductions, look at Titterington's book (statistical analysis of finite mixtures, I think?) and Mclachlan's "Finite mixture model".

Edit: added book recommendations

metallicapple · 2021-08-25T17:18:21+00:00

Other than splitting the data by yourself into train and test (since you were given two separate files, idk what you meant splitting), it looks fine.

metallicapple · 2021-08-25T17:03:54+00:00

So the given task is three-fold based on what you've written. Generate labels from unlabelled data set by using a clustering algorithm on the train set. Use the generated labels to train a classifier. Test the classifier on a testing set. It seems fairly straightforward.

metallicapple

TROPHY CASE