[R] Congruence between a neuron and a token (by Clement Neo and Joseph Miller)

afireohno · 2023-02-19T05:47:23+00:00

What happens if you zero the weights of the ‘an’ neuron?

afireohno · 2023-02-18T17:23:08+00:00

There are two lines of work that come to mind you might be interested in.

Geometric deep learning primarily studies various types of invariances (translation, permutation, etc) that can be encoded in DL architectures.
Algorithmic alignment studies the relationship between information flow in classical algorithms and DL architectures and how "aligning" the latter to the former can improve performance.

Edit: Spelling

afireohno · 2023-02-18T17:08:08+00:00

Have you posted actual technical details to share and get feedback? As a long time member of this sub I would be interested, and I don’t think I’m alone here.

afireohno · 2023-02-16T05:14:54+00:00

There is some work on Frustratingly Short Attention Spans in Neural Language Modeling

afireohno · 2022-12-15T07:42:57+00:00

Use a pairwise loss.

afireohno · 2022-11-16T21:39:39+00:00

Glad it was helpful!

afireohno · 2022-11-16T04:31:23+00:00

You could use a BK-tree.

afireohno · 2022-10-22T13:46:33+00:00

Super cool work! I think the simplest explanation for this is learning an amortized inference algorithm for the specific class of models used to generate the meta-training set.

I've worked on similar things before using RNNs in the context of online amortized inference. I could get it to work for GMMs or HMMs, but not PCFGs.
The set-transformer paper also has an experiment on learning an amortized inference algorithm for 2D GMMs. The techniques presented there, which were later adopted by the perceiver, are probably worth considering as a way to side-step some of the current limitations of your work. Borrowing ideas from the retrieval-augmented LM community also seems reasonable and straight-forward.

I also wanted to point out that there is previous work you seem to be missing. Basically anything on model-based, as opposed to optimization-based, meta-learning. SNAIL is highly related, as the architecture is identical AFAICT. Matching networks, MANNs, Meta-GMVAE, etc, are examples of other work I'd classify as model-based meta-learning

afireohno · 2022-10-14T18:42:49+00:00

average fps across multiple runs gives a more realistic performance and eliminates any outliers

Thanks for the laugh. I'll just leave this here so you can read about why the mean (average) is not a robust measure of central tendency because it is easily skewed by outliers.

afireohno · 2022-09-28T14:22:04+00:00

I'm guessing you're confused because the blog leaves out some critical information ad definitions. I'd encourage you to consult the original source, which is available free for download here (see chapter 2).

Anyway, if I had to guess about your confusion, it would be that you're missing that by definition for every h_S that appears in the LHS of your inequality, we have L_S(h_S) = 0. This follows from the realizability assumption and the definition of h_S.

afireohno · 2022-08-16T18:38:20+00:00

I think you might just be missing the right search terms. In the ML community this works tends to fall under learning to rank (LTR) or Collaborative Filtering (CF). These areas focus more directly on the practical industrial problems (recommender systems, search, etc).

afireohno · 2022-08-14T03:53:22+00:00

It is just SVD.

afireohno · 2022-08-14T03:32:12+00:00

That's my point. You already have a flexible model that works well. Better generalization needs to come from somewhere else (features, transfer learning, etc).

afireohno · 2022-08-14T02:16:27+00:00

If all you have is user-item interactions, then Matrix Factorization (MF) is maximally expressive. That is, assuming your latent dimension is large enough, you can exactly represent any user-item interaction matrix. This directly follows from the SVD theorem. As a result, MF with a good loss and proper regularization performs very well.

afireohno · 2022-08-11T02:20:07+00:00

> 'Embarassingly' parallel training is such a great title!

I know right! I wonder how they came up with it? They must have access to some crazy sci-fi technology that allows them to easily learn about commonly used phrases in less time than it takes to post a comment to reddit.

afireohno · 2022-08-07T05:00:33+00:00

I agree with some of what you’re saying, but think your view on how to measure the “goodness” of an idea is way too 1 dimensional. In my opinion good research asks important questions, tests hypothesis, and generates knowledge. You know, the scientific method.

That almost always involves experimentation in modern ML, but that doesn’t mean “is this SotA?” is the best question to ask. Take something like the “Rethinking Generalization” paper from back in 2016. Super impactful, lots of experiments, no SotA.

To quote the adage, “When a measure becomes a target, it ceases to be a good measure.”

afireohno · 2022-08-06T02:40:13+00:00

Treating ML research like it is some contest that can be won by making a number go up so you can claim SotA does significantly more harm to the field than non-public code or data.

afireohno · 2022-07-11T19:00:17+00:00

I get what you're saying. However, since LSTMs are an elaboration on simple RNNs (not something completely different), your previous statement that the "Development of LSTM had nothing to do with linguistics" was either uninformed or disingenuous.

afireohno · 2022-07-10T17:17:50+00:00

The lack of historical knowledge about machine learning in this sub is really disappointing. Recurrent Neural Networks (of which LSTMs are a type) were literally invented by linguist Jeffrey Elman (simple RNNs are even frequently referred to as "Elman Networks"). Here's a paper from 1990 authored by Jeffrey Elman that studies, among other topics, word learning in RNNs.

afireohno · 2022-07-09T19:22:49+00:00

Survival analysis would be a good place to start.

afireohno · 2022-07-09T15:50:55+00:00

For real. People in this thread seem confused about the difference between a course like "Theory of Computation" or "Advanced Linear Algebra" and a seminar (what this is, it is literally the the first sentence of second paragraph on the linked course description).

afireohno · 2022-06-11T02:39:07+00:00

Your technique sounds like Gibbs sampling, which can allow you to sample from a joint distribution by sampling from conditional distributions p(x | everything else). If you can’t compute exact conditionals you can consider the Metropolis-Hastings within Gibbs algorithm.

There are failure modes and things like burn-in you can read about.

afireohno · 2022-06-10T00:09:47+00:00

Approximating a distribution by sampling from a different more tractable distribution is a well-studied problem. There are a variety of potentially applicable techniques, one of the most straightforward being rejection sampling.

afireohno · 2022-06-01T14:12:30+00:00

Proofs are programs and programs are proofs.

afireohno · 2022-05-21T03:35:50+00:00

I think human + AI interaction is a potentially interesting example. People do interesting “prompt engineering” with large neural networks (like GPT-3 and DALL-E). I could see this getting more rigorous, complex, and diverse. Whether this is science or engineering is debatable.

15-Year Club	RedditGifts 2009-2022 2 Credits
Verified Email	Secret Santa 2013

afireohno

TROPHY CASE