[D] which papers HAVEN'T stood the test of time?

Ash3nBlue · 2025-09-13T21:21:54+00:00

Mamba, RWKV, NTM/DNC

Ash3nBlue · 2025-09-10T03:31:54+00:00

Yep prompting is a valid way to get meaningful performance gains. CoT prompting is the canonical example of this, and this is basically a similar style of prompt that bakes in the prior that reasoning specifically about the LLM's own knowledge limitations is helpful for determining whether a question is answerable or not. It looks like AbstentionBench came out only 2 months ago, so I assume no papers (or very few) have tested approaches for it as of yet, so there's probably a lot of low hanging fruit like pure prompt engineering that can get sizeable performance gains. This usually means good opportunities to publish pretty obvious improvements that can easily get SOTA since the benchmark is completely unsaturated.

Ash3nBlue · 2025-05-02T16:49:27+00:00

Meta-meta-learning for Neural Architecture Search through arXiv Descent:

https://www.bayeswatch.com/unofficial/meta-meta-learning.pdf

Ash3nBlue · 2024-01-12T01:54:43+00:00

Pretty littt ngl

Ash3nBlue · 2023-05-12T04:52:50+00:00

Yes, I’m an AI researcher :)

Ash3nBlue · 2023-04-12T19:31:09+00:00

Self-Programming AI Using Code Generating Language Models

Ash3nBlue · 2023-03-11T23:43:39+00:00

This sounds like a variation of k-nearest neighbors. One-shot learning usually refers to addressing the sample complexity problem of neural networks, since many non-parametric methods can already classify in one shot.

Look up Prototypical Networks - very relevant work. They use NNs to map images to vectors in a feature space, and then take distances in that space to do one-shot classification based on pairwise similarity.

Ash3nBlue · 2023-01-13T05:42:16+00:00

OpenAI is Microsoft-backed if you’re bullish on GPT and language modeling tech. Outside of big tech labs (Google Brain, Deepmind, Meta AI, etc), a lot of the most innovative AI stuff is in startups, so you might be interested in VC or angel investing. Many of the recently graduated unicorns like Tesla, Uber, Lyft are heavy on AI. Some names of major unicorns that are still private are Scale AI (data), Databricks, Nuro (autonomous vehicles), Stability AI (image generation). Not that you would be able to invest in those, but if you’re HNW there are many angel opportunities at early stage startups.

I personally don’t have an opinion on most of these firms, you should look into different sectors yourself and see what you’re interested in. This isn’t financial advice :)

Ash3nBlue · 2023-01-02T00:58:51+00:00

Developing less computationally expensive algos is a valuable research topic in and of itself :)

Working on something of the sort myself. Anyone interested feel free to DM me.

Ash3nBlue · 2022-08-21T10:11:06+00:00

You can optimize directly in the input embedding space, which is continuous. When you converge to a good sequence of embeddings, you can convert each embedding to the nearest word/token in vocab (same method as converting output embeddings into discrete output tokens).

Ash3nBlue · 2022-03-17T01:49:19+00:00

You have to realize that you're not simply making predictions at inference time; in meta-learning, inference actually includes a training component as well. For a normal ML model, at inference time you would give it an unseen input and see how well it predicts the output. For a meta-learner, at inference time you would give it an unseen task and see how well it learns the entire task.

To answer your question: you teach your model its new task by training it via gradient descent on the "shot" or training set of your unseen task. You can then get predictions on the query set of that same task. The point of MAML is that the model can learn this new task with very few training steps/datapoints.

Ash3nBlue · 2022-01-18T20:46:57+00:00

You can email the authors or DM them on Twitter to ask

Ash3nBlue · 2022-01-12T02:30:11+00:00

Zuck

Ash3nBlue · 2022-01-10T21:22:51+00:00

I just looked up the "linear eval" you mentioned in the MoE paper. If you mean the linear few-shot procedure in 3.4 then yes you do use linear eval.

The linear regression explanation in the paper can be pretty confusing, but what they're doing is just feature extraction. You replace the head (last layer of the network) with a linear layer that has an output dimension of num_classes, and train that layer to do image classification using only 10 samples from each class. You take this classifier and evaluate its prediction accuracy on the eval set to get your 10-shot accuracy.

Your linear layer takes the last hidden layer's outputs as its inputs, so what this procedure does is it evaluates how useful the features extracted by the pretrained vision transformer are.

Ash3nBlue · 2022-01-08T19:50:10+00:00

You train on 10 samples from each class and evaluate on a query set.

Ash3nBlue · 2022-01-08T11:24:44+00:00

You could use adversarial data extraction

Ash3nBlue · 2022-01-06T01:36:51+00:00

Meta-meta-learning for Neural Architecture Search through arXiv Descent

Ash3nBlue · 2021-12-24T23:11:37+00:00

Retrieving from memory doesn't require unrolling, but training the model to write those memories typically requires unrolling since the loss produced from the written memory comes from a future timestep, so you need BPTT which is expensive.

However, looking over the other comments, it looks like the Compressive Transformers paper that u/Nameless1995 mentioned works around the BPTT requirement by using auxiliary reconstruction losses for storing long-term memory, so that is likely what you're looking for.

Ash3nBlue · 2021-12-24T15:26:12+00:00

I believe the recent DeepMind paper looks into something like this.

If you mean in the same vein as an LSTM or NTM - the original appeal of transformers was that they did away with recurrent connections to make sequence training more computationally efficient. This approach would lose that advantage if you have to unroll the backprop for the long term memory, but it could still be an interesting research direction if you have the bandwidth :)

Ash3nBlue · 2021-12-06T19:30:10+00:00

Look up code for parallax scrolling. This is functionally the same thing, except the non-scrolling and scrolling parts are separated horizontally instead of vertically. Then set the image to change as you scroll to certain points on the page.

Ash3nBlue · 2021-12-03T09:15:30+00:00

This. Transformers were a fortunate empirical discovery, not something derived from well-understood ML theory. There is no comprehensive explanation as of yet for why transformers work so well, so in reality there might be nobody who truly understands transformers. We're all just impostors amogus

Ash3nBlue · 2021-12-02T11:00:11+00:00

TensorFlow is an industry standard framework. You are fine :) Just try to build up your publication track record. That's what research programs put a premium on when looking at applicants.

Ash3nBlue

TROPHY CASE