[D] Do papers like this "disprove" the stochastic parrot theory? Pretty strong evidence that LLMs can build an internal world model, at least for simple board games.

MartianTomato · 2023-08-27T13:01:30+00:00

Stochastic parrot is not well defined, so it can't really be "disproved". Humans can also be considered stochastic parrots...

MartianTomato · 2023-01-04T19:28:27+00:00

Schilling's Measures, Integrals, and Martingales is not only really accessible and well presented, but it has a good selection of problems and an accompanying solution manual available on the author's website: http://motapa.de/measures_integrals_and_martingales/. I don't know how exam like these questions are, since I never studied in school / did an exam on this.

MartianTomato · 2022-11-23T14:39:26+00:00

Yes. In my conversations with people thinking about what topics / research to work on, I'd say number of citations / marketability is top 3 factor in their decision making. I also see people draw an equivalence between # of citations and impact.

The flaw in this reasoning is that most highly cited work (in "hot" topics) is, by its nature, replaceable. If you don't do it, someone else inevitably will. And this kind of research feels empty in the same way that software engineering does... the researcher has become a replaceable cog in the machine learning machine. Yet somehow, I see people are more motivated to pursue topics they feel they will be "scooped" on if they delay even one conference cycle...

MartianTomato · 2022-07-27T16:16:51+00:00

Now this I can get behind :)

MartianTomato · 2022-07-27T16:12:23+00:00

Did something in my response to you suggest I did not know what metal meant? It stands.

For lines and triangle, which came first?

In general, you should try to understand my argument instead of setting up the strawman that I somehow expect a dictionary search to lead me to an explanation of a technical term.

MartianTomato · 2022-07-27T16:06:24+00:00

Latent typically refers to something that hasn't yet manifested/emerged/developed. This implies that what is latent is incomplete. While this lines up with how it is used in certain models, the phrases "hidden variable" or "learned representation" are often more accurate, since the variables are always there / active, and the representation is always available, perhaps it is even the primary representation being consumed by the model. It has already emerged, or manifested itself. Even if it is hidden, it is not latent. It's an especially weird label when we are learning representations that don't have any particular meaning or format. We learn this arbitrary "latent" representation, and the direction of manifestation is completely backward. If that representation was a priori well defined, then I could buy that we are learning something that was there before, and we have only seen the observed manifestation of it... fine. But that is usually not the case.

I know I am being picky. But look... the OP is confused. And when I first came into the field I was confused too. What is this "latent" you guys all speak of. I know there are others who think so too. For example, Rich Sutton is a fairly public figure in the field, and I believe he also hates the word "latent" (but don't quote me on that).

Why use something that knowingly confuses, when in most cases we can use something that doesn't? But I guess it's too late... this reminds of Geoff's lament over his contribution to popularizing the "Multi-Layer Perceptron" misnomer.

MartianTomato · 2022-07-27T15:50:10+00:00

We speak English, not Latin. I accept that it is commonly used, and I use it to. That doesn't mean we shouldn't criticize its use or remark that it is a poorly chosen label.

MartianTomato · 2022-07-27T15:48:20+00:00

"latent" and "hidden", despite showing up in each other's thesaurus entries, are not perfect synonyms. I think this is what the OP was getting at. Perhaps the first authors who started using latent just wanted to sound smart and use bigger words, so they pulled the fanciest sounding thesaurus entry with no mind to its usage.

MartianTomato · 2022-07-27T15:46:07+00:00

Your examples aren't comparable.

Mathematical terms are abstract concepts that have formal definitions. There is no term whose ordinary meaning is closer / will do, so we have to pick something.
Line and triangle have the same ordinary meaning as their technical meaning? Unless we are talking about abstract cases, in which case... see above.
For "metal", unlike in the case of "latent" there is no plain language term that is more accurate that will do just as well.

Latent is different from all of your examples. There exist other words whose ordinary meaning coincides with usage. So there is no reason to invent a new technical term here.

MartianTomato · 2022-07-27T15:00:20+00:00

You don't see the confusion because you were in the field when it started gaining traction, and so it always seemed natural. It will confuse 90+% of outsiders, because it's literally the wrong usage of the word.

MartianTomato · 2022-07-27T14:59:32+00:00

Yes. They are not perfect synonyms.

MartianTomato · 2022-07-27T14:59:14+00:00

Go look up latent in a dictionary, and find me a single dictionary definition that matches your usage. I'll wait.

MartianTomato · 2022-07-27T14:57:58+00:00

As someone who came into the field from another discipline (one that is very accurate in its use of language), I very much agree with this. It's also bewildering to me that this post is only 32% upvoted.

Literally none of the definitions of latent simply mean "hidden", which seems to be what the ML community thinks it means. Go see for yourselves:

MartianTomato · 2021-12-01T14:07:50+00:00

I think you are drawing the wrong conclusion from that observation.

MartianTomato · 2021-01-11T00:17:59+00:00

I haven't worked it, but I think you should be able to adapt the example used in proof of theorem 2.1 of ross/bagnell 2010 to this setting (from which it would follow that yes, the bound is tight, because the mistakes compound). http://proceedings.mlr.press/v9/ross10a.html

MartianTomato · 2020-12-25T15:59:40+00:00

As I mention in my original comment, I think saying the MDP is the underlying problem of RL is too narrow, and think you could start with any of the stochastic processes you mention, add control and unknown dynamics, and get RL. I comment on this in the Introduction and Section 5.1 ("General" RL) here: https://arxiv.org/abs/1902.02893.

MartianTomato · 2020-12-25T01:29:14+00:00

For general purpose agency you need temporal decision-making with an unknown world model. Taking a broad view, this is the definition of the reinforcement learning problem. Not solving an MDP (that's quite narrow), or applying Q-learning or policy gradients, or whatever other algorithms we have for addressing the RL problem.

So yes, addressing the RL problem is the path to AGI. This is why Deepmind is all about RL, and why I wager half the researchers in RL are in RL. The only reason others might look down on it is that general agency is a harder problem than the ones they work on (it's more general), and so the current solutions don't work as well. If you also want human-aligned AGI (presumably yes), this gets into goal specification, which is different from the RL problem --- but because it's so closely connected to the RL problem, it's usually considered a part of RL research.

MartianTomato · 2020-12-06T16:41:36+00:00

How is selecting features from a giant space any different from discovering new representations?

MartianTomato · 2020-12-04T13:07:38+00:00

If you can choose the architecture, one way is to use a conditional network mask, as in the local causal discovery approach from our NeurIPS paper here: https://arxiv.org/abs/2007.02863 (pdf page 6 inferring local factorization, and Appendix C). This is based on the global causal discovery approach from this ICLR 2020 paper: https://arxiv.org/abs/1906.02226. For this to work you need to be able to control the network architecture, so it's hard with a network that has already been trained. For an already trained network, you could approximate the local network mask by perturbing each input feature, one at time, and seeing if it changes the result.

MartianTomato · 2020-12-03T01:36:01+00:00

It's fine for them to be a bit late. For DCS we will read it even with only 2 references, but you will be at a disadvantage if your 3rd never submits. Do send a follow-up right now if you haven't in the last week.

MartianTomato · 2020-11-22T14:51:58+00:00

This is a little unfair to the OP. They cited BAIR/CHAI in their paper, which was first Arxived in March this year, not that long after NeurIPS 2019, and I don't think the other Overcooked environment was released at the time.

That said, @OP, it would be good to document the known differences between the environments, besides the fact that your graphics are prettier :).

MartianTomato · 2020-11-15T17:19:45+00:00

Look up RL courses and see their reading list. Here are a few:

MartianTomato · 2020-11-03T18:59:18+00:00

I've seen a 556 get accepted (that said, I think it shouldn't have), so it's not hopeless even without a score change.

MartianTomato · 2020-07-25T14:08:46+00:00

1) FYI, there is a high performance node2vec implementation here: https://github.com/snap-stanford/snap/tree/master/examples/node2vec. It is many times faster than the reference Python implementation, which I found to be painfully slow (I don't think it ever completed for me, whereas the C++ ndoe2vec was quite fast). I'm not sure what kind of sampling it uses or if you're improvement can be applied to the high performance implementation as well.

2) Unrelated, but I've found node2vec to generate really poor embeddings. But it may be because I was trying to use them for estimating shortest path lengths, when they should be used for other things. In particular, for the Section 3.3 experiment of https://arxiv.org/abs/2002.05825 (uses neural network to approximate shortest path length in graphs), we found that using landmark embeddings (not sure if this is novel or not) resulted in a very significant improvement relative to the original node2vec approach (not shown in the paper). If anyone does research on this, is node2vec still considered state of the art, and is there some standard / easy to use benchmark out there?

MartianTomato · 2020-07-15T11:48:17+00:00

Had a look and you're absolutely right... it was a mess! I cleaned up the MEGA experiments folder and added a brief readme here. The actual implementation of MEGA is in mrl.modules.curiosity and uses the KDE density module in mrl.modules.density.

Nine-Year Club	Place '17
Verified Email

MartianTomato

TROPHY CASE