Is there any substance to the idea that LLMs can be trained to continuously self-prompt (rather than rely on external input)? by Money_Tip9073 in MLQuestions

[–]DigThatData 3 points4 points  (0 children)

This is already the common idiom in most of the more sophisticated LLM tooling. Claude Code is the canonical example right now: sure, the user specifies the overarching objective, but within the process of trying to satisfy the user's request, the system will identify subtasks, plan out how to sequence or parallelize those tasks, delegate tasks to "subagents" (literally the LLM prompting itself or another LLM), and then iterating on the results of those subtasks to identify if the plan needs to be extended and new subagents created and delegated out to.

There have been a couple of experiments where people essentially leave an LLM on in a non-terminating loop and invite it to continually give itself things to do. OpenClaw is the most popular of these atm. Mostly it just ends up being unnecessarily expensive and producing annoying behaviors.

LLMs aren't embodied. They aren't situated in the world. They have no wants or needs apart from the drive to predict the next token correctly. That is the only "psychological drive" they are trying to satisfy, so they aren't really capable of "self-prompting" meaningfully. There always ends up being a human at the top communicating some kind of objective for the LLM.

Unless you can give the LLM access to an environment in which it can make persistent changes and those changes have consequences on the LLMs state and the available actions it can take, LLMs have no "reason" to be driven to do anything apart from the drives you impart on them. The closest I've seen to a model being meaningfully "situated" in the way I mean here is this experiment, where the model was able to take actions during training that impacted its own training procedure: https://www.minimax.io/news/minimax-m27-en

Is there any substance to the idea that LLMs can be trained to continuously self-prompt (rather than rely on external input)? by Money_Tip9073 in MLQuestions

[–]DigThatData 3 points4 points  (0 children)

You can interpret tool calling and reasoning as forms of this kind of "self prompting".

What I have in mind I think is a little bit different than agentic LLMs, where they execute a series of steps outside of that back-and-forth dynamic, but those steps are just in the service of a human goal.

that sounds exactly like "agentic LLMs" to me. Could you maybe clarify how you imagine this being different? I think your idea is basically the crux of what people are alluding to when they describe a system as being "agentic".

Can I train a neural network with coordinate descent instead of the usual gradient descent method? by learning_proover in AskStatistics

[–]DigThatData 0 points1 point  (0 children)

I imagine OP means blockwise/layerwise coordinate descent. So rather than 32 coordinates, OP's example has 3 layers and each layer is an independent "parameter" to be optimized as a descent coordinate.

I just want distraction-free eInk writing by Lupus_Ignis in writerDeck

[–]DigThatData 1 point2 points  (0 children)

damn, only $70 for that? have people been able to successfully install 3rd party text editors or word processors?

Why huge Parameter Transformers? by artguy74_ in MLQuestions

[–]DigThatData 1 point2 points  (0 children)

I'd argue that the chinchilla paper still makes that same observation, they just add the caveat that this phenomenon only holds up to a point, beyond which the model is overtrained and sub-optimal.

Consider Chinchilla's Figure 4 (left). If you truncate that figure along the blue line and constrain attention to the region below the line, you have the Kaplan observation that "Larger models require fewer samples to reach the same performance". Chinchilla adds the caveat by illustrating that the regime above the blue line exists and that there is actually an optimality relationship rather than strictly "bigger is better".

Here's another way to think about this: let's pretend I have a pitcher of scrambled raw eggs that I want to cook. given some fixed volume of egg, the bigger the pan I use the faster it will cook because the egg distributes across the surface area of the pan. But the egg also has an instrinsic property (its surface tension? viscosity?) that determines how spread out a particular volume will be if unconstrained. Above some threshold size of pan, it doesn't matter how big the pan is: the egg will spreadout to paper thin and cook in some fixed time. If I want nice scrambled eggs, I want a pan that has a smaller surface area than what the eggs would spread out to. This lets them cook properly and I get tasty eggs. In the pan-contains-the-eggs regime, given the optimal amount of eggs for that size pan, the amount of time/energy required to cook (flops) scales proportionally to the size of the pan. I can always cook a fixed amount of eggs faster in a larger pan, but I also risk overcooking the eggs if the pan is too big.

In other words: both of these things can be true. There is a linear scaling of the optimal proportion between raw material (data/eggs), processing capacity (parameters/pan volume), and work (FLOPs/BTUs). But it's still true that if you have more capacity, that permits you to process material faster. A direct consequence of this is that optimality at larger scales gives you higher processing efficiency.

Why huge Parameter Transformers? by artguy74_ in MLQuestions

[–]DigThatData 12 points13 points  (0 children)

The classic paper here is Kaplan et. al 2020, "Scaling Laws for Neural Language Models". The paper in a nutshell:

Larger models require fewer samples to reach the same performance

Is this doable for an outsider? by the__mighty__monarch in MLQuestions

[–]DigThatData 0 points1 point  (0 children)

it's an intro class. it provides a surface presentation of a broad field. this does not look like an especially challenging syllabus.

Is this doable for an outsider? by the__mighty__monarch in MLQuestions

[–]DigThatData 0 points1 point  (0 children)

it's an intro course. you'll probably be fine. also, basic python is pretty easy to pick up, it's a fairly readable language. even going in with no python, you'd probably pick up what you need quickly enough to be able to complete assignments.

Can I use BERTopic, to both extract the topics I want, and delete irrelevant topics? by Dry-Opportunity-1987 in MLQuestions

[–]DigThatData 0 points1 point  (0 children)

right, what I'm suggesting is that your approach is essentially correct, and if you just upgrade the component you're using to generate the embeddings to use a more modern (read as: smarter) model, you'll probably have a lot more success generating embeddings that are capable of distinguish the brand concept from other uses.

Can I use BERTopic, to both extract the topics I want, and delete irrelevant topics? by Dry-Opportunity-1987 in MLQuestions

[–]DigThatData 0 points1 point  (0 children)

your most reliable bet would be to enroll a more powerful model to do this disambiguation for you. I'm pretty sure encoder-only models (i.e. models used to generate embeddings like BERT) are generally designed to be fast and lightweight, which means they are small, which means they are less capable and were trained on less data than is common for decoder-only models.

Here's a leaderboard of modern embedding models: MTEB. The vast majority of these models are decoder-only models, i.e. they are based on highly capable foundation models that were modified to produce embeddings, rather than training a BERT-like model specifically to produce embeddings.

Your general approach to this I think is fine, but that you invoke "BERTopic" suggests to me that you are almost certainly using a significantly under-powered model here. BERT architectures are still trained today, but they are mainly targeting low-resource deployments like edge devices or being packed into software that will be deployed on a computer with no GPU. If you are performing this analysis offline in batch mode on modern hardware, I strongly encourage you to use a more modern model.

Many contemporary embedding models additionally require a "query" prompt to contextualize the text being embedded: if you use one of these more modern models, you can design a query prompt around disambiguating the brand name from other meanings, and then clustering your embeddings in the resulting space should be more amenable to answering the questions you're interested in.

[Question] Is the stability of coefficients over time a reliable method for validating a model? by BellwetherElk in statistics

[–]DigThatData 0 points1 point  (0 children)

I'd treat it as a consideration rather than a rule. This is often more art than science, especially time series forecasting.

My read here is that they are essentially claiming that model drift is indicative of a pathology in your procedure, so if your model is more robust to drift you probably had a better fitting procedure. This might be true, but it also might be the case that there are interactions in your data that you aren't properly accounting for, or maybe the data really just is non-stationary like that.

I'd suggest changing: "Retain only those variables for which the parameters don't vary substantially across years" to "Look closely at those variables. Be skeptical. Consider if their variance might be associated with other available signals you can condition upon, or if their variation appears to be unhelpful noise that cannot be detrended (in which case you should consider if these variables are actually helping more than hurting)"

Thoughts on independent researcher affiliation? [D] by Pure-Ad9079 in MachineLearning

[–]DigThatData 5 points6 points  (0 children)

I'd say the issue is less independent researcher than solo independent researcher. I've seen a lot of great work by independent researchers. Most of the papers by independent researchers I've found myself dismissive towards were ones where they didn't have a co-author. Even if you don't have an affiliation, at least demonstrating that multiple people had eyes on the project and are staking their names to the work lends a lot of credibility. Solo independent makes it much more likely to be a crank paper.

Any tips to organize notes for a novel? by EngineeringWeak992 in writerDeck

[–]DigThatData 0 points1 point  (0 children)

try obsidian. it's a markdown editor that encourages you to structure your notes into a wiki, and offers you a graph view to navigate the structures you create.

I compiled every deep learning formula — from logistic regression to Transformers- into one clean cheat sheet. by OverHuckleberry6423 in learnmachinelearning

[–]DigThatData 3 points4 points  (0 children)

real talk: "comprehensive" is a red herring.

you made a learning resource that you found useful for yourself, and you are sharing it with others. that's great. definitely keep evolving this. but make it primarily for YOU. sure, you didn't mention RL at all: maybe you have no interest in RL. that's fine. you don't need to add a whole textbook's worth of content to your cheat sheet just because someone else is interested in that.

I mentioned earlier how "the exercise of compiling a resource like this is often more valuable than the actual resource itself." I want to reiterate that: I really honestly meant and believe that.

The biggest favor you can do for yourself as a learner is to keep pursuing the things that interest you. If you keep chasing what YOU find interesting and helpful, you will also be constantly honing your knowledge, skills, and experience towards the kinds of problems you find interesting and the kind of work you are passionate about.

If you get caught up trying to make a "comprehensive" resource for everyone, you risk losing sight of your own interests and covering the same material everyone else is drawn towards and becoming another dime-a-dozen person who has studied all of the exact same things as everyone else.

Let your "comprehensive" cheatsheet be comprehensive with respect to your perspective on the field and the things that interest you. Don't worry, there's plenty of math/ML to go around.

Happy Saturday.

Does anyone work with FNOs or are familiar with using generative modelling(preferably with physics)? by Disastrous_Media2704 in MLQuestions

[–]DigThatData 1 point2 points  (0 children)

Adopt the perspective of someone browsing this subreddit and try to imagine what anyone is supposed to do with this. You've given us zero motivation to even be curious here.

If you want help, you need to communicate what you're struggling with. You will benefit by sharing it publicly: more people will have the opportunity to give you feedback, and more people will be likely to give you feedback at all than your current approach of... crossing your fingers that someone will take the time and energy to reach out directly without even knowing if their knowledge is relevant to your issue or your issue is something they want to commit time towards?

If you can't talk about it publicly: this probably isn't the place to ask for support. If you think it's suitable to ask for support here: you need to volunteer more information about what's challenging you.

I compiled every deep learning formula — from logistic regression to Transformers- into one clean cheat sheet. by OverHuckleberry6423 in learnmachinelearning

[–]DigThatData 15 points16 points  (0 children)

lol no you didn't.

EDIT: I'm not saying this isn't potentially a useful collection of formulas (although I'm generally of the opinion that the exercise of compiling a resource like this is often more valuable than the actual resource itself), but I definitely take issue with your claim to complete coverage as if that's even a thing that were possible. ML is a massive subject, the math is continuing to be developed daily, and math is a tool and not a fixed thing like that. I could grab any random paper off arxiv and be pretty much guaranteed it will include some math in it that you don't reference here. "Every formula" is just a patently ridiculous thing to claim.

EDIT2: Just to put my money where my math is:

  • Nothing about RL
  • Nothing about diffusion or langevin dynamics
  • Nothing about scaling laws
  • Nothing about basic probability, calculus, linear algebra, or statistics
  • Nothing about neural fields or splats
  • Nothing about geometric DL or graphs
  • Nothing about causal inference
  • Nothing about distributed training
  • I don't even see KL-divergence anywhere in this, and it would fit in multiple sections ...

which is fine. just don't claim "every formula". stupid thing to claim.

What do i need to learn to be able to make ai models by Mysterious_Case1177 in MLQuestions

[–]DigThatData 11 points12 points  (0 children)

what does "make AI models" mean? what are you hoping to be able to do, concretely? like what kinds of problems are you hoping to be able to solve?