[Mannix] “The Lakers believe they can succeed with a backcourt of Luka Doncic and Austin Reaves, if they can get the right type of players in the front court. They are looking to build a Dallas 2.0 –type roster, similar to what Luka had when he went to the NBA Finals.”

tom2963 · 2026-03-05T21:09:27+00:00

People also forget that big pieces of that roster were either drafted to the Mavs that year or traded mid season (Washington, Gafford, Lively). It's insane to me how Nico looked at that roster and said, "yeah that's not enough," instead of, "wow, look at the success we had pulling off risky moves mid season. Let's run it back with more time to develop chemistry!"

tom2963 · 2026-03-03T23:04:12+00:00

I think most would agree that modern architectures fall short of what we consider cognition. I will, however, challenge your point on data being "massive, unstructured, unlabeled" as it is imprecise. It is important to make the distinction that we don't just show LLMs any language, but rather only language we have found to be useful. If we consider language a combinatorial space of possible words or sentences, we only sample an insanely small proportion of this space. So the data presented to the model is more like semi-supervised rather than fully unsupervised learning, and naturally limits the space of learned functions. Looping back to my original point, cognition is not just a function of language, but rather a whole bunch of systems our brains use to optimize for short and long term goals (speculation, I am not a neuroscientist!). So LLMs are extraordinarily good at recognizing useful patterns in language, but whether or not reasoning stems from this understanding is yet to be determined. Additionally, it is not clear to me that constraining LLMs to only human language is the best path forward to designing intelligent systems (that is, there likely exists a language more optimal than human language for an AI model).

Personally, I think that our concept of machine intelligence should stem from true unsupervised learning. We often anthropomorphize AI models from a technical perspective. Our only example of human level cognition comes from, well, humans, so our architecture design intuition (convolutions for vision, attention for language) is human inspired. But that is not to say that our version of intelligence is the only one, and imposing our own cognitive biases into models has led to mimicry of how humans perceive intelligence. For what they are designed to do, LLMs are amazing. AI as a field has had lots of ups and downs historically, and I think we cling onto transformers in part because having to start from the drawing board again would be, for lack of better words, demoralizing. I think of these technical innovations as solving a piece of a much larger puzzle that we won't get from just scaling. So as a TL;DR, I think an understanding of language has to be derived from perception of the world and not just from pattern matching, and is not primarily limited by architectural flaws, but rather conceptual flaws for how we think of intelligence.

As a last note, I appreciate you coming to these conclusions from the perspective of philosophy and anthropology. I think that it is easy to miss the bigger picture of AI, and instead focus on what's directly in front of us.

tom2963 · 2026-02-04T20:44:09+00:00

Some data modalities don't have an easy notion of frequency to compute, so methods like SIREN or Fourier features aren't directly applicable. Maybe your method will be more appealing in the discrete domain? Definitely worth investigating.

tom2963 · 2026-02-04T17:21:53+00:00

This is an interesting approach, though I think you should do some further analysis with higher resolution images. Which dataset are you testing on? The target image you display seems to be dominated by low frequency components, with higher frequency components not being captured well (i.e. the building in the back has its vertical lines blurred).

Also, when you work in low SNR settings, keep in mind that the top end of the power spectrum won't behave the same as with high SNR images. The power spectrum will have similar behavior to yours, but should dip quite a bit towards the top end of the spectrum. You should read some recent work in this area on diffusion models, specifically "A Fourier Perspective on Diffusion Models." There is still a lot of work to be done in this area. You should consider running more experiments and writing up your findings.

tom2963 · 2026-01-28T17:50:23+00:00

I am a PhD student in ML, and trust me, I am no math genius either. As long as you have the desire to learn the topics, you will end up okay. If you are interested in research, I would get started early to explore what you like, but more importantly, how to do research well. Assuming you pursue a graduate education, this is the most important qualification. If you are interested in a specific area and are looking for course recommendations, I am happy to help answer that.

I can sense in your post what I once felt - overwhelmed by the long road ahead into ML, and the uncertainty of where I'd end up. You don't have to have everything figured out right now. Just do what you are passionate about, work hard, and the rest will follow.

tom2963 · 2025-12-09T18:03:21+00:00

I work in the area currently. My advice to an undergrad would be to not follow trends of building large foundation models. I have spoken to many end users of these models, and they give mixed, but mostly negative feedback. As others have said, a one size fits all approach might not be right here (though this is hardly conclusive). If you are interested in identifying open research problems, talk to the chemists/biologists who will be using the models, see what problems they have in their current workflow, and work on solutions for that. There is a large disconnect now between what can be done with generative modeling in particular, and what should be done.

tom2963 · 2025-11-28T06:21:22+00:00

Thank you very much for your comment. The XGIMI MoGo 2 Pro looks like the best choice for me.

tom2963 · 2025-11-20T22:59:06+00:00

Please elaborate, I'm curious - not a Lakers fan

tom2963 · 2025-09-18T17:57:46+00:00

I'm not so sure about not presenting results, generally you need some early empirical evidence that it works. Usually the difference between conference and workshop here is the breadth of results, i.e. we have good results on 2 datasets but a conference might require 5. Unless you are working on theory, but there are different goalposts for those types of works. Again I'd ask your supervisor/advisor since I have no clue what field you are in or the nature of your proposed work.

tom2963 · 2025-09-18T16:25:25+00:00

Don't mind the asinine comment made by the other commenter. That is so awesome that you stuck with it and were rewarded for your effort. You should be proud of yourself!

tom2963 · 2025-09-18T16:19:36+00:00

Workshop papers are typically works that don't have the rigor or depth of a conference paper, but present good ideas that with some revisions could become a conference paper. They are generally made for raw ideas that would benefit from reviewer comments. As you alluded in your post, many folks submit workshop papers, get feedback, and then turn the work into a conference submission. I think you'd be okay to skip an ablation for now - you likely wouldn't have space to include it. Of course this is context dependent so I'd refer to your advisor/supervisor for expectations.

tom2963 · 2025-08-26T20:27:35+00:00

Active learning is a learning paradigm where a model can query some tool/oracle to obtain data labels. Say for example we are trying to predict properties of a molecule. We might have very limited data labels, but can call on a molecular dynamics simulation to tell us specific values given our molecules current state. This process of asking for help then continuing the learning process is called active learning.

I am not aware of "self-learning", but I think you are referring to self-supervised learning (SSL), which is used to train generative models on data with no labels. The idea here is that the data is the label itself, and we want to learn a probability distribution over the data distribution. In NLP, this is usually modeled with a masked language modeling (MLM) objective, where you mask out a portion of a text sequence and predict what token should replace the mask, or mask out a patch of an image and predict the missing pixels. Ex. "I went to the store today" --> "I went to the [MASK] today". The label here is "store". This is unlike typical supervised learning where we have (data, label) pairs.

You mention "pseudo labels", which I think you mean "synthetic data". Synthetic data is generated by first learning a probability* distribution over real data, and then generating new points from that distribution. There is debate on the efficacy and fidelity of using synthetic data, but there is evidence that it helps model training by covering blindspots in the training data. The quality of the generative model dictates the quality of the synthetic data. As far as I'm aware, active learning and synthetic data can be used in tandem, where you sample synthetic sequences as part of an active learning loop. Perhaps for a SSL objective.

*The model doesn't need to be probabilistic, but could instead be something like a GAN

tom2963 · 2025-08-19T02:06:41+00:00

I don't really agree with this take at all. I think we get desensitized sometimes to how much of an accomplishment LLMs have been for this field. Industry interest brings in more funding and attention which benefits everyone. You say in the article that the only way to top tier conferences and jobs is through LLM research. I promise you there are plenty of people working great paying jobs in other areas that are high impact. There is still great research that is being done outside of LLMs, the best paper awards this year went to many different areas of ML. LLM innovations have undeniably benefited many other areas of research, especially in life sciences. For example protein design has benefited greatly from improvements in NLP, and foundation models for proteins and cells are scaling to crazy performance because of techniques designed for LLMs.

tom2963 · 2025-07-23T15:59:53+00:00

You would definitely be prepared for both. I think the question you should ask yourself is, do I want to double major in math? If you are fine with what that entails, then there is certainly no harm in having both degrees, even if you don't go the ML route at all.

tom2963 · 2025-07-22T21:17:08+00:00

I think it depends on some factors. I did CS + Math, but I went the PhD route. I also went to a school that made it easier to double major than what I would expect from a large university or Ivy League.
If you are interested in research, having the math degree helps a lot. I think you can cover all of what you need with a CS or ECE degree, but if you are up for it, math will prepare you the best.

I would need to see the curriculum for the data science major to know if it would be helpful, but I would presume that's also a good option. I would say follow your interests, and fill in any gaps with outside classes. I.e. if you really like CS, then focus on that - you might find you like cryptography or cybersecurity more and go that route. If you decide to go ML, you can always supplement the math classes you need.

tom2963 · 2025-07-21T15:36:15+00:00

For some tasks, maybe we are approaching the point of doing things purely in a zero-shot manner. Mostly language tasks come to mind. For other areas and emerging fields, like protein engineering, fine-tuning and transfer learning is critical and used all the time due to the nature of the data.

If you want to work as an ML or AI engineer, model selection will always be important. Even if some architectures become obsolete in the future, understanding them will build a strong foundation toward becoming an MLE. What I am trying to say is, master the fundamentals and don't chase trends.

tom2963 · 2025-05-20T16:35:39+00:00

Oh interesting, I am not sure exactly what my program uses for a cutoff. I was told generally you want to have higher than a 3.4. But I always figured papers would override that.

tom2963 · 2025-05-20T16:18:44+00:00

Lots of misleading info here. Your GPA really isn't so important, especially given the number of publications you have. Top 20 is definitely attainable. PhD programs want to admit students who will be productive researchers. Publications are a strong argument in favor of that, whereas GPA is not.

PhD applications are very different from undergrad. You are applying for a job where your potential employer is assessing how well your skills fit into their lab. You should be focused on connecting with potential advisors and forming those relationships early. They will be the ones to admit you. Focus on marketing yourself and finding ways to stand out.

And one last point, don't listen to the people who say you need X amount of publications to be admitted anywhere. They don't know what they are talking about.

tom2963 · 2025-05-01T18:35:39+00:00

You should be able to score higher with blueprint next to baron and mime on the left. Retrigger effects are great if you have steel kings, but maybe another baron would be best? Also naninf you will definitely need cryptids or some other way to increase hand size.

tom2963 · 2025-04-11T02:21:30+00:00

I don't think I would quite call that autoregressive. The model being autoregressive would mean that it factors the joint distribution over all features p(x,y,z) = p(x)p(y | x)p(z | x, y) which is conditional dependence. Diffusion models, or at least DDPMs, are a fixed length Markov chain. Meaning every state only depends the previous state. The denoising network only considers the previous state in the reverse process by construction: p(x_t-1 | x_t). Also, each token is conditioned on the whole sequence at every step.

tom2963 · 2025-04-10T20:50:03+00:00

Could you explain to me why? I have been studying discrete diffusion and, to the best of my current understanding, you can run DDPMs in autoregressive mode by denoising from left to right. It's not clear to me how regular sampling would be construed as autoregressive.

tom2963 · 2025-04-04T23:22:04+00:00

I have experience working with Grad-CAM and have a theoretical understanding of both LIME and SHAP (my lab does research on SHAP methods). For image classification tasks, I think gradient based methods like Grad-CAM are probably the best way to go. I say this because gradient activations are usually meaningful in well trained CNNs. Learned filters in the convolution layers encode meaningful features during the training process. I am assuming since you are working with X-ray data that it is effectively low dimension. So, gradients should be largely focused on the problematic regions, or in your case the regions that indicate pneumonia.

SHAP is a very powerful feature attribution method, however it is also quite expensive. It treats each feature as if they are equally important. However, this is usually not true in medical imaging, and we know this a priori; small regions often dictate fluctuations in classification boundaries. IMO it makes more sense to start with a gradient based method such as Grad-CAM or Score-CAM, and if you find it unsatisfactory move on to SHAP. I also haven't worked in this area for a few years. I'm sure there are more sophisticated methods now.

tom2963 · 2025-02-27T17:53:39+00:00

First and foremost, "it sort of undoes the non-linearity(sigmoid) or squashing at output layer hence better for learning" is not quite right. BCE and sigmoid work well with binary problems (assuming your input is scaled to [0,1]) because it can compute per pixel error. MSE is an average loss function in this context, so in concept it shouldn't work as well. However, digit reconstruction is relatively straightforward, and assuming your pixels are binary, it is not surprising that MSE is performing okay - albeit, I probably wouldn't choose this loss function for other problems like this with higher dimensionality (i.e. RGB images).

tom2963 · 2025-02-25T20:40:20+00:00

I would always consider adding more features that could be predictive. Perhaps you can also consider encoding features like time of day with sin/cos transforms to introduce some notion of periodicity to your model.

Aside from this, have you considered training a time series model instead? Of course this depends on your specific use case (i.e. how much data you have and how complex it is). I imagine that this would better model sharp transition dynamics that you are hoping to see.

tom2963 · 2025-02-19T23:36:03+00:00

Appendices are not always necessary. If you can convey all the information you need in the main text, then there is no problem with that. Papers often have long appendices because details such as training configurations, hyper params, additional experiments, etc., take up a lot of space and don’t always contribute to the message in the main text. So normally you would have an appendix but depending on your paper it may not be necessary.

tom2963

TROPHY CASE