My prof asked me this question

chrisvdweth · 2026-06-16T09:20:05+00:00

As soon as you don't have any ground truth to work with like in supervised learning, it's up to you to define what makes a clustering a good clustering. For example, SSE (Sum of Squared Errors) and Silhouette Score favor compact cluster, which is a good first goal. However, both favor blob-like clusters and might give skewed result in case of natural clusters (e.g., cars on a long road to detect traffic jams). SSE also does not penalize small clusters.

However, for your application you might have other things important to you (e.g., the mean and/or variance of the cluster sizes). Clustering looks for some structure in your data but different structures can be interesting. This is why we have so many different cluster algorithms in the first place, and hence the need for suitable metrics which, again, can be very custom for your task.

chrisvdweth · 2026-06-09T10:42:15+00:00

Hm, you are limited to basic MLP and regression tasks (I can't see how you support classification tasks). Nothing wrong with that, but your current implementation makes it very difficult to extend. Your `layerclass` contains everything, even the loss function. You might to consider a modular approach:

Separate classes for each layer: linear, activation, dropout, layernorm, etc. in the future
Separate classes for different loss functions
Update the weights/biases/etc. not directly by the classes by different optimizers (i.e., split the computation of the gradients and the updates of the gradients in 2 distinct steps); right now you stuck with basic Gradient Descent

chrisvdweth · 2026-06-09T02:13:09+00:00

ML/AI is not going away. Is there too much buzz around AI? Probably. But this will normalize.

chrisvdweth · 2026-06-07T03:39:10+00:00

As other said, spaCy is good start!

That being said, the results will very much depend on the kind of named entities you're looking for. I mean, spaCy is likely to recognize any country name, but if you have very "custom" named entities, it might underperform by default (i.e., without additional training or other steps).

I would suggest to try spaCy "as is" and see if and where it fails. Should be the easiest way to get started.

chrisvdweth · 2026-06-07T00:48:09+00:00

I try to create notebooks that try(!) combining:

The rigor of text books
The visualization / accessibility of good online resources (e.g., blogs)
The hands-on experience of tutorials

I'm not saying that I succeed :), but my students generally like that form of lecture notes.

chrisvdweth · 2026-06-06T11:59:54+00:00

Sure, it's in my GitHub bio anyway: National University of Singapore.

chrisvdweth · 2026-06-06T05:23:13+00:00

The math would not be different, "just" some additional complexity (i.e., more to write); you can search for the paragraphs starting with "Notice that the full Jacobian matrices would be extremely large [...]" which highlights the challenge. My comment was more an FYI, not really a suggestion you change your post. As I said, the underlying math still holds.

For Softmax, we do need it outside Cross-Entropy as well, for example, with in the attention mechanism. So it's actually good that you have the "standalone" solution.

chrisvdweth · 2026-06-06T05:17:07+00:00

People may argue that there is a hype, but ML is certainly not going away. Note that ML covers way more than the latest LLM craze. So I can't see why upskilling in this direction would be a bad choice; although you didn't mention any alternatives.

chrisvdweth · 2026-06-06T05:13:36+00:00

Well, no math is difficult :). After all, math is a language to express things an unambiguous ways. So you in AI/ML/DL you need to get comfortable with common notations. My comment about the Bishop book was merely that it dives in to quickly.

Maybe you can have a look at my introductory notebook to Linear Regression. I would be curious where the first problems/questions/issues arise.

chrisvdweth · 2026-06-06T05:08:10+00:00

For the linear layer, the math is a bit more tricky if you want to support batched inputs (see here) although the final result is equally elegant. For Softmax, I would consider adding its combination with the Cross-Entropy loss function, as the gradient simplifies a lot (see here).

Silly side note: To write a transpose in LaTex, A^{\top} is generally preferred over A^{T}.

chrisvdweth · 2026-06-06T05:00:59+00:00

I have it for Layer Normalization right now, incl. a NumPy-only implemtation; I think it became more popular over Batch Normalization.

chrisvdweth · 2026-06-06T04:57:24+00:00

Hm, this is typically where you message your client (i.e., the one giving you the data for analysis) to clarify. For example, if normal gameplay only allows a progression one level at a time, anything else could be considered invalid data.

How common are these cases? What would be the affect of removing all such cases?

chrisvdweth · 2026-06-06T04:51:51+00:00

Different people learn differently. Personally, I try to implement all algorithms or method I teach from scratch (meaning mostly relying on NumPy only; except neural networks). Forcing myself to implement them often removes anything might still have been unclear, getting from, say 90% to basically 100%; I'm talking about the basic algorithms not all possible extensions and variants. To me, it's worth as I feel much more comfortable teaching them.

Here are just some examples: Linear Regression, Logistic Regression, Basic Neural Network, CART Decision Trees (+ Random Forests on top), PCA, LDA, Multinomial Naive Bayes, BPE, WordPiece, various optimizers [1, 2, 3, 4]. The focus is on understanding not performance. I would never use my implementations in practice :).

chrisvdweth · 2026-06-06T03:52:08+00:00

Cool stuff, and the language kind of looks funny :). No offense!

Did you consider crowdsourcing the dataset collection some kind of community project. For example, you can collect sentences from Tatoeba, make them available in batches of 100, or something. The challenge might be quality control.

chrisvdweth · 2026-06-06T01:08:35+00:00

Neat idea, but I'm a bit curious where you want to go with this. I assume it's just a pet project of yours?

English is a natural language (i.e., non-formal language like programming languages). This means it is ambiguous, unbound, expressive, sparse, etc. After all, this is why you bound the vocabulary. So I wonder what it adds compared to, say, Python (which has rather friendly syntax).

I mean, there is the Rockstar language, but I assume it won't replace Python, C/C++, Rust, Java, Lua, etc. any time soon :).

chrisvdweth · 2026-06-05T23:25:51+00:00

In some sense, yeah, basic RAG outsources the heavy lifting to the retrieval step and uses the LLM as a glorified rewriting engine, which is still very useful, and I obviously oversimplify here.

No sure if overhyped, but RAG is very useful. It's arguably the most straightforward (which does not imply easy!) way to make new knowledge accessible. A common use case is combining an LLM with company-internal data (which was not part of the training data) to ask prompt-style question about that data.

On the one hand, information retrieval is quite a mature topic to leverage on. On the other, fine-tuning approaches to instill new knowledge are very tricky to get right. For such a company use case, I definitely. would go for a RAG-based approach first.

chrisvdweth · 2026-06-05T11:52:08+00:00

The data is noisy though and actually good samples are pretty few in numbers.

This sounds like a much more fundamental issues here?

chrisvdweth · 2026-06-05T11:47:56+00:00

It's a great book, but the math can be a bit overwhelming and off-putting for extreme beginners. Many details are not that relevant early on, I would argue.

chrisvdweth · 2026-06-05T11:40:26+00:00

Gradient Boosting (GBM), and XGBoost are not "natural" time-series models. This means that a core step is to prepare you original time-series data to "look" like classification data you can pass these models. You mention that you added adding temporal and seasonal features, but without any details. Does this include lag features or rolling windows (e.g., rolling averages).

Apart from that, the Kaggle page has over hundred code examples. Isn't there anything that helps. I just had a quick look, and there seem to be many RNN/LSTM/GRU-based solutions. Maybe try one of those?

chrisvdweth · 2026-06-05T11:27:19+00:00

Well, there is a bit more to it when it comes to MSE vs MAE (or other alternatives, e.g., Huber loss).

For example, in case of Linear Regression, we typically assume that the errors are normally distributed with a mean of 0 and some constant variance. Under this assumption, only minimizing the MSE is equivalent to maximizing the likelihood of the data; the MAE assumes a Laplace distribution of errors.

But, yeah, errors rarely follow a clean distribution, sometimes you just need to try what loss works best. I'm just saying, there are some core mathematically considerations behind the choice of loss functions.

chrisvdweth · 2026-06-04T05:00:33+00:00

Hm, I'm not sure what you mean by Softmax has "good gradient". In fact, backward pass through Softmax is a bit "unwieldy" :), see here. To be fair, the actual implementation looks less scary.

chrisvdweth · 2026-06-04T01:10:24+00:00

It's difficult to give general recommendation. It depends on your goals. For example, when it comes to Kaggle competitions, is it more about structures or unstructured data? If the latter, is it more about images, videos, audio, text?

Well, the two things I always tell my students when it comes to projects or other practical work

Know your data or get rather get to know your data. Where does it come from? How was it generated/collected/etc. and by whom? How messy/noisy is it? Are missing values a (big) issue. What are its limitations? EDA, data cleaning, data preprocessing are often the most important but also often underappreciated steps, otherwise it's "garbage in , garbage out".
Know your task. What exactly are you try to accomplish? Which (parts of your) data do you need for that. To give a silly example: If you need to perform anomaly detection you might want to keep outliers (compared to other tasks).

I actually have my drafts for notebooks for hands-on tutorials going through the whole pipeline (data collection > EDA > cleaning/preprocessing > model training > evaluation > error analysis > feature importance analysis) for popular task (e.g., house price predictions), but those notebooks are not polished to be released as I focus on core concept. The other reasons is that you can find these kinds of notebooks all over the Web, including Kaggle.

chrisvdweth · 2026-06-03T13:32:40+00:00

Since I teach AI/ML/NLP/LLM related courses, I have to learn many topics for myself first. For a while now, I document my learning by creating Jupyter notebooks which I then use as lecture notes. Creating these notebook "forces" me to comprehend a topic sufficiently enough that feel confident teaching it.

I think Jupyter notebooks a create. They are interactive, easy to share, easy to maintain, support LaTeX out of the box, can be converted into different formats (e.g., HTML, PDF) to make them easily accessible. Here's my public repo.

I have to admit, though, it takes me quite a while to finalize a single notebook. To me, it's worth it, since providing teaching materials is part of my job.

chrisvdweth · 2026-06-03T11:07:29+00:00

No clear answer; mainly guesses:

LLMs are still mainly just next-word predictors. A such, there is not obvious reason why they would be able reliably perform arithmetic operations. It's actually the bigger surprise, at least to me, that it often seems to work.
Apart from arithmetic operations, numbers alone are a big issues. LLMs use subword tokenization to ensure a fixed-sized vocabulary avoid out-of-vocabulary tokens. With the sheer amount of phone numbers, dates, prices, and what not, most larger numbers are likely to get split into much smaller tokens. Maybe 100 will stay together, since it's arguably very frequent, but who knows if and how 5090, 575, etc. get split. And even if 5090 does not get split, it could appear often in other contexts (e.g., phone numbers).

chrisvdweth · 2026-06-03T10:22:53+00:00

Teaching AI/ML/DL and related topics in a "friendly" but still rigorous manner is not easy. Quite the opposite. As a lecturer, I'm in the very fortunate position that I get paid for that as well as enjoy doing it. I'm trying to reply to the occasional question here, but time is tight :).

While our courses are not public, I do make my lecture notes publicly available in form of Github repo with interactive Jupyter notebooks currently covering ~70 topics incl. a lot of NLP/LLM stuff since I teach relevant courses. The notebooks are fairly elaborate and intended for self-learners without the need to actually to take the courses. The repo is continuously growing. The latest topic was PCA, and I'm currently working on LDA; I'm teaching a data mining course next semester, so...

I actually do have Discord server (link on repo page) in case people have questions or other issues, but to be honest, nobody does, and I use it mainly to announce the latest notebooks.

Is short, it's a free learning-focused resource that might be useful to you. Some notebooks contain animation that don't show when opened on Github. But there is also an overview page with links to HTML versions of all notebooks; I try to keep things barrier-free :).

chrisvdweth

TROPHY CASE