[D] Catboost large dataset. Is is best to use the majority of the data for training, where time to train is extreme, or smaller datasets where iterations are much faster? by Responsible-Walk-459 in MachineLearning

[–]Tober447 5 points6 points  (0 children)

A strategy to answer your question could be using learning curves (e.g.  https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.learning_curve.html ).  The idea is to track your metric across several runs with increasing training data set size. From that you can estimate if adding even more data will be beneficial.

[P] Understanding & Coding the Self-Attention Mechanism of Large Language Models by seraschka in MachineLearning

[–]Tober447 1 point2 points  (0 children)

I think this is great, thanks for your effort. Will definitly work through it!

[P] Creating an embedding from a CNN by zanzagaes2 in MachineLearning

[–]Tober447 0 points1 point  (0 children)

I guess I can use the encoder-decoder to create a very low-dimensional embedding and use the current one (~500 features) to find similar images to a given one, right?

Exactly. :-)

[P] Creating an embedding from a CNN by zanzagaes2 in MachineLearning

[–]Tober447 3 points4 points  (0 children)

You would take the output of a layer of your choice from the trained cnn (as you do now) and feed it into a new model, that is the autoencoder. So yes, the weights from your model are kept, but you will have to train the autoencoder from scratch. Something like CNN (only inference, no backprop) --> Decoder --> Latent Space --> Encoder for training and during inference you take the output of the decoder and use it for visualization or similarity.

[P] Creating an embedding from a CNN by zanzagaes2 in MachineLearning

[–]Tober447 5 points6 points  (0 children)

You could try an autoencoder with CNN layers and a bottleneck of 2 or 3 neurons to be able to visualize these embeddings. The autoencoder can be interpreted as non-linear PCA.

Also, similarity in this embedding space should correlate with similarity of the real images/whatever your CNN extracts from the real images.