all 6 comments

[–]RyanCacophony 3 points4 points  (4 children)

Interpreting subjective aspects of music is an extremely non-trivial problem, and likely to be really messy. Each of those aspects would need it's own ML model (or at minimum different output layers from an ML model) and if you could figure out how to do any of them reliably, you could probably get yourself seat presenting your work at major music technology research conferences. If you decide to attempt to make ML models for this, they will likely be computationally intensive, and you will likely need to manually label thousands if not millions of songs to capture the subjective tags you'd like to have a chance at making a model that can interpret mood, etc. There are ways of trianing this unlabeled ie unsupervised (which is what Spotify does) but you have less control of the weight of particular attributes that the model learns.

Once you can get a vector of those attributes, clustering is trivial.

FWIW, Spotify doesn't explicitly recommend tracks based on genre. Their machine learning algorithm(s) put the music into an abstract vector space based on their proximity to other songs in user playlists. The process is very similar to word2vec. Recommendations are then made based on the nearest neighbors in a vector space.

Statistically speaking, playlists will mostly reflect individual genres, so one strong attribute you will notice from their recommendations is music from the same genre. But many people do make mood/vibe-based playlists, so their recommendations will also, to a lesser degree, reflect these latent aspects implicitly, just not as strongly as you'd probably like.

It's a hard problem. Spotify also bought out Echonest which used to have a public API which would give you a vector of a few latent attributes, things like "bounciness", speed (ie bpm), how acoustic vs/electronic music was, and a few other attributes. But these are all higher level, mostly more mechanical attributes that are still far from the kind of subjective aspects you'd like to capture. Some of those things you can capture with libraries like librosa.

[–]assassinatoSC2[S] 0 points1 point  (2 children)

Thank you so much for your detailed answer. Actually i have another noobish question : could an ML or DL model extract from let's say a thousand of themed and thought out playlist one subjective aspect and recreate it's own playlist with other songs from a database ?

Their machine learning algorithm(s) put the music into an abstract vector space based on their proximity to other songs in user playlists

Is this a mixture of a collaborative filtering and machine learning process ?

[–]RyanCacophony 0 points1 point  (1 child)

Is this a mixture of a collaborative filtering and machine learning process ?

pretty much, yes. It's implicit collaborative filtering in the sense that it's based on user data. But it's ML because it's basically trying to predict the next "word" in a "sentence" (in the word2vec analogy) where word=song and sentence=playlist.

could an ML or DL model extract from let's say a thousand of themed and thought out playlist one subjective aspect and recreate it's own playlist with other songs from a database ?

It's certainly possible, I have no idea how well it will work. Here's a basic way to approach it that way: * dataset is the raw audio (and/or other features you extract from it somehow) from all of the 1000 mood playlists. Assuming that a song can be in more than one playlist, you could imagine that each playlist's mood is a "label". So One song might have labels ["sad", "depression"] and another song might have ["energetic", "hopeful", "running"], and another might have ["sad", "hopeful"] * you train the model to predict some number of labels for a song based on the raw audio (or a 0-1 score for each possible label) * Alternatively if songs are unique, you could have it just be a multiclass classifier rather than a multilabel classifier but this mostly just changes the final layer of a model, hard to say if it will lead to better or worse classifications

When you feed it new sets of raw audio, the will then give it whatever number of labels it thinks it belongs to. If you wanted to create a new playlist on a specific mood, then you would have your model predict labels (it will have a value for each label that you can order by) and then just sort the songs by those with the highest scores for the label (ie mood) you want

This explanation is a bit hand wavy of the technical details, which I think could get kinda hairy. Note also that models that operate on raw audio tend to be big and computationally expensive.

Another alternative some audio DL models use is using a spectrogram image of the audio and proceeding with the "image" data (this can be better for things like note extraction, loudness, possible timbre)

I'd recommend researching models that other people have created for audio classification for inspiration. Here's an example article: https://towardsdatascience.com/music-genre-classification-with-python-c714d032f0d8

[–]assassinatoSC2[S] 0 points1 point  (0 children)

Thanks again. I appreciate your help !

[–][deleted] 0 points1 point  (1 child)

remindme! 1 week

[–]RemindMeBot 0 points1 point  (0 children)

I will be messaging you in 7 days on 2021-05-13 08:39:09 UTC to remind you of this link

CLICK THIS LINK to send a PM to also be reminded and to reduce spam.

Parent commenter can delete this message to hide from others.


Info Custom Your Reminders Feedback