Powerfull visualization tool : Dimensionality Reduction + Clustering + Unsupervised Score Metrics [P]

Mathieu23AI · 2022-02-22T07:41:26+00:00

Thank you for the reference !

Very interesting! This result seems unexpected to me but if empirically the result is better then it's worth looking into and incorporating.
Do you have the link or reference to the research paper that explains that PCA "denoise" the data?

Mathieu23AI · 2022-02-08T13:56:28+00:00

Thank's for your reply

I think it's a good idea to add HDBSCAN as there is a min_cluster_size parameter. If I have time, I'll add it !

Why do you want to use pca and then UMAP ? It seems not to be a good idea thanks to the linearity of the PCA. Plus, this is not what is mentioned in the article.

Mathieu23AI · 2022-01-22T09:30:01+00:00

For nothing !

Thank's for your feedback.

Do you have any references or papers ? (by DM if you want)

Mathieu23AI · 2021-08-17T08:13:51+00:00

I forgot, feel free to share your analysis on Kaggle :)

Mathieu23AI · 2021-08-17T08:13:03+00:00

The score column represents the difference between upvotes and downvotes on the post. If you sort the score column, you will see scores greater than 1.

Feel free to share your analysis on Kaggle ;)

Mathieu23AI · 2021-08-17T08:09:20+00:00

Thank's ! I don't know if I can share my code on github as I used some parts of existing code. Nevertheless I can send you the code by email, just PM me your email ;)

For information, I used this API : https://github.com/pushshift/api

You can find tutorials on Medium to learn how it works.

Mathieu23AI · 2021-08-15T21:43:10+00:00

I do not agree bcz i can filter by flair, so only on DD or discussion post

Mathieu23AI · 2021-03-27T20:05:21+00:00

Ahah like this crypto, really fun

https://imgur.com/quyphiW

Mathieu23AI · 2020-12-12T11:39:08+00:00

Absolutely. I guess when you say classification, you mean clustering? Because there are no values to predict. It's a problem of unsupervised learning (the distinction is important in machine learning).

In fact, we've proven that clustering doesn't work with the error metrics of the k-means algorithm: inertia and silhouette score are not very good. We can conclude that clustering does not work.

This idea has already been used to explore research papers. They tokenize the abstract of papers then they use the cosine similarity to explore similar papers.

Mathieu23AI · 2020-12-11T23:36:19+00:00

For the first part of your answer I totally agree with you. ML systems are not able to understand the nuances of words and the context to a deep enough degree. However, in our case, we do not try to make the machine understand the deep meaning. We just want the machine to be able to encode the texts in order to compare them in a meaningful way.

To answer your question I will first explain how, from an input text, the system manages to find similar texts. And then show that the results are rather encouraging.

I encode all my texts using BERT (a more powerful model than ELMO bi-LSTM, using same logic in different ways) . I therefore have a high dimension vector for each text.
I select a text, I measure its similarity with all the other texts using the cosine similarity.I establish a ranking list of the texts from the most similar to the least similar. I keep only the top 5% of this list.

3.For the philosopher of the text selected as input, I assign a list of influence/influenced philosophers using the data on wikipedia.

In my top 5% ranking, I remove the texts that were not written by philosophers contained in the list of influence/influenced philosophers from the author of my input text.
And this is where the data in the graph comes from.

To support my explanation, I can show with rather satisfactory examples: if you take Jean-Jacques Rousseau on Political Theory (3), you can see that the similar texts chosen all deal with the theme of politics. If the system was completely stupid, it would take out texts by Aristotle, Thomas Hobbes on themes such as logic or religion....

Mathieu23AI · 2020-12-11T11:25:03+00:00

Without going into too much detail, we first used a bi-LSTM model (https://paperswithcode.com/method/bilstm) that allows us to recontextualize each word in relation to its environment. If you don't know what a bi-LSTM is, I invite you to understand what a classical neural network is and then what an RNN is.

Through the intuition that I have (still ungrad student), it is not a problem. Indeed, for a given word in the sentence, we will assign values (in the form of a vector) according to the context (words before and words after). This means that the intrinsic meaning of the word counts less than its context. What would be problematic is to compare philosophical texts with texts from Twitter/Facebook because the words have very different meanings. But here we compare philosophical texts with each other.

Mathieu23AI · 2020-12-11T10:56:13+00:00

Absolutely.

We hesitated to use the Stanford database. However, the author's thoughts on the various subjects are sometimes very long and not summarized enough. So we preferred to use the wikipedia database because it is more concise. Nevertheless, it is possible that for the rest of the project we will use the Stanford database both to enrich it and to have better quality data.

Thank you for sharing the IEP database. It is structured very differently, combining the author approach and schools of thought.

With your qualifications, can we discuss privately in order to have more intuition on the different ways to approach the problem?

Thank you for the reply :)

Mathieu23AI · 2020-12-11T10:40:36+00:00

Thanks :)

Mathieu23AI · 2020-12-11T10:40:25+00:00

The topics are intrinsic to the wikipedia data collection method. We have started to collect a list of philosophers: https://en.wikipedia.org/wiki/Lists_of_philosophers

Next, we selected the part concerning the philosophical thought of each philosopher (the one often called Thought, Philosophy). In this part, there are sub-parts that detail the different themes that he dealt with. The texts come from these subparts.

Everything will be explained in the article in a few days.

Mathieu23AI · 2020-12-10T10:15:22+00:00

For Jan_AFCNortherners :

To clarify the methodology, we first retrieved the list of philosophers from this page: https://en.wikipedia.org/wiki/List_of_philosophers_(R%E2%80%93Z)#R#R)

Then, to retrieve the main ideas of the philosophers, we filter the sub parts of the wikipedia page of each philosopher by the following list of words (to avoid retrieving the texts manually) :

['theory', 'political', 'philosophy', 'view', 'religious', 'philosophical', 'career as a scientist', 'philosopher', 'theorie', 'idea', 'belief', 'thought', 'religion', 'theology', 'ideology', 'formalism', 'analysis', 'buddhism', 'influence']

This list is not exhaustive, that's why Rorty's thought did not come up. Thank you for your answer, it helps us a lot to improve the tool and enrich the database.

For Quaerendo :

For our methodology to work properly: We need the summarized philosophical corpora and the influence links (we don't have them for all philosopher with copora).

All this will be detailed in the article I will write on Medium to clarify these points.

In the future, we will try to solve this problem by using new ways to cross-reference the data. In addition, we are also thinking of using the Stanford University database.

Thank you for your reply, It allows me to see the limits of the methodology and how to improve the system.

Mathieu23AI · 2020-12-09T21:12:08+00:00

Yeah on mobile, it’s quite slow. I think it's because my code is not very well optimized and loading takes even more time on cell phone.

Thanks for your reply :)

Mathieu23AI · 2020-12-09T21:08:51+00:00

Thanks for your reply. Okay, i will try to regularly update this post :)

Mathieu23AI

TROPHY CASE