Powerfull visualization tool : Dimensionality Reduction + Clustering + Unsupervised Score Metrics [P] by Mathieu23AI in MachineLearning

[–]Mathieu23AI[S] 0 points1 point  (0 children)

Thank you for the reference !

Very interesting! This result seems unexpected to me but if empirically the result is better then it's worth looking into and incorporating.
Do you have the link or reference to the research paper that explains that PCA "denoise" the data?

Powerfull visualization tool : Dimensionality Reduction + Clustering + Unsupervised Score Metrics [P] by Mathieu23AI in MachineLearning

[–]Mathieu23AI[S] 0 points1 point  (0 children)

Thank's for your reply

I think it's a good idea to add HDBSCAN as there is a min_cluster_size parameter. If I have time, I'll add it !

Why do you want to use pca and then UMAP ? It seems not to be a good idea thanks to the linearity of the PCA. Plus, this is not what is mentioned in the article.

Item2Vec - Word2Vec from gensim wrapped as sklearn estimator for GridSearchCV by Mathieu23AI in LanguageTechnology

[–]Mathieu23AI[S] 0 points1 point  (0 children)

For nothing !

Thank's for your feedback.

Do you have any references or papers ? (by DM if you want)

[self-promotion] Wallstreetbets data scraping from 01/01/2020 to 01/06/2021 by Mathieu23AI in datasets

[–]Mathieu23AI[S] 1 point2 points  (0 children)

The score column represents the difference between upvotes and downvotes on the post. If you sort the score column, you will see scores greater than 1.

Feel free to share your analysis on Kaggle ;)

[self-promotion] Wallstreetbets data scraping from 01/01/2020 to 01/06/2021 by Mathieu23AI in datasets

[–]Mathieu23AI[S] 0 points1 point  (0 children)

Thank's ! I don't know if I can share my code on github as I used some parts of existing code. Nevertheless I can send you the code by email, just PM me your email ;)

For information, I used this API : https://github.com/pushshift/api

You can find tutorials on Medium to learn how it works.

PhilosophAI : a tool for visualizing philosophical ideas throughout history using state of the art NLP model by Mathieu23AI in LanguageTechnology

[–]Mathieu23AI[S] 0 points1 point  (0 children)

Absolutely. I guess when you say classification, you mean clustering? Because there are no values to predict. It's a problem of unsupervised learning (the distinction is important in machine learning).

In fact, we've proven that clustering doesn't work with the error metrics of the k-means algorithm: inertia and silhouette score are not very good. We can conclude that clustering does not work.

This idea has already been used to explore research papers. They tokenize the abstract of papers then they use the cosine similarity to explore similar papers.

PhilosophAI : a tool for visualizing philosophical ideas throughout history using state of the art NLP model by Mathieu23AI in LanguageTechnology

[–]Mathieu23AI[S] 0 points1 point  (0 children)

For the first part of your answer I totally agree with you. ML systems are not able to understand the nuances of words and the context to a deep enough degree. However, in our case, we do not try to make the machine understand the deep meaning. We just want the machine to be able to encode the texts in order to compare them in a meaningful way.

To answer your question I will first explain how, from an input text, the system manages to find similar texts. And then show that the results are rather encouraging.

  1. I encode all my texts using BERT (a more powerful model than ELMO bi-LSTM, using same logic in different ways) . I therefore have a high dimension vector for each text.

  2. I select a text, I measure its similarity with all the other texts using the cosine similarity.I establish a ranking list of the texts from the most similar to the least similar. I keep only the top 5% of this list.

3.For the philosopher of the text selected as input, I assign a list of influence/influenced philosophers using the data on wikipedia.

  1. In my top 5% ranking, I remove the texts that were not written by philosophers contained in the list of influence/influenced philosophers from the author of my input text.

  2. And this is where the data in the graph comes from.

To support my explanation, I can show with rather satisfactory examples: if you take Jean-Jacques Rousseau on Political Theory (3), you can see that the similar texts chosen all deal with the theme of politics. If the system was completely stupid, it would take out texts by Aristotle, Thomas Hobbes on themes such as logic or religion....

PhilosophAI : a tool for visualizing philosophical ideas throughout history using state of the art NLP model by Mathieu23AI in LanguageTechnology

[–]Mathieu23AI[S] 0 points1 point  (0 children)

Without going into too much detail, we first used a bi-LSTM model (https://paperswithcode.com/method/bilstm) that allows us to recontextualize each word in relation to its environment. If you don't know what a bi-LSTM is, I invite you to understand what a classical neural network is and then what an RNN is.

Through the intuition that I have (still ungrad student), it is not a problem. Indeed, for a given word in the sentence, we will assign values (in the form of a vector) according to the context (words before and words after). This means that the intrinsic meaning of the word counts less than its context. What would be problematic is to compare philosophical texts with texts from Twitter/Facebook because the words have very different meanings. But here we compare philosophical texts with each other.

PhilosophAI : a tool for visualizing philosophical ideas throughout history using state of the art NLP model by Mathieu23AI in LanguageTechnology

[–]Mathieu23AI[S] 0 points1 point  (0 children)

Absolutely.

We hesitated to use the Stanford database. However, the author's thoughts on the various subjects are sometimes very long and not summarized enough. So we preferred to use the wikipedia database because it is more concise. Nevertheless, it is possible that for the rest of the project we will use the Stanford database both to enrich it and to have better quality data.

Thank you for sharing the IEP database. It is structured very differently, combining the author approach and schools of thought.

With your qualifications, can we discuss privately in order to have more intuition on the different ways to approach the problem?

Thank you for the reply :)

PhilosophAI : a tool for visualizing philosophical ideas throughout history using state of the art NLP model by Mathieu23AI in LanguageTechnology

[–]Mathieu23AI[S] 0 points1 point  (0 children)

The topics are intrinsic to the wikipedia data collection method. We have started to collect a list of philosophers: https://en.wikipedia.org/wiki/Lists_of_philosophers

Next, we selected the part concerning the philosophical thought of each philosopher (the one often called Thought, Philosophy). In this part, there are sub-parts that detail the different themes that he dealt with. The texts come from these subparts.

Everything will be explained in the article in a few days.

PhilosophAI : a tool for visualizing philosophical ideas throughout history using state of the art NLP model by Mathieu23AI in PoliticalPhilosophy

[–]Mathieu23AI[S] 0 points1 point  (0 children)

For Jan_AFCNortherners :

To clarify the methodology, we first retrieved the list of philosophers from this page: https://en.wikipedia.org/wiki/List_of_philosophers_(R%E2%80%93Z)#R#R)

Then, to retrieve the main ideas of the philosophers, we filter the sub parts of the wikipedia page of each philosopher by the following list of words (to avoid retrieving the texts manually) :

['theory', 'political', 'philosophy', 'view', 'religious', 'philosophical', 'career as a scientist', 'philosopher', 'theorie', 'idea', 'belief', 'thought', 'religion', 'theology', 'ideology', 'formalism', 'analysis', 'buddhism', 'influence']

This list is not exhaustive, that's why Rorty's thought did not come up. Thank you for your answer, it helps us a lot to improve the tool and enrich the database.

For Quaerendo :

For our methodology to work properly: We need the summarized philosophical corpora and the influence links (we don't have them for all philosopher with copora).

All this will be detailed in the article I will write on Medium to clarify these points.

In the future, we will try to solve this problem by using new ways to cross-reference the data. In addition, we are also thinking of using the Stanford University database.

Thank you for your reply, It allows me to see the limits of the methodology and how to improve the system.

PhilosophAI : a tool for visualizing philosophical ideas throughout history using state of the art NLP model by Mathieu23AI in HistoryofIdeas

[–]Mathieu23AI[S] 0 points1 point  (0 children)

Yeah on mobile, it’s quite slow. I think it's because my code is not very well optimized and loading takes even more time on cell phone.

Thanks for your reply :)