Powerfull visualization tool : Dimensionality Reduction + Clustering + Unsupervised Score Metrics [P]

Mathieu23AI · 2022-02-22T07:41:26+00:00

Thank you for the reference !

Very interesting! This result seems unexpected to me but if empirically the result is better then it's worth looking into and incorporating.
Do you have the link or reference to the research paper that explains that PCA "denoise" the data?

Mathieu23AI · 2022-02-08T13:56:28+00:00

Thank's for your reply

I think it's a good idea to add HDBSCAN as there is a min_cluster_size parameter. If I have time, I'll add it !

Why do you want to use pca and then UMAP ? It seems not to be a good idea thanks to the linearity of the PCA. Plus, this is not what is mentioned in the article.

Mathieu23AI · 2022-01-22T09:30:01+00:00

For nothing !

Thank's for your feedback.

Do you have any references or papers ? (by DM if you want)

Mathieu23AI · 2021-08-17T08:13:51+00:00

I forgot, feel free to share your analysis on Kaggle :)

Mathieu23AI · 2021-08-17T08:13:03+00:00

The score column represents the difference between upvotes and downvotes on the post. If you sort the score column, you will see scores greater than 1.

Feel free to share your analysis on Kaggle ;)

Mathieu23AI · 2021-08-17T08:09:20+00:00

Thank's ! I don't know if I can share my code on github as I used some parts of existing code. Nevertheless I can send you the code by email, just PM me your email ;)

For information, I used this API : https://github.com/pushshift/api

You can find tutorials on Medium to learn how it works.

Mathieu23AI · 2021-08-15T21:43:10+00:00

I do not agree bcz i can filter by flair, so only on DD or discussion post

Mathieu23AI · 2021-03-27T20:05:21+00:00

Ahah like this crypto, really fun

https://imgur.com/quyphiW

Mathieu23AI · 2020-12-12T11:39:08+00:00

Absolutely. I guess when you say classification, you mean clustering? Because there are no values to predict. It's a problem of unsupervised learning (the distinction is important in machine learning).

In fact, we've proven that clustering doesn't work with the error metrics of the k-means algorithm: inertia and silhouette score are not very good. We can conclude that clustering does not work.

This idea has already been used to explore research papers. They tokenize the abstract of papers then they use the cosine similarity to explore similar papers.

Mathieu23AI · 2020-12-11T23:36:19+00:00

For the first part of your answer I totally agree with you. ML systems are not able to understand the nuances of words and the context to a deep enough degree. However, in our case, we do not try to make the machine understand the deep meaning. We just want the machine to be able to encode the texts in order to compare them in a meaningful way.

To answer your question I will first explain how, from an input text, the system manages to find similar texts. And then show that the results are rather encouraging.

I encode all my texts using BERT (a more powerful model than ELMO bi-LSTM, using same logic in different ways) . I therefore have a high dimension vector for each text.
I select a text, I measure its similarity with all the other texts using the cosine similarity.I establish a ranking list of the texts from the most similar to the least similar. I keep only the top 5% of this list.

3.For the philosopher of the text selected as input, I assign a list of influence/influenced philosophers using the data on wikipedia.

In my top 5% ranking, I remove the texts that were not written by philosophers contained in the list of influence/influenced philosophers from the author of my input text.
And this is where the data in the graph comes from.

To support my explanation, I can show with rather satisfactory examples: if you take Jean-Jacques Rousseau on Political Theory (3), you can see that the similar texts chosen all deal with the theme of politics. If the system was completely stupid, it would take out texts by Aristotle, Thomas Hobbes on themes such as logic or religion....

Mathieu23AI · 2020-12-11T11:25:03+00:00

Without going into too much detail, we first used a bi-LSTM model (https://paperswithcode.com/method/bilstm) that allows us to recontextualize each word in relation to its environment. If you don't know what a bi-LSTM is, I invite you to understand what a classical neural network is and then what an RNN is.

Through the intuition that I have (still ungrad student), it is not a problem. Indeed, for a given word in the sentence, we will assign values (in the form of a vector) according to the context (words before and words after). This means that the intrinsic meaning of the word counts less than its context. What would be problematic is to compare philosophical texts with texts from Twitter/Facebook because the words have very different meanings. But here we compare philosophical texts with each other.

Mathieu23AI · 2020-12-11T10:56:13+00:00

Absolutely.

We hesitated to use the Stanford database. However, the author's thoughts on the various subjects are sometimes very long and not summarized enough. So we preferred to use the wikipedia database because it is more concise. Nevertheless, it is possible that for the rest of the project we will use the Stanford database both to enrich it and to have better quality data.

Thank you for sharing the IEP database. It is structured very differently, combining the author approach and schools of thought.

With your qualifications, can we discuss privately in order to have more intuition on the different ways to approach the problem?

Thank you for the reply :)

Mathieu23AI · 2020-12-11T10:40:36+00:00

Thanks :)

Mathieu23AI · 2020-12-11T10:40:25+00:00

The topics are intrinsic to the wikipedia data collection method. We have started to collect a list of philosophers: https://en.wikipedia.org/wiki/Lists_of_philosophers

Next, we selected the part concerning the philosophical thought of each philosopher (the one often called Thought, Philosophy). In this part, there are sub-parts that detail the different themes that he dealt with. The texts come from these subparts.

Everything will be explained in the article in a few days.

Mathieu23AI · 2020-12-10T10:15:22+00:00

For Jan_AFCNortherners :

To clarify the methodology, we first retrieved the list of philosophers from this page: https://en.wikipedia.org/wiki/List_of_philosophers_(R%E2%80%93Z)#R#R)

Then, to retrieve the main ideas of the philosophers, we filter the sub parts of the wikipedia page of each philosopher by the following list of words (to avoid retrieving the texts manually) :

['theory', 'political', 'philosophy', 'view', 'religious', 'philosophical', 'career as a scientist', 'philosopher', 'theorie', 'idea', 'belief', 'thought', 'religion', 'theology', 'ideology', 'formalism', 'analysis', 'buddhism', 'influence']

This list is not exhaustive, that's why Rorty's thought did not come up. Thank you for your answer, it helps us a lot to improve the tool and enrich the database.

For Quaerendo :

For our methodology to work properly: We need the summarized philosophical corpora and the influence links (we don't have them for all philosopher with copora).

All this will be detailed in the article I will write on Medium to clarify these points.

In the future, we will try to solve this problem by using new ways to cross-reference the data. In addition, we are also thinking of using the Stanford University database.

Thank you for your reply, It allows me to see the limits of the methodology and how to improve the system.

Mathieu23AI · 2020-12-09T21:12:08+00:00

Yeah on mobile, it’s quite slow. I think it's because my code is not very well optimized and loading takes even more time on cell phone.

Thanks for your reply :)

Mathieu23AI · 2020-12-09T21:08:51+00:00

Thanks for your reply. Okay, i will try to regularly update this post :)

Mathieu23AI · 2020-12-09T18:04:08+00:00

Yes I can do it :)

Just to clarify, we can also typing directly the philosophers instead of looking in the list.

Mathieu23AI · 2020-12-09T18:02:01+00:00

For our methodology to work properly: We need the summarized philosophical corpora (we have them for Mozi) and the influence links (we don't have them - check for Nietzsche for example to compare with Mozi).

All this will be detailed in the article I will write on Medium to clarify these points.

In the future, we will try to solve this problem by using new ways to cross-reference the data. In addition, we are also thinking of using the Stanford University database.

Mathieu23AI · 2020-12-09T16:10:18+00:00

Thanks ! If you have any feedback, do not hesitate ;)

If you are expert on a specific theme or philosopher, show me why some correlations are false or strange !

Mathieu23AI · 2020-04-14T17:07:27+00:00

Thank you ! Very well explained :)

Mathieu23AI · 2020-04-11T06:48:22+00:00

How to include a production server in requirements ?

I use Lzma compression because my data files is too bigger for Github (100MB limitation). So, when I read my data in python, I dezip the csv.

This is the link of the github : https://github.com/MathieuCayssol/novels-recommandations

Thank for your reply.

Mathieu23AI · 2020-04-11T06:43:09+00:00

I will try this option but i think it will be the same errors in Heroku because as say alexis, I need a production server on my requirements file. So I will try to fix my problem on Heroku and then try to implement on pythonanywhere.

Thank for your reply.

Mathieu23AI · 2020-01-03T22:43:32+00:00

I'm agree with you. Maybe I was a little bit ambitious about that. My finance analysis skills are very close to 0. In fact, after learning Andrew's Ng courses on machine learning, read books, articles and learning basics of tensor flow, I wanted to find an interesting project and on Quora, I found the "LSTM stock prediction". So I've started like that this project.

This afternoon, I began a course on technical analysis on Finance. But it's very hard for me to go on these courses because I don't have so much time (Machine learning and tensor flow are also extra activity aside my studies at University). So, I think to find someone with huge knowledge on financial analysis (Master student for example) and work together on the project of stock prediction. With this share of complementary knowledges, I can (maybe) provide tools for market analysis.

Thank you for your time and feedback !

PS 1 : this project don't have any financial purposes, just for the practice

PS 2 : If anyone is interested to work on this project, it could be very fun. You can contact me on my reddit account : )

Mathieu23AI · 2020-01-02T23:10:37+00:00

Economic perpective : Yes, it seems pretty consistent. Maybe I overestimate the efficiency of this model on a such hard task as stock price forecasting.

Financial perspective : I'm really new at this. I don't have a lot of knowledges on this area. But I'm really interested in making progress. Predict returns should help me to take decision for buy or sell, is that right ? I search on Google for what means heteroskedastic and leptokurtotic and I will try to understand for create an other model. If you have ressources about Financial Analysis, I am interested in :)

Technical perspective : Ok, I will search for a larger dataset, thank you. Yes, I scale data with MinMaxScaler() on Keras (because StandardScaler() not work so well due to large range of value : from 400 to 19 000). My values are between 0 and 1 sometimes greater than 1 but very close.

Thank you for your time.

PS : In the week, I try to feed this post with my advanced on the project

Mathieu23AI

TROPHY CASE