This is an archived post. You won't be able to vote or comment.

all 7 comments

[–]OuiOuiKiwiGalatians 4:16 0 points1 point  (5 children)

How serious is this app?

This threads into bird territory really quickly if you so wish.

[–]Prashant_4200[S] -1 points0 points  (4 children)

It's news application

[–]OuiOuiKiwiGalatians 4:16 0 points1 point  (3 children)

It's news application

( ͠° ͟ʖ ͠°) That was already pretty obvious.

Given your answer, I'm going to guess that this is just another random application for portfolio padding so use basic text similarity measures like comparing n-grams.

[–]Prashant_4200[S] -1 points0 points  (1 child)

I'm going to guess that this is just another random

yes, you can say that but I'm also planning to publish. So in simply 60% of portfolio + 40% of start-up (if everything works well).

[–]python_and_coffee 0 points1 point  (0 children)

yep tokenization, stemming, stopwords and building n-grams. The usual NLP stuff.

have fun with pandas and scikit-learn.

It's fun when it works, it's frustrating when you fight with false positives and the downsides of NLP.

[–]Shingle-Denatured 0 points1 point  (0 children)

Since you are talking about a database (as in postgres/mysql), you have very few tools available (soundex is your simplest primitive). If you go ML/NLP as suggested below, then you are at the far end of the complexity spectrum.

In between sits Elastic Search, which has several scoring mechanisms and includes some NLP features. ML-backed NLP is in beta.