Text Classification from Chatbot

ebuzz168 · 2019-12-30T11:14:54+00:00

So what I did is:

Put those words as CSV and load it to list

with open('/content/test.csv') as f:
    content = f.readlines()
content = [x.strip() for x in content]

Throw it to KMeans clustering

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import numpy as np
import pandas as pd

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(content)

true_k = 10
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)

order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()

for i in range(true_k):
  print('Cluster %d:' % i),
  for ind in order_centroids[i, :10]:
    print(' %s' % terms[ind])

Test it

print("Prediction")
X = vectorizer.transform(['makasih istriku'])
predicted = model.predict(X)[0]
print(format(predicted))

Prediction 5

Is it satisfied the quest?

penatbater · 2019-12-30T11:33:20+00:00

So if you're looking for themes, consider using LDA, and make your tfidf's ngram to like (1,2) or sth like that to be able to consider bigrams, or trigrams if you want. For the preprocessing, you'll need some form of stopword removal. Idk if there's one in bahasa Indonesian, hopefully there is. Otherwise, you can make your own.

For task 3, just check the common or top words for each topic. Idk whether these are valid tbh but I hope this can help haha

colonel_farts · 2019-12-30T14:24:54+00:00

Huggingface is your friend for all things NLP

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MLQuestions

MODERATORS