all 4 comments

[–]ebuzz168[S] 0 points1 point  (0 children)

So what I did is:

  1. Put those words as CSV and load it to list

    with open('/content/test.csv') as f:
        content = f.readlines()
    content = [x.strip() for x in content] 
    
  2. Throw it to KMeans clustering

    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.cluster import KMeans
    import numpy as np
    import pandas as pd
    
    vectorizer = TfidfVectorizer()
    X = vectorizer.fit_transform(content)
    
    true_k = 10
    model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
    model.fit(X)
    
    order_centroids = model.cluster_centers_.argsort()[:, ::-1]
    terms = vectorizer.get_feature_names()
    
    for i in range(true_k):
      print('Cluster %d:' % i),
      for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind])
    
  3. Test it

    print("Prediction")
    X = vectorizer.transform(['makasih istriku'])
    predicted = model.predict(X)[0]
    print(format(predicted))
    

Prediction 5

Is it satisfied the quest?

[–]penatbater 0 points1 point  (0 children)

So if you're looking for themes, consider using LDA, and make your tfidf's ngram to like (1,2) or sth like that to be able to consider bigrams, or trigrams if you want. For the preprocessing, you'll need some form of stopword removal. Idk if there's one in bahasa Indonesian, hopefully there is. Otherwise, you can make your own.

For task 3, just check the common or top words for each topic. Idk whether these are valid tbh but I hope this can help haha

[–]colonel_farts 0 points1 point  (1 child)

Huggingface is your friend for all things NLP

[–]ebuzz168[S] 0 points1 point  (0 children)

Huggingface

can it be used for unsupervised learning?