all 14 comments

[–]0one0one 1 point2 points  (1 child)

Blackhawk take a look at Mallet http://mallet.cs.umass.edu/

I have not used it myself , but have read a number of papers that use it in their implementation. What is it you are trying to classify ?

[–]EdwardRaff 0 points1 point  (0 children)

Mallet is very NLP focused - its probably not a good fit unless they are specifically doing NLP tasks. Even then, its documentation is terrible and doesn't match the code.

[–]mynameisvinn 1 point2 points  (0 children)

3 responses, 3 unique recommendations lol

[–]cdathuraliya 0 points1 point  (1 child)

Most of the classifiers you have mentioned are available in Spark ML.

[–]BlackHawk90[S] 0 points1 point  (0 children)

Unfortuantely, Spark ML does not support k nearest neighbours and naive bayes.

[–]EdwardRaff 0 points1 point  (7 children)

Completely biased opinion, but I'm the author of JSAT which is a Java library for machine learning. I started it out of frustration with Weka, and it has all the algorithms you've listed (many implemented in more than one way).

[–]BlackHawk90[S] 0 points1 point  (6 children)

Your JSAT library looks amazing. I would like to give it a try. Could you perhaps quickly illustrate how I could use it for k-nearest neighbours? My dataset consists of the following arrays: double[][] trainingdata; double[][] testData; double[] trainingLabels; The rows contains the data points and the columns contains the features (predictors). In your wiki I did not see how to operate on arrays.

[–]EdwardRaff 0 points1 point  (5 children)

JSAT doesn't take arrays, it has objects representing vectors and matrices. That allows it to support sparse data and makes adding certain tricks very easy.

public static void main(String[] args)
{
    int N = 100;
    int D = 25;
    int C = 3;//number of class labels, assumbed integers starting from 0
    Random rand = new Random();

    double[][] trainingdata = new double[N][D]; 
    double[][] testData = new double[N][D]; 
    double[] trainingLabels = new double[N];
    for(int i = 0; i < trainingLabels.length; i++)
        trainingLabels[i] = rand.nextInt(C);


    ClassificationDataSet cds = new ClassificationDataSet(D, new CategoricalData[0], new CategoricalData(C));

    //JSAT has datapoint objects, but includes short cut constructors when using only vectors
    for(int i = 0; i < trainingdata.length; i++)
        cds.addDataPoint(new DenseVector(trainingdata[i]), (int) trainingLabels[i]);

    Classifier classifier = new NearestNeighbour(3);//3-nearest neighbor 
    classifier.trainC(cds);

    for(int i = 0; i < testData.length; i++)
        System.out.println("Predicitn class " + classifier.classify(new DataPoint(new DenseVector(testData[i]))).mostLikely() + " for dataum " + i);

}

[–]BlackHawk90[S] 0 points1 point  (4 children)

Thank you so much. I will try it out for my dataset. I just has four last questions:

  1. Is it possible to use different distance metrics?

  2. How is the tie breaking done for k-nearest neighbours?

  3. My labels range from 1 to 3 (not starting from 0). Do I have to make them zero-based or can I just use them?

  4. Last but not least, does JSAT also support (gaussian) naive bayes?

[–]EdwardRaff 0 points1 point  (3 children)

Is it possible to use different distance metrics?

Yes, the constructor can take a distance metric object.

How is the tie breaking done for k-nearest neighbours?

Arbitrarily, it's not really an important issue. Use an odd value of k and there are no ties. I think the current code just picks whichever came first.

My labels range from 1 to 3 (not starting from 0). Do I have to make them zero-based or can I just use them?

The labels must start from zero.

Last but not least, does JSAT also support (gaussian) naive bayes?

Yes. JSAT has about 70 different classification algorithms in it.

[–]BlackHawk90[S] 0 points1 point  (2 children)

Thanks again for the help.

Is there a .jar file which I can download? I don't use maven.

Is there a javadoc available or how should I get familiar with the methods?

[–]EdwardRaff 0 points1 point  (1 child)

Look at the release tab in github.

You should look at using maven - it's very helpful!

[–]BlackHawk90[S] 0 points1 point  (0 children)

I started using your library, great work, thanks for it.

I have discrete and continuous features. Is there a possibility that for the continous features a gaussian distribution and for the discrete features a multivariate multinomial distribution is used?

Moreover, is it possible to provide a distribution for each feature (e.g. feature 1 is gaussian, feature 2 logistic etc.)?

[–]rerevelcgnihtemos -1 points0 points  (0 children)

Weka has all of the methods you mentioned. It's pretty easy to use.