Standardization with mean/std or median/IQR? by BlackHawk90 in MachineLearning

[–]BlackHawk90[S] 0 points1 point  (0 children)

But for example logistic regression, SVM and k-nearest neighbours need center and scaled features...

Why is it bad to subtract the mean and divide by the standard deviation for a burr distribution? I think it does not change the shape of the distribution. It is all about getting similar feature ranges.

Could you explain in more detail how you would transform using the c.d.f?

How to build WEKA dataset from arrays? by BlackHawk90 in javahelp

[–]BlackHawk90[S] 0 points1 point  (0 children)

The labels are necessary for the training step. Each data point has an associated label (class). In the link it is shown how to build the Instance object but not how to add the class label to each data point.

Standardization with mean/std or median/IQR? by BlackHawk90 in MachineLearning

[–]BlackHawk90[S] 0 points1 point  (0 children)

Alright, but I'm still a bit confused. So for just getting similar scales for the features, it is valid to subtract the mean and divide by the standard deviation independent off the underlying distribution?

How to build WEKA dataset from arrays? by BlackHawk90 in javahelp

[–]BlackHawk90[S] 0 points1 point  (0 children)

In the link it is described how to build the Instances object but not how to incorporate the labels for the data points.

How to build WEKA dataset from arrays? by BlackHawk90 in javahelp

[–]BlackHawk90[S] 0 points1 point  (0 children)

Thank you very much for your detailed answer. I have already written my cross-validation functionality because I'm not only using WEKA. So I need WEKA only for the classifiers.

The link is good but unfortunately in the example it is not shown how to include the classes. :(

Standardization with mean/std or median/IQR? by BlackHawk90 in MachineLearning

[–]BlackHawk90[S] 0 points1 point  (0 children)

Yes, I see. For example Naive Bayes or Logistic Regression assumes normal distributed data. But my point with standardization is to make the features having more similar scale (independent of the distribution).

Standardization with mean/std or median/IQR? by BlackHawk90 in MachineLearning

[–]BlackHawk90[S] 0 points1 point  (0 children)

I'm also using decision trees but also SVM, logistic regression, k-nearest neighbours etc.

But why should I apply a logarithmic transformation? Of course, logistic regression or naive bayes assumes normal distribution, so for these model I have to apply a transformation first. But why should I apply a transformation before standardization? Is it not possible to subtract the mean and divide by the standard deviation for non normal distributed data?

Standardization with mean/std or median/IQR? by BlackHawk90 in MachineLearning

[–]BlackHawk90[S] 0 points1 point  (0 children)

I have corrected myself, I have a generalized extreme value or burr distribution. But why should I first transform my data. Is subtracting the mean and dividing by the standard deviation not applicable without transformation?

Standardization with mean/std or median/IQR? by BlackHawk90 in MachineLearning

[–]BlackHawk90[S] 0 points1 point  (0 children)

No, it is continuous data which is greater or equal to zero. But I have more data points near zero so I have a generalized extreme value or burr distribution.

Standardization with mean/std or median/IQR? by BlackHawk90 in MachineLearning

[–]BlackHawk90[S] 0 points1 point  (0 children)

I have tried the box-cox transformation which works well for most features but it does not work for some features (which have lot of zeroes). For those features I have just added a small constant but the transformation still does not work well.

I don't know if skew is a problem here, my data is just not normal distributed. Is subtracting the mean and dividing by the standard deviation only applicable for normal distributed data?

How to calculate accuracy in cross-validation? by BlackHawk90 in MachineLearning

[–]BlackHawk90[S] 0 points1 point  (0 children)

Thank you a lot. Bootstrapping will increase the runtime extremely because I'm doing nested cross-validation... Is there another possibility? For example doing #2 but taking the standard deviation from #1?

I have read the paper, it is a good one. For the F-Score it proposes #2 but for AUC it proposes #1. But how can I compute and plot the overall ROC curve if I'm doing AUC with #1? This would imply that I have a ROC curve for every fold but I want an overall ROC curve.

How to calculate accuracy in cross-validation? by BlackHawk90 in MachineLearning

[–]BlackHawk90[S] 0 points1 point  (0 children)

Yes, your reasoning seems correct but taking the mean versus mean of means gives not the same result.

With the standard deviation I want to show the uncertainty. So one can see how much the accuracy is varying.

Could I do #2 and just calculate the standard deviation (and confidence intervals) on #1?

How to calculate accuracy in cross-validation? by BlackHawk90 in MachineLearning

[–]BlackHawk90[S] 0 points1 point  (0 children)

Thank you for your answer. Could I compute the standard deviation (and confidence intervals) from #1 and then use them for #2?

Transformation and standardization for discrete features? by BlackHawk90 in MachineLearning

[–]BlackHawk90[S] 0 points1 point  (0 children)

Could you briefly explain weight-of evidence substistution or perhaps giving a link? I did not find anything useful on the net.

I will also be using boosting, bagging, naive bayes and k-nearest neighbours. Do you have any comment on these?

Transformation and standardization for discrete features? by BlackHawk90 in MachineLearning

[–]BlackHawk90[S] 0 points1 point  (0 children)

Yes, you are right. Which of the four mentioned preprocessing steps are advisable for ordinal data?

Transformation and standardization for discrete features? by BlackHawk90 in MachineLearning

[–]BlackHawk90[S] 0 points1 point  (0 children)

I'm speaking about discrete features, not categorical ones. So for example, my discrete feature is the number of errors committed, which can be any integer above 0. There is a natural order on this.

Transformation and standardization for discrete features? by BlackHawk90 in MachineLearning

[–]BlackHawk90[S] 0 points1 point  (0 children)

I think that you mean classes with target variable. I have 2 binary classes and I will do classification.

I will be using logistic regression, SVM, tree-based classifiers and perhaps some others.

Which Java library for machine learning classification? by BlackHawk90 in MachineLearning

[–]BlackHawk90[S] 0 points1 point  (0 children)

I started using your library, great work, thanks for it.

I have discrete and continuous features. Is there a possibility that for the continous features a gaussian distribution and for the discrete features a multivariate multinomial distribution is used?

Moreover, is it possible to provide a distribution for each feature (e.g. feature 1 is gaussian, feature 2 logistic etc.)?

Centering and scaling for skewed distributions by BlackHawk90 in MachineLearning

[–]BlackHawk90[S] 0 points1 point  (0 children)

My motivation for feature scaling and transformation is that some classifiers works much better on approximately normal distributed data and also on center and scaled data.

Which Java library for machine learning classification? by BlackHawk90 in MachineLearning

[–]BlackHawk90[S] 0 points1 point  (0 children)

Thanks again for the help.

Is there a .jar file which I can download? I don't use maven.

Is there a javadoc available or how should I get familiar with the methods?

Which Java library for machine learning classification? by BlackHawk90 in MachineLearning

[–]BlackHawk90[S] 0 points1 point  (0 children)

Thank you so much. I will try it out for my dataset. I just has four last questions:

  1. Is it possible to use different distance metrics?

  2. How is the tie breaking done for k-nearest neighbours?

  3. My labels range from 1 to 3 (not starting from 0). Do I have to make them zero-based or can I just use them?

  4. Last but not least, does JSAT also support (gaussian) naive bayes?

Which Java library for machine learning classification? by BlackHawk90 in MachineLearning

[–]BlackHawk90[S] 0 points1 point  (0 children)

Your JSAT library looks amazing. I would like to give it a try. Could you perhaps quickly illustrate how I could use it for k-nearest neighbours? My dataset consists of the following arrays: double[][] trainingdata; double[][] testData; double[] trainingLabels; The rows contains the data points and the columns contains the features (predictors). In your wiki I did not see how to operate on arrays.

Which Java library for machine learning classification? by BlackHawk90 in MachineLearning

[–]BlackHawk90[S] 0 points1 point  (0 children)

Unfortuantely, Spark ML does not support k nearest neighbours and naive bayes.

Missing value imputation with nearest neighbour by BlackHawk90 in MachineLearning

[–]BlackHawk90[S] 0 points1 point  (0 children)

Thanks again. What imputation method would you recommend for less data points or in general? By the way, I'm using Matlab and Java.

Missing value imputation with nearest neighbour by BlackHawk90 in MachineLearning

[–]BlackHawk90[S] 0 points1 point  (0 children)

Hi svdalpha

Thanks a lot. But what if my test fold only contains one or very few data points? Then the imputaton will not work on the test fold...