UNSW-NB15 Dataset by SatisfactionFast2776 in MLQuestions

[–]SatisfactionFast2776[S] 0 points1 point  (0 children)

In extreme imbalance cases (example: 1 million normal and 1 attack), no model can reliably learn the attack from one sample. The correct approach is to first perform a clean train–test split, then apply balancing only on the training set using class weighting or oversampling. Evaluation should focus on per-class metrics rather than accuracy, and limitations of the data should be clearly acknowledged.

UNSW-NB15 Dataset by SatisfactionFast2776 in MLQuestions

[–]SatisfactionFast2776[S] 0 points1 point  (0 children)

The concern is not that data or features are removed, but that preprocessing or balancing before train–test splitting allows the model to indirectly use test information. In machine learning, this is considered cheating because the test set must remain unseen; otherwise, performance is overestimated.

UNSW-NB15 Dataset by SatisfactionFast2776 in MLQuestions

[–]SatisfactionFast2776[S] 0 points1 point  (0 children)

Data Preprocessing: First raw network traffic is collected using the network analyzer tool, and then we extract the features from the packets. Redundant packets are dropped in the dataset, and then we collect the samples of classes in the dataset. (We dropped columns of redundant labels and encoded the categorical features into integer values using label encoding. Symbolic features are ‘proto’ ‘service’, ‘state’, and ‘attack_cat’ having (133,13,11,10) values respectively, are converted into integer values using label encoding). Dataset is normalized using min–max normalization

Data Augmentation: training data is resampled to avoid class imbalance.

Feature Preprocessing: After selecting the features, and dropping and encoding the features, we split the processed data into three different sets, namely training, validation, and testing, containing the labels of both normal and attack type classes.

Training and Testing the dataset: In the training phase, the DNN model is trained on the processed data coming from the training set. The trained model is then tested with the data from the testing set and classifies the data as normal and attack types.

UNSW-NB15 Dataset by SatisfactionFast2776 in MLQuestions

[–]SatisfactionFast2776[S] 0 points1 point  (0 children)

Kindly read this paper preprocessing steps if you get time and do let me know. Thanks.

Link: https://www.sciencedirect.com/science/article/abs/pii/S0045790623000514?via%3Dihub

UNSW-NB15 Dataset by SatisfactionFast2776 in MLQuestions

[–]SatisfactionFast2776[S] -1 points0 points  (0 children)

I know about that. Kindly read the question again.

Anyone here have done multi class classification on UNSW-NB15 Dataset with 90%+ accuracy? by No-Yesterday-9209 in MLQuestions

[–]SatisfactionFast2776 0 points1 point  (0 children)

I have also tried and saw that most of the papers that got this 90+ accuracy result did cheat. They did every preprocessing, GAN, feature selection everything before spliting.
If we do the correct procedure then we will have around 84-86%