So I am at a cross roads on what to do next, I set out to learn and apply some machine learning algorithms on a complicated dataset and I have now done this. My plan from the very beginning was to combine two possible classifiers in an attempt to make a multi-classification system.
But here is where I am stuck. I choose a clustering algorithm (Fuzzy C Means) (after learning some sample K-means stuff) and Naive Bayes as the two candidates for the MCS (Multi-Classifier System).
I can use both independently to classify the data but I am struggling to combine the two in a meaningful way.
For instance the Fuzzy clustering catches almost all "Smurf" attacks except for usually one and I am not sure why it doesnt catch this odd ball but all I know is it doesnt. One of the clusters will be dominated by the smurf attacks and usualy I will find just one smurf in the other clusters. And here is where I run into the problem scenario, if I train the bayes classifier on all the different attack types (Smurf, normal, neptune... etc) and apply that to the remainder of the clusters in an attempt to find that last remaining smurf it will have a high false alarm rate.
I'm not sure how to proceed, I dont want to take the other attacks out of the training set but I only want to train the bayes classifier to spot "Smurf" attacks. At the moment it is trained to try and spot everything, and in this process I think (not sure) that the accuracy is dropped.
So this is my question when using the naive bayes classifier, how would you get it to only look for smurf and categorise everything else as "Other".
rows = 1000;
columns = 6;
indX = randperm( size(fulldata,1) );
indX = indX(1:rows)';
data = fulldata(indX, indY)
indX1 = randperm( size(fulldata,1) );
indX1 = indX1(1:rows)';
%% apply normalization method to every cell
%data = zscore(data);
training_data = data;
target_class = labels(indX,:)
class = classify(test_data,training_data, target_class, 'diaglinear')
confusionmat(target_class,class)
What I was thinking was manually changing target_class
from all the normal traffic and attacks that arent smurf to other. Then as I already know that FCM correctly classifies all but one smurf attack, I just have to use the naive bayes classifier on the remaining clusters.
For instance:
Cluster 1 = 500 smurf attacks (repeating this step might shift the "majority" of smurf attacks from the 1000 samples into a different cluster so I have to check or iterate through the clusters for the biggest size, once found I can remove it from the naive bayes classifier stage)
Then I test the classifier on each remaining cluster (not sure how to do loops etc yet in matlab) so at the moment I have to manually pick them during the processing.
clusters = 4;
CM = colormap(jet(clusters));
options(1) = 12.0;
options(2) = 1000;
options(3) = 1e-10;
options(4) = 0;
[~,y] = max(U);
[centers, U, objFun] = fcm(data, clusters, options); % cluster 1000 sample data rows
training_data = newTrainingData(indX1,indY); % this is the numeric data
test_data = fulldata(indX(y==2),:); % this is cluster 2 from the FCM phase which will be classified.
test_class = labels(indX(y==2),:); % thanks to amro this helps the confusion matrix give an unbiased error detection rate in the confusion matrix.
target_class = labels(indX,:) % this is labels for the training_data, it only contains the smurf attacks while everything else is classed as other
class = classify(test_data,training_data, target_class, 'diaglinear')
confusionmat(test_class,class)
I then repeat the bayes classifier for each of the remaining clusters, looking for that one smurf attack.
My problem is what happens if it misclassifies an "other" attack as a smurf or doesn't find the one remaining smurf.
I feel kind of lost on a better way of doing it. I am in the process of trying to pick a good ratio of smurf attacks to "other" as I dont want to over-fit which was explained in a previous question here.
But this will take me some time as I dont yet know how to change/replace the existing labels from neptune, back, ipsweep, wareclient attacks to "other" in matlab so I can't yet test this theory out (will get there).
So my question is:
1) Is there a better method at finding that one elusive smurf attack.
2) How can I grep the target_class (labels) to replace everything that isn't smurf with "other"