Network Intrusion Detection
Introduction
At the moment we will deal with the
"Network Intrusion Detection" subject. The objective of this
assignment is finding computer network attacks are normal or anomalous with
some features. Here, we begin with a Kaggle dataset. The "Network
Intrusion Detection" dataset (https://www.kaggle.com/datasets/sampadab17/network-intrusion-detection)
is a collection of network traffic data that can be used to detect and classify
unauthorized or malicious activities within a computer network. Binary
classification refers to the task of categorizing instances into one of two
classes, in this case, normal or intrusive network traffic.
Data preparation
Data preparation is an essential step
in the machine learning process to ensure that the data is in a suitable format
for analysis. Here are some common data preparation steps we are follow for
binary classification using the "Network Intrusion Detection" dataset
:
Load the Dataset: Start by loading the
dataset into our programming environment (google colab). We used some libraries
like Pandas in Python to read the dataset file and create a DataFrame that
allows for easy manipulation and analysis.
Explore the Data: Take a closer look
at the dataset to understand its structure and contents. Examine the columns,
data types, and any missing values present. Get an overview of the distribution
of the target variable (e.g., the proportion of normal and anomalous instances)
to ensure it is balanced or identify any class imbalance issues.
Data Pre-processing
Convert Categorical Variables: In the
dataset some of the columns contains categorical variables. Therefore, we need
to convert them into numerical representations that machine learning algorithms
can handle. This can be done by using label encoding (scikit-learn package).
Ensure that each category is represented by a distinct numerical value or a
binary column.
Handle Missing Values: We checked for
missing values in the dataset. There are no such values.
Normalize or Scale Numerical Features:
Normalize or scale numerical features to ensure that they are on a similar
scale. This step is important as it prevents features with larger magnitudes
from dominating the model training process. Therefore, we used min-max scaling
(values between 0 and 1) by using MinMaxscaler function (sklearn. Preprocessing
package).
Split the Dataset: Divide the dataset
into training and testing sets. The training set is used to train the binary
classification model, while the testing set is used to evaluate its
performance. Typically, a common split is 80% for training and 20% for testing.
But here we used 75% for training and 25% for testing. Because, we had 25192
instances in the dataset. Therefore, we considered to get the validation
accuracy as best. We used scikit-learn library (train_test_split) to perform
this split.
Verify Data Quality: We performed a final
check on the prepared data to ensure it is clean and ready for analysis. Look
for any inconsistencies, outliers, or errors that may affect the model's
performance.
Model Design
model = Sequential()
model.add(Dense(4, input_dim=41, activation='relu'))
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer='rmsprop',
metrics=['accuracy'])
- Model Initialization:
- model = Sequential():
Initializes a sequential model, which is a linear stack of layers.
- Adding Layers:
- model.add(Dense(4, input_dim=41,
activation='relu')): Adds the first dense layer with 4
neurons (units). The input_dim parameter is set to 41, indicating
the number of input features. The activation function used is ReLU, which
introduces non-linearity into the model.
- model.add(Dense(1)):
Adds the second dense layer with 1 neuron. This layer serves as the
output layer and does not specify any activation function.
- model.add(Activation('sigmoid')):
Adds an activation layer after the output layer. The activation function
used here is the sigmoid function, which squashes the output between 0
and 1, representing the probability of belonging to the positive class
(binary classification).
- Compiling the Model:
- model.compile(loss='binary_crossentropy',
optimizer='adam', metrics=['accuracy']): Compiles the model and configures
the loss function, optimizer, and evaluation metric(s).
- loss='binary_crossentropy':
Specifies the loss function for binary classification tasks. Binary
cross-entropy is commonly used for this purpose.
- optimizer='adam':
Specifies the optimizer algorithm used to update the model's weights
during training. 'Adam' is a popular optimization algorithm that adapts
the learning rate based on the gradient's estimated moments.
- metrics=['accuracy']:
Specifies the evaluation metric(s) to be computed during training and
testing. In this case, the accuracy metric is used to evaluate the
model's performance.
Overall, the model design consists of
an input layer with 41 features, a hidden layer with 4 neurons, and an output
layer with 1 neuron because of binary classification. The ReLU activation
function is used in the hidden layer, while the sigmoid activation function is
used in the output layer. The model is trained using binary cross-entropy loss
and optimized using the Adam algorithm. The accuracy metric is used to evaluate
the model's performance.
Hyperparameter selection
Number of Hidden Units (16):
The number
of hidden units in a layer determines the capacity or complexity of the model
to learn and represent patterns in the data. The value of 16 for the number of
hidden units have been chosen based on the complexity of the problem and the
size of the dataset. Increasing the number of hidden units can potentially
increase the model's capacity to capture more complex relationships in the
data, but it may also lead to overfitting if not balanced with appropriate
regularization techniques.
Activation Function ('relu' and
'sigmoid'):
- The activation function introduces non-linearity into the
neural network, allowing it to learn complex relationships between the
input and output.
- 'relu' (Rectified Linear Unit) is a common choice for the
activation function in hidden layers as it helps the model learn
non-linear representations effectively.
- 'sigmoid' is used as the activation function in the output
layer because it maps the model's output to a probability-like value
between 0 and 1, suitable for binary classification problems.
Loss Function ('binary_crossentropy'):
·
The loss function measures the discrepancy
between the predicted values and the actual labels during training.
·
'binary_crossentropy' is a commonly used loss
function for binary classification tasks.
·
It is appropriate when the model outputs
probabilities and aims to minimize the difference between the predicted
probabilities and the true labels.
Optimizer ('adam'):
- The
optimizer determines the algorithm used to update the model's weights
during training, based on the computed gradients.
·
'adam' is popular optimization algorithms. It uses
adaptive learning rates, which can help improve convergence and training
efficiency in different scenarios. That adapts the learning rate based on the
gradient's estimated moments.
Number of Epochs (20):
- The number of epochs defines how many times the model will
iterate over the training data during training.
- The choice of the number of epochs depends on factors such as
the complexity of the problem, the size of the dataset, and the
convergence behavior of the model. A lower number of epochs may not allow
the model to fully learn from the data, leading to underfitting, while a
higher number of epochs may risk overfitting.
- The value of 20 epochs might have been selected based on
prior experimentation or domain knowledge.
Batch Size (50):
- The batch size determines the number of samples processed
before the model's weights are updated.
- A larger batch size can provide more stable gradient
estimates, but it may also require more memory. The choice of batch size
depends on factors such as the available computational resources, dataset
size, and training dynamics.
- The value of 50 for the batch size have been chosen based on
experimentation or performance considerations.
Train test split ratio:
- Specifies that 25% of the data will be allocated to the
testing set, while the remaining 75% will be used for training.
- Allocating a larger proportion of the data to the training
set allows the model to learn from more samples, potentially improving its
ability to capture patterns and generalize well. However, reducing the
size of the training set means fewer samples are available for training,
which can affect the model's ability to learn complex relationships.
·
This split ratio is often used as a default or
starting point in many tutorials, examples, and frameworks.
·
Sets the random seed to 42 for
reproducibility. Using the same seed ensures that the data is split in the same
way each time the code is run.
Overview of the Implementation platform
- Language: Python
Python is the programming language
used for implementing our project. Python is widely used in the field of
machine learning and provides a rich ecosystem of libraries and frameworks for
developing and training models.
- Libraries and Frameworks:
- Keras: We are using the Keras library as
the primary deep learning framework for building and training neural network
models.
- TensorFlow: Keras runs on top of
TensorFlow, a popular open-source deep learning library.
- scikit-learn: We are using the train_test_split
function from the scikit-learn library for splitting the data into
training and testing sets.
- Model Architecture:
- Neural Network Model: Our implementation
involves building a neural network model.
- Sequential Model: We are using the
Sequential model from Keras, which allows us to stack layers in a linear
sequence.
- Dense Layers: We are adding Dense layers
to our model, which are fully connected layers where each neuron is
connected to all neurons in the previous layer.
- Activation Functions: We using
activation functions to introduce non-linearity in the network.
- 'relu' activation function is used in
the hidden layer(s) to capture complex patterns.
- 'sigmoid' activation function is used
in the output layer for binary classification, producing a
probability-like output.
- Model Compilation:
- Loss Function: We are using the
'binary_crossentropy' loss function.
- Optimizer: We are using the 'adam'
optimizer, which is an adaptive learning rate optimization algorithm.
- Metrics: We are evaluating the model's
performance using the 'accuracy' metric.
- Data Preparation:
- Data Preprocessing: We have performed
data preprocessing steps such as handling missing values,
scaling/normalizing features, and encoding categorical variables.
- Train-Test Split: We are using the train_test_split
function from scikit-learn to split the data into training and testing
sets with a test size of 0.25.
- Model Training and Evaluation:
- Training: We are using the fit
method in Keras to train the model.
- Epochs: The model is trained for 50
epochs, indicating the number of times the model iterates over the
training data.
- Batch Size: The model is trained using a
batch size of 32, specifying the number of samples processed before the
model's weights are updated.
- Evaluation: The model's performance is
evaluated using the testing set provided through the validation_data
parameter.
- Cross-Validation:
- K-Fold Cross-Validation: We are using
the KFold class from scikit-learn with 10 folds for performing
cross-validation.
- Each fold represents a different split
of the data into training and validation sets.
Training
history
= model.fit(X_train, y_train ,verbose=1, epochs=50, batch_size=32,
validation_data=(X_test, y_test))
- X_train and y_train: These are the
training data features and labels, respectively. They are used to train
the model.
- verbose=1:
This parameter controls the verbosity of the training process. Setting it
to 1 displays progress bar during training.
- epochs=50: The
number of epochs specifies how many times the model will iterate over the
training data. Each epoch consists of one forward pass and one backward
pass (gradient descent) through the network.
- batch_size=32: The
batch size determines the number of samples used in each iteration of the
gradient descent algorithm. In this case, 32 samples will be processed
before updating the model's weights.
- validation_data=(X_test, y_test):
This parameter provides the validation data, which is used to evaluate the
model's performance during training. It helps monitor the model's
generalization ability on unseen data.
The training process involves the
following steps for each epoch:
- Forward Pass: The model takes the training data (X_train)
and makes predictions based on the current weights. The predicted outputs
are compared to the actual labels (y_train).
- Loss Calculation: The loss function (binary_crossentropy)
calculates the discrepancy between the predicted outputs and the actual
labels. This quantifies the model's error.
- Backward Pass (Gradient Descent): The gradients of the loss
with respect to the model's weights are computed using backpropagation.
These gradients indicate the direction and magnitude of weight adjustments
required to minimize the loss.
- Weight Update: The optimizer (adam) uses the gradients
to update the model's weights, adjusting them in the direction that
reduces the loss.
- Validation: After each epoch, the model's performance is
evaluated on the validation data (X_test and y_test). The
loss and accuracy metrics are computed to monitor how well the model
generalizes to unseen data.
- Repeat: The above steps are repeated for the specified number
of epochs (50 in this case), gradually refining the model's weights and
improving its performance.
During training, the model learns to
optimize its weights to minimize the loss function and improve its ability to
make accurate predictions. The training process aims to find the optimal set of
weights that generalize well to unseen data. The training progress and
performance metrics (loss and accuracy) can be stored in the history
variable, allowing you to analyze and visualize the model's learning progress
over the epochs.
Model Summary
Model:
"sequential_21"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_42 (Dense) (None, 16) 672
dense_43 (Dense) (None, 1) 17
activation_21 (Activation) (None, 1) 0
=================================================================
Total
params: 689
Trainable
params: 689
Non-trainable
params: 0
These are the accuracy scores after 10
kfolds:
Score
for this split is: 0.9686507936507937
Score
for this split is: 0.9722222222222222
Score
for this split is: 0.9710202461294164
Score
for this split is: 0.9753870583564906
Score
for this split is: 0.9706232631996824
Score
for this split is: 0.9678443826915443
Score
for this split is: 0.9722111949186185
Score
for this split is: 0.9583167923779278
Score
for this split is: 0.962683604605002
Score
for this split is: 0.963080587534736
Mean of the accuracy scores:
Accuracy:
0.9682040145686435
Below are the final 5 epochs during training
stage
Epoch
45/50
591/591
[==============================] - 1s 2ms/step - loss: 0.0989 - accuracy:
0.9645 - val_loss: 0.0970 - val_accuracy: 0.9644
Epoch
46/50
591/591
[==============================] - 1s 2ms/step - loss: 0.0954 - accuracy:
0.9652 - val_loss: 0.0931 - val_accuracy: 0.9649
Epoch
47/50
591/591
[==============================] - 1s 2ms/step - loss: 0.0924 - accuracy:
0.9660 - val_loss: 0.0901 - val_accuracy: 0.9660
Epoch
48/50
591/591
[==============================] - 1s 2ms/step - loss: 0.0896 - accuracy:
0.9666 - val_loss: 0.0874 - val_accuracy: 0.9667
Epoch
49/50
591/591
[==============================] - 1s 2ms/step - loss: 0.0872 - accuracy:
0.9672 - val_loss: 0.0852 - val_accuracy: 0.9684
Epoch
50/50
591/591
[==============================] - 1s 2ms/step - loss: 0.0853 - accuracy:
0.9681 - val_loss: 0.0829 - val_accuracy: 0.9682
Accuracy of the model
Accuracy =
96.82438969612122 %
Training and Validation loss and accuracy graphs
This is overfitting. Validation loss
is greater than Training loss.
Confusion matrix
TP = 3.3e+03
TN = 2.9e+03
FP = 1.2e+02
FN = 60
Score after confusion matrix
evaluation
precision recall
f1-score support
0 0.98
0.96 0.97 2970
1 0.96
0.98 0.97 3328
accuracy 0.97 6298
macro avg 0.97
0.97 0.97
6298
weighted
avg 0.97 0.97
0.97 6298
In here, False positive value is much
larger. Therefore, we want to optimize the model for more precise classify.
Optimizing the hyper parameters
Since, our model is overfitting,
meaning it performs well on the training data but poorly on the validation or
test data, adjusting the number of neurons in our model can be a potential
solution.
Reduce the number of neurons:
Overfitting can occur when the model is too complex and has too many neurons,
allowing it to memorize the training data instead of learning generalizable
patterns.
And change the optimizer to “rmsprop”
Therefore, we decided to reduce the
number of neurons to 4 in our model, especially in the hidden layers. This can
help prevent overfitting by reducing the model's capacity to memorize noise or
irrelevant details in the training data. And use the RMSprop optimizer during
training, which is known for its effectiveness in handling non-stationary
gradient problems. It can help improve the training process and potentially
enhance the model's performance.
After the reduction of the neurons in
the hidden layer:
Model Summary
Model:
"sequential_32"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_64 (Dense) (None, 4) 168
dense_65 (Dense) (None, 1) 5
activation_32 (Activation) (None, 1) 0
=================================================================
Total
params: 173
Trainable
params: 173
Non-trainable
params: 0
_________________________________________________________________
None
Below are the final 5 epochs during
training stage
Epoch
45/50
591/591
[==============================] - 2s 3ms/step - loss: 0.0563 - accuracy:
0.9778 - val_loss: 0.0578 - val_accuracy: 0.9768
Epoch
46/50
591/591
[==============================] - 2s 3ms/step - loss: 0.0558 - accuracy:
0.9781 - val_loss: 0.0577 - val_accuracy: 0.9767
Epoch
47/50
591/591
[==============================] - 1s 3ms/step - loss: 0.0554 - accuracy:
0.9789 - val_loss: 0.0568 - val_accuracy: 0.9773
Epoch
48/50
591/591
[==============================] - 1s 2ms/step - loss: 0.0547 - accuracy:
0.9789 - val_loss: 0.0561 - val_accuracy: 0.9779
Epoch
49/50
591/591
[==============================] - 1s 2ms/step - loss: 0.0541 - accuracy:
0.9798 - val_loss: 0.0556 - val_accuracy: 0.9782
Epoch
50/50
591/591
[==============================] - 1s 2ms/step - loss: 0.0535 - accuracy:
0.9806 - val_loss: 0.0553 - val_accuracy: 0.9785
Accuracy of the model
97.8578299999237 %
Training and Validation loss and accuracy graphs
After 30 epochs the model is a good
fit for the data.
The model shows strong performance
with an accuracy of 97.85% and a loss of 0.05. It also demonstrates good
precision and recall, making it a good fit for the data.
Final Optimized Model
model = Sequential()
model.add(Dense(4, input_dim=41, activation='relu'))
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer='rmsprop',
metrics=['accuracy'])
Model Architecture: The model
architecture consists of a sequential stack of layers. It starts with a dense
layer with 4 neurons, followed by a dense layer with 1 neuron, and finally an
activation layer with the 'sigmoid' activation function. This architecture can
be adjusted based on the specific problem and dataset.
Optimizer Selection: The optimizer
used in the model is 'rmsprop', which stands for Root Mean Square Propagation.
It is known for its effectiveness in handling non-stationary gradient problems
and is commonly used in deep learning tasks.
Loss Function: The loss function used
is 'binary_crossentropy', which is appropriate for binary classification
problems. It measures the discrepancy between the predicted and actual values.
Metrics: The model is evaluated using
the 'accuracy' metric, which provides the accuracy rate of the model's
predictions.
To see our Full code: Code
Test Results
Confusion matrix
TP = 3.3e+03
TN = 2.9e+03
FP = 38
FN = 41
Score after confusion matrix
evaluation
precision recall
f1-score support
0 0.98
0.98 0.98 2970
1 0.98
0.98 0.98 3328
accuracy 0.98 6298
macro avg 0.98
0.98 0.98 6298
weighted
avg 0.98 0.98
0.98 6298
Discussion
True positive = 3.3e+03
The model correctly predicted 3300
instances as positive (class 1). These are the cases where the model correctly
identified positive instances, and in this context, it means that there were
3300 instances of class 1 that were correctly classified as such.
True negative = 2.9e+03
The model correctly predicted 2900
instances as negative (class 0). These are the cases where the model correctly
identified negative instances, and in this context, it means that there were
2900 instances of class 0 that were correctly classified as such.
False positive = 38
The model incorrectly predicted 38
instances as positive (class 1), but they were actually negative (class 0).
These are the cases where the model falsely identified negative instances as
positive, leading to a false positive prediction.
False negative = 41
The model incorrectly predicted 41
instances as negative (class 0), but they were actually positive (class 1).
These are the cases where the model falsely identified positive instances as
negative, leading to a false negative prediction.
- Precision: The precision for both classes (0 and 1) is 0.98,
indicating that the model has a high proportion of correct positive
predictions out of all instances predicted as positive. This means that
when the model predicts an instance as positive, it is correct 98% of the
time for both classes.
- Recall: The recall for both classes is also 0.98, indicating
that the model correctly identifies 98% of the positive instances out of
all true positive instances. This implies that the model has a high
sensitivity in detecting positive instances for both classes.
- F1-Score: The F1-score for both classes is 0.98, which is the
harmonic mean of precision and recall. This balanced measure suggests that
your model achieves a good trade-off between precision and recall for both
classes.
·
The overall accuracy of our model is 0.98,
which means that it correctly predicts the class labels for 98% of the
instances in the test set.
References
[1]. https://www.geeksforgeeks.org/how-to-import-kaggle-datasets-directly-into-google-colab/
[2]. https://towardsdatascience.com/downloading-kaggle-datasets-directly-into-google-colab-c8f0f407d73a
[4]. https://machinelearningmastery.com/evaluate-performance-deep-learning-models-keras/
[6]. https://www.digitalocean.com/community/tutorials/python-scikit-learn-tutorial
[7]. https://colab.google/articles/alive
[8]. https://towardsdatascience.com/understanding-confusion-matrix-a9ad42dcfd62
[9]. https://www.jeremyjordan.me/evaluating-a-machine-learning-model/
[10].
https://www.pluralsight.com/guides/data-visualization-deep-learning-model-using-matplotlib
[12].
https://www.w3schools.com/python/python_ml_train_test.asp
[13].
https://medium.com/@varunsrivatsa27/data-analysis-with-python-model-evaluation-part-4-1f2a46d4781a
[15].
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html
[16].
https://machinelearningmastery.com/standardscaler-and-minmaxscaler-transforms-in-python/
[17].
https://machinelearningmastery.com/k-fold-cross-validation/
[18].
https://scikit-learn.org/stable/modules/cross_validation.html
[19].
https://www.kdnuggets.com/2022/07/kfold-cross-validation.html
[20].
https://machinelearningmastery.com/use-keras-deep-learning-models-scikit-learn-python/
[23].
https://www.learndatasci.com/glossary/binary-classification/
[24].
https://saturncloud.io/blog/how-to-import-files-from-google-drive-to-colab/
Comments
Post a Comment