Introduction

At the moment we will deal with the "Network Intrusion Detection" subject. The objective of this assignment is finding computer network attacks are normal or anomalous with some features. Here, we begin with a Kaggle dataset. The "Network Intrusion Detection" dataset (https://www.kaggle.com/datasets/sampadab17/network-intrusion-detection) is a collection of network traffic data that can be used to detect and classify unauthorized or malicious activities within a computer network. Binary classification refers to the task of categorizing instances into one of two classes, in this case, normal or intrusive network traffic.

Data preparation

Data preparation is an essential step in the machine learning process to ensure that the data is in a suitable format for analysis. Here are some common data preparation steps we are follow for binary classification using the "Network Intrusion Detection" dataset :

Load the Dataset: Start by loading the dataset into our programming environment (google colab). We used some libraries like Pandas in Python to read the dataset file and create a DataFrame that allows for easy manipulation and analysis.

Explore the Data: Take a closer look at the dataset to understand its structure and contents. Examine the columns, data types, and any missing values present. Get an overview of the distribution of the target variable (e.g., the proportion of normal and anomalous instances) to ensure it is balanced or identify any class imbalance issues.

Data Pre-processing

Convert Categorical Variables: In the dataset some of the columns contains categorical variables. Therefore, we need to convert them into numerical representations that machine learning algorithms can handle. This can be done by using label encoding (scikit-learn package). Ensure that each category is represented by a distinct numerical value or a binary column.

Handle Missing Values: We checked for missing values in the dataset. There are no such values.

Normalize or Scale Numerical Features: Normalize or scale numerical features to ensure that they are on a similar scale. This step is important as it prevents features with larger magnitudes from dominating the model training process. Therefore, we used min-max scaling (values between 0 and 1) by using MinMaxscaler function (sklearn. Preprocessing package).

Split the Dataset: Divide the dataset into training and testing sets. The training set is used to train the binary classification model, while the testing set is used to evaluate its performance. Typically, a common split is 80% for training and 20% for testing. But here we used 75% for training and 25% for testing. Because, we had 25192 instances in the dataset. Therefore, we considered to get the validation accuracy as best. We used scikit-learn library (train_test_split) to perform this split.

Verify Data Quality: We performed a final check on the prepared data to ensure it is clean and ready for analysis. Look for any inconsistencies, outliers, or errors that may affect the model's performance.

Model Design

model = Sequential()

model.add(Dense(4, input_dim=41, activation='relu'))

model.add(Dense(1))

model.add(Activation('sigmoid'))

model.compile(loss='binary_crossentropy',

optimizer='rmsprop',

metrics=['accuracy'])

Model Initialization:

model = Sequential(): Initializes a sequential model, which is a linear stack of layers.

Adding Layers:

model.add(Dense(4, input_dim=41, activation='relu')): Adds the first dense layer with 4 neurons (units). The input_dim parameter is set to 41, indicating the number of input features. The activation function used is ReLU, which introduces non-linearity into the model.
model.add(Dense(1)): Adds the second dense layer with 1 neuron. This layer serves as the output layer and does not specify any activation function.
model.add(Activation('sigmoid')): Adds an activation layer after the output layer. The activation function used here is the sigmoid function, which squashes the output between 0 and 1, representing the probability of belonging to the positive class (binary classification).

Compiling the Model:

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']): Compiles the model and configures the loss function, optimizer, and evaluation metric(s).

loss='binary_crossentropy': Specifies the loss function for binary classification tasks. Binary cross-entropy is commonly used for this purpose.
optimizer='adam': Specifies the optimizer algorithm used to update the model's weights during training. 'Adam' is a popular optimization algorithm that adapts the learning rate based on the gradient's estimated moments.
metrics=['accuracy']: Specifies the evaluation metric(s) to be computed during training and testing. In this case, the accuracy metric is used to evaluate the model's performance.

Overall, the model design consists of an input layer with 41 features, a hidden layer with 4 neurons, and an output layer with 1 neuron because of binary classification. The ReLU activation function is used in the hidden layer, while the sigmoid activation function is used in the output layer. The model is trained using binary cross-entropy loss and optimized using the Adam algorithm. The accuracy metric is used to evaluate the model's performance.

Hyperparameter selection

Number of Hidden Units (16):

The number of hidden units in a layer determines the capacity or complexity of the model to learn and represent patterns in the data. The value of 16 for the number of hidden units have been chosen based on the complexity of the problem and the size of the dataset. Increasing the number of hidden units can potentially increase the model's capacity to capture more complex relationships in the data, but it may also lead to overfitting if not balanced with appropriate regularization techniques.

Activation Function ('relu' and 'sigmoid'):

The activation function introduces non-linearity into the neural network, allowing it to learn complex relationships between the input and output.
'relu' (Rectified Linear Unit) is a common choice for the activation function in hidden layers as it helps the model learn non-linear representations effectively.
'sigmoid' is used as the activation function in the output layer because it maps the model's output to a probability-like value between 0 and 1, suitable for binary classification problems.

Loss Function ('binary_crossentropy'):

· The loss function measures the discrepancy between the predicted values and the actual labels during training.

· 'binary_crossentropy' is a commonly used loss function for binary classification tasks.

· It is appropriate when the model outputs probabilities and aims to minimize the difference between the predicted probabilities and the true labels.

Optimizer ('adam'):

The optimizer determines the algorithm used to update the model's weights during training, based on the computed gradients.

· 'adam' is popular optimization algorithms. It uses adaptive learning rates, which can help improve convergence and training efficiency in different scenarios. That adapts the learning rate based on the gradient's estimated moments.

Number of Epochs (20):

The number of epochs defines how many times the model will iterate over the training data during training.
The choice of the number of epochs depends on factors such as the complexity of the problem, the size of the dataset, and the convergence behavior of the model. A lower number of epochs may not allow the model to fully learn from the data, leading to underfitting, while a higher number of epochs may risk overfitting.
The value of 20 epochs might have been selected based on prior experimentation or domain knowledge.

Batch Size (50):

The batch size determines the number of samples processed before the model's weights are updated.
A larger batch size can provide more stable gradient estimates, but it may also require more memory. The choice of batch size depends on factors such as the available computational resources, dataset size, and training dynamics.
The value of 50 for the batch size have been chosen based on experimentation or performance considerations.

Train test split ratio:

Specifies that 25% of the data will be allocated to the testing set, while the remaining 75% will be used for training.
Allocating a larger proportion of the data to the training set allows the model to learn from more samples, potentially improving its ability to capture patterns and generalize well. However, reducing the size of the training set means fewer samples are available for training, which can affect the model's ability to learn complex relationships.

· This split ratio is often used as a default or starting point in many tutorials, examples, and frameworks.

· Sets the random seed to 42 for reproducibility. Using the same seed ensures that the data is split in the same way each time the code is run.

Overview of the Implementation platform

Language: Python

Python is the programming language used for implementing our project. Python is widely used in the field of machine learning and provides a rich ecosystem of libraries and frameworks for developing and training models.

Libraries and Frameworks:

Keras: We are using the Keras library as the primary deep learning framework for building and training neural network models.
TensorFlow: Keras runs on top of TensorFlow, a popular open-source deep learning library.
scikit-learn: We are using the train_test_split function from the scikit-learn library for splitting the data into training and testing sets.

Model Architecture:

Neural Network Model: Our implementation involves building a neural network model.
Sequential Model: We are using the Sequential model from Keras, which allows us to stack layers in a linear sequence.
Dense Layers: We are adding Dense layers to our model, which are fully connected layers where each neuron is connected to all neurons in the previous layer.
Activation Functions: We using activation functions to introduce non-linearity in the network.

'relu' activation function is used in the hidden layer(s) to capture complex patterns.
'sigmoid' activation function is used in the output layer for binary classification, producing a probability-like output.

Model Compilation:

Loss Function: We are using the 'binary_crossentropy' loss function.
Optimizer: We are using the 'adam' optimizer, which is an adaptive learning rate optimization algorithm.
Metrics: We are evaluating the model's performance using the 'accuracy' metric.

Data Preparation:

Data Preprocessing: We have performed data preprocessing steps such as handling missing values, scaling/normalizing features, and encoding categorical variables.
Train-Test Split: We are using the train_test_split function from scikit-learn to split the data into training and testing sets with a test size of 0.25.

Model Training and Evaluation:

Training: We are using the fit method in Keras to train the model.
Epochs: The model is trained for 50 epochs, indicating the number of times the model iterates over the training data.
Batch Size: The model is trained using a batch size of 32, specifying the number of samples processed before the model's weights are updated.
Evaluation: The model's performance is evaluated using the testing set provided through the validation_data parameter.

Cross-Validation:

K-Fold Cross-Validation: We are using the KFold class from scikit-learn with 10 folds for performing cross-validation.
Each fold represents a different split of the data into training and validation sets.

Training

history = model.fit(X_train, y_train ,verbose=1, epochs=50, batch_size=32,

validation_data=(X_test, y_test))

X_train and y_train: These are the training data features and labels, respectively. They are used to train the model.
verbose=1: This parameter controls the verbosity of the training process. Setting it to 1 displays progress bar during training.
epochs=50: The number of epochs specifies how many times the model will iterate over the training data. Each epoch consists of one forward pass and one backward pass (gradient descent) through the network.
batch_size=32: The batch size determines the number of samples used in each iteration of the gradient descent algorithm. In this case, 32 samples will be processed before updating the model's weights.
validation_data=(X_test, y_test): This parameter provides the validation data, which is used to evaluate the model's performance during training. It helps monitor the model's generalization ability on unseen data.

The training process involves the following steps for each epoch:

Forward Pass: The model takes the training data (X_train) and makes predictions based on the current weights. The predicted outputs are compared to the actual labels (y_train).
Loss Calculation: The loss function (binary_crossentropy) calculates the discrepancy between the predicted outputs and the actual labels. This quantifies the model's error.
Backward Pass (Gradient Descent): The gradients of the loss with respect to the model's weights are computed using backpropagation. These gradients indicate the direction and magnitude of weight adjustments required to minimize the loss.
Weight Update: The optimizer (adam) uses the gradients to update the model's weights, adjusting them in the direction that reduces the loss.
Validation: After each epoch, the model's performance is evaluated on the validation data (X_test and y_test). The loss and accuracy metrics are computed to monitor how well the model generalizes to unseen data.
Repeat: The above steps are repeated for the specified number of epochs (50 in this case), gradually refining the model's weights and improving its performance.

During training, the model learns to optimize its weights to minimize the loss function and improve its ability to make accurate predictions. The training process aims to find the optimal set of weights that generalize well to unseen data. The training progress and performance metrics (loss and accuracy) can be stored in the history variable, allowing you to analyze and visualize the model's learning progress over the epochs.

Model Summary

Model: "sequential_21"

_________________________________________________________________

Layer (type) Output Shape Param #

=================================================================

dense_42 (Dense) (None, 16) 672

dense_43 (Dense) (None, 1) 17

activation_21 (Activation) (None, 1) 0

=================================================================

Total params: 689

Trainable params: 689

Non-trainable params: 0

These are the accuracy scores after 10 kfolds:

Score for this split is: 0.9686507936507937

Score for this split is: 0.9722222222222222

Score for this split is: 0.9710202461294164

Score for this split is: 0.9753870583564906

Score for this split is: 0.9706232631996824

Score for this split is: 0.9678443826915443

Score for this split is: 0.9722111949186185

Score for this split is: 0.9583167923779278

Score for this split is: 0.962683604605002

Score for this split is: 0.963080587534736

Mean of the accuracy scores:

Accuracy: 0.9682040145686435

Below are the final 5 epochs during training stage

Epoch 45/50

591/591 [==============================] - 1s 2ms/step - loss: 0.0989 - accuracy: 0.9645 - val_loss: 0.0970 - val_accuracy: 0.9644

Epoch 46/50

591/591 [==============================] - 1s 2ms/step - loss: 0.0954 - accuracy: 0.9652 - val_loss: 0.0931 - val_accuracy: 0.9649

Epoch 47/50

591/591 [==============================] - 1s 2ms/step - loss: 0.0924 - accuracy: 0.9660 - val_loss: 0.0901 - val_accuracy: 0.9660

Epoch 48/50

591/591 [==============================] - 1s 2ms/step - loss: 0.0896 - accuracy: 0.9666 - val_loss: 0.0874 - val_accuracy: 0.9667

Epoch 49/50

591/591 [==============================] - 1s 2ms/step - loss: 0.0872 - accuracy: 0.9672 - val_loss: 0.0852 - val_accuracy: 0.9684

Epoch 50/50

591/591 [==============================] - 1s 2ms/step - loss: 0.0853 - accuracy: 0.9681 - val_loss: 0.0829 - val_accuracy: 0.9682

Accuracy of the model

Accuracy = 96.82438969612122 %

Training and Validation loss and accuracy graphs

This is overfitting. Validation loss is greater than Training loss.

Confusion matrix

TP = 3.3e+03

TN = 2.9e+03

FP = 1.2e+02

FN = 60

Score after confusion matrix evaluation

precision recall f1-score support

0 0.98 0.96 0.97 2970

1 0.96 0.98 0.97 3328

accuracy 0.97 6298

macro avg 0.97 0.97 0.97 6298

weighted avg 0.97 0.97 0.97 6298

In here, False positive value is much larger. Therefore, we want to optimize the model for more precise classify.

Optimizing the hyper parameters

Since, our model is overfitting, meaning it performs well on the training data but poorly on the validation or test data, adjusting the number of neurons in our model can be a potential solution.

Reduce the number of neurons: Overfitting can occur when the model is too complex and has too many neurons, allowing it to memorize the training data instead of learning generalizable patterns.

And change the optimizer to “rmsprop”

Therefore, we decided to reduce the number of neurons to 4 in our model, especially in the hidden layers. This can help prevent overfitting by reducing the model's capacity to memorize noise or irrelevant details in the training data. And use the RMSprop optimizer during training, which is known for its effectiveness in handling non-stationary gradient problems. It can help improve the training process and potentially enhance the model's performance.

After the reduction of the neurons in the hidden layer:

Model Summary

Model: "sequential_32"

_________________________________________________________________

Layer (type) Output Shape Param #

=================================================================

dense_64 (Dense) (None, 4) 168

dense_65 (Dense) (None, 1) 5

activation_32 (Activation) (None, 1) 0

=================================================================

Total params: 173

Trainable params: 173

Non-trainable params: 0

_________________________________________________________________

None

Below are the final 5 epochs during training stage

Epoch 45/50

591/591 [==============================] - 2s 3ms/step - loss: 0.0563 - accuracy: 0.9778 - val_loss: 0.0578 - val_accuracy: 0.9768

Epoch 46/50

591/591 [==============================] - 2s 3ms/step - loss: 0.0558 - accuracy: 0.9781 - val_loss: 0.0577 - val_accuracy: 0.9767

Epoch 47/50

591/591 [==============================] - 1s 3ms/step - loss: 0.0554 - accuracy: 0.9789 - val_loss: 0.0568 - val_accuracy: 0.9773

Epoch 48/50

591/591 [==============================] - 1s 2ms/step - loss: 0.0547 - accuracy: 0.9789 - val_loss: 0.0561 - val_accuracy: 0.9779

Epoch 49/50

591/591 [==============================] - 1s 2ms/step - loss: 0.0541 - accuracy: 0.9798 - val_loss: 0.0556 - val_accuracy: 0.9782

Epoch 50/50

591/591 [==============================] - 1s 2ms/step - loss: 0.0535 - accuracy: 0.9806 - val_loss: 0.0553 - val_accuracy: 0.9785

Accuracy of the model

97.8578299999237 %

Training and Validation loss and accuracy graphs

After 30 epochs the model is a good fit for the data.

The model shows strong performance with an accuracy of 97.85% and a loss of 0.05. It also demonstrates good precision and recall, making it a good fit for the data.

Final Optimized Model

model = Sequential()

model.add(Dense(4, input_dim=41, activation='relu'))

model.add(Dense(1))

model.add(Activation('sigmoid'))

model.compile(loss='binary_crossentropy',

optimizer='rmsprop',

metrics=['accuracy'])

Model Architecture: The model architecture consists of a sequential stack of layers. It starts with a dense layer with 4 neurons, followed by a dense layer with 1 neuron, and finally an activation layer with the 'sigmoid' activation function. This architecture can be adjusted based on the specific problem and dataset.

Optimizer Selection: The optimizer used in the model is 'rmsprop', which stands for Root Mean Square Propagation. It is known for its effectiveness in handling non-stationary gradient problems and is commonly used in deep learning tasks.

Loss Function: The loss function used is 'binary_crossentropy', which is appropriate for binary classification problems. It measures the discrepancy between the predicted and actual values.

Metrics: The model is evaluated using the 'accuracy' metric, which provides the accuracy rate of the model's predictions.

To see our Full code: Code

Test Results

Confusion matrix

TP = 3.3e+03

TN = 2.9e+03

FP = 38

FN = 41

Score after confusion matrix evaluation

precision recall f1-score support

0 0.98 0.98 0.98 2970

1 0.98 0.98 0.98 3328

accuracy 0.98 6298

macro avg 0.98 0.98 0.98 6298

weighted avg 0.98 0.98 0.98 6298

Discussion

True positive = 3.3e+03

The model correctly predicted 3300 instances as positive (class 1). These are the cases where the model correctly identified positive instances, and in this context, it means that there were 3300 instances of class 1 that were correctly classified as such.

True negative = 2.9e+03

The model correctly predicted 2900 instances as negative (class 0). These are the cases where the model correctly identified negative instances, and in this context, it means that there were 2900 instances of class 0 that were correctly classified as such.

False positive = 38

The model incorrectly predicted 38 instances as positive (class 1), but they were actually negative (class 0). These are the cases where the model falsely identified negative instances as positive, leading to a false positive prediction.

False negative = 41

The model incorrectly predicted 41 instances as negative (class 0), but they were actually positive (class 1). These are the cases where the model falsely identified positive instances as negative, leading to a false negative prediction.

Precision: The precision for both classes (0 and 1) is 0.98, indicating that the model has a high proportion of correct positive predictions out of all instances predicted as positive. This means that when the model predicts an instance as positive, it is correct 98% of the time for both classes.
Recall: The recall for both classes is also 0.98, indicating that the model correctly identifies 98% of the positive instances out of all true positive instances. This implies that the model has a high sensitivity in detecting positive instances for both classes.
F1-Score: The F1-score for both classes is 0.98, which is the harmonic mean of precision and recall. This balanced measure suggests that your model achieves a good trade-off between precision and recall for both classes.

· The overall accuracy of our model is 0.98, which means that it correctly predicts the class labels for 98% of the instances in the test set.