With the widespread adoption of Internet of Things (IoT) technology, the increasing number of IoT devices has led to a rise in serious network security issues. Botnets, a major threat in network security, have garnered significant attention over the past decade. However, detecting these rapidly evolving botnets remains a challenge, with current detection accuracy being relatively low. Therefore, this study focuses on designing efficient botnet detection models to enhance detection performance. This paper improves the initial population generation strategy of the Dung Beetle Optimizer (DBO) by using the centroid opposition-based learning strategy instead of the original random generation strategy. The improved DBO is applied to optimize Catboost parameters and is employed in the field of IoT botnet detection. Performance comparison experiments are conducted using real-world IoT traffic datasets. The experimental results demonstrate that the proposed method outperforms other models in terms of accuracy and F1 score, indicating the effectiveness of the proposed approach in this field.
Keywords: IoT; botnet; catboost; dung beetle optimizer
The rapid development of the Internet of Things (IoT) has brought many conveniences to modern society, such as smart homes [[
On one hand, a large number of commonly used devices such as televisions, refrigerators, surveillance cameras, doors, windows, cups, and light bulbs are interconnected with other devices through technology such as Bluetooth and ZigBee, greatly facilitating people's lives. However, this also provides attackers with more opportunities to invade and steal user information, including privacy and property security [[
One common attack scenario is when network hackers target IoT devices to turn them into part of a zombie network. Once infected, these devices are controlled and utilized by attackers to expand the zombie network continuously. After gaining control over multiple IoT devices, the infected zombie network devices are used to carry out various attacks, such as gaining unauthorized access to user information, massive spam email sending, and executing DDoS attacks. The impact of IoT zombie networks is extremely severe, as exemplified by the Mirai botnet in 2016, which infected millions of devices and conducted the largest DDoS attack in history [[
This paper aims to propose an IoT botnet detection model based on DBO-Catboost. In this model, the DBO algorithm is improved by incorporating a centroid opposition-based learning strategy to generate the initial population. The improved DBO algorithm produces initial solutions that are closer to the algorithm's optimal solution. The improved DBO algorithm is then used to optimize the hyperparameters of Catboost, and applied to the IoT botnet detection. The experiments utilize real IoT datasets, Botnet and Bot-IoT, and make comparisons with existing detection models. The experimental results demonstrate that the IoT botnet detection model based on IDBO-Catboost outperforms other state-of-the-art classifiers in terms of accuracy and F1 score. It also exhibits higher detection efficiency and possesses a certain degree of generalization capability.
With the continuous development of machine learning, machine learning algorithms have been widely applied in the field of IoT security and play a crucial role in IoT zombie network detection.
Ryu et al. [[
Currently, many detection models are based on intelligent optimization algorithms. Table 1 summarizes the related research on intelligent optimization algorithms in IoT botnet detection models.
In conclusion, the academic community has made significant progress in research on detecting IoT zombie network attacks. However, due to the complex and ever-changing nature of the IoT environment, as well as the rapid evolution of zombie networks, there is a need for further improvement and refinement in the accuracy of detection models. Additionally, most studies have only conducted experiments on a single dataset, which fails to demonstrate the generalizability of the proposed methods. This work aims to optimize the process of model hyperparameter selection using a novel swarm intelligence optimization algorithm. It is the first time that the DBO algorithm has been combined with the Catboost model and applied in the field of IoT botnet detection, resulting in the achievement of optimal classification results.
The Catboost algorithm [[
The dung beetle optimizer (DBO) [[
(
(
(
(
Overall, the Dung Beetle Optimizer leverages the behavior of dung beetles to offer an efficient and effective optimization algorithm capable of balancing exploration and exploitation for solving complex problems.
The position updating rules for different types of beetles are as follows:
To simulate the rolling behavior of the rolling beetles, they need to move along a given direction within the entire search space. During the rolling process, the position of a rolling beetle is updated and can be represented as follows:
(
(
where t represents the current iteration number of the algorithm and
Additionally, when the beetle's path encounters obstacles, it needs to change its forward direction through a "dance" behavior. This behavior is mimicked using a tangent function. Once the new path is determined, the beetle's position is updated using the following formula:
(
where
The position updating of breeding beetles requires setting a boundary to simulate the oviposition area of female beetles. It can be defined as:
(
(
where
After the hatchlings emerge, it is necessary to establish an optimal foraging zone to guide the beetles in their search for food. The optimal foraging zone can be represented as follows:
(
(
where
(
where
Finally, concerning the beetles referred to as thieves, they will steal dung balls from other beetles. Since
(
where g is a random vector of size 1 × D following a normal distribution, and
Controlling the quality of the initial population in DBO is crucial for enhancing the algorithm performance. The quality is primarily determined by two factors: the search range and the initial positions. If the search range is too narrow, it hampers the algorithm's ability to explore. Conversely, if the initial positions are close to the global optimal solution, the population can effectively uncover valuable information within a superior solution space. However, the standard DBO adopts a random initialization method to generate the population, which makes it difficult to meet the above two requirements.
To address this issue, this paper introduces the centroid opposition-based learning strategy (COBL) [[
Let (X
(
then we have:
(
If the centroid of a discrete and uniformly distributed set is denoted
(
when the reverse point exceeds the search space, recompute the reverse point according to Equation (
(
In the equation,
Optimizing the hyperparameters of a machine learning model can significantly impact its performance [[
In comparison to these methods, the dung beetle optimization algorithm is a novel swarm intelligence optimization algorithm that demonstrates fast convergence and strong global search capabilities. It can effectively discover the optimal combination of parameters in machine learning models.
When predicting botnet attacks using the Catboost model, inappropriate parameter settings can have a significant impact on the model's predictions. Therefore, in this paper an improved dung beetle optimization (IDBO) algorithm was chosen to optimize the important hyperparameters of the Catboost model. The ensemble model solves the problem of the Catboost model easily, becoming stuck in local optima while further improving the accuracy of the predictions.
The design concept of the IDBO-Catboost detection model combines the dung beetle optimizer algorithm with the Catboost algorithm, optimizing the key parameters in Catboost that have the most significant impact on classification performance. This approach leverages the parameter optimization capabilities of the DBO algorithm and combines it with the strengths of Catboost in handling classification problems, achieving an organic integration of the two. It eliminates the tedious manual tuning process. The IDBO-Catboost detection model is illustrated in Figure 1.
The optimization process of Catboost using the improved dung beetle optimizer can be described in the following steps:
Step 1: Initialize the Catboost model parameters and determine the optimal range based on experience. Set algorithm control parameters such as population size, maximum number of iterations, and position boundaries. Randomly define the positions of the beetles in the search space and form a new population according to the centroid opposition-based learning strategy.
Step 2: Calculate the fitness values of all beetles based on the objective function.
Step 3: Start the loop and update the positions of the beetles.
Step 4: Check if each beetle is out of bounds.
Step 5: Calculate the individual best and global best solutions, as well as their fitness values, based on the current positions of the beetles. The individual best represents the best solution found by each beetle, and the global best is selected from these individual best solutions.
Step 6: Repeat the above steps until the stopping criteria are met—either the maximum number of iterations is reached or the model performance reaches a pre-defined threshold. When the algorithm finishes, the global best solution and its fitness value are outputted. The global best solution represents the optimal parameter combination for the Catboost algorithm, and the fitness value represents the classification performance of the model in terms of AUC (Area Under the Curve).
In this experiment, the parameters that have the most significant impact on the performance of the Catboost algorithm are the number of gradient boosting trees (iterations), learning_rate, and maximum tree depth (depth). The dung beetle optimizer is used to search for the optimal parameters for these parameters. The pseudocode for constructing the IDBO-Catboost model is shown in Algorithm 1.
1: Initialization: population size, search dimensionality, maximum number of iterations, search range. 2: The centroid opposition-based learning strategy forms a new population. 3: 4: 5: 6: δ = rand(1) 7: 8: Update the position of the rolling beetle using Equation (1) 9: 10: Update the position of the rolling beetle using Equation (3) 11: 12: 13: 14: Update the position of the breeding beetle using Equations (4) and (5) 15: 16: 17: Update the position of the foraging beetle using Equation (8) 18: 19: 20: Update the position of the thief beetle using Equation (9) 21: 22: 23: 24: Update the corresponding position 25: 26: Iteration 27: 28: 29: Build the IDBO-Catboost model
In this section, the performance of the proposed improved DBO-Catboost detection model is compared and evaluated using publicly available datasets.
The hardware platform used in this experiment included CPU: 12th Generation Intel Core i5 12400f, GPU: NVIDIA GeForce 3060 Ti, 32 GB 3200 Mhz RAM, and a 1 TB SSD for storage.
The software platform for this experiment was conducted on Windows 10 operating system, using Python 3.7 as the primary programming language. Various tools and packages such as pandas, sklearn, and numpy were employed for the experiment.
The dataset used in this experiment was the Botnet dataset [[
Data preprocessing is an essential step in machine learning, involving cleaning, transforming, and standardizing raw data before feeding it into a model. The main purpose of data preprocessing is to improve data quality, enhance model performance, and increase generalization ability.
Typically, data preprocessing involves several steps:
Step1: Data cleaning. The primary objective is to remove duplicate and invalid values from the dataset. This also involves handling missing values by imputation.
Step2: Data type conversion. This refers to the process of converting the data type of features to a different format. Machine learning algorithms typically require numerical data, so non-numeric features need to be converted into a numerical format for the algorithms to process.
Step3: Standardization. Some features in the dataset may have varying scales or ranges of values. To simplify the learning process of the model, it is necessary to standardize the features in the dataset.
Step4: Feature engineering. The Botnet dataset consists of 83 features, many of which are redundant. These redundant features can negatively impact the model training. In this study, principal component analysis (PCA) [[
It is worth noting that Catboost has strong capabilities in handling missing values. It can automatically detect and handle missing data, eliminating the need for additional feature engineering and reducing the time required for data preprocessing before training. In this section, to ensure consistency across all experiments, the preprocessed dataset was used.
This article utilized evaluation metrics such as the confusion matrix, precision, recall, accuracy, and F1 score to analyze and assess the classification results. The confusion matrix is defined as shown in Table 3.
Accuracy: refers to the proportion of correctly predicted samples out of the total samples:
(
Precision: refers to the probability of correctly predicting a positive instance:
(
Recall: refers to the probability of correctly predicting a positive instance among all actual positive samples:
(
F1 Score: represents the harmonic mean of precision and recall:
(
The Catboost algorithm had a significant impact on the classification performance through parameters such as the maximum number of trees (iterations), tree depth (depth), and learning rate (learning_rate). In this study, the dung beetle optimizer was used to optimize these parameters for Catboost. When using DBO-Catboost, the iteration count was set to 100. After several preliminary experiments, it was observed that 30 iterations were sufficient for the DBO-Catboost model to converge. Therefore, the iteration count was set to 30, the number of beetle populations was set to 40, and the parameter search ranges were set as follows: Iterations = [
To verify the optimization effectiveness of the improved DBO algorithm on the model, particle swarm optimization (PSO), grey wolf optimization (GWO), and the original dung beetle optimization (DBO) algorithms were used for comparison. The number of iterations was set to 30, and the AUC value of the model was used as the fitness function. The results are shown in Figure 2.
From the results in Figure 2, it can be observed that the improved DBO algorithm achieved faster convergence, reaching convergence after 16 iterations. In terms of the AUC value, after 30 iterations the IDBO-Catboost model had the highest AUC value, indicating better optimization performance. Comparing the improved DBO algorithm with the unimproved DBO algorithm, the improved DBO algorithm had a higher initial fitness value, and the final fitness value was also higher, with a faster iteration speed. This indicates that the improved DBO generated a better initial population that was closer to the optimal solution, requiring fewer iterations to converge on the vicinity of the best solution. This result validates the effectiveness of our improvement in the generation rules of the initial DBO population.
The experimental results on the Botnet dataset, comparing Catboost, PSO-Catboost, GWO-Catboost, and the original DBO-Catboost models, are shown in Table 4.
According to the information in Table 4, the optimization strategy of IDBO-Catboost effectively found the best parameter combination for the model, improving its performance. In a vertical comparison with Catboost, IDBO-Catboost outperformed Catboost with an improvement of 2.17% in accuracy and F1 score. In a horizontal comparison, DBO-Catboost outperformed PSO-Catboost and GWO-Catboost with improvements of 1.21% and 1.38% in F1 score and accuracy, respectively. As for the comparison with the original DBO-Catboost, the IDBO-Catboost showed a slight improvement of 0.35% in accuracy and F1 score. Although the improvement was modest, it still represents a significant value when applied in real-life scenarios. The main reason for these differences is that the improved DBO algorithm can increase the population size and diversity of the population by optimizing the oviposition area, thereby better utilizing the search space. Additionally, by optimizing the strategy for generating the initial population, the algorithm can come closer to the optimal solution within the search space, enhancing its search capability. Therefore, the improved DBO algorithm can comprehensively explore the search space and achieve better global search capability to obtain the optimal parameter combination for the Catboost algorithm.
Figure 3 shows the ROC curves of several models. According to the information in Figure 3, we can see that the ROC curve of IDBO-Catboost was closer to the top-left corner, indicating that IDBO-Catboost achieved a better balance between sensitivity and specificity. Additionally, the AUC value of IDBO-Catboost was the highest among all the comparison models, reaching 0.99. Therefore, this result effectively demonstrates that the model has good performance and high discriminative ability at different thresholds.
After validating the optimization effect of IDBO on the Catboost algorithm, we compared it with the recently proposed models, namely BO-GP-DT (2020), GWO-OCSVM (2020), PSO-ONE-SVM (2021), BO-LGBM (2022), and the original DBO-Catboost. The parameter settings for the models are shown in Table 5, and the experimental results from the Botnet dataset are presented in Table 6.
From the results in Table 6, it can be observed that IDBO-Catboost achieved the highest accuracy of 96.21% among all the comparative models. It outperformed BO-GP-DT, BO-LGBM, GWO-OCSVM, PSO-ONE-SVM, and the original DBO-Catboost by 1.04%, 0.63%, 0.72%, 0.55%, and 0.35%, respectively. Similarly, IDBO-Catboost also demonstrated superior precision, recall, and F1 score compared to other models, indicating its optimal classification performance.
There are two main reasons for these results: firstly, compared to decision trees and SVM algorithms, gradient-boosting tree-based algorithms such as Catboost and LGBM perform better when dealing with large-scale datasets, and the Catboost algorithm is particularly effective in handling noisy data in the dataset. Secondly, for optimization algorithms, the ability to balance global and local searches is crucial. In the dung beetle optimizer, the breeding behavior ensures that new individuals have better fitness, foraging behavior accelerates the speed of local search, stealing behavior utilizes better solutions discovered in the global search, and rolling behavior increases the algorithm's diversity, helping it to escape local optima and further improving the global search capability. Finally, the improved DBO algorithm generates an initial population that is closer to the optimal solution, allowing it to better find the optimal parameter combination. Through the combined effect of these behaviors, the algorithm achieves a balance between global and local search, thereby enhancing its search ability and optimization performance, effectively improving the detection and classification performance of the model.
To evaluate the generalization ability of the proposed method, the Bot-IoT dataset [[
Table 7 presents the performance of the IDBO-Catboost and the comparison models on the processed Bot-IoT dataset.
From the information in Table 7, it can be observed that compared to other models, IDBO-Catboost achieved higher detection accuracy on the Bot-IoT dataset, with an accuracy and F1 score of 98.57%. This indicates that the hyperparameters identified by IDBO-Catboost were highly effective, and the model exhibited good generalization ability. The main reason for this result was the presence of the dynamically changing parameter R in the dung beetle optimizer, which allowed the algorithm to adapt better to different problems and enhance the model's generalization performance.
In addition to performance metrics such as accuracy and F1 score, the average time (sum of model training and prediction time) of the proposed model and other detection models on the Botnet dataset and Bot-IoT dataset are compared in Table 8. It can be observed that the IDBO-Catboost detection model requires less time compared to the other detection models. This is because the IDBO-Catboost model can utilize GPU acceleration for training, thereby improving the detection efficiency of the model. Regarding DBO-Catboost and Catboost, the IDBO-Catboost model has fewer iterations, resulting in a shorter training time compared to DBO-Catboost and Catboost models.
In summary, this paper conducted experiments using two datasets, Botnet and Bot-IoT, related to Internet of Things (IoT) botnet networks. By comparing with existing detection models, the proposed improved DBO-Catboost detection model demonstrated the best classification performance in terms of high accuracy and detection efficiency. Therefore, it can be effectively applied to the detection tasks of IoT botnet networks.
To address the issue of designing an effective zombie network detection model and improving detection performance, this paper proposes a zombie network detection model based on the improved DBO-CatBoost. We improved the initial population generation strategy of the dung beetle optimization algorithm, replacing random initialization with the centroid opposition-based learning strategy, and applied it to the field of IoT botnet network detection. The improved DBO-CatBoost model utilized the dung beetle optimizer to optimize the hyperparameters of CatBoost, reducing the cumbersome and repetitive process of hyperparameter tuning and improving efficiency. The proposed method was validated using real-world IoT traffic datasets. Experimental results demonstrate that the proposed approach achieved accuracy rates of 96.10% and 98.57% on the Botnet and Bot-IoT datasets, respectively. It outperformed all the compared models in terms of classification performance and exhibited a certain level of generalization capability.
There are some limitations in this study that need to be addressed. When dealing with imbalanced datasets, further research is needed to improve the focus on the minority class and enhance the detection performance of the model on imbalanced data. Additionally, in the future, introducing interpretability mechanisms will be beneficial to improve the explainability of the model.
Graph: Figure 1 IDBO-Catboost Botnet Detection Model.
Graph: Figure 2 Changes in AUC values for the number of iterations. (a) IDBO-Catboost, (b) DBO-Catboost, (c) PSO-Catboost, and (d) GWO-Catboost.
Graph: applsci-13-07169-g002b.tif
Graph: Figure 3 AUC values and ROC curves of different models.
Graph: Figure 4 Imbalanced data handling process.
Graph: Figure 5 Dataset sample distribution.
Table 1 Related studies on intelligent optimization algorithms for IoT botnet detection.
Year Reference Methodology Dataset Model Performance 2020 Injadat et al. [ BO-GP Bot-IoT BO-GP-DT Accuracy: 99.99% 2020 Al et al. [ GWO N-BaIoT GWO-OCSVM G-Mean: 96.8% 2021 Salam et al. [ PSO NN-BaIoT PSO-ONE-SVM G-Mean: 98.7% 2022 Gong et al. [ BO CTU-13 BO-LGBM Accuracy: 96.62%, F1 Score: 86.71%
Table 2 Basic Information of Botnet Dataset.
Number of Samples Number of Features Attack Normal Training Set 331,852 84 173,192 158,660 Testing Set 345,746 84 212,154 133,592
Table 3 Confusion Matrix.
Predicted Positive Predicted Negative Actual Positive TP FP Actual Negative FN TN
Table 4 Comparison of experimental results.
Methods Precision Recall Accuracy F1-Score Catboost 94.09% 93.93% 93.93% 93.93% DBO-Catboost 95.84% 95.75% 95.75% 95.75% IDBO-Catboost 96.21% 96.10% 96.10% 96.10% PSO-Catboost 95.00% 94.89% 94.89% 94.89% GWO-Catboost 94.83% 94.72% 94.72% 94.72%
Table 5 Parameter settings.
Methods Parameter Settings BO-GP-DT max_depth = 6, max_leaf_nodes = 11, min_samples_leaf = 2, min_samples_split = 7 BO-LGBM max_depth = 18, learning_rate = 0.58, n_estimators = 800, num_leaves = 35 DBO-Catboost Iterations = 849, learning_rate = 0.12, depth = 8 IDBO-Catboost Iterations = 816, learning_rate = 0.17, depth = 7 GWO-OCSVM .061, .21, Kenel = rbf PSO-ONE-SVM .053, .22, Kenel = rbf
Table 6 Experimental Comparison of Various Models on Botnet.
Methods Precision Recall Accuracy F1-Score BO-GP-DT 95.19% 95.06% 95.06% 95.06% BO-LGBM 95.53% 95.47% 95.47% 95.47% DBO-Catboost 95.84% 95.75% 95.75% 95.75% IDBO-Catboost 96.21% 96.10% 96.10% 96.10% GWO-OCSVM 95.45% 95.38% 95.38% 95.38% PSO-ONE-SVM 95.62% 95.55% 95.55% 95.55%
Table 7 Experimental Comparison of Various Models on BotIoT.
Methods Precision Recall Accuracy F1-Score Catboost 94.38% 94.25% 94.25% 94.25% BO-GP-DT 96.73% 96.68% 96.68% 96.68% BO-LGBM 97.10% 97.06% 97.06% 97.06% DBO-Catboost 98.13% 98.12% 98.12% 98.12% IDBO-Catboost 98.62% 98.57% 98.57% 98.57% GWO-OCSVM 96.54% 96.50% 96.50% 96.50% PSO-ONE-SVM 97.08% 97.01% 97.01% 97.01%
Table 8 Average time.
Dataset Average Time(s) IDBO-Catboost DBO-Catboost BO-GP-DT BO-LGBM GWO-OCSVM PSO-ONE-SVM Catboost Botnet 38.21 39.84 46.75 45.31 47.82 46.52 51.89 BotIoT 20.19 23.44 31.67 27.05 30.70 28.55 37.42
Conceptualization, C.Y.; Writing—original draft, C.Y.; experiment, C.Y.; resources, C.Y.; formal analysis, C.Y. and W.G.; project running, W.G. and Z.F.; supervision, W.G and Z.F.; funding acquisition, Z.F. All authors have read and agreed to the published version of the manuscript.
The dataset used can be located at https://research.unsw.edu.au/projects/bot-iot-dataset (accessed on 20 April 2023) and https://
The authors declare no conflict of interest.
By Changjin Yang; Weili Guan and Zhijie Fang
Reported by Author; Author; Author