Cancer is the second-leading cause of death worldwide, and therapeutic peptides that target and destroy cancer cells have received a great deal of interest in recent years. Traditional wet experiments are expensive and inefficient for identifying novel anticancer peptides; therefore, the development of an effective computational approach is essential to recognize ACP candidates before experimental methods are used. In this study, we proposed an Ada-boosting algorithm with the base learner random forest called ACP-ADA, which integrates binary profile feature, amino acid index, and amino acid composition with a 210-dimensional feature space vector to represent the peptides. Training samples in the feature space were augmented to increase the sample size and further improve the performance of the model in the case of insufficient samples. Furthermore, we used five-fold cross-validation to find model parameters, and the cross-validation results showed that ACP-ADA outperforms existing methods for this feature combination with data augmentation in terms of performance metrics. Specifically, ACP-ADA recorded an average accuracy of 86.4% and a Mathew's correlation coefficient of 74.01% for dataset ACP740 and 90.83% and 81.65% for dataset ACP240; consequently, it can be a very useful tool in drug development and biomedical research.
Keywords: anticancer peptides; ada-boosting algorithm; data augmentation; binary profile feature; amino acid index; amino acid composition
Cancer is currently the second most common cause of death and a leading cause of morbidity worldwide [[
Anticancer Peptides (ACPs) do not interfere with healthy bodily processes; rather, they provide new therapeutic options. The discovery of ACPs has opened new avenues for cancer treatment. ACPs are made up of 10 to 60 amino acids and feature an amphipathic cationic [[
Numerous computational techniques in the domain of bioinformatics are utilized to solve various types of issues [[
The number of the ACPs engaged with the above strategies has not surpassed 1000 cases, which is certainly not a huge number. The prediction performance of this strategy can be further improved if additional ACPs are included [[
There are four steps involved in the proposed method, as shown in Figure 1. First, the given peptide sequences are input and each peptide sequence is preprocessed to an equal length. Second, we calculate the BPF (140-Dimensional feature vector), AAINDEX (50-Dimensional feature vector with the features selected based on minimum redundancy–maximum relevance (mRMR)), and AAC (20-Dimensional feature vector) of the peptides to contribute a 210-dimensional feature vector. Third, the training samples are augmented based on the contributing feature vector, and the augmented training samples are used to train the boosting classifier. Finally, to test the performance of the proposed technique, we apply five-fold cross-validation to evaluate ACP-ADA based on two benchmark datasets, ACP740 and ACP240. We assess the effectiveness of this strategy using several classification matrices and the outcome of augmentation using a different classifier. The results obtained from the experiment demonstrate that data augmentation based on the concatenated hybrid feature vector, that is, BPFs, AAINDEX, and AAC, can improve the prediction of ACPs with the choice of suitable classifiers using data augmentation. Thus, the proposed ACP-ADA method is suitable for prediction.
In this section, we illustrate the effects of concatenated features (BPF+AAINDEX+AAC) on the performance of the proposed method when using different classifiers with and without data augmentation. Finally, we compare the proposed method with existing methods using a different classifier.
The parameter affecting the performance of the model is Lx, the peptide length after pre-processing, which was selected as a length of 40, 50, or 60. In the data augmentation stage, N is an additional parameter connected to the number of new positive (negative) samples in the model. Thus, N can be set to 100, 200, or 300 percent of the initial positive (negative) sample number.
The prediction performance of the model established based on different values of the Lx parameter, which is the peptide length, and 'N', which represents the percentage of augmentation for databases ACP740 and ACP240, is presented in Table 1 and Table 2. MCC is a threshold-independent performance evaluation metric that generates a high score only if the classifier correctly predicts most of the positive and negative data instances. Therefore, we chose the best parameters, namely, Lx = 50 and N = 100% for ACP740 and Lx = 50 and N = 300% for ACP240, according to the maximum MCC value. Because ACP240 has fewer samples than ACP740, the value of N is larger for ACP240 than for ACP740, implying that more pseudo-samples are required for ACP240 than for ACP740. In addition, the performance of the model was evaluated on the ACP214 test dataset. The results for ACP-ADA on the independent test dataset are explained in the Supplementary Materials Section S2, Figure S2, Section S2.1.
BPF and k-mer sparse matrix have proven to be effective in ACP-DL [[
BPFs, AAINDEX, and AAC are the three features. BPF+AAINDEX, BPF+AAC, AAINDEX+AAC, and BPF+AAINDEX+AAC were combined together. The performance of the models for individual features and their concatenation is depicted in Figure 2. When the three features were applied separately, BPF and AAC performed the best. Based on the MCC value, the BPF+AAINDEX+AAC feature combination produced the best results for ACP740 and ACP240 among the four feature concatenations, as shown in Figure 2. We chose the BPF+AAINDEX+AAC concatenation to represent the peptide sequence based on feature concatenation and consequent performance.The feature importance for anticancer peptide prediction is explained in the Supplementary Materials Section S2.
We used the concatenated BPF + AAINDEX + AAC as a concatenated feature to represent peptides. It was then necessary to determine the classifier which worked best with our strategy. In Figure 3, the horizontal axis represents the classifier and the vertical axis represents the MCC value for each classifier with and without data augmentation. We analyzed the performance of the prediction model with and without data augmentation on seven selected models: Multi-layer Perceptron (MLP), a neural network-based model for prediction; Support Vector Machine (SVM), which classifies peptides using a hyperplane; Random Forest (RF), which classifies peptides based on the if–then rule and is a tree-based model; k-Nearest Neighbours (KN), which separates two different classes using their number of nearest neighbours; Extremely Randomized Tree (ET), which is a tree-based hybrid model built using decision trees; Gradient Boosting Classifier (GB), which is a boosting method that focuses on previous incorrect classification by a weak learner and tries to improve the prediction; and AdaBoost (Base Learner = RF), which is an adaptive boosting method constructed using a weak learner random forest. We utilized MCC to assess and test the models' performance because it is a comprehensive metric. The performance of the selected models on the ACP-740 and ACP-240 datasets are shown in Figure 3.
Figure 3 confirms that based on the ACP740 dataset the prediction models built using MLP, RF, ET, GB, and ADA show performance improvements in terms of the MCC value used to evaluate the prediction models. However, data augmentation causes performance degradation in the models based on SVM and KN. On the ACP240 dataset, data augmentation can enhance the performance of the prediction models developed based on RF, KN, ET, GB, and ADA, meaning that the relative prediction performance of the models based on MLP and SVM decreases. Thus, when using RF, ET, GB, and ADA, data augmentation improved the performance of the ACP prediction model. This finding indicates that the effectiveness of data augmentation is linked to the classifier selected for prediction. Therefore, MLP, SVM, and KN were not suitable for our prediction model.
Based on MCC as the comprehensive metric for evaluating the performance of the model, we chose the AdaBoost classifier (ADA) to build the final predictive model. Though GB the method achieved the best performance on ACP740, its classification performance on the ACP240 dataset after data augmentation, which consists of relatively fewer samples, was much weaker. Therefore, the ADA method was selected as a more robust alternative for classifying ACPs and non-ACPs on both datasets. ADA shows a better performance improvement on both datasets after data augmentation compared to the other classifiers. The method for building the AdaBoost classifier is called ACP-ADA, and has exhibited outstanding performance in various fields in recent years. The results of our developed Adaptive Boosting Classifier for the ACP740 and ACP240 peptide datasets show significant improvement compared with previous state-of-the-art models. It achieves a better performance based on both ACC and MCC, which indicates that the proposed ACP-ADA model can be used as an anti-cancer peptide model for investigating ACPs and non-ACPs.
To ensure the effectiveness and efficiency of the proposed method, we compared the performance of ACP-ADA with ACP-DA [[
Compared with ACP-DA, the use of our method has a distinct advantage. It is accompanied by a concatenated feature vector (BPF+AAINDEX+AAC) representing the order, composition, and physicochemical properties to represent the peptides with data augmentation and the boosting classifier, which is an ensemble learner that focuses on incorrectly classified samples. The proposed method with concatenated hybrid feature vectors with data augmentation outperforms ACP-DA in most metrics, especially the two most important performance metrics, ACC and MCC.
As shown in Figure 4, the performance of the proposed method on the ACP740 and ACP240 datasets was better than that of ACP-DA, ACP-DL, DeepACP, and AntiCP 2.0. Compared to the ACP-DA as the current guarding model, our method showed improvements in ACC by 5%, PRE by 5%, SPE by 6%, and MCC by 9% for the ACP740 dataset. For ACP240, the number of samples was lower than for ACP740; nonetheless, our method improved the ACC by 3%, PRE by 1%, SPE by 2%, and MCC by almost 6%. The proposed method outperformed the alternatives on the ACP240 dataset in terms of both the ACC and MCC evaluation metrics, indicating that our strategy is well suited to datasets with a lower fraction of samples. This method applies the Gaussian noise oversampling method with the AdaBoost classifier method using random forest as a base learner and a feature vector representing the order and composition with physicochemical properties, which improves the prediction of ACPs. In addition, the performance of ACP-ADA and all control methods was evaluated on the ACP214 test dataset. The details are provided in the Supplementary Materials Section S2.
Tracing the etiology of cancer remains challenging because of its ambiguous mechanisms. According to a systematic examination, individual feature vectors do not offer viable biomarkers for predicting peptide activity. Therefore, in order to investigate a suitable feature vector, we used BPF, AAINDEX, AAC, and their combination to represent the order, composition, and physicochemical properties of peptides to obtain suitable feature representation. From the experiment with features comparison based on the maximum MCC value for the ACP740 and ACP240 datasets, we selected the concatenation of BPF, AAINDEX, and AAC to represent the peptides. We extracted 210-dimensional feature vectors from this feature combination to represent peptides in the feature space. Here, we propose an ACP prediction method called ACP-ADA which uses a boosting method along with data augmentation of the training samples. According to the results on the two datasets, the proposed model has good overall performance. Compared with existing methods, ACP-ADA had better results in classifying whether the peptides were ACP or non-ACP; its ACC may be attributed to the following reasons.
First, we used effective feature representation methods to characterize peptide sequences. To find the feature combinations, we concatenated three feature representation methods to form robust features using BPF, AAINDEX, and AAC. Experiments on the ACP740 and ACP240 datasets show that the concatenated features obtain the best performance; therefore, we used triad feature combination to represent the peptide sequences.
Second, to compensate for the lack of samples in the training set, data augmentation was applied to generate pseudosamples. We generated a pseudosample by adding perturbation to the training samples in the 210-dimension feature space of the original samples. The feature space of the samples was formed by the concatenation of BPF, AAINDEX, and AAC as a hybrid feature, resulting in a 210-dimensional numerical feature vector. BPF is composed of vectors of 1 and 0, which are incompatible with the addition of noise; thus, we only added noise to AAINDEX and AAC to generate pseudosamples. Augmented training samples were used to train the machine learning model to further improve the performance of the prediction model, which showed a significant impact based on the choice of the classifier.
Finally, the various models showed good performance in many bioinformatic classifications. However, it remains unclear whether data augmentation can improve the performance of prediction models using different classifiers. Therefore, we analyzed the effect of this methodology using seven different classifiers. The results show that data augmentation is effective when using RF, ET, GB, and ADA classifiers with RF as the base learner. Therefore, we selected ADA, which is a boosting classifier, as the final classifier with the best overall performance.
In summary, the proposed method for the identification of ACPs showed improved performance; it is our hope that ACP-ADA can play an important role in biomedical research and the development of new anticancer drugs. Furthermore, a comparative analysis with other methods showed that ACP-ADA was better than the other methods in most cases.
To accurately and quickly identify ACPs, a boosting classifier was applied to discriminate peptide sequences using a 210-dimension feature vector which focuses on incorrectly classified samples as a sample of priority while constructing a random forest to form a complete AdaBoost classifier. As an ensemble learning method, boosting effectively prevents over fitting; it performed well on test data and achieved a comparative improvement in prediction of ACPs. In addition, the secondary and tertiary structure prediction characteristics of peptides can be added to this model as a feature descriptor, which may improve the performance of the model with the data augmentation method. Furthermore, the neural network method can be used for the identification of ACPs with an increase in the dataset size.
Because of the successful result with data augmentation for the dataset with low sample proportion (ACP240 dataset), using machine learning boosting methods, we can conclude that this methodology for peptide data augmentation can be applied for training deep learning models such as Convolutional Neural Networks, Recurrent Neural Networks, Transformer and several language models. Based on our predictive performance improvement for the dataset with a lower number of positive and negative classes, we can assert that this method of peptide data augmentation can enhance and quantify predictive performance on datasets with fewer samples using advanced deep learning models, which can be further explored for peptide-based research using data augmentation to escalate model performance. This method can be explored while working with advanced deep learning models using data augmentation.
In this study, a machine learning model called the boosting method is proposed to predict ACPs. Called ACP-ADA, the proposed method uses concatenated features provided by BPF, AAINDEX, and AAC. We evaluated the predictive performance of ACP-ADA for ACPs on the ACP740 and ACP240 benchmark datasets. Furthermore, using the common tool CD-HIT [[
The main dataset, ACP740, includes 364 non-ACPs (negative examples) and 376 experimentally validated ACPs (positive examples).
The alternate dataset, ACP240, includes 111 non-ACPs (negative examples) and 129 experimentally validated ACPs (positive examples).
In addition, we build datasets with an CD-HIT cutoff of 0.35% named ACP614 and ACP214. A description of the datasets and experimental results are provided in the Supplementary Materials Section S2.
The iLearn python package [[
First, the physiochemical characteristics of each sequence of amino acids were determined using the AAINDEX function in the iLearn Python package [[
(
BPF represents the residue order, AAINDEX represents the peptides in terms of the properties of 20 amino acid residues with respect to the physicochemical properties (activity-based features) and AAC represents the proportion of residues dominant in ACPs and non-ACPs (which are highly dominant). Thus, the combination collectively represents the residue order, activity, and percentage of each residue for each peptide. Combining these features can capture the local residue level order information, structural sequence features, and proportion of amino acids highly available in ACPs and Non-ACPs as explainable parameters for the sequence and model. Because of this, we selected and extracted BPF, AAINDEX, and AAC as a predicting feature in our proposed method. Each individual feature was used along with the combination of trait features as predictors for the machine learning model. Finally, the training samples were augmented in the feature vector and used to train the machine learning model, with the trained model assigning the class level to the test sets.
The newly constructed datasets ACP614 and ACP214 (with CD-HIT 0.35%) were featured based on PSSM. The details are explained in the Supplementary Materials Section S2.
Converting peptides of various lengths into feature vectors of a fixed length is the primary goal of feature representation. The unprocessed peptide sequence P can be modeled as
(
where P[
(
The binary profile has the advantage of providing an order of residues in the peptides, which is not feasible with composition-based characteristics [[
(
where k represents the length of the peptide resembling the N-terminal amino group. The experiments suggest that setting k to 7 produces the best results [[
The most useful qualities for representing biological reactions are the physicochemical characteristics of amino acids, which have been widely employed in bioinformatic studies. Numerous published indices that represent the physicochemical characteristics of amino acids can be found in the AAINDEX database [[
The frequency of each residue in the peptide sequence was determined using AAC encoding. AAC, which demonstrates that particular residues are more prevalent in ACPs than in non-ACPs, can be used to discriminate between ACPs and non-ACPs. As a result, the AAC feature was added to represent the peptide, then extracted into a fixed-dimensional feature vector using the iLearn Python tool. All 20 natural amino acid frequencies (i.e., "ACDEFGHIKLMNPQRSTVWY") can be described by Equation (
(
Here, N(t) is the repetition of an amino acid of type t, N is the length of a protein or peptide sequence, and F(AAC[N]) results in a 20-dimensional feature vector representing the AAC of the peptide sequence. A conjoint feature vector was formed to represent peptides using BPF (
When solving scientific problems, data imbalance and insufficient data are common issues in machine learning and deep learning technologies [[
(
where F(i) is a random sample from a training sample of peptide sequences, i = 1 ..., and N (N) is the total number of positive (negative) samples, representing a 210-dimensional vector used to generate a perturbation that corresponds to F(i). In order to improve model learning, we performed peptide augmentation by adding noise to the training samples following the Gaussian distribution and left the test set without data augmentation. Because test sets are used for evaluation of model performance, they are not suitable for data augmentation. Here, V is composed of three parts; one is a 140-dimensional vector of zeros and ones corresponding to BPFs, and the other consists of a 50-dimensional random vector and a 20-dimensional random vector with a value between 0 and 1, corresponding to the AAINDEX and AAC, respectively. Thus, perturbation was added to AAINDEX and AAC and BPFs were kept unchanged in the pseudo-sample set F (new), where 'a' is the coefficient of perturbation and was set to 0.02 for ACP740 and 0.005 for ACP240.
We tried adding different values of perturbation, and usually preferred a range of 0 to 1 to ensure that the features followed a Gaussian distribution. After training and testing with different set values, we found 0.02 and 0.005 to be the best values to add for feature distribution for ACP740 and ACP240, respectively, as these values closely resemble the AAINDEX and AAC. Augmenting the samples with these values led to improved prediction performance. Therefore, these fixed values of noise were considered as standard for augmenting the samples in ACP740 and ACP240. To obtain N new samples, the sampling process was repeated N times using these noise value for datasets ACP740 and ACP240.
AdaBoost Random Forest Model
When adaptive boosting is used in conjunction with the random forest approach, there are two options. The first is "boost in the forest", in which an AdaBoost classifier is generated for each random vector k (i.e., a set of variables); a series of 'simple' AdaBoost classifiers, each with a limited number of variables, is then used to arrive at a final result [[
An AdaBoost classifier is a meta-estimator that starts with the original dataset and then fits new copies on the same dataset while adjusting the weights of poorly classified instances in order to ensure that succeeding classifiers focus on more difficult cases. Owing to its excellent performance, this classifier has gained popularity in many fields of bioinformatics [[
To evaluate the performance of ACP-ADA, we used a five-fold cross-validation strategy. Five performance metrics were used to evaluate the strength of the binary classification tasks: accuracy (ACC), precision (PRE), sensitivity (SEN), specificity (SPE), and the Mathews correlation coefficient (MCC) [[
(
(
(
(
(
where FP stands for false positive predictions, FN stands for false negative predictions, TP stands for correct positive predictions, and TN stands for true negative predictions. In addition to these metrics, we used the F1-Score to evaluate the performance of the classifiers. The detailed results are provided in the Supplementary Materials Section S2.
The proposed ACP-ADA method can be used to determine whether peptides are anticancer or non-anticancer based solely on the concatenation of hybrid sequence feature vectors representing the order, composition, and physicochemical properties with data augmentation. The predicted results obtained by ACP-ADA via five-fold cross-validation on the benchmark datasets ACP-740 and ACP-240 indicate that the proposed ACP-ADA method is comparably better, or at the very least capable of supplementing futuristic computational models in this area. Because of its success rate on the alternate ACP-240 dataset with a lower number of (positive/negative) samples, ACP-ADA is expected to become a useful throughput tool that is widely used in drug development and biomedical research. This confirms the data augmentation method as an alternative approach to over-sampling techniques, as it can boost the performance of various sequence-based peptide and non-peptide models based on the choice of features and classifier. In the future, we intend to consider more complex feature extraction methods and machine learning algorithms to further improve the performance of ACP peptide prediction models.
DIAGRAM: Figure 1 Step flow diagram of ACP-ADA: binary profile feature (BPFs), amino acid index (AAINDEX) features after feature selection, and amino acid composition (AAC) were integrated to represent peptides, and the samples in the training set were augmented in the feature space. After data augmentation, the samples were used to train a machine learning model for the prediction of anticancer peptides (ACPs).
Graph: Figure 2 Comparison of feature efficacy for prediction using the BPF, AAINDEX, AAC, and their possible concatenations on the ACP740 and ACP240 datasets.
Graph: Figure 3 Comparison of the prediction models with and without data augmentation on the ACP740 and ACP240 datasets.
Graph: Figure 4 Comparison of ACP-ADA with existing methods on the ACP740 and ACP240 datasets.
Graph: Figure 5 Histogram of the sequence length of peptides in ACP740 and ACP240 datasets.
Table 1 Performance of ACP-ADA with parameters 'Lx' and 'N' on the ACP740 dataset along with performance metrics (the best metrics are shown in bold).
Lx N% ACC% PRE% SEN% SPE% MCC% 40 100 85.81 87.60 84.05 87.64 71.77 40 200 86.21 88.13 84.32 88.17 72.64 40 300 85.54 87.55 83.52 87.64 71.25 50 200 85.81 87.82 83.79 87.91 71.81 50 300 85.67 87.83 83.52 87.91 71.58 60 100 86.48 88.61 84.32 88.73 73.05 60 200 85.94 87.84 84.05 87.91 72.05 60 300 86.35 88.33 84.32 88.46 72.86
Table 2 Performance of ACP-ADA with parameters 'Lx' and 'N' on the ACP240 dataset along with performance metrics (the best metrics are shown in bold).
Lx N% ACC% PRE% SEN% SPE% MCC% 40 100 87.08 88.29 87.60 86.44 74.06 40 200 87.09 88.15 87.63 86.48 74.21 40 300 86.66 87.15 88.36 84.66 73.19 50 100 88.33 89.18 89.13 87.35 76.57 50 200 88.34 89.28 89.14 87.39 76.81 60 100 90.41 90.78 91.47 89.16 80.75 60 200 89.16 90.51 89.16 89.16 78.30 60 300 90.02 90.67 90.70 89.16 79.91
Table 3 Model Parameters for AdaBoost-Random Forest for ACP Prediction.
Parameters Settings Base Learner Random Forest Learning Rate 0.04 Seed 121 Number of Estimators 406
Conceptualization, S.B. and H.T.; methodology, S.B.; software, S.B.; validation, S.B., H.T. and K.T.C.; resources, K.T.C.; data curation, S.B.; writing—original draft preparation, S.B. and H.T.; writing—review and editing, K.T.C. and K.-S.K.; supervision, H.T. and K.T.C.; project administration, H.T.; funding acquisition, K.T.C. All authors have read and agreed to the published version of the manuscript.
Not applicable.
Not applicable.
The datasets and codes are available at https://github.com/Sadik90/ACP-Prediction/find/main (accessed on 6 July 2022).
The authors declare no conflict of interest.
The following abbreviations are used in this manuscript:
ACP Anticancer Peptide BPF Binary Profile Feature AAINDEX Amino Acid Index AAC Amino Acid Composition mRMR Minimum Redundancy–Maximum Relevance MLP MultiLayer Perceptron SVM Support Vector Machine RF Random Forest KN k-Nearest Neighbours ET Extremely Randomized Tree GB Gradient Boosting ADA Adaptive Boosting with Random Forest
The following supporting information can be downloaded at: https://
By Sadik Bhattarai; Kyu-Sik Kim; Hilal Tayara and Kil To Chong
Reported by Author; Author; Author; Author