The interaction among proteins is essential in all life activities, and it is the basis of all the metabolic activities of the cells. By studying the protein-protein interactions (PPIs), people can better interpret the function of protein, decoding the phenomenon of life, especially in the design of new drugs with great practical value. Although many high-throughput techniques have been devised for large-scale detection of PPIs, these methods are still expensive and time-consuming. For this reason, there is a much-needed to develop computational methods for predicting PPIs at the entire proteome scale. In this article, we propose a new approach to predict PPIs using Rotation Forest (RF) classifier combine with matrix-based protein sequence. We apply the Position-Specific Scoring Matrix (PSSM), which contains biological evolution information, to represent protein sequences and extract the features through the two-dimensional Principal Component Analysis (2DPCA) algorithm. The descriptors are then sending to the rotation forest classifier for classification. We obtained 97.43% prediction accuracy with 94.92% sensitivity at the precision of 99.93% when the proposed method was applied to the PPIs data of yeast. To evaluate the performance of the proposed method, we compared it with other methods in the same dataset, and validate it on an independent datasets. The results obtained show that the proposed method is an appropriate and promising method for predicting PPIs.
Since the interactions among proteins play an extremely important role in almost all biological processes, many researchers have designed innovative techniques for detecting Protein-Protein Interactions (PPIs) in post genome era[
More recently, researchers have become increasingly interested in determining whether proteins interact by using information obtained directly from the protein amino acid sequence[
In this article, we develop a new sequence-based approach to predict PPIs using the matrix-based protein sequence descriptors combined with the Rotation Forest (RF). In detail, we first represent the protein sequence as the Position-Specific Scoring Matrix (PSSM) and use the two-dimensional Principal Component Analysis (2DPCA) algorithm to extract numerically descriptor to characterize the protein amino acid sequence. We then construct the feature vector of the protein pair by coding two protein vectors in this pair. Finally, the feature vectors of these protein pairs are sent to the RF classifier for classification. In order to assess the ability of the proposed model to predict PPIs, we use Yeast and Helicobacter pylori datasets to verify it. In the experiment, our model achieved 97.43% and 88.07% prediction accuracy with 94.92% and 78.20% sensitivity at the specificity of 99.93% and 97.44% on these two datasets. Furthermore, we evaluated the ability of the proposed model on independent datasets (C.elegans, E.coli, H.sapiens and M.musculus), where 91.43%, 99.93%, 92.00% and 90.73% of the prediction accuracy were generated, respectively.
In this study, we use five-fold cross-validation technique to verify the predictive power of our model. All samples are randomly divided into almost the same number of 5 subsets, each subset containing interacting and non-interacting protein pairs. Four subsets are used as training sets each time, and the remaining one subset is used as a test set, the process is repeated five times so that every subset is used as a test set once. The performance of the method is the average of the 5 sets performances. Several evaluation criteria used in our study to estimate the predictive power of our model including accuracy (Accu.), sensitivity (Sen.), precision (Prec.), and Matthews correlation coefficient (MCC). The calculation formulas are listed below:Accu.=TP+TNTP+TN+FP+FNSen.=TPTP+FNPrec.=TPTP+FPMCC=TP×TN−FP×FN(TP+FP)(TP+FN)(TN+FP)(TN+FN)where True Positive (TP) represents the number of correct classification of positive samples, False Positive (FP) represents the number of incorrect classification of positive samples, True Negative (TN) represents the number of correct classification of negative samples, and False Negative (FN) represents the number of incorrect classification of negative samples.
We also produce Receiver Operating Characteristic (ROC) curves[
We appraise the ability of our model using the Golden Standard Datasets. To ensure the stability of the experimental results, the five-fold cross-validation is exploited in the experiment. The parameters of the rotation forest (feature subset number K and decision trees number L) were tested within the range of values by the grid search method to expect to achieve better performance. Considering the accuracy rate and time cost of the rotation forest, as a result the best parameter we get K is 20 and L is 2.
The experimental results of the RF classifier and the matrix-based protein amino acids sequences representation is summarized in Table 1. As seen from the Table 1 that the average accuracy of our approach is as high as 97.43%. In order to more fully show the predicted results of our approach, we also calculated the values of precision, sensitivity, MCC, and AUC. From Table 1, we can see that our model has achieved good experimental results, the sensitivity value of 94.92%, the precision value of 99.93%, the MCC value of 94.97%, and the AUC value of 97.51%. Furthermore, it can be seen from the table that the standard deviation of accuracy, sensitivity, precision, MCC, and AUC is 0.30%, 0.43%, 0.17%, 0.59% and 0.47%, respectively. Figure 1 plots the ROC curve generated by our method on the Yeast dataset. X-axis expresses false positive rate (FPR) and Y-axis expresses true positive rate (TPR) in the figure.The ROC curves performed on the Yeast dataset using the proposed method.
In order to further evaluate the ability of our approach to predict PPIs, we tested it against the H. pylori dataset. In the experiment, the same classifier parameters and feature extraction algorithm are used. Table 2 lists the experimental results of cross-validation. We achieved the high accuracy of 88.07%, the sensitivity value of 78.20%, the precision value of 97.44%, the MCC value of 77.66%, and the AUC value of 88.76% on the H. pylori dataset. In addition, from Table 2 we can also observe that the standard deviation of accuracy, sensitivity, precision, MCC, and AUC is 0.77%, 0.97%, 0.90%, 1.46% and 1.32%, respectively. Figure 2 plots the ROC curve generated by our method on the H. pylori dataset.The ROC curves performed on the H. pylori dataset using the proposed method.
Machine learning has been successfully and reliably applied to predictive PPIs. Wherein, SVM is one of the famous algorithms based on statistical learning theory. To verify the predictive ability of our approach, we compare the RF classifier with the SVM classifier based on the same feature extraction method. For the SVM classifier, the LIBSVM we used can be downloaded at
The experimental prediction results of the SVM combined with the protein sequence descriptor are listed in Table 3. It can be observed from Table 3 that the accuracy of SVM on the Yeast dataset is 87.29%, wherein the results of five experiments are 87.84%, 85.47%, 87.71%, 89.23%, and 86.21%. However, the rotation forest classifier achieves an average accuracy of 97.43%. To show the prediction ability of our approach more comprehensively, we calculated the values of precision, sensitivity, MCC, and AUC. As seen from the Table 3, the prediction result of the SVM classifier with the sensitivity value of 84.42%, precision value of 89.58%, MCC value of 74.73%, and AUC value of 94.59%. Furthermore, we can see in detail from Table 3 that the standard deviation of accuracy, sensitivity, precision, MCC, and AUC is 1.48%, 1.88%, 2.02%, 2.93% and 0.56%, respectively. The accuracy, sensitivity, precision, MCC and AUC of the RF classifier is 10.14%, 10.50%, 10.35%, 20.24% and 2.92% higher than that of the SVM classifier. From the comparison of experimental results we can see that the evaluation criteria based on SVM classifier are all lower than those of our model. The ROC curves performed by support vector machine classifier on Yeast dataset were shown in Fig. 3.The ROC curves performed on the Yeast dataset using the SVM classifier.
To further evaluate the performance of our approach, we also compared it with different descriptors. In the experiment, we selected feature extraction algorithms including Auto Covariance (AC) and Discrete Cosine Transform (DCT) to perform experiments on the Yeast dataset. The introduction of these feature extraction algorithms can be viewed in the supplementary file.In addition, we also verified the protein descriptors without feature extraction. Table 4 summarizes the comparison results of the proposed feature descriptor with the above three descriptors. It can be seen from Table 4 that our feature descriptors have obtained the best results on accuracy, sensitivity, and MCC. The precision is only 0.07% lower than the highest AC descriptor and DCT descriptor. This indicates that the 2DPCA algorithm can effectively extract the features of the protein and help improves the prediction performance of the model.
In the past few years, many research teams have put forward a variety of computational methods to solve the problem of PPI prediction. By comparison with these models on the Yeast and H. pylori datasets, we can more clearly evaluate the proposed method. We selected accuracy, precision, sensitivity, and MCC as evaluation indicators that are listed in Tables 5 and 6. Table 5 summarizes the experimental results of different approaches on the Yeast dataset. From the table we can clearly see that the range of accuracy generated by the other methods is from 75.08% to 89.33%, the range of sensitivity generated is from 75.81% to 89.93%, the range of precision generated is from 74.75% to 90.24%, and corresponding experimental results we generated were 97.43%, 94.92%, 99.93%, 94.97%, these results are lower than what we have achieved. Table 6 shows the performance of different models on the H. pylori dataset. It can be seen that the range of accuracy generated by the other approaches is from 75.80% to 87.50%, the range of sensitivity obtained is from 69.80% to 88.95%, the range of precision obtained is from 80.20% to 86.15%, and the corresponding experimental results we obtained were 88.07%, 78.20%, 97.44%, and 77.66%. Except for the precision and MCC slightly lower, the accuracy and sensitivity are higher than the highest value.
To further estimate the proposed model, we decided to verify its performance on an independent datasets. We apply all of the 11188 pairs from the Yeast dataset as the training set in our final prediction model, the test set is composed of C.elegans, E.coli, H.sapiens and M.musculus datasets from the DIP database. The number of protein pairs they contained was 4013, 6954, 1413, and 313, respectively. In the experiment, we utilize the same matrix representation and feature extraction algorithm for these datasets, and we also use the same parameters for rotation forest classification. Table 7 lists the prediction results of four independent datasets based on our method. We can observe from Table 7 that the high accuracy of 91.43% was acquired on the C.elegans dataset, 99.93% accuracy on the E.coli dataset, 92.00% accuracy on the H.sapiens dataset, and 90.73% accuracy on the M.musculus dataset. All of these results demonstrate that our approach is a suitable method for predicting the interactions of other species.
In this article, we develop an efficient and practical prediction approach, which utilizes protein sequence information combined with feature descriptors to accurately predict protein interactions at high speed. It is well known that the most important challenge of sequence-based algorithm is to find appropriate features to adequately represent the information of protein interactions. For this purpose, we transform the protein sequences into the PSSM and use the 2DPCA algorithm to extract their features, extracting as much as possible the hidden information in the primary sequence of the protein. Then the rotation forest is applied to guarantee the reliability of prediction. In comparison with the SVM classifier and other approaches, our model has achieved excellent results. Furthermore, we validate our model on the independent datasets. The excellent results show that our model performed well in the prediction of protein interactions. In future research, we will focus on finding better ways to describe protein sequences to accurately identify interacting and non-interacting protein pairs.
In the experiments we used the real Yeast PPIs dataset, which was collected from Saccharomyces cerevisiae core subset of Database of Interacting Proteins (DIP)[
Position-Specific Scoring Matrix (PSSM) is proposed by Gribskov et al.[
In our experiment, we introduced the Position-Specific Iterated BLAST (PSI-BLAST) tool[
Two-dimensional Principal Component Analysis (2DPCA)[
Assuming that the sample number is N and the ith matrix is V
In the 2DPCA algorithm, the matrix V is projected onto the optimal projection matrix, so we can get the following formula:F=VX
Thus we can get an M-dimensional projection vector F. The optimal projection axis X is determined by the dispersion of eigenvector F, and uses the following equation:J(X)=trace(Sx)where S
Define total scatter matrix G
The formula for calculating G
Therefore, the criterion function can be written as:J(X)=trace(XTGtX)where X is a unit column vector. The first d maximum eigenvalues of the covariance matrix corresponding to the orthogonal eigenvectors constitute the optimal projection axis X
A new set of eigenvectors F
Rotation forest[
Assume that {x
Select the suitable parameter K, randomly divide S into K parts of the disjoint subsets, the number of features that each subset contains is n/k.
Let S
Principal component analysis is performed on X′i,j to obtain M
The coefficients obtained in the matrix M
During the prediction period, a test sample x generated by the classifier F
Then assign the category with the largest λ
Lei Wang and Zhu-Hong You contributed equally.
This work is supported in part by the National Science Foundation of China, under Grants 61373086, 61572506, 61702444, in part by Guangdong Natural Science Foundation, under Grant 2014A030313555 and in part by the Pioneer Hundred Talents Program of Chinese Academy of Sciences, and in part by the CCF-Tencent Open Fund. The authors would like to thank all anonymous reviewers for their constructive advices.
L.W., Z.Y. and X.Y. conceived the algorithm, carried out the analyses, prepared the data sets, carried out experiments, and wrote the manuscript. S.X., F.L., L.L., W.Z. and Y.Z. designed, performed and analyzed experiments and wrote the manuscript. All authors read and approved the final manuscript.
The authors declare no competing interests.
1 Zhang, Q. C., et al. Structure-based prediction of protein-protein interactions on a genome-wide scale., Nature, 490, 556-+, 10.1038/nature11503 (2012).
- 2 Krogan NJ, Global landscape of protein complexes in the yeast Saccharomyces cerevisiae, Nature, 2006, 440, 637, 6432006Natur.440..637K, 10.1038/nature04670
- 3 Ito T, A comprehensive two-hybrid analysis to explore the yeast protein interactome, Proceedings of the National Academy of Sciences of the United States of America, 2001, 98, 4569, 45742001PNAS...98.4569I, 10.1073/pnas.06103449831875
- 4 Ho Y, Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry, Nature, 2002, 415, 180, 1832002Natur.415..180H, 10.1038/415180a
- 5 Templin MF, Protein microarrays: Promising tools for proteomic research, Proteomics, 2003, 3, 2155, 2166, 10.1002/pmic.200300600
- 6 Trinkle-Mulcahy L, Identifying specific protein interaction partners using quantitative mass spectrometry and bead proteomes, Journal of Cell Biology, 2008, 183, 223, 239, 10.1083/jcb.200805092
- 7 Guo Y, Yu L, Wen Z, Li M, Using support vector machine combined with auto covariance to predict proteinprotein interactions from protein sequences, Nucleic Acids Research, 2008, 36, 3025, 3030, 10.1093/nar/gkn1592396404
- 8 You, Z.-H., Yin, Z., Han, K., Huang, D.-S. & Zhou, X. A semi-supervised learning approach to predict synthetic genetic interactions by combining functional and topological properties of functional gene network., Bmc Bioinformatics, 11, 10.1186/1471-2105-11-343 (2010).
- 9 Zhu, L., You, Z.-H., Huang, D.-S. & Wang, B. LSE: A Novel Robust Geometric Approach for Modeling Protein-Protein Interaction Networks., Plos One, 8, 10.1371/journal.pone.0058368 (2013).
- 10 Xia JF, You ZH, Wu M, Wang SL, Zhao XM, Improved Method for Predicting pi-Turns in Proteins Using a Two-Stage Classifier, Protein and Peptide Letters, 2010, 17, 1117, 1122, 10.2174/092986610791760315
- 11 You ZH, Lei YK, Gui J, Huang DS, Zhou XB, Using manifold embedding for assessing and predicting protein interactions from high-throughput experimental data, Bioinformatics, 2010, 26, 2744, 2751, 10.1093/bioinformatics/btq5103025743
- 12 You ZH, Li LP, Yu HJ, Chen SF, Wang SL, Increasing Reliability of Protein Interactome by Combining Heterogeneous Data Sources with Weighted Network Topological Metrics, Advanced Intelligent Computing Theories and Applications, 2010, 6215, 657, 663, 10.1007/978-3-642-14922-1_82
- 13 Lei, Y. K., You, Z. H., Ji, Z., Zhu, L. & Huang, D. S. Assessing and predicting protein interactions by combining manifold embedding with multiple information integration., Bmc Bioinformatics, 13, 10.1186/1471-2105-13-s7-s3 (2012).
- 14 Zhang QC, Structure-based prediction of protein-protein interactions on a genome-wide scale (vol 490, pg 556, 2012), Nature, 2013, 495, 127, 1272013Natur.495..127Z, 10.1038/nature11977
- 15 You ZH, Yu JZ, Zhu L, Li S, Wen ZK, A MapReduce based parallel SVM for large-scale predicting protein-protein interactions, Neurocomputing, 2014, 145, 37, 43, 10.1016/j.neucom.2014.05.072
- 16 Gao, Z. G., et al. Ens-PPI: A Novel Ensemble Classifier for Predicting the Interactions of Proteins Using Autocovariance Transformation from PSSM., Biomed Research International, 8, 10.1155/2016/4563524 (2016).
- 17 Zhao XM, Wang Y, Chen LN, Aihara K, Protein domain annotation with integration of heterogeneous information sources, Proteins-Structure Function and Bioinformatics, 2008, 72, 461, 473, 10.1002/prot.21943
- 18 Huang Y-A, Construction of reliable protein-protein interaction networks using weighted sparse representation based classifier with pseudo substitution matrix representation features, Neurocomputing, 2016, 218, 131, 138, 10.1016/j.neucom.2016.08.063
- 19 Wang L, An ensemble approach for large-scale identification of protein-protein interactions using the alignments of multiple sequences, Oncotarget, 2017, 8, 5149
- 20 Yang YD, Faraggi E, Zhao HY, Zhou YQ, Improving protein fold recognition and template-based modeling by employing probabilistic-based matching between predicted one-dimensional structural properties of query and corresponding native properties of templates, Bioinformatics, 2011, 27, 2076, 2082, 10.1093/bioinformatics/btr3503137224
- 21 Yin, Z., et al. Using iterative cluster merging with improved gap statistics to perform online phenotype discovery in the context of high-throughput RNAi screens., Bmc Bioinformatics, 9, 10.1186/1471-2105-9-264 (2008).
- 22 Yang YD, Zhou YQ, Specific interactions for ab initio folding of protein terminal regions with secondary structures, Proteins-Structure Function and Bioinformatics, 2008, 72, 793, 803, 10.1002/prot.21968
- 23 Chen, W., Feng, P. M., Lin, H. & Chou, K. C. iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition., Nucleic Acids Research, 41, 10.1093/nar/gks1450 (2013).
- 24 Lin H, The modified Mahalanobis Discriminant for predicting outer membrane proteins by using Chou’s pseudo amino acid composition, Journal of Theoretical Biology, 2008, 252, 350, 3562948536, 10.1016/j.jtbi.2008.02.004
- 25 Wang L, Advancing the prediction accuracy of protein-protein interactions by utilizing evolutionary information from position-specific scoring matrix and ensemble classifier, Journal Of Theoretical Biology, 2017, 418, 105, 1103621169, 10.1016/j.jtbi.2017.01.003
- 26 Wang, L., et al. An improved efficient rotation forest algorithm to predict the interactions among proteins., Soft Computing, 1-9 (2017).
- 27 Luo, X., et al. A Highly Efficient Approach to Protein Interactome Mapping Based on Collaborative Filtering Framework., Scientific Reports, 5, 10.1038/srep07702 (2015).
- 28 Zhao, X. M., Wang, Y., Chen, L. N. & Aihara, K. Gene function prediction using labeled and unlabeled data., Bmc Bioinformatics, 9, 10.1186/1471-2105-9-57 (2008).
- 29 Pitre S, PIPE: a protein-protein interaction prediction engine based on the re-occurring short polypeptide sequences between known interacting protein pairs, Bmc Bioinformatics, 2006, 7, 10.1186/1471-2105-7-365
- 30 Shen J, Predictina protein-protein interactions based only on sequences information, Proceedings of the National Academy of Sciences of the United States of America, 2007, 104, 4337, 43412007PNAS..104.4337S, 10.1073/pnas.06078791041838603
- 31 Zweig MH, Campbell G, Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine, Clinical chemistry, 1993, 39, 561, 577
- 32 Chang, C.-C. & Lin, C.-J. LIBSVM: A Library for Support Vector Machines., Acm Transactions on Intelligent Systems and Technology, 2, 10.1145/1961189.1961199 (2011).
- 33 Xenarios I, DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions, Nucleic Acids Research, 2002, 30, 303, 305, 10.1093/nar/30.1.30399070
- 34 Martin S, Roe D, Faulon JL, Predicting protein-protein interactions using signature products, Bioinformatics, 2005, 21, 218, 226, 10.1093/bioinformatics/bth483
- 35 Gribskov M, McLachlan AD, Eisenberg D, Profile analysis: detection of distantly related proteins, Proceedings of the National Academy of Sciences of the United States of America, 1987, 84, 4355, 43581987PNAS...84.4355G, 10.1073/pnas.84.13.4355305087
- 36 Altschul SF, Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, Nucleic acids research, 1997, 25, 3389, 3402, 10.1093/nar/25.17.3389146917
- 37 Yang J, Zhang D, Frangi AF, Yang JY, Two-dimensional PCA: A new approach to appearance-based face representation and recognition, Ieee Transactions on Pattern Analysis and Machine Intelligence, 2004, 26, 131, 137, 10.1109/TPAMI.2004.1261097
- 38 Yang J, Yang JY, From image vector to matrix: a straightforward image projection technique - IMPCA vs. PCA, Pattern Recognition, 2002, 35, 1997, 1999, 10.1016/S0031-3203(02)00040-71006.68865
- 39 Wang L, RFDT: A Rotation Forest-based Predictor for Predicting Drug-Target Interactions Using Drug Structure and Protein Sequence Information, Current Protein & Peptide Science, 2018, 19, 445, 454, 10.2174/1389203718666161114111656
- 40 Zhou YZ, Gao Y, Zheng YY, Prediction of Protein-Protein Interactions Using Local Description of Amino Acid Sequence, Advances in Computer Science and Education Applications, Pt Ii, 2011, 202, 254, 262, 10.1007/978-3-642-22456-0_37
- 41 Yang L, Xia J-F, Gui J, Prediction of Protein-Protein Interactions from Protein Sequence Using Local Descriptors, Protein and Peptide Letters, 2010, 17, 1085, 1090, 10.2174/092986610791760306
- 42 You, Z.-H., Lei, Y.-K., Zhu, L., Xia, J. & Wang, B. Prediction of protein-protein interactions from amino acid sequences with ensemble extreme learning machines and principal component analysis., Bmc Bioinformatics, 14, 10.1186/1471-2105-14-s8-s10 (2013).
- 43 Bock JR, Gough DA, Whole-proteome interaction mining, Bioinformatics, 2003, 19, 125, 134, 10.1093/bioinformatics/19.1.125
- 44 Nanni L, Hyperplanes for predicting protein-protein interactions, Neurocomputing, 2005, 69, 257, 263, 10.1016/j.neucom.2005.05.007
- 45 Nanni L, Lumini A, An ensemble of K-local hyperplanes for predicting protein-protein interactions, Bioinformatics, 2006, 22, 1207, 1210, 10.1093/bioinformatics/btl055
- 46 Liu, B., et al. QChIPat: a quantitative method to identify distinct binding patterns for two biological ChIP-seq samples in different experimental conditions., Bmc Genomics, 14, 10.1186/1471-2164-14-s8-s3 (2013).
Supplementary information accompanies this paper at 10.1038/s41598-018-30694-1.
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
PHOTO (COLOR)
PHOTO (COLOR)
PHOTO (COLOR)
PHOTO (COLOR)
PHOTO (COLOR)
PHOTO (COLOR)
PHOTO (COLOR)
PHOTO (COLOR)
PHOTO (COLOR)
PHOTO (COLOR)
PHOTO (COLOR)
PHOTO (COLOR)
PHOTO (COLOR)
PHOTO (COLOR)
PHOTO (COLOR)
PHOTO (COLOR)
PHOTO (COLOR)
PHOTO (COLOR)
PHOTO (COLOR)
PHOTO (COLOR)
PHOTO (COLOR): Supplementary Materials
By Lei Wang; Zhu-Hong You; Xin Yan; Shi-Xiong Xia; Feng Liu; Li-Ping Li; Wei Zhang and Yong Zhou