Found in recent research, tumor cell invasion, proliferation, or other biological processes are controlled by circular RNA. Understanding the association between circRNAs and diseases is an important way to explore the pathogenesis of complex diseases and promote disease-targeted therapy. Most methods, such as k-mer and PSSM, based on the analysis of high-throughput expression data have the tendency to think functionally similar nucleic acid lack direct linear homology regardless of positional information and only quantify nonlinear sequence relationships. However, in many complex diseases, the sequence nonlinear relationship between the pathogenic nucleic acid and ordinary nucleic acid is not much different. Therefore, the analysis of positional information expression can help to predict the complex associations between circRNA and disease. To fill up this gap, we propose a new method, named iCDA-CGR, to predict the circRNA-disease associations. In particular, we introduce circRNA sequence information and quantifies the sequence nonlinear relationship of circRNA by Chaos Game Representation (CGR) technology based on the biological sequence position information for the first time in the circRNA-disease prediction model. In the cross-validation experiment, our method achieved 0.8533 AUC, which was significantly higher than other existing methods. In the validation of independent data sets including circ2Disease, circRNADisease and CRDD, the prediction accuracy of iCDA-CGR reached 95.18%, 90.64% and 95.89%. Moreover, in the case studies, 19 of the top 30 circRNA-disease associations predicted by iCDA-CGR on circRDisease dataset were confirmed by newly published literature. These results demonstrated that iCDA-CGR has outstanding robustness and stability, and can provide highly credible candidates for biological experiments.
Author summary: Understanding the association between circRNAs and diseases is an important step to explore the pathogenesis of complex diseases and promote disease-targeted therapy. Computational methods contribute to discovering the potential disease-related circRNAs. Based on the analysis of the location information expression of biological sequences, the model of iCDA-CGR is proposed to predict the circRNA-disease associations by integrates multi-source information, including circRNA sequence information, gene-circRNA associations information, circRNA-disease associations information and the disease semantic information. In particular, the location information of circRNA sequences was first introduced into the circRNA-disease associations prediction model. The promising results on cross-validation and independent data sets demonstrated the effectiveness of the proposed model. We further implemented case studies, and 19 of the top 30 predicted scores of the proposed model were confirmed by recent experimental reports. The results show that iCDA-CGR model can effectively predict the potential circRNA-disease associations and provide highly reliable candidates for biological experiments, thus helping to further understand the complex disease mechanism.
Circular RNA (circRNA) is a type of non-coding RNA without 5' end caps or a 3' end poly (A) tails [[
In recent years, in order to unify the standards of circRNAs obtained by experiment, many databases were established as circBase, CIRCpedia, deepBase, CircNet and circRNADb [[
Through the above analysis, we can see that although the current computing models have achieved good results, they also have some defects. First, it is not difficult to see that the training data used by the current model is limited, which has an impact on the robustness of the model. At the same time, the lack of training data also brings the problem of limited coverage. The potential associations that these models can predict are all around 10,000. Secondly, they are mainly based on a single data description method, which does not integrate circRNA and disease behavior information and attribute information in the network to comprehensively define the feature of circRNA and disease, resulting in limited prediction performance. Finally, they did not take the circRNA sequence information into account and cannot accurately measure the circRNA similarity. Therefore, in order to improve the drawbacks of the current computational models, we propose iCDA-CGR model to identify CircRNA-Disease Associations based on Chaos Game Representation. By introducing the circFunBase database and sequence information, the problems of limited model coverage and limited predictive performance are solved. The iCDA-CGR integrates multi-source information, including circRNA sequence information, gene-circRNA associations information, circRNA-disease associations information and the disease semantic information. In particular, iCDA-CGR extracts the biological sequence position information and quantifies the biological sequence nonlinear relationship of circRNA by Chaos Game Representation (CGR) technology [[
Graph: Fig 1 The workflow of iCDA-CGR model to predict potential circRNA-disease associations.
In the past year, a number of benchmark databases have been proposed for collecting circRNA-disease associations, such as circR2Disease, circRNADisease, circFunBase, and Circ2Disease, which contain the association between experimentally validated diseases and circRNAs [[
circR2Disease. To evaluate the reliability of our method, the widely used benchmark set circR2Disease was selected. The dataset was preprocessed due to its repetitiveness and non-human circRNA disease association. Specifically, we obtained 612 confirmed circRNA-disease associations consisting of 533 circRNA and 89 diseases after removing the circRNAs in which the gene symbol could not be found, as shown in Table 1. The base dataset circR2Disease can be defined as:
Graph
where Z
Graph
Table 1 Data distribution of the benchmark set circR2Disease and circFunBase of circRNA-disease association.
benchmark set circRNA Disease Association circR2Disease 533 89 612 circFunBase 2597 67 2984
circFunBase. CircFunBase is a database that provides high-quality functional circRNA resources and few models are used. In order to improve the problem of small coverage predicted by the current model, we also performed experiments on this dataset. After removing circRNAs that did not match the gene symbols, 2984 confirmed circRNA-disease associations were obtained, including 2597 circRNAs and 67 diseases, as shown in Table 1. The Benchmark database circFunBase can be defined as:
Graph
where Z
Sequence information and gene symbols information for circRNAs are provided by many public databases such as circBase, CIRCpedia, deepBase, CircNet and circRNADb[[
It is an iterative mapping technique for processing sequences[[
Graph
Where ν is the nucleotide contribution factor and we set it to be 0.5. g
The Medical Subject Headings (MeSH) database categorizes the disease rigorously, which helps to calculate the semantic similarity of the disease. It can be download from https://
Graph
We define the amount of DAGs which includes disease r as n(DAGs(r)) and the quantity of all diseases as n(disease). Therefore, the semantic similarity score
Graph
Graph
where N
Many researches have applied Gaussian interaction profile kernel (GAS) to measure the similarity between diseases, according to that pathologically similar diseases tend to be associated with functionally similar circRNAs. In this study, the
Graph
Graph
Where
Graph
Graph
We define the parameter as the width parameter of the function, τ
By analyzing the disease similarity measures form multiple perspectives, we gain the similarity matrices, including
Graph
Graph
Graph
Graph
Graph
Circular RNA regulates the activity of RNA polymerase and promotes parental genes' transcription found in previous researches. Because if RNA affects the same human disease, their functions tend to be similar [[
Graph
Where the elements in
Graph
Graph
Many researches chose to utilize gaussian interaction profile kernel (GAS) to measure the similarity between biomolecules [[
Graph
Graph
Graph
Where
Graph
Existing sequence alignment algorithms only quantify position information or non-linear information, and few algorithms that can combine both are proposed. Therefore, a new CGR-based method is proposed to quantify the similarity and difference between position and non-linear information using Pearson correlation coefficient. The specific calculation process is as follows.
Firstly, the CGR space is divided into N
Graph: Fig 2 A) the CGR of hsa_circ_0005931 are plotted with the average coordinates for each 8 × 8 quadrant represented. B) A matrix of hsa_circ_0005931's nucleotides with probabilities for chaos game representation.
Graph
Secondly, the abscissa point.x and ordinate point.y in each grid are accumulated respectively to quantify position information.
Graph
Graph
Thirdly, we calculate the z-scores of each grid Z
Graph
Graph
Finally, each grid can be described as three attributes, and we fused the attributes to construct the descriptors descriptors(c( i )) to determine the sequence similarity
Graph
Graph
Graph
where Cov(descriptors(c( i ))) is the covariance of descriptors(c( i )), D(descriptors(c( i ))) is the variance of descriptors(c( i )). The size of circRNA sequence similarity matrix
Graph
Graph: Fig 3 The workflow of circRNA sequence-based similarity.
By analyzing circRNA's characteristics from different perspectives, we can obtain three similarity matrices, including
Graph
Graph
Graph
Graph
Graph
Graph
Support Vector Machines (SVM) was introduced in 1963 by Vanpik et al., which demonstrated many unique advantages in solving small sample, nonlinear and high dimensional pattern recognition problems. Due to the training samples used in iCDA-CGR are small, SVM is selected to build a model of predicting potential circRNA-disease association. Prediction is mainly divided into three steps: 1. Construct positive and negative sample sets; 2. Form the association descriptors based on the characteristics of the circRNA and disease; 3. Train models based on descriptors to predict potential circRNA-disease associations. Each step will be described in detail below.
Firstly, we built positive and negative sample sets. Specifically, 612 corresponding experimentally supported circRNA-disease pairs in circR2Disease were chosen as positive samples. Meantime, we randomly selected the same number of associations that without experimentally supported as negative samples.
Secondly, the association descriptors based on the characteristics of the circRNA and disease were formed. We calculated the semantic similarity
Graph
Graph
Graph
where S
Graph
Graph
Graph
where S
Graph
where (f
Finally, support vector machines (SVM) is utilized to train samples to build predictive models. More specifically. Firstly, we set the label of the training set. If the samples are in Z
In this work, the five-fold cross-validation (5-CV) is selected to evaluate the effectiveness of iCDA-CGR in predicting disease-related circRNAs. We separated the base dataset Z into five parts on average:
Graph
where ∅ is empty set. ∪ and ∩ are the union and intersection of set theory. Subset Z
Graph
The relationship between the i th positive subset
Graph
Graph
Graph
where the quantity of sample in the i th positive subset
Graph
Graph
Graph
Graph
Graph
Three evaluation criteria were introduced for assessing the performance of iCDA-CGR. Accu. is the ratio of the number of samples correctly classified by the classifier to the total number of samples.
Graph
where TP and FP are the number of true positive and false positive samples, respectively. TN and FN are the number of true negative and false negative samples, respectively. Sen. is the ratio of the number of samples correctly classified by the classifier to the total positive samples.
Graph
Prec. is the ratio of the number of samples correctly classified by the classifier to the sum of true positive and false positive samples.
Graph
F
Graph
To evaluate the capabilities of the model, we performed experiments on the circR2Disease and circFunBase datasets, respectively. The five-fold cross-validation results on the circR2Disease dataset are summarized in Table 2. iCDA-CGR has gained an average prediction AUC of 0.8533+/-0.0249. The AUCs of the five experiments are 0.8923 (fold 1), 0.8252 (fold 2), 0.8390 (fold 3), 0.8723 (fold 4) and 0.8385 (fold 5) respectively as Fig 4. iCDA-CGR has gained an average prediction AUPR of 0.7584+/-0.0351. The AUPRs of the five experiments are 0.8240 (fold 1), 0.7463 (fold 2), 0.7187 (fold 3), 0.7566 (fold 4) and 0.7465 (fold 5) respectively as Fig 5. The yielded averages of accuracy, sensitivity, precision and f1-score come to be 81.95%, 88.08%, 78.46% and 82.97% as in Table 2.
Graph: Fig 4 ROC curves performed by iCDA-CGR on circR2Disease dataset.
Graph: Fig 5 PR curves performed by iCDA-CGR on circR2Disease dataset.
Graph
Table 2 The five-fold cross-validation results performed by iCDA-CGR on circR2Disease dataset.
Testing set Accuracy Precision Sensitivity F1-score 1 83.74% 80.74% 88.62% 84.50% 2 78.86% 76.30% 83.74% 79.84% 3 81.15% 76.76% 89.34% 82.58% 4 84.84% 79.72% 93.44% 86.04% 5 81.15% 78.79% 85.25% 81.89% Average 81.95±2.11% 78.46±1.70% 88.08±3.39% 82.97±2.14%
On the circFunBase dataset, the mean and standard deviation were utilized as the experimental results of the five-fold cross-validation. In Table 3, the experimental results were obtained by iCDA-CGR on the circFunBase database. iCDA-CGR has gained an average prediction AUC of 0.8049+/-0.169. The AUCs of the five experiments are 0.7820 (fold 1), 0.8316 (fold 2), 0.8104 (fold 3), 0.7926 (fold 4) and 0.8080 (fold 5) respectively as Fig 6. The AUPRs of the five experiments are 0.7276 (fold 1), 0.8037 (fold 2), 0.7816 (fold 3), 0.7437 (fold 4) and 0.7727 (fold 5) respectively as Fig 7. The yielded averages of accuracy, precision, sensitivity and f1-score come to be 78.03%, 79.96%, 74.94% and 77.31% as in Table 3.
Graph: Fig 6 ROC curves performed by iCDA-CGR on circFunBase dataset.
Graph: Fig 7 PR curves performed by iCDA-CGR on circFunBase dataset.
Graph
Table 3 The five-fold cross-validation results performed by iCDA-CGR on circFunBase dataset.
Testing set Accuracy Precision Sensitivity F1-score 1 77.22% 80.37% 72.03% 75.97% 2 80.40% 82.35% 77.39% 79.79% 3 77.22% 80.83% 71.36% 75.80% 4 76.88% 76.27% 78.06% 77.15% 5 78.44% 80.00% 758.4% 77.86% Average 78.03±1.30% 79.96±2.01% 74.94±2.75% 77.31±1.45%
In the above experiment, iCDA-CGR has received a reliable result. To prove the correctness of the classifier selection, we have compared the support vector machine (SVM) with random forest (RF), decision tree (DT), k-nearest neighbor (KNN) on benchmark database circR2Disease.
Support vector machines (SVM) is a binary classification model. Its purpose is to find a hyperplane to segment samples. The principle of segmentation is to maximize the spacing, and finally it is transformed into a convex quadratic programming problem to solve. The decision tree (DT) adopts a top-down recursive method. The basic idea is to construct a tree with the fastest entropy decline as measured by information entropy, and the entropy value at the leaf node is 0. The random forest (RF) is a kind of Ensemble Learning, which belongs to Bagging. By combining multiple weak classifiers, the final results can be voted or averaged, which makes the results of the whole model have higher accuracy and generalization performance. The main idea of the k-nearest neighbor (KNN) algorithm is that if most of the k most adjacent samples in the feature space belong to a certain category, then the sample also belongs to this category and has the characteristics of samples in this category.
In Table 4, we compare the results of Support vector machines with the other three classifiers on the circR2Diseas database. The accuracy of the four experiments are 82.44% (Support vector machines), 76.32% (k-nearest neighbor), 70.61% (Random forest) and 73.06% (Decision Tree). Their AUC are 0.8645 (Support vector machines), 0.8479 (k-nearest neighbor), 0.7927 (Random forest) and 0.7281 (Decision Tree) shown as Fig 8.
Graph: Fig 8 The ROCs of four different classifiers which are support vector machines, decision tree, random forest and k-nearest neighbor on circR2Disease dataset.
Graph
Table 4 Performance comparison among four different classifiers which are k-nearest neighbor, random forest, decision tree and support vector machine.
Method Accuracy (%) Sensitivity (%) Precision (%) F1-score (%) KNN 76.32% 86.15% 73.68% 79.43% RF 70.61% 71.54% 72.66% 72.09% DT 73.06% 76.39% 73.53% 75.19% SVM 82.44% 87.69% 80.85% 84.13%
To further evaluate the reliability of iCDA-CGR, we compared it to five related prediction models: KATZHCDA, GHICD, RWRHCD, CD-LNLP and ICFCDA. The details of the comparison are summarized in Table 5. From the table, we can see that KATZHCDA, GHICD, RWRHCD and our model iCDA-CGR are all based on circR2Disease data set and use the five-fold cross-validation method, so iCDA-CGR can be directly compared with these three models. In terms of AUC scores reflecting the overall performance of the model, KATZHCDA, GHICD and RWRHCD achieved 0.7936, 0.7290 and 0.6660 respectively, while the proposed model iCDA-CGR achieved 0.8533. The results show that iCDA-CGR is significantly better than these methods.
Graph
Table 5 Performance comparison (AUC scores) among four different prediction model which are iCDA-CGR, KATZHCDA, GHICD, RWRHCD and CD-LNLP, ICFCDA.
Method AUC Dataset Association Assessment method GHICD 0.7290 circR2Disease 592 5-CV KATZHCDA 0.7936 circR2Disease 312 5-CV RWRHCD 0.6660 circR2Disease 592 5-CV iCDA-CGR 0.8533 circR2Disease 612 5-CV CD-LNLP 0.9007 circ2Disease 273 LOOCV ICFCDA 0.9460 circR2Disease 212 LOOCV
1
2
In the last two rows of Table 5, we list the performance of CD-LNLP and ICFCDA, which are 0.9007 and 0.9460, respectively. However, because the dataset or assessment methods used by these two models are inconsistent with the proposed model, we cannot directly compare them, so they are used as a reference for model performance. The specific reasons that cannot be directly compared are as follows:
For model CD-LNLP, it uses the circ2Disease database instead of the more commonly used circR2Disease database. Due to the different data sources used, the training model evaluation criteria will be different. Furthermore, CD-LNLP uses leave-one-out cross validation (LOOCV) to evaluate model performance instead of the more commonly used five-fold cross validation (5-CV). Based on previous work, using the same model and data, LOOCV assessments are usually higher than 5-CV [[
For model ICFCDA, it uses the circR2Disease database, but this method removes more noisy data. The training data of ICFCDA includes 212 associations consisting of 200 circRNAs and 42 diseases. The predicted coverage of this model is 7976 associations, which is 17.25% of the coverage of iCDA-CGR. This operation makes the model performance stronger, but sacrifices the model's coverage. In addition, ICFCDA also uses LOOCV. Therefore, ICFCDA cannot be directly compared with the proposed model.
In summary, the proposed model has superior performance and coverage, which indicates that CGR-based sequence extraction technology and characterization of intrinsic structure and circRNA-disease association information could effectively improve the reliability of prediction.
To verify the performance of the model in predicting potential associations based on confirmed associations, we carried out a case study. To be specific, we define the training samples and test samples as follows:
Graph
In the validation, confirmed associations Z
Graph
Graph
Table 6 Prediction of the top 30 predicted circRNAs associated based on known associations on circR2Disease.
Rank circRNA Disease Evidence (PMID) 1 Circ_MED12L Hepatoblastoma unconfirmed 2 hsa_circ_0070933 Oral squamous cell carcinoma unconfirmed 3 hsa_circ_0070934 Diabetic myocardial fibrosis unconfirmed 4 hsa_circ_0002113 Breast cancer 28803498 5 hsa_circ_0070934 Hypertension unconfirmed 6 hsa_circ_0067934 Hepatocellular carcinoma 29458020 7 hsa_circ_0001445 Pancreatic cancer unconfirmed 8 hsa_circ_0014717 Gastric cancer 28544609 9 hsa_circ_0001649 Gastric cancer 28167847 10 hsa_circ_0001649 Glioma 29343848 11 hsa_circ_0067934 Esophageal squamous cell carcinoma 27752108 12 hsa_circ_0003838 Breast cancer 28803498 13 circETFA Breast cancer 29221160 14 mmu_circ_0001052 Immunosenescence unconfirmed 15 circMED13 Breast cancer 29221160 16 hsa_circ_0068087 Rheumatoid arthritis unconfirmed 17 hsa_circ_0007031 Colorectal cancer 28656150 18 hsa_circ_0068033 Breast cancer 29045858 19 Circ_SMARCA5 Glioma 26873924 20 hsa_circ_0000504 Colorectal cancer 28656150 21 circ-Foxo3 Acute ischemic stroke unconfirmed 22 hsa_circ_0072359 Hepatoblastoma 29414822 23 Circ_ZNF148 Glioma 26873924 24 hsa_circ_0081342 Papillary thyroid carcinoma 28288173 25 mmu_circ_0000290 Primary great saphenous vein varicosities unconfirmed 26 circ-FBXW7 Glioblastoma 28903484 27 hsa_circ_0085495 Breast cancer 28803498 28 hsa_circ_0001824 Breast cancer unconfirmed 29 Circ_ADCY1 Glioma 26873924 30 circDLGAP4 Cardiovascular disease unconfirmed
Similar to the definition above, the confirmed associations provided by the circFunBase database were selected as the training set Z
Graph
Table 7 Prediction of the top 30 predicted circRNAs associated based on known associations on circFunBase.
Rank circRNA Disease Evidence (PMID) 1 hsa_circ_0078768 Facet joint osteoarthritis unconfirmed 2 hsa_circ_0000893 Breast cancer 28744405 3 hsa_circ_0046264 Coronary artery disease unconfirmed 4 hsa_circ_0039353 Bladder cancer unconfirmed 5 hsa_circ_0071896 Facet joint osteoarthritis 29470979 6 hsa_circ_0001112 Colorectal cancer unconfirmed 7 hsa_circ_0087537 Facet joint osteoarthritis 29470979 8 circVRK1 Breast cancer 29221160 9 hsa_circ_0003570 basal cell cancer unconfirmed 10 hsa_circ_0020397 Colorectal cancer 28707774 11 hsa_circ_0011316 Colorectal cancer unconfirmed 12 hsa_circ_0098964 Coronary artery disease 28045102 13 hsa_circ_0051172 Coronary artery disease 28947970 14 hsa_circ_0000069 Colorectal cancer 28003761 15 hsa_circ_0078768 Active pulmonary tuberculosis 28846924 16 hsa_circ_0003838 Breast cancer 28803498 17 hsa_circ_0007006 Colorectal cancer 28656150 18 circRPAP2 Cutaneous squamous cell cancer unconfirmed 19 hsa_circ_0058792 Coronary artery disease unconfirmed 20 hsa_circ_0001667 Breast cancer 28803498 21 hsa_circ_0088452 Active pulmonary tuberculosis 28846924 22 hsa_circ_0001087 breast cancer unconfirmed 23 hsa_circ_0002874 Breast cancer 28803498 24 circUGP2_2 Cervical cancer unconfirmed 25 circC3 Facet joint osteoarthritis unconfirmed 26 hsa_circ_0089378 Coronary artery disease unconfirmed 27 hsa_circRNA_104333 Basal cell cancer unconfirmed 28 hsa_circ_0002495 Bladder cancer 29558461 29 hsa_circ_0001721 Breast cancer 28744405 30 hsa_circ_0000745 Gastric cancer 28974900
The results indicate that this method is reliable for circRNA-disease association prediction. In order to further support this conclusion, we verified the method in other databases (CRDD, circRNADisease, and Circ2Disease). It is not possible to identify all potential circRNA disease associations because each database is incomplete. So, we assume that the associations in the database are the only known associations that have been experimentally verified, and the rest are set to unknown associations. The training samples and test samples are described as follows:
Graph
where
Graph
Graph
Graph
Graph
It can be seen from Table 8 that the proposed method obtained predicted values of 63.26% (Circ2Disease), 73.43% (circRNADisease) and 72.72% (CRDD) in three databases, respectively. The experiment shows that the iCDA-CGR has strong generalization ability.
Graph
Table 8 Predictive results of the iCDA-CGR on other three databases.
Benchmark dataset Database Test pairs True pairs Accuracy (%) circR2disease Circ2Disease 83 79 95.18 circRNADisease 171 155 90.64 CRDD 438 420 95.89 circFunBase Circ2Disease 49 31 63.26 circRNADisease 128 94 73.43 CRDD 121 88 72.72
3
In this study, we proposed the calculation model iCDA-CGR based on quantify location and non-linear information to identify the circRNA-disease associations. This model integrates circRNA sequence information, gene-circRNA associations information, circRNA-disease associations information and the disease semantic information, and predicts the final results by SVM classifier. In particular, we introduce circRNA sequence information and extract the biological sequence position information and quantifies the biological sequence nonlinear relationship of circRNA by Chaos Game Representation for the first time in the circRNA-disease prediction model. The model achieved outstanding results in the experiments of five cross-validation, comparisons with other methods, and independent data sets. Furthermore, 19 of the top 30 circRNA-disease associations predicted in case studies experiments were confirmed by the latest published literature. Due to the addition of sequence information, iCDA-CGR exhibited strong reliability and stability in predicting potential circRNA-disease associations. These experimental results indicate that the sequence information has sufficient coverage relative to nucleic acids, and iCDA-CGR has great potential for nucleic acid function analysis.
S1 Table. Data distribution of the benchmark set circR2Disease and circFunBase of circRNA-disease association.
(XLSX)
S2 Table. Known circRNA-disease associations obtained from circR2Disease database.
(XLSX)
S3 Table. Names of 533 circRNAs involved in known circRNA-disease associations obtained from circR2Disease database.
(XLSX)
S4 Table. Names of 89 diseases involved in known circRNA-disease associations obtained from circR2Disease database.
(XLSX)
S5 Table. The final disease similarity matrix.
(XLSX)
S6 Table. The final circRNA similarity matrix.
(XLSX)
By Kai Zheng; Zhu-Hong You; Jian-Qiang Li; Lei Wang; Zhen-Hao Guo and Yu-An Huang
Reported by Author; Author; Author; Author; Author; Author