In the course of sample preparation for Next Generation Sequencing (NGS), DNA is fragmented by various methods. Fragmentation shows a persistent bias with regard to the cleavage rates of various dinucleotides. With the exception of CpG dinucleotides the previously described biases were consistent with results of the DNA cleavage in solution. Here we computed cleavage rates of all dinucleotides including the methylated CpG and unmethylated CpG dinucleotides using data of the Whole Genome Sequencing datasets of the 1000 Genomes project. We found that the cleavage rate of CpG is significantly higher for the methylated CpG dinucleotides. Using this information, we developed a classifier for distinguishing cancer and healthy tissues based on their CpG islands statuses of the fragmentation. A simple Support Vector Machine classifier based on this algorithm shows an accuracy of 84%. The proposed method allows the detection of epigenetic markers purely based on mechanochemical DNA fragmentation, which can be detected by a simple analysis of the NGS sequencing data.
Sergei L. Grokhovsky is deceased.
DNA methylation level of CpG islands, genomic sequences with a high occurrence of methylated CpG dinucleotides, is an important regulator of gene expression. The level of gene expression can increase or decrease, depending on the methylation level in the CpG sites inside a CpG island. Analysis of methylation levels in regulatory regions of various genes can provide information on their involvement in the development of various diseases, including cancer.
Previously we showed that sonication of restriction DNA fragments leads to sugar-phosphate backbone breaks, which depend on the nucleotide sequence. Breaks in CpG dinucleotides occur about 1.5 times more often than in other dinucleotides[
In this work the cleavage rates for methylated and unmethylated CpG dinucleotides of human genome were estimated. We found that the cleavage rate for methylated CpG dinucleotides is about 1.5 times higher than that for unmethylated ones. On the basis on this observation one can estimate the CpG methylation level in CpG islands without any experimental data on DNA methylation. Further, we show that tumor and healthy tissues differ significantly in the methylation status of CpG islands. In human somatic cells approximately 80% of CpG dinucleotides are methylated. Bisulfite sequencing was the first method for detection of cytosine methylation[
We randomly selected 100 whole genome sequencing datasets from the 1000 Genomes project that contained reads mapped on a reference genome (GRCh37). According to[
Graph
where X,Y = {A, T, G, C
Graph: Figure 1 Average cleavage rates r(XY) of 18 dinucleotides (namely, 16 common dinucleotides along with methylated CpGs (CMG) and unmethylated CpGs (CUG)). Cleavage rates are plotted on Y axis, dinucleotides are listed on X axis. CpG methylation results in a substantial increase of the cleavage rates in comparison with unmethylated CpGs and other dinucleotides.
The results of the ANOVA, Kruskal-Wallis test analysis suggest that the effect of the dinucleotide type on the relative frequency of its ultrasonic cleavage is statistically significant at the level p <α = 0.05 (see Supplementary Table 4). The results of the multiple comparisons, Duncantest and Kruskal-Wallis test, are shown in the Tables (see Supplementary Tables 2 and 3). The sample means in the Table 2 Suppl. are combined into subsets so that the means from different subsets (columns 1, 2, 7, 8, 9, 10) differ statistically significantly at the p < 0.05. Therefore the mean of r(C
According to[
We intended to develop an approach that will make it possible to distinguish tumor tissue samples from normal ones by the dinucleotide cleavage characteristics. Our approach is based on the observation that methylated CpG islands have higher cleavage rates. Namely, the reads with the 5'-end nucleotide G preceded by C in genomic sequence will occur more frequently if C is methylated. For practical evaluation of this method, the data from EGA datasets was used (24 datasets of T-cell lymphoma, 15 high-coverage datasets of prostate cancer, big dataset (> 300 samples) of breast cancer, 24 datasets of hepatocarcinoma, and 37 datasets of medulloblastoma (average coverage ~ 40×) from ICGC database, as well as control datasets of healthy tissues). We selected a subset of EGA T-cell datasets with a good coverage (> 40×). Our approach allows the prediction of the difference between methylation statuses of CpG islands in healthy and tumor tissues. Figure 2 shows the CpG islands with the biggest absolute difference in their mean cleavage levels. In almost all cases we observed demethylation effect in the CpG islands of tumor tissues with one exception: the CpG island with coordinates chr13:-25,212,380–25,212,623, which could be associated with promoter region of the pseudogene TNFRSF1A. This gene is a well known oncogene involved in tumorogenesis[
Graph: Figure 2 Colormaps for the CpG islands with highest difference of cleavage rates between normal and tumor tissues in meduloblastoma. X axis corresponds to samples, and Y axis, to CpG islands with their coordinates in the genome. The intensity of the color scale corresponds to the mean cleavage rate for CpG in every CpG island, which was calculated according to (
The detection of the cancer-specific methylation of particular CpG islands is a complicated task even in the presence of qualitative results of bisulfite-sequencing experiments. Therefore, for effective prediction of disease status from dinucleotide cleavage rates we employed some machine learning algorithms. At first, for each CpG island in each dataset (control and tumor) we computed an average CpG dinucleotide cleavage rate and turned every set into high dimension vector. Each element of this vector is an average CpG dinucleotide cleavage rate in a specific CpG island. These vectors were used for training of binary classifier on the basis of the support vector machine (SVM) method with linear kernel. This SVM classifier was used to predict the state of specific sample (cancer or normal). For training we used the same number of control and tumor datasets for each cancer type. In our research we applied SVM realization in e1071 package[
To control the accuracy of this algorithm, the jack-knife method was used. On each round of jack-knife we excluded random subset for each cancer type and trained the model on the remaining sets. Each subset contained an equal number of control and tumor sets. Then we made a prediction on the excluded subsets and computed true and false positive rates (Table 1).
Results for of the prediction of tumor/normal status for different cancer types calculated by the SVM model described in the main text.
Cancer type EGA ID Percent of true positive cases Percent of true negative cases Percent of false positive cases Percent of false negative cases Breast Cancer EGAD00001000126 79 60 21 40 Meduloblastoma EGAD00001000816 76 96 24 4 T-Cell Lymphoma EGAD00001002738 84 82 16 18 Prostate Cancer EGAD00001000263 70 97 30 3 Hepatocarcinoma EGAD00001001881 87 71 13 29
Percentage of positive cases and negative cases are calculated separately.
Our predictions were comparable with bisulfite sequencing based classifiers in the accuracy of prediction. The accuracy of the sample status prediction in[
We analyzed raw WGS data and found that cytosine methylation strongly affects the cleavage rate of CpG dinucleotides in the human genome. This fact agrees with recent observations of another group[
Using our method, one can estimate the total level of the whole-genome base methylation and compare the methylation degree in different genomic regions with similar functions at different stages of the organism development. The inferred correlation between the probability of mechanochemical cleavage of the DNA sugar-phosphate backbone and local nucleotide sequence enabled us to find an important connection between structural parameters of specific nucleotide sequences of DNA and their biological functions. Moreover, it should be possible to conduct a comparative analysis of methylation profiles for cells during differentiation and during aging by estimating the total DNA methylation level in various cells at least at the resolution of the CpG islands. The main problem for practical use of our method is a bad coverage of CpG islands and their heterogeneity in different cancer cells. For example, worst results of tumor/control separation were obtained for breast cancer with badly covered samples
On the basis of these results, the method can be developed for detecting the total level of the CpG methylation from the raw WGS data without any additional experiments. We hope that the developed method makes it possible to recognize base modifications in genomes other than methylation of cytosines in CpG dinucleotides[
We used the whole genome sequencing samples of human individuals from the 1000 Genomes project[
GRCh37 was used as a reference genome. To be sure about the type of dinucleotides observed we masked all low sequence complexity regions by RepeatMasker[
We used bisulfite-sequencing data from NGSmethDB database[
Authors thank V.J.Makeev, D.Y. Nechipurenko, M.V. Khodykov, and M.V.Fridman. This work was supported by project no. 0112-2019-0001 and RFFI grants 18-07-00354 А; 17-07-01331 A, program of the Presidium of the Russian Academy of Sciences for Molecular and Cellular Biology no. 01201456592, the Russian Foundation for Fundamental Investigations project no. 14-04-01269, Program of fundamental research for state academies for 2013-2020 years, no. 01201363818.
S.L.G., I.A.I. and Yu.D.N conceived the project, F.A.K. and R.V.P. designed the experiments, E.T.A., L.A.U. and I.R.U. conducted all simulations and analysis, L.A.P performed statistical analysis of the obtained data, S.L.G., I.A.I, Yu.D.N., R.V.P., E.T.A., L.A.U., L.A.P. and F.A.K. wrote and prepared the manuscript.
The authors declare no competing interests.
Graph: Supplementary Information.
is available for this paper at 10.1038/s41598-020-65406-1.
Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
By Leonid A. Uroshlev; Eldar T. Abdullaev; Iren R. Umarova; Irina A. Il'icheva; Larisa A. Panchenko; Robert V. Polozov; Fyodor A. Kondrashov; Yury D. Nechipurenko and Sergei L. Grokhovsky
Reported by Author; Author; Author; Author; Author; Author; Author; Author; Author