While CpG dinucleotides are significantly reduced compared to other dinucleotides in mammalian genomes, they can congregate and form CpG islands, which localize around the 5ʹ regions of genes, where they function as promoters. CpG-island promoters are generally unmethylated and are often found in housekeeping genes. However, their nucleotide sequences and existence per se are not conserved between humans and mice, which may be due to evolutionary gain and loss of the regulatory regions. In this study, human and rhesus monkey genomes, with moderately conserved sequences, were compared at base resolution. Using transcription start site data, we first validated our methods' ability to identify orthologous promoters and indicated a limitation using the 5ʹ end of curated gene models, such as NCBI RefSeq, as their transcription start sites. We found that, in addition to deamination mutations, insertions and deletions of bases, repeats, and long fragments contributed to the mutations of CpG dinucleotides. We also observed that the G + C contents tended to change in CpG-poor environments, while CpG content was altered in G + C-rich environments. While loss of CpG islands can be caused by gradual decreases in CpG sites, gain of these islands appear to require two distinct nucleotide altering steps. Taken together, our findings provide novel insights into the process of acquisition and diversification of CpG-island promoters in vertebrates.
Electronic supplementary material The online version of this article (10.1007/s00335-020-09844-2) contains supplementary material, which is available to authorized users.
In vertebrate genomic sequences, the content of CpG dinucleotides is significantly lower than expected based on the nucleotide composition (Bird [
Approximately half of all human protein-coding genes bear a CpG island in the promoter regions, and hence, promoter regions are classified into two distinct types, CpG-island and non-CpG-island promoters (Davuluri et al. [
Considering that housekeeping genes reportedly bear a CpG island in their promoter regions, it is likely that the presence of CpG-islands is conserved among various species (Zhu et al. [
To conduct comparative genomics at base resolution, nucleotide sequences should be highly conserved. However, if aligned sequences share high similarity, informative output cannot be obtained. Using the human genomic sequence as a control, mouse and chimpanzee non-coding sequences are considered to be too different and too similar, respectively. Thus, the rhesus monkey genome was employed as it exhibits moderate divergence. The macaque diverged from our last common ancestor approximately 25 million years ago, and an average identity of the genomes was reported to be ~ 93% (Gibbs et al. [
Since promoters are non-coding, sequence alignment is not as straightforward as with protein-coding sequences, which consist of anchored 3-bp frames. Therefore, we employed transcription start sites (TSSs) as an alternative anchor. TSSs can be determined by the Oligo-capping method or Cap Analysis Gene Expression (CAGE), both of which exploit the 5ʹ cap of mRNA molecules (Shiraki et al. [
Although CpG islands are often associated with gene promoters (Bird [
Graph: Fig. 1 Sequence logos obtained from 40-nt promoter sequences. Human, macaque, and mouse logos are displayed from left to right. Transcription start sites (TSSs) are located at position 0, which corresponds to the traditional + 1 start site. a TSSs were inferred from 5ʹ end of RefSeq entries. Promoter elements are not clear in all the three species. b Human representative TSSs were obtained from DBTSS. The Inr or YR consensus dinucleotide is evident. Disappearance of G and C nucleotides around position -27 suggests presence of TATA boxes. Macaque and mouse TSSs were inferred from sequence alignments against human TSSs
Generally, one gene locus has many TSSs, which form a cluster around a core promoter. Considering frequency of usage and genomic position, a single TSS is selected as a representative for each core promoter in DBTSS (Wakaguri et al. [
In addition to human data, sequence logos of mouse and macaque data were drawn (Fig. 1). For the latter two species, TSSs were inferred from human representative TSSs compiled in DBTSS. Clear YR consensus sequences were obtained, suggesting this method to effectively predict TSSs for organisms without experimentally determined TSSs. Those promoter sequences, however, should be conserved to some extent, e.g., desirably to the degree seen between humans and macaque.
The upstream regions from the TSSs appeared to be G + C-rich for the three species, although CpG-island promoters were not selected in this analysis (Fig. 1). Interestingly, from position − 25 to − 29, enrichment of G and C was lost for all species. Because the position corresponds to that of the TATA box (Juven-Gershon et al. [
Graph: Fig. 2 Genomic locations of possible TATA boxes relative to TSSs. a Histograms depict distributions of 5′-TATAAA-3′ fragments in promoters. Position 0 corresponds to 5ʹ end of RefSeq entries. b When DBTSS was employed, a sharp peak was observed around position − 31 in each species. In mice, the number of inferred start sites was not large due to a high degree of sequence divergence from humans
Graph: Fig. 3 Locational difference of transcription start sites between RefSeq and DBTSS. Histograms indicate relative positions of RefSeq 5ʹ ends compared to TSS data from DBTSS. RefSeq 5′ ends tend to be positioned downstream from TSSs. As for macaque, available RefSeq entries were limited compared to human and mouse
Several definitions of a CpG island have been proposed (Gardiner-Garden and Frommer [
Changes in G + C content and CpG scores between the human and macaque promoters are illustrated as a scatter plot (Fig. 4, Supplementary Fig. S1). Orthologous counterparts were connected by a single line, while vertical and horizontal lines were drawn to discern the two promoter types. CpG-island promoters formed an apparent cluster in the top right region, in which G + C content and CpG score were more than 0.5 and 0.6, respectively. In addition, the two lines partitioned the plot into four regions, namely, low-G + C/high-CpG (left top), high-G + C/high-CpG (right top), low-G + C/low-CpG (left bottom), and high-G + C/low-CpG (right bottom). The majority of promoter pairs were located in the second region, CpG-island promoters. Others were principally located in the two bottom regions or low-CpG promoters. Appearance of low-G + C/high-CpG promoters was exceptional in humans and macaque. Large numbers of changes were observed between high-G + C/high-CpG and high-G + C/low-CpG (right top–bottom), as well as between low-G + C/low-CpG and high-G + C/low-CpG (right–left bottom). Connecting lines that span a border of two promoter groups were counted and illustrated (Fig. 4). It is likely that G + C contents and CpG scores have drifted evolutionarily in either of the following two ways: changes in G + C content in low-CpG genomic context, or CpG content in high-G + C genomic context. Performing a gene ontology analysis revealed that such liable promoters were preferentially found in genes associated with alternative splicing (150 of 258 hCmN and 99 of 158 hNmC).
Graph: Fig. 4 Scatter plot of aligned human and macaque promoter pairs. Horizontal and vertical axes represent G + C content and CpG score, respectively. The vertical 0.5 and horizontal 0.6 lines split the promoters into four groups. The top right group corresponds to CpG-island promoters; the other three groups correspond to non-CpG-island ones. a Each line connecting human (blue dot) and macaque (red dot) promoters represents a change in the contents and scores between the two species. Most connecting lines were short, indicating the contents and scores are highly conserved between human and macaque. A scalable version of this scatter plot is also provided (supplementary Fig. S1). b Numbers representing change in promoter types between humans and macaque. The numbers were obtained by counting the connecting lines that span the group border and are shown with gray arrows. Width of the gray arrow represents number of changes between two groups. CpG score changes in high-G + C content were predominant. Intriguingly, almost all invertebrate ascidian promoters clustered in the top left group
The number of 601-nt promoter sequences detected in the macaque genome was 12,543, which were then realigned to the corresponding human queries with ClustalW 2.1 to examine base changes between the two species. Pairwise alignments with alignment scores less than 90 were discarded from further analysis. Additionally, those with a mismatch in the YR consensus dinucleotide were also discarded. More than 97% of promoter pairs between human and macaque preserved their promoter types, i.e., CpG-island or non-CpG-island promoters (Table 1). However, the TBCE gene represents one of the rare cases that showed a discrepancy of promoter types (Fig. 5). Although the orthologous promoter sequences were highly similar between the two species, a major difference was a 47-bp indel near the 3ʹ end. BLAT search for the 47-bp fragment using chimpanzee, gorilla, baboon, and marmoset reference sequences revealed that a deletion event may have occurred in a common ancestor of gorilla, chimpanzee, and humans. Compared to the macaque sequence, four CpG sites were lost by the deletion in the human TBCE promoter, resulting in transition from a CpG-island promoter to a non-CpG-island promoter in hominids, namely great apes.
Classification of pairwise-aligned promoters
rheMac8 (macaque) mm10 (mouse) CpG-island promoter Non-CpG-island promoter Subtotal CpG-island promoter Non-CpG-island promoter Subtotal hg19 CpG-island promoter 6893 259 7152 1557 167 1724 (human) Non-CpG-island promoter 158 5288 5446 47 968 1015 Subtotal 7051 5547 12598 1604 1135 2739
Graph: Fig. 5 Examples of indels in human and macaque promoters. Sequence alignments are shown for the PNP, CEND1, and TBCE promoters, in which an indel altered the promoter types between the two species. Harr plots were drawn to detect tandem repeats related to insertion or deletion events (rheMac8 chr7:81,686,561–81,686,600 for PNP, hg19 chr11:790,324–790,373 for CEND1, and rheMac8 chr1:211,651,491–211,651,557 for TBCE). Each dot indicates a 5-base perfect match. Indel-related sequences are shown in red. Multiple alignments of the six primate species were drawn to determine the type of mutation event, namely insertion or deletion. The gray and black triangles indicate the occurrence of an insertion and deletion, respectively. CpG islands were gained in macaque PNP and human CEND1 through insertion and lost in hominoid TBCE via deletions. All CpG sites are shaded in blue
In vertebrate genomes, CpG sites are subject to cytosine methylation, often followed by deamination and mutation to TpG (or CpA if deamination occurred in the complementary strand). In the human genome, the frequency of the CpG dinucleotide was extremely low among all 16 dinucleotide sequences. Conversely, frequencies of resultant TpG and CpA were the highest, except for A-rich or T-rich ones, i.e., ApA, ApT, TpA, and TpT (Okamura et al. [
Graph: Fig. 6 Dinucleotide alterations between human and macaque promoters. a Heat map showing alteration rates of each dinucleotide. Increased rates were observed when corresponding dinucleotides were not conserved between the two species. In the top two rows, human promoters were aligned to macaque CpG-island and non-CpG-island promoters, respectively. In the bottom two rows, macaque promoters were aligned to human CpG-island and non-CpG-island promoters, respectively. Both single-nucleotide substitutions and indels were considered for the rate calculation. Frequent alteration of CpG sites was observed when a promoter aligned to a non-CpG-island promoter of the counter species. Alteration of TpA dinucleotide was also high when aligned to a CpG-island promoter of the counter species. b Band graphs indicate alignment of dinucleotides to CpG sites. Between human and macaque CpG-island promoters, CpG sites were well aligned to CpG sites in the counter species (top bands). In background genomic sequences, however, many CpG sites were aligned to TpG or CpA sites (bottom bands). hCmC human CpG-island and macaque CpG-island promoter pairs, hNmN human non-CpG-island and macaque non-CpG-island promoter pairs
As for CpG sites in CpG-island promoters, more than 80% of the dinucleotide sequences were conserved between the two species. In particular, 85.7% and 86.7% of the CpG sites in human and macaque CpG-island promoters, respectively, were aligned without mismatch (Fig. 6, Supplementary Table S2). The slight difference in these percentage values arose as the total number of CpG sites between the counterparts was not identical. However, dinucleotide conservation was not observed in non-CpG-island promoters. In type-change promoter pairs, a large number of CpG-to-TpG/CpA transitions were found. The dinucleotide changes were observed 2.8 and 3.5 times more frequently in humans and macaque, respectively, than those in CpG-island promoters. The frequencies were also higher than those in background genomic sequences. Thus, it is likely that deamination mutations contributed to promoter variation.
While most of the human–macaque promoter pairs consisted of the same type between the two species, i.e., CpG-island or non-CpG-island type, 417 pairs (158 pairs + 259 pairs) showed discordance (Table 1). In addition to nucleotide substitutions, many insertions and deletions were detected in their sequence alignments, suggesting that such events could have evolutionarily altered the promoter types as shown in the TBCE locus. Additionally, multiple sequence alignments of the PNP and CEND1 promoters are provided as examples of insertion by repeat expansion (Fig. 5). From all of the indels, we selected 439 sites whose neighboring 10-bp regions on both sides showed 90%, or greater, similarity between human and macaque genomic sequences. Approximately 10% were 10-bp or longer indels (Fig. 7). Alterations in the numbers of CpG dinucleotides, including in the nearby regions, were observed in 82 indel sites. Among these, only 16 sites showed two or more CpG-site alterations, including six sites harboring a 20-bp, or longer, indel. In CpG-island promoters, long indels often contained several CpG sites. If an inserted or deleted fragment contained CpG sites, the event could contribute to gain or loss of CpG-island-like sequences. For all the detected indel sequences longer than 4 nt, Harr plots were drawn along with neighboring sequences (Fig. 5, Supplementary Table S1, Supplementary Fig. S2). Of all 88 indels, 77 were associated with tandem repeats, including 16 indels with inverted repeats (Supplementary Fig. S2). In total, repetitive sequences were found in 365 out of the 439 indel sites, including shorter indels (Supplementary Table S3). More than 50% of the indels were single-nucleotide insertions or deletions (Fig. 7). In many cases, they could be considered to be an extension or shortening of homopolymeric repeats. Inserted or deleted single nucleotides clearly depended on promoter types; while nucleotide bias was not observed in non-CpG-island promoters, frequent C or G indels were identified in CpG-island promoters.
Graph: Fig. 7 Pie charts showing number of insertions or deletions between human and macaque promoters. a Number of detected indel sites is classified by length. More than half were single-nucleotide insertions or deletions. The longest indel detected was 35 bp. b–d Types of nucleotides detected in alignments are shown between CpG-island promoters (b), CpG-island and non-CpG-island promoters (c), and non-CpG-island promoters (d). In CpG-island promoters, inserted or deleted nucleotides were biased toward C or G. hCmC CpG-island promoter in human and macaque, hNmC CpG-island promoter in macaque and not in human, hCmN CpG-island promoter in human and not in macaque, and hNmN non-CpG-island promoters in the two species
Although the rhesus monkey is an instrumental model organism for understanding many aspects of humans, the number of documented genes or transcripts in these monkeys is much smaller than in humans and mice. Further, genomic annotation of macaque has been limited. Therefore, curated cDNA sequences, such as datasets provided by NCBI RefSeq, have been used to infer putative promoter regions (Pruitt et al. [
We found that 5ʹ ends of curated cDNA sequences tend to be positioned far upstream of TSSs determined by the Oligo-capping method (Maruyama and Sugano [
The two representative mammalian species show high sequence conservation in protein-coding regions. However, it is quite difficult to align non-coding sequences, including regulatory regions such as promoters. Furthermore, the presence of CpG-island promoters is not conserved for many human–mouse gene loci (Jiang et al. [
Despite moderate sequence divergence, changes were detected in 417 promoters, most of which were C-to-T transitions. Using outgroup species, multiple studies reported loss of CpG as much more frequent than changes of other dinucleotides to CpG (Jiang et al. [
Certain limitations were noted in the current study. For instance, this study was designed under the presumption that the human and macaque genomic sequences can be readily aligned at base resolution; therefore, a bias may have occurred toward highly conserved promoters. Although up to 35-bp indels were identified in our alignment settings, it is difficult to detect much larger structural alterations, such as deletion, segmental duplication, retrotransposition, or gene conversion, within the promoter regions. These alternate mutational events could also provide opportunities for gain or loss of a CpG island. By using orthologous gene pairs, highly diverged promoters may also be compared. Nonetheless, our study presents distinctive insights into the evolution of CpG density, which likely contributed to the fine-tuning of gene expression from early vertebrates to primates. Understanding this process may contribute to the development of platforms capable of altering gene regulation at our own convenience in the future.
Human promoter sequences were used to locate macaque promoter regions at their orthologous regions. Using DBTSS (https://dbtss.hgc.jp/) ver. 6, each human promoter sequence was prepared as a 601-bp fragment bearing the transcription start site (TSS) at the center and aligned to the UCSC rheMac8 assembly (Baylor College of Medicine Genome Sequencing Center Mmul_8.0.1) considering the transcriptional orientation. Meanwhile, the UCSC hg19 (GRCh37) assembly was used for the human reference genome. BLAT ver. 36, using rheMac8 as the reference, was employed for identification. Soft-masked bases in rheMac8 were hard-masked or replaced with Ns as primate genomes contain G + C-rich Alu sequences. Macaque sequence hits with ≥ 90% identity of over 200 bp were adopted as promoter candidates. When several fragments were obtained, the longest was selected. If a YR consensus dinucleotide was present corresponding to humans, the candidate fragment was extended to 601 bp centering the dinucleotide, and the macaque promoter underwent further analyses. Approximately 10% of candidates were eliminated due to lack of a YR dinucleotide. Fragments that could not be extended to the fixed length due to contig gaps were also eliminated. Mouse promoter sequences were located using the above-mentioned approach with the UCSC mm10 (GRCm38) assembly sequences.
To confirm the existence and positions of promoter elements, including the Inr and TATA boxes, aligned genomic sequences were graphically represented. Sequence logos spanning 35 bp upstream and 5 bp downstream regions from TSSs were drawn using WebLogo 3.5.0 (Crooks et al. [
TSS positions can be inferred from 5ʹ-end positions of gene model sequences such as NCBI RefSeq. For comparison to DBTSS data, separated distances were calculated and represented in histograms. The distance was obtained by subtracting the DBTSS genomic position from that of RefSeq. Minus distance values indicated a RefSeq position upstream of the DBTSS position. Since genes curated in RefSeq have many transcript variants with different 5ʹ-end positions, only the nearest 5ʹ-end position to the corresponding DBTSS position was selected for each gene. As for DBTSS positions, TSSs supported by less than three clones were eliminated.
A CpG-island promoter was defined as a genomic region based on the following criteria: (
Orthologous pairs of promoters were grouped into the following four groups: CpG-island promoter in human and macaque (hCmC); CpG-island promoter in human and not in macaque (hCmN); CpG-island promoter in macaque and not in human (hNmC); and non-CpG-island promoters in the two species (hNmN). We used DAVID 6.8 to perform the gene ontology analysis (Huang et al. [
After BLAT search, putative orthologous promoter sequences were meticulously aligned using ClustalW 2.1 with gap extension penalty 0.2 (Thompson et al. [
Using the human and macaque PNP, CEND1, and TBCE promoter sequences as BLAT queries, orthologous sequences of other primates were obtained. Subject reference assemblies were UCSC panTro6, gorGor5, papAnu4, and calJac3 for chimpanzee, gorilla, baboon, and marmoset, respectively. Output sequences by ClustalW were visually inspected and manually realigned, if necessary.
To detect repetitive sequences in promoter regions, Harr plots were drawn with our custom scripts. We used a 5-bp sliding window with a 1-bp stride. One dot indicated a 5-bp perfect match. Except for the PNP, CEND1, and TBCE promoters, Harr plots are provided as supplementary material (supplementary fig. S2).
This work was supported by NCCHD (Grant Number 2019C-15 to SA) and KAKENHI (Grant Number 23770273 to KO).
MF was supported by Project Encouraging Science Undergraduates by MEXT and Ochanomizu University when she was an undergraduate student. The computing resources were mainly provided by cluster of Hitachi HA8000/RS210 at the Center for Regenerative Medicine, National Research Institute for Child Health and Development.
KO conceived and designed the study. SA and MF performed the analysis. SA, MF, KY, and KO interpreted the data and discussed the results. SA and KO drafted the manuscript.
All data are available upon request.
Custom scripts, which were written in Python, are available upon request.
The authors have no conflicts of interest and declare no competing financial interests.
All authors agree the order of author listing and read the final manuscript.
All authors approved submission of the final manuscript.
Below is the link to the electronic supplementary material.
Graph: Supplementary file1 (PDF 1398 kb)
Graph: Supplementary file2 (PDF 1539 kb)
Graph: Supplementary file3 (PDF 185 kb)
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
By Saki Aoto; Mayu Fushimi; Kei Yura and Kohji Okamura
Reported by Author; Author; Author; Author