While the malaria parasite Plasmodium falciparum has low average genome-wide diversity levels, likely due to its recent introduction from a gorilla-infecting ancestor (approximately 10,000 to 50,000 years ago), some genes display extremely high diversity levels. In particular, certain proteins expressed on the surface of human red blood cell–infecting merozoites (merozoite surface proteins (MSPs)) possess exactly 2 deeply diverged lineages that have seemingly not recombined. While of considerable interest, the evolutionary origin of this phenomenon remains unknown. In this study, we analysed the genetic diversity of 2 of the most variable MSPs, DBLMSP and DBLMSP2, which are paralogs (descended from an ancestral duplication). Despite thousands of available Illumina WGS datasets from malaria-endemic countries, diversity in these genes has been hard to characterise as reads containing highly diverged alleles completely fail to align to the reference genome. To solve this, we developed a pipeline leveraging genome graphs, enabling us to genotype them at high accuracy and completeness. Using our newly- resolved sequences, we found that both genes exhibit 2 deeply diverged lineages in a specific protein domain (DBL) and that one of the 2 lineages is shared across the genes. We identified clear evidence of nonallelic gene conversion between the 2 genes as the likely mechanism behind sharing, leading us to propose that gene conversion between diverged paralogs, and not recombination suppression, can generate this surprising genealogy; a model that is furthermore consistent with high diversity levels in these 2 genes despite the strong historical P. falciparum transmission bottleneck.
Merozoite surface proteins expressed by the malaria parasite Plasmodium falciparum on the surface of human red-blood-cell-infecting merozoites possess exactly two deeply-diverged lineages that have seemingly not recombined. The evolutionary origin of this phenomenon remains unknown but genomic analysis of DBLMSP and DBLMSP2 reveals the answer to this long-standing puzzle.
Plasmodium falciparum is a single-celled eukaryotic parasite causing malaria disease in humans. Malaria burden remains high worldwide, with 241 million cases and 627,000 deaths in 2020 according to WHO [[
Historically, several cell-surface antigens called merozoite surface proteins (MSPs) were found to display unusual genealogies, with exactly 2 deeply diverged lineages: This includes MSP1, MSP2, MSP3, and MSP6 [[
In this study, we focussed on 2 MSPs called DBLMSP and DBLMSP2, both among the most diverse genes in P. falciparum [[
The evolutionary history of P. falciparum surface antigens, including DBLMSP and DBLMSP2, has been difficult to study until now because of reference bias: Reads spanning highly diverged nonreference alleles fail to align to a reference genome, making them hard to reconstruct. To address this, we previously developed gramtools, a software for mapping reads and genotyping using a genome graph incorporating multiple references simultaneously [[
For the remainder of this paper, we refer to DBLMSP and DBLMSP2 collectively as DBLMSP1/2.
To analyse variation in DBLMSP1/2, we used data from malariaGEN, a consortium releasing Illumina whole-genome sequencing data from global P. falciparum samples [[
To evaluate our genotype calls, we implemented 2 orthogonal approaches (see Methods) and compared our pipeline outputs with those from malariaGEN's existing pipeline, based on GATK. GATK is a state-of-the-art genotyping framework [[
To analyse polymorphism levels in DBLMSP1/2, we translated all confidently resolved gene sequences from the 2 pipelines into proteins and computed 2 measures of sequence diversity, as shown in Fig 1 (measures computed from multiple-sequence alignments). In panel (a), we show within-gene heterozygosity (y-axis), defined as the probability that, for a given gene and at a given aligned position (x-axis), 2 randomly chosen amino acids from the population differ. For both DBLMSP and DBLMSP2, much less diversity was recovered by the GATK-based pipeline (left-hand side panels) compared to ours (right-hand side). In our sequences only, a central region of each gene is particularly polymorphic and spans their DBL domain, delimited with blue vertical dotted lines (see Methods for annotation).
Graph: In panel (a), the y-axis measures the probability that, at each aligned protein position (x-axis), 2 randomly chosen amino acids differ, for each gene and each of the GATK- and gramtools-based pipelines. A region of extreme diversity spans the DBL domain, annotated with blue vertical dotted lines, and is only visible with our new pipeline. Panel (b) shows the probability that 2 randomly chosen amino acids, one from each gene, differ. A value of 1 indicates no amino acids in common, i.e., full divergence of the 2 genes. The DBL domain lies in a region of shared sequence, where no amino acid has fully diverged, and indicated with red vertical dotted lines—we call this the DBL-spanning region (DSR). We note that a smaller C-terminal region also displays positions with putative sequence sharing, but these are in fact gap characters in an indel-rich region of the alignment. The data and code to generate this Figure can be found at.
In panel (b), we show between-gene heterozygosity, defined as the probability that, for 2 genes at an aligned position, 2 randomly chosen amino acids—one from each gene—differ. A value of 1 indicates no amino acids in common between the sequences of the 2 genes (fully diverged position), while a value of 0 indicates a single identical amino acid is found in both genes (fixed position). While many positions are fully diverged, a region spanning the DBL domain, shown with red vertical dotted lines, has zero fixed differences between the genes, indicating sequence sharing. This observation is impossible with previous methods (i.e., the GATK-based or any single-reference/non-pangenomic pipeline). We call this region the DBL-spanning region, or DSR, and focus the rest of the analysis in this paper on this region, and using our gramtools-pipeline results.
We then built a hierarchical clustering tree from all unique protein sequences (278 in total) in the multiple-sequence alignment of DBLMSP1/2 to visualise the sequence relationships in the DSR (Fig 2). The tree clearly shows 3 main lineages, marked A, B, and C. Lineages A and C consist exclusively of DBLMSP (yellow in innermost coloured ring) and DBLMSP2 (blue) sequences, respectively, while lineage B contains sequences from both genes. We thus call lineage B "shared lineage" and lineages A and C "private lineages" (for details, see S11 Fig). We note that the shared lineage is abundant in populations: It occurs at approximately 25% to 50% frequencies in all 16 countries with more than 50 samples (S12 Fig). This is consistent with balancing selection maintaining both shared and private lineages in populations.
Graph: We built a hierarchical clustering tree of all unique DBL-spanning protein sequences (see Methods). The inner ring colours sequences by gene of origin (DBLMSP, DBLMSP2), and the outer ring shows species of origin, for P. falciparum and its 3 most closely related species. Three main lineages exist in the tree, labelled A, B, and C: Lineages A and C contain only representatives of DBLMSP2 and DBLMSP, respectively ("private lineages"), and lineage B contains representatives of both ("shared lineage"). The data and code to generate this Figure can be found at.
To probe the evolution of DBLMSP1/2, we searched for orthologs in the 6 closest known relatives of P. falciparum (all part of subgenus Laverania). Using sequencing data and genome assemblies from Otto and colleagues [[
The species of origin of each DBLMSP/DBLMSP2 sequence is shown in Fig 2 (outer ring). For DBLMSP, the orthologs fall in a distinct sublineage of lineage C and consistently with the known Laverania phylogeny. For DBLMSP2, the 2 orthologs in P. reichenowi are pseudogenes (consistent with prior knowledge [[
Overall, in P. falciparum, 2 deeply diverged lineages exist per gene, one of which is shared across both—leading to 3 lineages in total, instead of 4. In S16 Fig, we show HMM logos of what each sequence lineage prototypically looks like at the amino acid level. In other highly diverse MSPs, recombination between the 2 lineages has been reported to be rare or absent [[
To detect recombination in our DBLMSP1/2 protein sequences, we used a method developed by Zilversmit and colleagues [[
Graph: (a) Visual confirmation of MosaicAligner's inferred alignment for the first DBLMSP sequence (target; fully shaded sequence), aligned to 2 other DBLMSP sequences (donors; partially shaded sequences). The red vertical line marks the switch in alignment between the 2 donors. Either side of the switch, the target aligns to the shaded donor with many fewer edits than to the other donor (red-coloured letters flag mismatches). (b) Full mosaic alignments illustrated for 3 out of 35 representative sequences in the panel. In each panel, the aligned target is the fully opaque row labelled with an arrow, and the donors are shown as partially opaque rows. Illustrations for all 35 alignments are available with this paper. (c) For each gene, the aggregated locations of all breakpoints from the mosaic alignments are shown. The breakpoints did not seem to cluster into hotspots. The data and code to generate this Figure can be found at.
For 3 of the mosaic alignments, the donor sequences came from different genes (last alignment of Figs 3B and S21), consistent with sequence exchange between the paralogs. This can notably occur during repair of double-strand breaks and subsequent sequence pasting—also called gene conversion—from a nearby unbroken template. The template is usually a homologous gene copy, either from an identical sister chromatid (e.g., after genome replication) or from a homologous chromosome (e.g., during recombination in meiosis). In some cases, a nearby paralog can act as the template instead; this is also called nonallelic conversion [[
Graph: (a) This scheme explains the matrix that follows in panel (b). For each of the samples in which both DBLMSP1/2 gene sequences were confidently resolved, we aligned their DNA sequences in the DSR and recorded positions where codons were identical between the 2 genes (beige cells) versus different (black cells). Gene conversion should appear as contiguous strips of beige cells. (b) The 209 samples with >50% identical codons between DBLMSP and DBLMSP2 are shown (rows) at each position of the DSR (columns). The strips of near-all beige indicate likely sequence copying between the 2 genes in a sample, supporting within-genome gene conversion. Two main sets of samples can be distinguished visually, consistent with at least 2 distinct conversion events (labelled on the right) having occurred in the lineages leading to these samples. The data and code to generate this Figure can be found at.
The samples in Fig 4 panel b are consistent with 2 main conversion events having occurred ancestrally, with different breakpoints (start and end positions of the beige strips) and sequences (S23 Fig). Samples from the 2 events are geographically widespread, occurring across both west and east sub-Saharan Africa and Southeast Asia (S24 Fig), suggesting they are both being actively maintained, either through selection or recurrent conversion.
In Fig 5, we illustrate the relationship between gene conversion and the lineages of DBLMSP1/2. In panel a, we show once more the clustering tree from Fig 2, with the addition of an outer ring marking samples belonging to the 2 different conversion events identified in Fig 4. The coloured stars mark the putative locations of new subclades in the tree that were created by gene conversion.
Graph: (a) The same clustering tree as in Fig 2 is shown (built from alleles of the DSR), with the addition of an outer ring labelling the sequences shown in Fig 4. These were divided into the 2 distinct conversion events identified in Fig 4 and labelled green (conversion event 1) and pink (conversion event 2). The 2 events gave rise to new subclades in the tree; for how, see the main text and the following panel. (b) A simplified schematic showing how gene conversion event 1 created 2 deeply diverged lineages either in DBLMSP2 (scenario i) or in DBLMSP (scenario ii), depending on the direction of the sequence pasting. The data and code to generate this Figure can be found at.
Conversion event 1 is labelled in green in Fig 5A, and panel b illustrates its effect on an ancestral tree of DBLMSP1/2 sequences, depending on whether DBLMSP pasted into DBLMSP2 (scenario i), or vice versa (scenario ii). For example, in scenario ii, a preexisting sequence from lineage B.2 in DBLMSP2 pasted into a sequence from DBLMSP lineage C, giving rise to lineage B.1 in DBLMSP and the creation of 2 deeply diverged lineages in DBLMSP. In scenario i, 2 deeply diverged lineages are instead created in DBLMSP2, through pasting from a DBLMSP allele in lineage B.1. Note that because almost all of the sequence has been pasted (approximately 80%; Fig 4B), the recipient sequence ends up in a lineage close to the donor sequence; the opposite would hold if a small minority (e.g., 20%) of the sequence had been pasted. We hypothesise scenario ii is more likely, leading to the birth of subclade B.1 (green star in Fig 5A), because we identified sequences from P. praefalciparum in these 2 lineages (B.2 and C), but no sequences from either lineages B.1 (DBLMSP) or A (DBLMSP2).
Similarly, conversion event 2 led to the birth of lineage A.1 or lineage B.1.1 and likely occurred after conversion event 1 as subclade B.1.1 lies nested within subclade B.1. Here, we cannot speculate on which gave rise to the other, so we show 2 pink stars in Fig 5 panel a. Note that because of the intermediate fraction of pasted sequence for this event, (approximately 0.55; Fig 4B) subclades A.1 and B.1.1 are not located close to each other in the tree. Finally, while our data are clearly consistent with sublineage birth in the tree of DBLMSP by gene conversion, they do not explain the preexistence of deeply diverged lineages in DBLMSP2 (Fig 5B). We return to this in the discussion.
The recombination and gene conversion events studied so far were inferred indirectly from population-level data. To test for direct evidence of these events in DBLMSP1/2, we also analysed data from repeatedly sequenced isolates through time. We looked for mutations in DBLMSP1/2 in 2 sources: the "clone trees" from Hamilton and colleagues [[
The existence of exactly 2 deeply diverged lineages that have not recombined in specific P. falciparum genes, historically called "allelic dimorphism" in the malaria literature, has been a long-standing puzzle [[
This idea is consistent with the recent introduction of P. falciparum into humans by zoonosis from a common ancestor with the gorilla-infecting P. praefalciparum, only about 10,000 to 50,000 years ago [[
In the future, our evolutionary model could also be tested in other MSPs that have been called "dimorphic": notably, MSP2 occurs in tandem with another MSP (MSP4), and MSP3 and MSP6 both occur in the same 8-gene paralog tandem array as DBLMSP and DBLMSP2 (S1 Fig). Interestingly, and as for DBLMSP1/2 here, in 2003, Nielsen and colleagues reported gene conversion between the paralogous genes FP2A and FP2B located 10 kbp apart, causing the genes to look far more diverse than consistent with a recent bottleneck [[
In terms of evolutionary constraints, we note that in P. falciparum, DBLMSP2 appears more highly constrained than DBLMSP: of 234 DBLMSP1/2 gene sequences with premature stop codons, 196 lie in DBLMSP and 38 in DBLMSP2. By contrast, in P. reichenowi and P. praefalciparum, all 4 identified DBLMSP orthologs had a complete open reading frame (2 in each species), while 1 of 4 DBLMSP2 orthologs did (in P. praefalciparum). This raises the intriguing possibility of a human-specific function (or constraint) having evolved in DBLMSP2. To fully test this will require population-level data in P. reichenowi and P. praefalciparum.
We also note that gene conversion has only occurred (or been selected in) the DSR, while the rest of the DBLMSP1/2 genes have diverged substantially (Fig 1). In other P. falciparum proteins, the DBL domain is key to parasite invasion and persistence: In the EBA family, the DBL domain mediates RBC invasion, and in the vars, it binds to receptors on endothelial cells and other infected RBCs (iRBC), enabling iRBC sequestration [[
In conclusion, our study highlights the importance of paralogous gene evolution, and we hope that the higher-resolution sequence data in DBLMSP1/2 will help contextualise their biological function once it is elucidated.
Of the >7,000 samples available in the MalariaGEN release that we analysed (released November 2020 [[
The sequenced life stage of P. falciparum is haploid, so a single haplotype can be expected in each sample. However, samples with multiple co-occurring strains (multiplicity of infection (MOI) > 1) are common in P. falciparum and hard to genotype confidently. We thus further filtered out samples with evidence of MOI > 1, using the F
For each sample in the analysis set, reads were downloaded from the ENA, trimmed using trimmomatic [[
We also aligned reads to the 3D7 reference genome using bwa-mem [[
We provide only a brief summary of genotyping evaluation and performance here and refer the reader to S1 Text for full details. We used 2 approaches to evaluate the genotype calls made by both pipelines (S4 and S5 Figs). The first used 14 independent samples with both Illumina and PacBio data, from which high-quality truth assemblies were built in [[
We then defined a gene sequence as confidently resolved if, across the full gene (DBLMSP: 2,094 base pairs (bp); DBLMSP2: 2,289 bp), the reads contained no positions with less than 5 aligned reads, no positions where the majority of reads disagreed with the induced reference, and <15% of reads with large insert sizes. This resulted in 5,895 confidently resolved DBLMSP1/2 sequences in total for the gramtools-based pipeline (including the two 3D7-reference representatives). We observed that for the gramtools-based pipeline results, a further 200 DBLMSP1/2 sequences (92 DBLMSP and 108 DBLMSP2) had no coverage gaps and <15% reads with large insert sizes, and a single majority-pileup difference. We corrected these single-SNP sequences using a custom script (available with this paper), and added them to our analysis set, and also added the 28 DBLMSP1/2 sequences from the 14 samples assembled by Otto and colleagues [[
Each step in the gramtools-based pipeline gradually improved genotyping performance across both evaluation metrics and confidently resolving most samples at the end (S6 Fig). Overall, the gramtools-based pipeline also clearly outperformed the GATK-based pipeline (S7–S9 Figs).
DBLMSP1/2 gene sequences were translated to protein using seqkit [[
The heterozygosities in Fig 1 were computed on multiple-sequence alignments built using mafft [[
At a given aligned position, we define the set of all observed amino acids as {a
Graph
The within-gene heterozygosity is the equation above evaluated on a single gene j (h
The DBL domain was annotated on the 3D7 sequence of DBLMSP by downloading its HMM model from InterPro (https://
To obtain DBLMSP and DBLMSP2 sequences from other Laverania parasites, we used the data from Otto and colleagues [[
To find DBLMSP and DBLMSP2 in the assemblies (plus an additional assembly for P. reichenowi [[
As P. praefalciparum is particularly relevant to the evolution of P. falciparum, we worked to resolve DBLMSP and DBLMSP2 in a further 3 isolates with only Illumina data [[
The hierarchical clustering tree was built using scipy [[
The 5,889 DBLMSP1/2 protein sequences in the DSR were first clustered at 96% identity using cd-hit [[
To perform "mosaic alignment," MosaicAligner [[
All mosaic alignments included at least 1 recombination breakpoint. To validate these, we compared the edit distance of each target to its MosaicAligner-inferred donor path with the edit distance to the single closest donor. The former was always smaller than the latter (S18 Fig).
When comparing the sequences of DBLMSP1/2 in individual samples, we aligned their DNA sequences, as gene conversion occurs at the DNA level, and measured the fraction of identical codons, not nucleotides, to match the protein-level analysis. Notably, codon-level identity is closer than nucleotide-level identity to protein-level identity, though a lower-bound of it as 2 identical amino acids can be encoded by 2 different codons.
For identifying samples with evidence of gene conversion, we looked for stretches of identical sequence between the 2 paralogs on the same genome, for all samples where both DBLMSP1/2 sequences were "confidently resolved," meaning (as defined above) no coverage gaps or pileup-based differences and no high levels of large-insert sizes. To rule out the possibility of erroneously attributing sequence sharing to a duplication event in 1 gene, we further filtered out samples with evidence of a possible gene copy-number variation (CNV; see S25 Fig). Of the 212 samples with a codon-level identity >0.5, 3 had a possible CNV, leaving 209 samples all shown in Fig 4.
We further validated 8 samples from each gene conversion event in Fig 4 by manually inspecting read coverage levels and insert sizes in IGV [[
For the P. falciparum genetic crosses, we used all 4 publically available crosses, between strains 3D7 and HB3 [[
We downloaded all available read accessions (284 clone tree samples in 6 clone trees, and 142 samples in 4 crosses) from the ENA. For each sample, we performed preprocessing as above (trimmomatic + rasusa) and then genotyped each sample with gramtools on the graph built from the 3,589 "analysis-set" samples above. To discover any missed variation (as these samples were not part of the 3,589 in the graph), or mutational events in progeny samples, we then ran all steps of our gramtools-based pipeline up to and including Gapfiller (S4 Fig).
By our evaluation pipeline standards, all samples were confidently resolved. We then aligned all pairs of progeny and parent samples (in the clone trees, aligning to the only parent, and in the crosses, aligning to both parents and looking for the closest parent (the parents being usually highly diverged)) to infer any mutations. The only mutational events found were 2 SNPs, one in each DBLMSP1/2 gene, in 1 progeny sample of the cross HB3xDd2 (S27 Fig).
The analysis steps underlying the results presented in this paper were implemented using bioinformatic workflows written in Snakemake [[
S1 Text
Text file supporting the supplementary figures of the paper.
(DOCX)
S1 Fig
Genomic context and protein domains in DBLMSP and DBLMSP2.
The 2 genes, marked with grey arrows, are located at a distance of 16.1 kbp from each other, inside an array of 8 contiguous genes spanning 32 kbp on chromosome 10. These genes are likely paralogs due to observed sequence sharing: All 8 have an N-terminal shared motif, a further 6 have a C-terminal SPAM domain, and DBLMSP and DBLMSP2 further share a DBL domain (domains shown as coloured circles below each gene). Figure annotated from a screenshot of the gene track for DBLMSP2 taken from PlasmoDB.(TIF)
S2 Fig
Geographical distribution of the 3,589 analysed P. falciparum samples.
A total of 29 countries are represented, with most samples located in the 2 regions with highest endemicity: sub-Saharan Africa and Southeast Asia. The base map comes from the freely distributed R package "maps," under a GPL-2 licence: https://cran.r-project.org/package=maps. The data and code to generate this Figure can be found at https://zenodo.org/doi/10.5281/zenodo.7677547.
(TIF)
S3 Fig
Read statistics for the 3,589 analysed P. falciparum samples.
The upper panels show statistics measured on the reads directly: per-base quality (top-left panel) and read lengths (top-right panel). Per-base quality q gives the Illumina-estimated sequencing error rate ϵ as
Graph
(TIF)
S4 Fig
Existing and novel genotyping pipelines applied to the MalariaGEN data.
Panel (a) illustrates MalariaGEN's existing GATK-based pipeline, and panel (b) illustrates our new pipeline. Both first discover variants in each sample individually before regenotyping each sample at the union of all variants. GATK relies on the linear reference genome to do this, while gramtools uses a genome graph.(TIF)
S5 Fig
Framework used to evaluate the variant calls from the GATK and gramtools-based pipelines.
Starting from a tool's variant calls in a VCF file (middle), 2 independent evaluations were performed. First, for 14 samples with truth assemblies, the calls were directly compared to the truth, by applying them to the 3D7 gene sequence and measuring edit distance for the whole gene (part a). Second, the calls were all applied to the reference genome, and the reads remapped to this induced reference. Incorrect or missing calls then appear from read pileups, as majority-differences compared to the reference base, coverage gaps, or inconsistent insert sizes between read pairs (part b).(TIF)
S6 Fig
Performance of the gramtools pipeline steps in DBLMSP and DBLMSP2.
The 2 panels, a and b, correspond to parts a and b of the evaluation framework in S5 Fig. Panel (a) shows the mean edit distance between the inferred gene sequence and the truth assembly for the 14 samples with truth assemblies (edit distance is scaled by gene length). Panel (b) shows the fraction of positions with pileup-based differences (top) and with low read coverage (bottom), after the sequencing reads are remapped to the 3D7 reference genome with each tool's called variants applied. A pileup-based difference is when the majority of reads disagree with the reference at a given position, given a minimum of 5 mapped reads, and low read coverage is defined as a position with fewer than 5 mapped reads. Each bar in panel b shows the mean across 500 of the 3,589 analysed samples. Across both panels, each coloured bar corresponds to one additional step in the gramtools-based pipeline, in the same order they are run (see S4 Fig). The "baseline" condition is not part of the pipeline and refers to using 3D7 reference gene sequence with no variants applied (in panel a: 3D7 sequence aligned to the truth assemblies; in panel b: sample reads aligned to the 3D7 reference genome). The data and code to generate this Figure can be found at https://zenodo.org/doi/10.5281/zenodo.7677547.
(TIF)
S7 Fig
Global performance of the gramtools-based and GATK-based pipelines.
Panels a and b show the same metrics as S6 Fig. Metrics in panel b were computed on all 3,589 analysed samples. The data and code to generate this Figure can be found at https://zenodo.org/doi/10.5281/zenodo.7677547.
(TIF)
S8 Fig
Frequency distributions of the evaluation metrics for the gramtools-based and GATK-based pipelines.
Each subplot shows the frequency distribution, across all 3,589 analysed samples, of the fraction of positions with pileup-based gaps (left-hand side plots) or differences (right-hand side plots), for DBLMSP (top) and DBLMSP2 (bottom). The mean is shown as a red vertical line (value shown in text next to it) and corresponds to the height of the coloured bars in S7 Fig panel b. The data and code to generate this Figure can be found at https://zenodo.org/doi/10.5281/zenodo.7677547.
(TIF)
S9 Fig
Sequence filtering using pileup-based metrics.
In each panel ("baseline":no variant calling, "gram_joint_geno": gramtools-based pipeline, "malariaGEN": GATK-based pipeline), the total fraction of remaining gene sequences (out of the 3,589 analysed samples) passing filters is shown. Filters (colours) are applied in succession, on each set of remaining gene sequences, in the order they appear in the legend. The number of remaining sequences is given above each coloured bar. The data and code to generate this Figure can be found at https://zenodo.org/doi/10.5281/zenodo.7677547.
(TIF)
S10 Fig
Peptides defined as shared are shared in many different countries.
The number of shared peptides by our definition (y-axis) that are found in both genes inside the same country, for up to 16 countries with high levels of sampling (defined as >50 available DBLMSP1/2 sequences; x-axis). A value of zero on the x-axis means the shared peptide is not found on both genes in any of these countries, and 16 means it is found on both genes in all of them. A majority (57%) of shared peptides are found in all of these countries, and 86% are found in at least 2 different countries, showing that the shared peptides are, overall, highly widespread geographically. The data and code to generate this Figure can be found at https://zenodo.org/doi/10.5281/zenodo.7677547.
(TIF)
S11 Fig
Clustering tree with sequence sharing.
The 2 innermost rings show the gene and species of origin (as in Fig 2), and the outermost ring measures the level of sequence sharing between the 2 genes (see definition in the text). The data and code to generate this Figure can be found at https://zenodo.org/doi/10.5281/zenodo.7677547.
(TIF)
S12 Fig
Shared peptide 10-mers are highly prevalent.
The frequency of shared peptides (y-axis), at each position (x-axis), is shown for the 16 countries with more than 50 sequences. Colour indicates frequency in each gene. Shared peptides are found at high frequencies, between 25% and 50%, across all countries. By extension, private peptides are also frequent, as any nonshared peptide is a private peptide. Values of zero, at the left and right ends of the x-axis, show the diverged flanks of the region, while values of one correspond to peptides that are always identical in both genes, i.e., where any mutations are likely eliminated by selection. On the left-hand side of the plots, DBLMSP2 displays a region with low shared peptide frequency across all countries, indicating this region has almost fully diverged between DBLMSP and DBLMSP2. The data and code to generate this Figure can be found at https://zenodo.org/doi/10.5281/zenodo.7677547.
(TIF)
S13 Fig
Identification of P. falciparum orthologs in Laverania assemblies.
For each P. falciparum gene (panels), orthologs were searched for using minimap2 (preset: "-x asm20"). The y-axis shows the length of each hit normalised by the length of the P. falciparum gene sequence, and hits are coloured by % identity between query sequence and target in each Laverania assembly. The first 7 panels show genes occurring contiguously in a 40-kbp stretch of chromosome 10 on the P. falciparum 3D7 reference genome, and AMA1 was added as we expected it to be well conserved and found in single-copy. AMA1 could indeed be found in full length across all 6 Laverania assemblies, as was MSP11, a gene located in-between DBLMSP and DBLMSP2. We note that many genes are missing in P. blacklocki; this is most likely due to a restrictive form of whole-genome amplification prior to sequencing, which the original authors noted led to missing core genes in the resulting assembly [[
(TIF)
S14 Fig
Clustering tree of full-length DBLMSP1/2 sequences.
This Figure is the same as Fig 2, except that the tree was built from all unique DBLMSP1/2 full-length protein sequences, and not just of the DSR. While DBLMSP sequences from P. reichenowi and P. billcollinsi are outgroups in the clade of DBLMSP alleles, the sequences of DBLMSP and DBLMSP2 from P. praefalciparum fall nested within clades of P. falciparum alleles. This is consistent with a recent radiation of P. falciparum from a P. praefalciparum-like ancestor. DBLMSP2 is absent in P. billcollinsi and not shown in the tree for P. reichenowi as it is pseudogenised. The data and code to generate this Figure can be found at https://zenodo.org/doi/10.5281/zenodo.7677547.
(TIF)
S15 Fig
Clustering tree of full-length AMA1 sequences.
As for S14 Fig above, the orthologous sequences from P. praefalciparum falls inside a P. falciparum clade, consistent with a recent radiation of P. falciparum from a P. praefalciparum-like ancestor, while orthologs from the other Laverania species occur as outgroups to P. falciparum alleles. The data and code to generate this Figure can be found at https://zenodo.org/doi/10.5281/zenodo.7677547.
(TIF)
S16 Fig
HMM logos of private and shared DB sequences in the DSR.
One logo was produced for peptides found only in DBLMSP (top panel), only in DBLMSP2 (middle panel), and found on both genes (lower panel, labelled "Both"). The 3 tracks are broken into segments for visual clarity. At each position, observed amino acids are shown, with letter height proportional to amino acid frequency. In-between diverged N- and C-terminal regions, there is mostly 1 prototypical private sequence for each gene (first 2 tracks) and 1 prototypical shared sequence (or 2, in the C-terminal half of the protein domain). The data and code to generate this Figure can be found at https://zenodo.org/doi/10.5281/zenodo.7677547.
(TIF)
S17 Fig
Number of distinct shared and private peptides per position.
For the 2 private and 1 shared MSAs, containing peptides only found in DBLMSP (top panel), only found in DBLMSP2 (middle panel), or both (lower panel), the total number of distinct peptide 10-mers at each position is shown. Mostly 1 to 4 peptides were observed at each position in the shared category, while 2 to 6 were observed on each gene only. This figure complements S16 Fig, which shows that mostly two 10-mer peptides occur in each gene with high frequency—here, total number is shown, regardless of frequency. The data and code to generate this Figure can be found at https://zenodo.org/doi/10.5281/zenodo.7677547.
(TIF)
S18 Fig
Validation of MosaicAligner-inferred recombination events.
The blue dots show, for each target, its edit distance to the path of donors inferred by MosaicAligner (y-axis) and the edit distance to the single closest donor (x-axis). The grey dotted line shows y = x. All inferences reduce edit distances to the single closest donor, supporting adding recombination breakpoints to the alignment. The data and code to generate this Figure can be found at https://zenodo.org/doi/10.5281/zenodo.7677547.
(TIF)
S19 Fig
Patterns of within- and interlineage recombinations.
The same clustering tree as in Fig 2 of the main text is shown, with the addition of dotted lines that connect 2 sequences if they were inferred to have recombined at some point in the past (see main text and Methods for how). Most recombination events occurred within the main lineages of the tree (e.g., within A, or within B.1), but a few events also occurred between highly diverged lineages of the tree (e.g., between C and A, or C and B.2). The data and code to generate this Figure can be found at https://zenodo.org/doi/10.5281/zenodo.7677547.
(TIF)
S20 Fig
Specific examples of within and interlineage recombinations.
Five different recombinations are shown inside matrices, where, as in Fig 4 of the main text, each matrix depicts the mosaic alignment of 1 target sequence to the panel of 35 sequences. Sequences from DBLMSP (top) and DBLMSP2 (bottom) are separated by a white horizontal strip. Each cell is coloured by whether the size-10 peptide centred at that position occurs only in DBLMSP (blue-green), only in DBLMSP2 (orange), or in both (yellow). Recombinations mostly occur within the private DBLMSP2 lineage (all donors are mostly orange) and within the shared DBLMSP2 lineage (all donors are mostly yellow). In the last panel, the target is a recombinant of a highly private and a highly shared sequence. The data and code to generate this Figure can be found at https://zenodo.org/doi/10.5281/zenodo.7677547.
(TIF)
S21 Fig
Three mosaic alignments support gene conversion.
In each panel, 2 recombination breakpoints can be seen (red vertical lines). The target sequence is the fully opaque one (along its entire length; indicated with a black arrow), and the donor sequences (those the target aligns to) are shown as highlighted in places they match the target, and less opaque where they do not. In each panel, the target aligns to donors across the 2 different genes, consistent with gene conversion between the genes. The data and code to generate this Figure can be found at https://zenodo.org/doi/10.5281/zenodo.7677547.
(TIF)
S22 Fig
Distribution of DBLMSP1/2 codon-level identity across 2,882 confidently resolved samples.
For all samples in which both DBLMSP1/2 sequences were confidently resolved, the DNA sequences of DBLMSP and DBLMSP2 within a single genome were aligned and the fraction of identical codons in the DSR recorded. Most samples have quite low identity levels (e.g., 0.2 up to 0.4), and a minority of samples have high identity levels, defined as >0.5 identity. The latter samples are illustrated in Fig 4 of the main text. The data and code to generate this Figure can be found at https://zenodo.org/doi/10.5281/zenodo.7677547.
(TIF)
S23 Fig
Sequence motifs of samples from the 2 gene conversion events.
One logo was produced for each of the 2 conversion events in Fig 4 of the main text, and each logo split into 3 portions for visual clarity. While at many positions, the sequences in each conversion event overlap, each is enriched for different amino acids, and some positions have entirely different amino acids. This supports a distinct evolutionary trajectory for each event and thus at least 2 distinct gene conversion events having occurred in DBLMSP1/2. The data and code to generate this Figure can be found at https://zenodo.org/doi/10.5281/zenodo.7677547.
(TIF)
S24 Fig
Geographical distribution of the 2 gene conversion events.
The 2 panels correspond to the 2 gene conversion events identified in Fig 4 of the main text (in the same order). In each panel, the number of samples in each geographical region is shown, both through the size and colour of dots. For both conversion events, samples are geographically widespread, occurring across west and east Africa and Southeast Asia. The base map comes from the freely distributed python package "plotly" (function "plotly.express.scatter_geo"), under an MIT licence: https://github.com/plotly/plotly.py. The data and code to generate this Figure can be found at https://zenodo.org/doi/10.5281/zenodo.7677547.
(TIF)
S25 Fig
Identification of samples with putative CNVs.
For all 3,589 analysis-set samples, the mean and standard deviation (std) of the per-base read coverage of reads realigned the "induced reference" (S5 Fig, panel b) was measured in genes DBLMSP, DBLMSP2, and AMA1. For each gene, we produced a coverage interval {mean– 2 * std, mean + 2 * std}, which we consider a "plausible range" of gene-level coverage. The x-axis shows the ratio of the mean coverage in DBLMSP1/2 to that in AMA1, a gene that we assume to be single-copy in all samples. The marginal distribution histogram is shown on top. Most samples have a ratio of 1, and some have ratios <0.5 or >2, indicating possible copy-number changes. The y-axis shows the fraction of the DBLMSP or DBLMSP2 coverage interval overlapped by the AMA1 coverage interval. Most samples have totally overlapping intervals (marginal distribution on right-hand side). Small overlap values indicate more likely true differences in coverage. Of the 6,123 analysed ("confidently resolved") DBLMSP and DBLMSP2 sequences, 31 had a fold-coverage >2 and an overlap <0.5 (bottom-right of plot), indicating putative duplication. Three of these overlapped with samples with evidence of gene conversion and were filtered out in that analysis (Fig 4 of the main text). The data and code to generate this Figure can be found at https://zenodo.org/doi/10.5281/zenodo.7677547.
(TIF)
S26 Fig
Diversity and divergence levels in the DBL-spanning region (DSR) of DBLMSP1/2.
The 2 first panels measure, for each of DBLMSP and DBLMSP2, the percent codon identity of 2,882 randomly chosen gene pairs and is a measure of sequence diversity. The third panel shows the percent codon identity between DBLMSP and DBLMSP2 across all 2,882 samples where they were confidently resolved and is a measure of sequence divergence. Between-gene divergence exceeds within-gene diversity (lower codon identity across genes than within genes). The data and code to generate this Figure can be found at https://zenodo.org/doi/10.5281/zenodo.7677547.
(TIF)
S27 Fig
Two SNPs identified in 1 genetic cross progeny.
In 1 progeny sample from genetic cross HB3xDd2, 2 SNPs were identified in DBLMSP1/2, one in each gene (panel a: DBLMSP2, panel b: DBLMSP). In both panels, the top track shows the parent gene sequence (HB3), and 4 subsequent tracks are shown below, each representing 1 different aligned sequence (grey horizontal bars). The first aligned sequence is the child sample gene sequence, showing a single SNP difference to the parent. To confirm these were spontaneous mutations and not single-base gene conversions from a homolog, 3 homologous sequences that could have been conversion donors were aligned to the parent: the orthologous sequence from the other cross parent (Dd2, second track), and the paralogous sequences from both parents (third and fourth tracks). No matches to these at the SNP positions can be seen. The data and code to generate this Figure can be found at https://zenodo.org/doi/10.5281/zenodo.7677547.
(TIF)
S1 Table
Characteristics of the tools used in our new genotyping pipeline.
Each tool's approach and main strengths are summarised. "Specific" refers to low false-positive rates in variant calling, and "Sensitive" to high true-positive rates.(DOCX)
Roberts Roland G Senior Editor
30 Mar 2023
Dear Dr Letcher,
Thank you for submitting your manuscript entitled "Gene conversion drives allelic dimorphism in two paralogous surface antigens of the malaria parasite P. falciparum" for consideration as a Research Article by PLOS Biology.
Your manuscript has now been evaluated by the PLOS Biology editorial staff, as well as by an academic editor with relevant expertise, and I'm writing to let you know that we would like to send your submission out for external peer review.
However, before we can send your manuscript to reviewers, we need you to complete your submission by providing the metadata that is required for full assessment. To this end, please login to Editorial Manager where you will find the paper in the 'Submissions Needing Revisions' folder on your homepage. Please click 'Revise Submission' from the Action Links and complete all additional questions in the submission questionnaire.
Once your full submission is complete, your paper will undergo a series of checks in preparation for peer review. After your manuscript has passed the checks it will be sent out for review. To provide the metadata for your submission, please Login to Editorial Manager (https://
If your manuscript has been previously peer-reviewed at another journal, PLOS Biology is willing to work with those reviews in order to avoid re-starting the process. Submission of the previous reviews is entirely optional and our ability to use them effectively will depend on the willingness of the previous journal to confirm the content of the reports and share the reviewer identities. Please note that we reserve the right to invite additional reviewers if we consider that additional/independent reviewers are needed, although we aim to avoid this as far as possible. In our experience, working with previous reviews does save time.
If you would like us to consider previous reviewer reports, please edit your cover letter to let us know and include the name of the journal where the work was previously considered and the manuscript ID it was given. In addition, please upload a response to the reviews as a 'Prior Peer Review' file type, which should include the reports in full and a point-by-point reply detailing how you have or plan to address the reviewers' concerns.
During the process of completing your manuscript submission, you will be invited to opt-in to posting your pre-review manuscript as a bioRxiv preprint. Visit
Feel free to email us at plosbiology@plos.org if you have any queries relating to your submission.
Kind regards,
Roli Roberts
Roland Roberts, PhD
Senior Editor
PLOS Biology
rroberts@plos.org
Roberts Roland G Senior Editor
18 May 2023
Dear Dr Letcher,
Thank you for your patience while your manuscript "Gene conversion drives allelic dimorphism in two paralogous surface antigens of the malaria parasite P. falciparum" was peer-reviewed at PLOS Biology. It has now been evaluated by the PLOS Biology editors, an Academic Editor with relevant expertise, and by three independent reviewers.
You'll see that reviewer #1 is positive, but worries about availability of the sequence assemblies (please ensure that you are fully compliant with the PLOS data availability policy!), has semantic issues with the word "dimorphism" (and its relationship to paralogy – this also confused me first time round), and wants you to mark the mechanistic model as speculative and tone down the title. Reviewer #2 is also positive, but thinks that validation of the pipeline duplicates somewhat your previous paper and detracts from the main message (s/he suggests moving it to the supplement, leaving space for considering other stuff). Reviewer #3 is similarly positive, but wonders if you could have obtained the result simply by using the two haplotypes as two parallel reference samples, and asks about the timing with respect to the zoonotic jump from gorillas. The Academic Editor, in discussing these comments, said "R1 and R3 query the gorilla orthologs and R3 specifically asks for these to be included in the analysis which doesn't seem unreasonable in light of the authors' bottleneck hypothesis. I think this additional analysis would justify 'major revision'. R1 emphasises the usefulness of the method details but R2 wants them shifted to suppl. I think if the method has been published separately as R2 claims then the suppl is appropriate. I think providing the assemblies as requested by R1 is also appropriate."
In light of the reviews, which you will find at the end of this email, we would like to invite you to revise the work to thoroughly address the reviewers' reports.
Given the extent of revision needed, we cannot make a decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is likely to be sent for further evaluation by all or a subset of the reviewers.
We expect to receive your revised manuscript within 3 months. Please email us (plosbiology@plos.org) if you have any questions or concerns, or would like to request an extension.
At this stage, your manuscript remains formally under active consideration at our journal; please notify us by email if you do not intend to submit a revision so that we may withdraw it.
**IMPORTANT - SUBMITTING YOUR REVISION**
Your revisions should address the specific points made by each reviewer. Please submit the following files along with your revised manuscript:
*NOTE: In your point-by-point response to the reviewers, please provide the full context of each review. Do not selectively quote paragraphs or sentences to reply to. The entire set of reviewer comments should be present in full and each specific point should be responded to individually, point by point.
You should also cite any additional relevant literature that has been published since the original submission and mention any additional citations in your response.
2. In addition to a clean copy of the manuscript, please also upload a 'track-changes' version of your manuscript that specifies the edits made. This should be uploaded as a "Revised Article with Changes Highlighted" file type.
*Re-submission Checklist*
When you are ready to resubmit your revised manuscript, please refer to this re-submission checklist: https://plos.io/Biology_Checklist
To submit a revised version of your manuscript, please go to https://
Please make sure to read the following important policies and guidelines while preparing your revision:
*Published Peer Review*
Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Please see here for more details:
https://blogs.plos.org/plos/2019/05/plos-journals-now-open-for-published-peer-review/
*PLOS Data Policy*
Please note that as a condition of publication PLOS' data policy (
*Blot and Gel Data Policy*
We require the original, uncropped and minimally adjusted images supporting all blot and gel results reported in an article's figures or Supporting Information files. We will require these files before a manuscript can be accepted so please prepare them now, if you have not already uploaded them. Please carefully read our guidelines for how to prepare and upload this data: https://journals.plos.org/plosbiology/s/figures#loc-blot-and-gel-reporting-requirements
*Protocols deposition*
To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols
Thank you again for your submission to our journal. We hope that our editorial process has been constructive thus far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.
Sincerely,
Roli Roberts
Roland Roberts, PhD
Senior Editor
PLOS Biology
rroberts@plos.org
------------------------------------
REVIEWERS' COMMENTS:
Reviewer #1:
This paper explores gene dimorphism in P. falciparum, by considering a pair of paralogous genes (DBLMSP and DBLMSP2) which were the subject of previous studies by some of the authors. Here' they propose a new genotyping pipeline, largely based on de novo assemblies, to obviate short-read mapping issues that occur in high-similarity paralogs. Using this method, they reconstruct the DBLMSP and DBLMSP2 sequences of several thousand samples including in the MalariaGEN dataset and, unsurprisingly, show that the pipeline is able to resolve these sequences better than the default GATK-based genotyping used by MalariaGEN. The authors then investigate sequence dimorphism in a domain of the two paralogs, and find that one of the "forms" is shared by the two genes. They also describe a number of recombination and conversion events, proposing a model for how dimorphism might have emerged as a result of population bottlenecks, speculating that this might have occurred when Pf ancestors jumped from gorilla to human hosts.
The paper is well-written and informative about the genetics of dimorphism, and the pipeline seems a valuable contribution. Generally, it would be of interest to those interested in parasite genetics, and suitable for publication. There are, however, some revision- including some conceptual adjustments- that I think are necessary.
1. Major Point: The primary output data from this analysis (namely, the thousands of reconstructed nucleotide sequences of DBLMSP and DBLMSP2) are not made available as far as I can see. I believe these data will be very useful to the research community. I appreciate that the authors have gone to some effort to make the pipeline available for reproducibility, but realistically this dataset could only be replicated from scratch with the resources of a major northern institution (e.g. EMBL-EBI), to the disadvantage of researchers in malaria-endemic countries. Please make a downloadable dataset of these sequences available, labelling them as they are labelled in Figure 6 (e.g. "PA1234-C DBLMSP2").
- 2. Important point: I feel the nomenclature and concept needs clarifying, in particular the concept of "dimorphism". The author insist that these genes have two, and exactly two forms; and then proceed to show a whole variety of different forms, 28 recombination breakpoints, gene conversions. So these domains are not dimorphic at all in the strict sense of the word, they're highly polymorphic, although each SNP within the domain appear to be dimorphic. It may be more correct to say that, for each locus, ancestries coalesce to exactly two individuals (to simplify, perhaps one could say that there are two major lineages that frequently interact with each other). I think the authors should spend more effort in the intro and discussion to clarify this.
- 3. Important point: Related to the above, one problem with using the DBLMSP and DBLMSP2 gene pair is that you're convolving "dimorphism" with paralogy. I can see that this gene pair was a logical choice in terms of showing the advantages of your pipeline, whose strength is primarily to resolve similar paralog sequences. But it does bring you into this rather confusing space where you're essentially analyzing the two DBs as a single entity (some sort of "pseudo-diploid"), and you're in fact showing that they are actually trimorphic (three clades in your Figure 6). I think a lot of readers may lose the plot at this point (the story would be a lot simpler had you picked other dimorphic genes). I believe you need to think how to guide the reader through these analyses.
- 4. Important point: The proposed model in the discussion is speculative (albeit plausible) and needs to be clearly marked as such. I am afraid that this also makes the manuscript title inappropriate. Even if the model is correct, and an additional form was generated by gene conversion, there is no explanation of how this form persists at ~50% prevalence, so to say that the conversion "drives" dimorphism does not seem correct. Also- the paper does not say whether the duplication leading to the paralogs occurred before or after the species jump. Is there evidence of this paralogy in gorillas?
Reviewer #2:
Letcher et al have undertaken a study in which they have employed a variant detection pipeline, gramtools, which uses genome graphs to assess variants that are called with a range of different genotype callers and is less affected by divergent haplotypes than other callers. After extensive validation of this pipeline they have applied it merozoite surface proteins in P. falciparum which appear to show highly diverged haplotypes, and are subject to frequent ectopic recombination.
The improved variant calls suggest haplotypes are shared between the two MSPs, and further analysis of recombination patterns and phylogenetics strongly suggest this is a case of gene conversion.
The work is extremely thorough, comprehensive and convincing, the results are notable in the context of malaria evolution. I have little hesitation in recommending it for publication. If I have a criticism to make, it is that the validation of gramtools in the content of Pf takes up a large part of the paper and does not show significant novelty beyond Letcher 21 / Hunt 22. This detracts from the more interesting story of gene conversion between MSPs and the paper might be better served by placing this in the supplementary. This may leave space for other evolutionary implications of this result (assuming they are not being dealt with elsewhere) such as the biological effects of these haplotypes and whether metrics such as KaKs etc indicate differing selective pressures on both private and shared haplotypes.
General nitpicking below:
L113 - the range of tools appears to be cortex + octopus; these tools (and the reasons for choosing each) should be included
L127 / supplemental
Numbering and ordering of supplemental figures is either out of order or very confusing.
L181 / fig s-2.1: without context it is hard to know how high this is - how does it compare to other paralogues that are not involved in cytoadhesion?
L256: Analysis appears sound, but it took me a few reads to figure it out, so perhaps some clarification that this is an alignment between genes is needed.
L261 / Fig 5 do branch lengths derive from both the non-converted and converted regions? i.e. is there divergence between samples where gene conversion has occurred between DBs and if so how much?
L280: This figure is *very* complex and I'm not sure it adds much to the plot.
L318: This section feels like a response to a reviewer question, but IMO doesn't add a lot to the paper.
Reviewer #3:
The authors have analyzed sequence reads from the MalariaGEN project (Plasmodium falciparum genome sequences from many thousands of samples). They have used a new pipeline they devised (and published in Ref.28) to characterize previously overlooked divergent allele sequences of two surface antigens (DBLMSP and DBLMSP2). I would not claim to fully understand the pipeline that leads to improved resolution of these sequences, but I think that they present convincing evidence of the efficacy of this method (Fig. 2). They then go on to show that there is evidence of recombination (probable gene conversion) between divergent sequences, within and between the two nearby loci. I found this very interesting.
Major comments
1. Given that it is already known that these two loci are dimorphic, would a standard assembly approach, run twice (using the two forms of each locus as reference) work just as well to resolve these sequences?
2. The authors note that Plasmodium falciparum arose by transmission of a gorilla parasite via a tight bottleneck, and discuss a model (Fig. 7) for the subsequent evolution of dimorphic loci through gene conversion between paralogous loci. So I was surprised that they did not include the DBLMSP and DBLMSP2 sequences from the gorilla parasite in their analysis; Otto et al. (2018; Ref.17) presented three genome sequences from this species.
Minor comments
1. Line 32, and later: I think the authors should avoid using "Pf" to denote P.falciparum.
Similarly, line 98 and later, I think the authors should avoid using "DBs". Finally, lne 144 and later, I think the authors should avoid using "MSA". Filling the paper with acronyms/abbreviations does not help its readability, and shortens the paper only marginally.
- 2. Figure 1. I did not find this helpful - I found it lacked sufficient detail/information to clarify anything beyond what is already stated in the text.
- 3. Figure 2a and line 360: With two amino acids at a site in different alleles, the heterozygosity can rise to a maximum value of 0.5, when the two are at equal frequencies. In Figure 2a (right panels), for both genes there are numerous sites with values greater than 0.5, seemingly indicating that there are more than two alleles. Is this due to a number of rare amino acids at these sites?
- 4. Line 179, and lines 558 onwards: could clarify whether overlapping 10-mers are considered. (For example, if the 10-mer at sites 101-110 is considered, are those at 102-111, 103-112, etc., also considered?)
- 5. Figure 3. I did not find the explanation at lines 207-209 clear. Perhaps giving examples would help. Am I correct to think that the motif LRWFREWST, found in both genes near the bottom right, is a counterexample to what is stated in lines 207-209?
- 6. Line 285: Please give the number of unique protein sequences at this point.
- 7. Fig.6: I assume that the "subform" mentioned at line 314 is clade A.1.
- 8. Line 443: explain MOI.
- 9. Line 461: 240kbp ? - which 240kbp ?
- 10. Line 463: "lied" should be "lay"
- 11. The numbering of the supplemental figures is odd.
Attachment
Submitted filename: letcher_gene_conversion - Google Docs.pdf
25 Oct 2023
Attachment
Submitted filename: round_1_revision_response_to_reviewers.pdf
Roberts Roland G Senior Editor
18 Dec 2023
Dear Dr Letcher,
Thank you for your patience while we considered your revised manuscript "Evolution of deeply-diverged lineages in two paralogous cell-surface antigens of the malaria parasite P. falciparum" for publication as a Research Article at PLOS Biology. This revised version of your manuscript has been evaluated by the PLOS Biology editors, the Academic Editor, and two of the original reviewers.
Based on the reviews, we are likely to accept this manuscript for publication, provided you satisfactorily address the remaining points raised by reviewer #3 and the following data and other policy-related requests.
IMPORTANT - please attend to the following:
a) We wonder whether you could make the Title more accessible and appealing? Maybe "Role for gene conversion in the allelic dimorphism of malaria parasite cell surface antigens" or "Evolution of deeply-diverged alleles of cell surface antigens of the malaria parasite" - happy to discuss this further by email.
b) Please address reviewer #3's remaining concerns.
c) Please address my Data Policy requests below; specifically, we need you to supply the numerical values underlying Figs 1AB, 2, 3AB, 4B, 5A, S3, S6AB, S7AB, S8AB, S9, S10, S11, S12, S13, S14, S15, S16, S17, S18, S19, S20, S22, S23, either as a supplementary data file or as a permanent DOI'd deposition. I note that you already have an associated GitHub deposition and a frozen Zenodo version thereof. Please could you confirm whether the aforementioned Figure panels can be generated using the data and code supplied in your Zenodo deposition, and if not, supply the underlying numerical values.
d) Please cite the location of the data clearly in all relevant main and supplementary Figure legends, e.g. "The data underlying this Figure can be found in S1 Data" or "The data underlying this Figure can be found in https://zenodo.org/record/8171279
As you address these items, please take this last chance to review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the cover letter that accompanies your revised manuscript.
We expect to receive your revised manuscript within two weeks.
To submit your revision, please go to https://
- a cover letter that should detail your responses to any editorial requests, if applicable, and whether changes have been made to the reference list
- a Response to Reviewers file that provides a detailed response to the reviewers' comments (if applicable, if not applicable please do not delete your existing 'Response to Reviewers' file.)
- a track-changes file indicating any changes that you have made to the manuscript.
NOTE: If Supporting Information files are included with your article, note that these are not copyedited and will be published as they are submitted. Please ensure that these files are legible and of high quality (at least 300 dpi) in an easily accessible file format. For this reason, please be aware that any references listed in an SI file will not be indexed. For more information, see our Supporting Information guidelines:
https://journals.plos.org/plosbiology/s/supporting-information
*Published Peer Review History*
Please note that you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Please see here for more details:
https://blogs.plos.org/plos/2019/05/plos-journals-now-open-for-published-peer-review/
*Press*
Should you, your institution's press office or the journal office choose to press release your paper, please ensure you have opted out of Early Article Posting on the submission form. We ask that you notify us as soon as possible if you or your institution is planning to press release the article.
*Protocols deposition*
To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols
Please do not hesitate to contact me should you have any questions.
Sincerely,
Roli Roberts
Roland Roberts, PhD
Senior Editor,
rroberts@plos.org,
PLOS Biology
------------------------------------------------------------------------
DATA POLICY:
You may be aware of the PLOS Data Policy, which requires that all data be made available without restriction:
Note that we do not require all raw data. Rather, we ask that all individual quantitative observations that underlie the data summarized in the figures and results of your paper be made available in one of the following forms:
Regardless of the method selected, please ensure that you provide the individual numerical values that underlie the summary data displayed in the following figure panels as they are essential for readers to assess your analysis and to reproduce it: Figs 1AB, 2, 3AB, 4B, 5A, S3, S6AB, S7AB, S8AB, S9, S10, S11, S12, S13, S14, S15, S16, S17, S18, S19, S20, S22, S23. NOTE: the numerical data provided should include all replicates AND the way in which the plotted mean and errors were derived (it should not present only the mean/average values).
IMPORTANT: Please also ensure that figure legends in your manuscript include information on where the underlying data can be found, and ensure your supplemental data file/s has a legend.
Please ensure that your Data Statement in the submission system accurately describes where your data can be found.
------------------------------------------------------------------------
DATA NOT SHOWN?
- Please note that per journal policy, we do not allow the mention of "data not shown", "personal communication", "manuscript in preparation" or other references to data that is not publicly available or contained within this manuscript. Please either remove mention of these data or provide figures presenting the results and the data underlying the figure(s).
------------------------------------------------------------------------
REVIEWERS' COMMENTS:
Reviewer #1:
I have now reviewed the authors' replies and I am happy to accept the manuscript for publication.
Reviewer #3:
I have the revised version, and find it greatly improved. I have only a few comments, which you will find below.
PBIOLOGY-D-23-00695R2
LETCHER et al.
Line 169: DBLMSP2 was not found in P. billcollinsi, even though an assembled genome is available. Laverania retain strong synteny, so it should be possible to comment whether the gene is indeed missing. Do the more divergent Laverania species have DBLMSP2? Or was it created by a duplication event after the divergence of the ancestor of P. billcollinsi?
Line 176: It would be strange if DBLMSP2 has retained a conserved function only in P. falciparum. That would imply that in both P. reichnowi and P. praefalciparum (and P. billcollinsi, if the duplication happened earlier) the gene was independently inactivated subsequent to divergence from the ancestor with P. falciparum; not very parsimonious to suggest the gene was required all along the specific lineage ;eading to P. falciparum (through various host species), but not in any divergent lineages.
Figure 2: The P. praefalciparum alleles are not easy to see. But having found them, it then seems remarkable that one within the B lineage is nested within a radiation of P. falciparum alleles. How can that happen?
Line 258-261: "The 209 samples ...... supporting gene conversion occurring within each genome" sounds like the authors are suggesting 209 conversion events. Whereas, they are really invoking only two major events (line 263).
Line 290: I do not understand "we note that paralogous sequences diverge faster than orthologous sequences" ?
In general, at the interspecific level, that would not seem to make sense. Is this comment specific to intraspecific observations, reflecting limited recombination between paralogues?
Line 296: it is commented that the alleles in conversion cluster 1 do not cluster in the tree, with the excuse that the fraction of pasted sequence is lower. But the alleles in conversion cluster 2 do not cluster in the tree either.
Figure 5: For clarity: when I saw "Fraction of identical codons between paralogs" I first read this to mean the fraction of codons, within the converted region, that remain identical. Whereas I now believe this to be an indication of the length of the converted region.
Line 378: I think it should be "selected) in"
17 Jan 2024
Attachment
Submitted filename: round_2_revision_response_to_reviewers.pdf
Roberts Roland G Senior Editor
19 Jan 2024
Dear Dr Letcher,
Thank you for the submission of your revised Short Report "Role for gene conversion in the evolution of cell-surface antigens of the malaria parasite Plasmodium falciparum" for publication in PLOS Biology. On behalf of my colleagues and the Academic Editor, Michael Duffy, I'm pleased to say that we can in principle accept your manuscript for publication, provided you address any remaining formatting and reporting issues. These will be detailed in an email you should receive within 2-3 business days from our colleagues in the journal operations team; no action is required from you until then. Please note that we will not be able to formally accept your manuscript and schedule it for publication until you have completed any requested changes.
IMPORTANT:
a) I've taken two liberties with your manuscript. First, I've edited the genus name into the Title. Second, I've changed the article type to Short Report, which we think is more appropriate for the nature of the study (I meant to ask you to do this at the previous decision, but neglected to do so). The paper is already quite concise, so no reformatting is required.
b) Many thanks for clarifying the situation regarding your Zenodo deposition. However we need it to be cited in all relevant main and supplementary Figure legends, e.g. "The data and code needed to generate this Figure can be found in https://zenodo.org/records/XXXXXXX (this may seem repetitive, but it serves to make the Figs more standalone. At a guess, the relevant Figs are Figs 1AB, 2, 3AB, 4B, 5A, S3, S6AB, S7AB, S8AB, S9, S10, S11, S12, S13, S14, S15, S16, S17, S18, S19, S20, S21, S22, S23, S25, S26). I have asked my colleagues to include this request with the aforementioned format-related requirements.
Please take a minute to log into Editorial Manager at
PRESS: We frequently collaborate with press offices. If your institution or institutions have a press office, please notify them about your upcoming paper at this point, to enable them to help maximise its impact. If the press office is planning to promote your findings, we would be grateful if they could coordinate with biologypress@plos.org. If you have previously opted in to the early version process, we ask that you notify us immediately of any press plans so that we may opt out on your behalf.
We also ask that you take this opportunity to read our Embargo Policy regarding the discussion, promotion and media coverage of work that is yet to be published by PLOS. As your manuscript is not yet published, it is bound by the conditions of our Embargo Policy. Please be aware that this policy is in place both to ensure that any press coverage of your article is fully substantiated and to provide a direct link between such coverage and the published work. For full details of our Embargo Policy, please visit
Thank you again for choosing PLOS Biology for publication and supporting Open Access publishing. We look forward to publishing your study.
Sincerely,
Roli Roberts
Roland G Roberts, PhD, PhD
Senior Editor
PLOS Biology
rroberts@plos.org
The authors thank Leah Roberts for reviewing the manuscript, Richard Pearson and Gavin Band for discussions of malaria genomics, and Richard Pearson for sharing MalariaGEN data ahead of the Pf7 release [[
• CNV
- copy-number variation
• DSR
- DBL-spanning region
• EBA
- erythrocyte binding antigen
• ML
- maximum-likelihood
• MOI
- multiplicity of infection
• MSP
- merozoite surface protein
• RBC
- red blood cell
By Brice Letcher; Sorina Maciuca and Zamin Iqbal
Reported by Author; Author; Author