Annotation of plant genomes is still a challenging task due to the abundance of repetitive sequences, especially long terminal repeat (LTR) retrotransposons. LTR_FINDER is a widely used program for the identification of LTR retrotransposons but its application on large genomes is hindered by its single-threaded processes. Here we report an accessory program that allows parallel operation of LTR_FINDER, resulting in up to 8500X faster identification of LTR elements. It takes only 72 min to process the 14.5 Gb bread wheat (Triticum aestivum) genome in comparison to 1.16 years required by the original sequential version. LTR_FINDER_parallel is freely available at https://github.com/oushujun/LTR_FINDER_parallel.
Keywords: Genome annotation; Transposable element; LTR retrotransposon; LTRFINDER
Transposable elements (TEs) are the most prevalent components in eukaryotic genomes. Among different TE classes, long terminal repeat (LTR) retrotransposons, including endogenous retroviruses (ERVs), is one of the most repetitive TEs due to their high copy numbers and large element sizes [[
Annotation of LTR retrotransposons relies primarily on de novo approaches due to their highly diverse terminal repeats. For this purpose, many computational programs have been developed in the past two decades. LTR_FINDER is one of the most popular LTR search engines [[
We hypothesized that complete sequences of highly complex genomes may contain a large number of complicated nested structures that exponentially increase the search space. To break down these complicated sequence structures, we split chromosomal sequences into relatively short segments (1 Mb) and executes LTR_FINDER in parallel. We expect the time complexity of LTR_FINDER_parallel is O(n). For highly complicated regions (i.e., centromeres), one segment could take a rather long time (i.e., hours). To avoid extended operation time in such regions, we used a timeout scheme (300 s) to control for the longest time a child process can run. If timeout, the 1 Mb segment is further split into 50 Kb segments to salvage LTR candidates. After processing all segments, the regional coordinates of LTR candidates are converted back to the genome-level coordinates for the convenience of downstream analyses.
LTR_FINDER_parallel is a Perl program that can be "download and run" and does not require any form of installation. We used the original LTR_FINDER as the search engine which is binary and also installation free. Based on our previous study [[
To benchmark the performance of LTR_FINDER_parallel, we selected four plant genomes with sizes varying from 120 Mb to 14.5 Gb, which are Arabidopsis thaliana (version TAIR10) [[
Using our method, we observe 5X - 8500X increase in speed for plant genomes with varying sizes (Table 1). For the 14.5 Gb bread wheat genome, the original LTR_FINDER took 10,169 h, or 1.16 years, to complete, while the multithreading version completed in 72 min on a modern server with 36 threads, demonstrating an 8500X increase in speed (Table 1). Even we analyzed each wheat chromosome separately, the original LTR_FINDER still took 20 days on average to complete. Among the genomes we tested, the parallel version of LTR_FINDER produced slightly different numbers of LTR candidates when compared to those generated using the original version (0–2.73%; Table 1), which is likely due to the use of the dynamic task control approach for processing of heavily nested regions. By filtering out LTR candidates in the rice genome with LTR_retriever [[
Benchmarking the performance of LTR_FINDER_parallel
Genome Arabidopsis Rice Maize Wheat Version TAIR10 MSU7 AGPv4 CS1.0 Size 119.7 Mb 374.5 Mb 2134.4 Mb 14,547.3 Mb Original memory (1 threada) 0.37 Gbyte 0.55 Gbyte 5.00 Gbyte 11.88 Gbyteb Parallel memory (36 threadsa) 0.10 Gbyte 0.12 Gbyte 0.82 Gbyte 17.67 Gbyte Original time (1 thread) 0.58 h 2.1 h 448.5 h 10,169.3 hb Parallel time (36 threads) 6.4 min 2.6 min 10.3 min 71.8 min Speed up 5.4 X 48.5 X 2613 X 8498 X # of LTR candidates (1 thread) 226 2851 60,165 231,043 # of LTR candidates (36 threads) 226 2834 59,658 237,352 % difference in candidate # 0.00% 0.60% 0.84% −2.73%
This study was supported by National Science Foundation (IOS-1740874 to N.J.); United States Department of Agriculture National Institute of Food and Agriculture and AgBioResearch at Michigan State University (Hatch grant MICL02408 to N.J.).
We wish to acknowledge Matthew Hufford (Iowa State University) and Candice Hirsch (University of Minnesota) for helpful feedback on a previous version of this manuscript.
SO and NJ conceived this study. SO developed the code and analyzed the genomes. SO and NJ wrote and revised the manuscript. All authors read and approved the final manuscript.
LTR_FINDER_parallel is freely available at https://github.com/oushujun/LTR_FINDER_parallel.
Not applicable.
Not applicable.
The authors declare that they have no competing interests.
• ERV
- Endogenous retrovirus
- LTR retrotransposon
- Long terminal repeat retrotransposon
• TE
- Transposable element
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
By Shujun Ou and Ning Jiang
Reported by Author; Author