Identifying and removing haplotypic duplication in primary genome assemblies
Motivation
Rapid development in long-read sequencing and scaffolding technologies is accelerating the production of reference-quality assemblies for large eukaryotic genomes. However, haplotype divergence in regions of high heterozygosity often results in assemblers creating two copies rather than one copy of a region, leading to breaks in contiguity and compromising downstream steps such as gene annotation. Several tools have been developed to resolve this problem. However, they either focus only on removing contained duplicate regions, also known as haplotigs, or fail to use all the relevant information and hence make errors.
Results
Here we present a novel tool, purge_dups, that uses sequence similarity and read depth to automatically identify and remove both haplotigs and heterozygous overlaps. In comparison with current tools, we demonstrate that purge_dups can reduce heterozygous duplication and increase assembly continuity while maintaining completeness of the primary assembly. Moreover, purge_dups is fully automatic and can easily be integrated into assembly pipelines.
Availability and implementation
The source code is written in C and is available at https://github.com/dfguan/purge_dups.
Supplementary information
Supplementary data are available at Bioinformatics online.
Top-30
Journals
|
200
400
600
800
1000
1200
1400
|
|
|
Wellcome Open Research
1313 publications, 55.54%
|
|
|
Scientific data
136 publications, 5.75%
|
|
|
G3: Genes, Genomes, Genetics
51 publications, 2.16%
|
|
|
Genome Biology and Evolution
50 publications, 2.12%
|
|
|
Journal of Heredity
43 publications, 1.82%
|
|
|
Nature Communications
22 publications, 0.93%
|
|
|
Molecular Ecology Resources
21 publications, 0.89%
|
|
|
BMC Genomics
20 publications, 0.85%
|
|
|
GigaScience
20 publications, 0.85%
|
|
|
DNA Research
17 publications, 0.72%
|
|
|
Molecular Ecology
15 publications, 0.63%
|
|
|
Molecular Biology and Evolution
15 publications, 0.63%
|
|
|
Communications Biology
14 publications, 0.59%
|
|
|
Horticulture Research
13 publications, 0.55%
|
|
|
Genome Biology
12 publications, 0.51%
|
|
|
BMC Biology
12 publications, 0.51%
|
|
|
bioRxiv
12 publications, 0.51%
|
|
|
Science advances
11 publications, 0.47%
|
|
|
iScience
10 publications, 0.42%
|
|
|
Current Biology
10 publications, 0.42%
|
|
|
Plant Journal
10 publications, 0.42%
|
|
|
Open Research Europe
10 publications, 0.42%
|
|
|
BMC Genomic Data
9 publications, 0.38%
|
|
|
Frontiers in Genetics
8 publications, 0.34%
|
|
|
Science
8 publications, 0.34%
|
|
|
F1000Research
8 publications, 0.34%
|
|
|
Frontiers in Plant Science
7 publications, 0.3%
|
|
|
Scientific Reports
7 publications, 0.3%
|
|
|
Genome Research
7 publications, 0.3%
|
|
|
200
400
600
800
1000
1200
1400
|
Publishers
|
200
400
600
800
1000
1200
1400
|
|
|
F1000 Research
1330 publications, 56.26%
|
|
|
Springer Nature
299 publications, 12.65%
|
|
|
Cold Spring Harbor Laboratory
236 publications, 9.98%
|
|
|
Oxford University Press
217 publications, 9.18%
|
|
|
Wiley
74 publications, 3.13%
|
|
|
Elsevier
64 publications, 2.71%
|
|
|
Frontiers Media S.A.
20 publications, 0.85%
|
|
|
MDPI
19 publications, 0.8%
|
|
|
American Association for the Advancement of Science (AAAS)
19 publications, 0.8%
|
|
|
Public Library of Science (PLoS)
12 publications, 0.51%
|
|
|
The Royal Society
6 publications, 0.25%
|
|
|
American Society for Microbiology
6 publications, 0.25%
|
|
|
eLife Sciences Publications
5 publications, 0.21%
|
|
|
Proceedings of the National Academy of Sciences (PNAS)
5 publications, 0.21%
|
|
|
GigaScience Press
4 publications, 0.17%
|
|
|
Research Square Platform LLC
4 publications, 0.17%
|
|
|
Scientific Societies
2 publications, 0.08%
|
|
|
Pensoft Publishers
2 publications, 0.08%
|
|
|
Maximum Academic Press
2 publications, 0.08%
|
|
|
The Company of Biologists
2 publications, 0.08%
|
|
|
Korean Society of Mycology
1 publication, 0.04%
|
|
|
American Physiological Society
1 publication, 0.04%
|
|
|
Institute of Cytology and Genetics SB RAS
1 publication, 0.04%
|
|
|
Association for Computing Machinery (ACM)
1 publication, 0.04%
|
|
|
S. Karger AG
1 publication, 0.04%
|
|
|
Taylor & Francis
1 publication, 0.04%
|
|
|
200
400
600
800
1000
1200
1400
|
- We do not take into account publications without a DOI.
- Statistics recalculated weekly.