Open Access
Open access
volume 36 issue 9 pages 2896-2898

Identifying and removing haplotypic duplication in primary genome assemblies

Publication typeJournal Article
Publication date2020-01-23
scimago Q1
wos Q1
SJR2.451
CiteScore9.6
Impact factor5.4
ISSN13674803, 13674811, 14602059
Biochemistry
Computer Science Applications
Molecular Biology
Statistics and Probability
Computational Mathematics
Computational Theory and Mathematics
Abstract
Motivation

Rapid development in long-read sequencing and scaffolding technologies is accelerating the production of reference-quality assemblies for large eukaryotic genomes. However, haplotype divergence in regions of high heterozygosity often results in assemblers creating two copies rather than one copy of a region, leading to breaks in contiguity and compromising downstream steps such as gene annotation. Several tools have been developed to resolve this problem. However, they either focus only on removing contained duplicate regions, also known as haplotigs, or fail to use all the relevant information and hence make errors.

Results

Here we present a novel tool, purge_dups, that uses sequence similarity and read depth to automatically identify and remove both haplotigs and heterozygous overlaps. In comparison with current tools, we demonstrate that purge_dups can reduce heterozygous duplication and increase assembly continuity while maintaining completeness of the primary assembly. Moreover, purge_dups is fully automatic and can easily be integrated into assembly pipelines.

Availability and implementation

The source code is written in C and is available at https://github.com/dfguan/purge_dups.

Supplementary information

Supplementary data are available at Bioinformatics online.

Found 
Found 

Top-30

Journals

200
400
600
800
1000
1200
1400
Wellcome Open Research
1313 publications, 55.54%
Scientific data
136 publications, 5.75%
G3: Genes, Genomes, Genetics
51 publications, 2.16%
Genome Biology and Evolution
50 publications, 2.12%
Journal of Heredity
43 publications, 1.82%
Nature Communications
22 publications, 0.93%
Molecular Ecology Resources
21 publications, 0.89%
BMC Genomics
20 publications, 0.85%
GigaScience
20 publications, 0.85%
DNA Research
17 publications, 0.72%
Molecular Ecology
15 publications, 0.63%
Molecular Biology and Evolution
15 publications, 0.63%
Communications Biology
14 publications, 0.59%
Horticulture Research
13 publications, 0.55%
Genome Biology
12 publications, 0.51%
BMC Biology
12 publications, 0.51%
bioRxiv
12 publications, 0.51%
Science advances
11 publications, 0.47%
iScience
10 publications, 0.42%
Current Biology
10 publications, 0.42%
Plant Journal
10 publications, 0.42%
Open Research Europe
10 publications, 0.42%
BMC Genomic Data
9 publications, 0.38%
Frontiers in Genetics
8 publications, 0.34%
Science
8 publications, 0.34%
F1000Research
8 publications, 0.34%
Frontiers in Plant Science
7 publications, 0.3%
Scientific Reports
7 publications, 0.3%
Genome Research
7 publications, 0.3%
200
400
600
800
1000
1200
1400

Publishers

200
400
600
800
1000
1200
1400
F1000 Research
1330 publications, 56.26%
Springer Nature
299 publications, 12.65%
Cold Spring Harbor Laboratory
236 publications, 9.98%
Oxford University Press
217 publications, 9.18%
Wiley
74 publications, 3.13%
Elsevier
64 publications, 2.71%
Frontiers Media S.A.
20 publications, 0.85%
MDPI
19 publications, 0.8%
American Association for the Advancement of Science (AAAS)
19 publications, 0.8%
Public Library of Science (PLoS)
12 publications, 0.51%
The Royal Society
6 publications, 0.25%
American Society for Microbiology
6 publications, 0.25%
eLife Sciences Publications
5 publications, 0.21%
Proceedings of the National Academy of Sciences (PNAS)
5 publications, 0.21%
GigaScience Press
4 publications, 0.17%
Research Square Platform LLC
4 publications, 0.17%
Scientific Societies
2 publications, 0.08%
Pensoft Publishers
2 publications, 0.08%
Maximum Academic Press
2 publications, 0.08%
The Company of Biologists
2 publications, 0.08%
Korean Society of Mycology
1 publication, 0.04%
American Physiological Society
1 publication, 0.04%
Institute of Cytology and Genetics SB RAS
1 publication, 0.04%
Association for Computing Machinery (ACM)
1 publication, 0.04%
S. Karger AG
1 publication, 0.04%
Taylor & Francis
1 publication, 0.04%
200
400
600
800
1000
1200
1400
  • We do not take into account publications without a DOI.
  • Statistics recalculated weekly.

Are you a researcher?

Create a profile to get free access to personal recommendations for colleagues and new articles.
Metrics
2.4k
Share
Cite this
GOST |
Cite this
GOST Copy
Guan D. et al. Identifying and removing haplotypic duplication in primary genome assemblies // Bioinformatics. 2020. Vol. 36. No. 9. pp. 2896-2898.
GOST all authors (up to 50) Copy
Guan D., MaCarthy S. A., Howe K., Durbin R. L., Wood J., Wang Y. Identifying and removing haplotypic duplication in primary genome assemblies // Bioinformatics. 2020. Vol. 36. No. 9. pp. 2896-2898.
RIS |
Cite this
RIS Copy
TY - JOUR
DO - 10.1093/bioinformatics/btaa025
UR - https://doi.org/10.1093/bioinformatics/btaa025
TI - Identifying and removing haplotypic duplication in primary genome assemblies
T2 - Bioinformatics
AU - Guan, Dengfeng
AU - MaCarthy, Susan A.
AU - Howe, Kerstin
AU - Durbin, Richard L.
AU - Wood, Jonathan
AU - Wang, Yadong
PY - 2020
DA - 2020/01/23
PB - Oxford University Press
SP - 2896-2898
IS - 9
VL - 36
PMID - 31971576
SN - 1367-4803
SN - 1367-4811
SN - 1460-2059
ER -
BibTex |
Cite this
BibTex (up to 50 authors) Copy
@article{2020_Guan,
author = {Dengfeng Guan and Susan A. MaCarthy and Kerstin Howe and Richard L. Durbin and Jonathan Wood and Yadong Wang},
title = {Identifying and removing haplotypic duplication in primary genome assemblies},
journal = {Bioinformatics},
year = {2020},
volume = {36},
publisher = {Oxford University Press},
month = {jan},
url = {https://doi.org/10.1093/bioinformatics/btaa025},
number = {9},
pages = {2896--2898},
doi = {10.1093/bioinformatics/btaa025}
}
MLA
Cite this
MLA Copy
Guan, Dengfeng, et al. “Identifying and removing haplotypic duplication in primary genome assemblies.” Bioinformatics, vol. 36, no. 9, Jan. 2020, pp. 2896-2898. https://doi.org/10.1093/bioinformatics/btaa025.