Recent advances and challenges in <i>de novo</i> genome assembly

Yingxue Yang; Wenjie Du; Yanchun Li; Jiawei Lei; Weihua Pan; Yingxue Yang; Wenjie Du; Yanchun Li; Jiawei Lei; Weihua Pan

doi:10.48130/gcomm-0025-0015

2025 Volume 2

Article Contents

Next Previous

MINI REVIEW Open Access

Recent advances and challenges in de novo genome assembly

1.
State Key Laboratory of Genome and Multi-omics Technologies, Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518120, China
2.
Institute of Emerging Agricultural Technology, Shenzhen University of Advanced Technology, Shenzhen 518107, China
^# Authors contributed equally: Yingxue Yang, Wenjie Du, Yanchun Li

More Information

Corresponding author: panweihua@caas.cn

Received: 24 March 2025
Revised: 01 July 2025
Accepted: 14 July 2025
Published online: 29 July 2025
Genomics Communications 2, Article number: e014 (2025) | Cite this article

Abstract

De novo genome assembly has entered the telomere-to-telomere (T2T) era, driven by breakthroughs in high-quality long-read sequencing technologies (PacBio HiFi and ONT UL), and advanced assembly algorithms. This review comprehensively examines the current state and challenges of genome assembly across various genomic contexts. We highlight significant achievements, including the construction of T2T reference genomes for multiple eukaryotic species, and the development of pangenomes incorporating diverse individual assemblies. However, several critical challenges persist: (1) the assembly of ultra-long, highly similar tandem repeats, particularly in rDNA regions; (2) haplotype-resolved assembly of complex polyploid genomes, especially autopolyploids; and (3) complete metagenome assembly and high-resolution binning. These challenges are compounded by the high costs and time-intensive manual curation required for current assembly workflows. To address these limitations, we identify key areas for methodological innovation, including: improved alignment algorithms for repetitive sequences, AI-driven assembly graph analysis, and enhanced metagenomic Hi-C binning techniques. The integration of these advancements will be crucial for achieving cost-effective, efficient, and scalable assembly of complete and accurate genomes across diverse biological contexts, enabling large-scale population studies and advancing our understanding of genomic complexity.
- Tandem repeats,
- Polyploid genomes,
- Metagenomes,
- De novo genome assembly
Rights and permissions
Copyright: © 2025 by the author(s). Published by Maximum Academic Press, Fayetteville, GA. This article is an open access article distributed under Creative Commons Attribution License (CC BY 4.0), visit https://creativecommons.org/licenses/by/4.0/.

References

[1]	Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, et al. 2001. Initial sequencing and analysis of the human genome. Nature 409:860−921 doi: 10.1038/35057062 CrossRef Google Scholar
[2]	Thuronyi BW, Koblan LW, Levy JM, Yeh WH, Zheng C, et al. 2019. Continuous evolution of base editors with expanded target compatibility and improved activity. Nature Biotechnology 37:1070−79 doi: 10.1038/s41587-019-0193-0 CrossRef Google Scholar
[3]	Shafin K, Pesout T, Lorig-Roach R, Haukness M, Olsen HE, et al. 2020. Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes. Nature Biotechnology 38:1044−53 doi: 10.1038/s41587-020-0503-6 CrossRef Google Scholar
[4]	Cheng H, Concepcion GT, Feng X, Zhang H, Li H. 2021. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nature Methods 18:170−75 doi: 10.1038/s41592-020-01056-5 CrossRef Google Scholar
[5]	Rautiainen M, Nurk S, Walenz BP, Logsdon GA, Porubsky D, et al. 2023. Telomere-to-telomere assembly of diploid chromosomes with Verkko. Nature Biotechnology 41:1474−82 doi: 10.1038/s41587-023-01662-6 CrossRef Google Scholar
[6]	Kolmogorov M, Yuan J, Lin Y, Pevzner PA. 2019. Assembly of long, error-prone reads using repeat graphs. Nature Biotechnology 37:540−46 doi: 10.1038/s41587-019-0072-8 CrossRef Google Scholar
[7]	Koren S, Walenz BP, Berlin K, Miller JR, Bergman NH, et al. 2017. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Research 27:722−36 doi: 10.1101/gr.215087.116 CrossRef Google Scholar
[8]	Yu W, Luo H, Yang J, Zhang S, Jiang H, et al. 2024. Comprehensive assessment of 11 de novo HiFi assemblers on complex eukaryotic genomes and metagenomes. Genome Research 34:326−40 doi: 10.1101/gr.278232.123 CrossRef Google Scholar
[9]	Nurk S, Koren S, Rhie A, Rautiainen M, Bzikadze AV, et al. 2022. The complete sequence of a human genome. Science 376:44−53 doi: 10.1126/science.abj6987 CrossRef Google Scholar
[10]	Yang C, Zhou Y, Song Y, Wu D, Zeng Y, et al. 2023. The complete and fully-phased diploid genome of a male Han Chinese. Cell Research 33:745−61 doi: 10.1038/s41422-023-00849-5 CrossRef Google Scholar
[11]	He Y, Chu Y, Guo S, Hu J, Li R, et al. 2023. T2T-YAO: a telomere-to-telomere assembled diploid reference genome for Han Chinese. Genomics Proteomics & Bioinformatics 21:1085−100 doi: 10.1016/j.gpb.2023.08.001 CrossRef Google Scholar
[12]	Shang L, He W, Wang T, Yang Y, Xu Q, et al. 2023. A complete assembly of the rice Nipponbare reference genome. Molecular Plant 16:1232−36 doi: 10.1016/j.molp.2023.08.003 CrossRef Google Scholar
[13]	Chen J, Wang Z, Tan K, Huang W, Shi J, et al. 2023. A complete telomere-to-telomere assembly of the maize genome. Nature Genetics 55:1221−31 doi: 10.1038/s41588-023-01419-6 CrossRef Google Scholar
[14]	Chen W, Wang X, Sun J, Wang X, Zhu Z, et al. 2024. Two telomere-to-telomere gapless genomes reveal insights into Capsicum evolution and capsaicinoid biosynthesis. Nature Communications 15:4295 doi: 10.1038/s41467-024-48643-0 CrossRef Google Scholar
[15]	Huang G, Bao Z, Feng L, Zhai J, Wendel JF, et al. 2024. A telomere-to-telomere cotton genome assembly reveals centromere evolution and a Mutator transposon-linked module regulating embryo development. Nature Genetics 56:1953−63 doi: 10.1038/s41588-024-01877-6 CrossRef Google Scholar
[16]	Cai Y, Gao X, Mao J, Liu Y, Tong L, et al. 2024. Genome sequencing of 'Fuji' apple clonal varieties reveals genetic mechanism of the spur-type morphology. Nature Communications 15:10082 doi: 10.1038/s41467-024-54428-2 CrossRef Google Scholar
[17]	Wu H, Luo LY, Zhang YH, Zhang CY, Huang JH, et al. 2024. Telomere-to-telomere genome assembly of a male goat reveals variants associated with cashmere traits. Nature Communications 15:10041 doi: 10.1038/s41467-024-54188-z CrossRef Google Scholar
[18]	Yoo D, Rhie A, Hebbar P, Antonacci F, Logsdon GA, et al. 2025. Complete sequencing of ape genomes. Nature 641:401−18 doi: 10.1038/s41586-025-08816-3 CrossRef Google Scholar
[19]	Wang L, Zhang M, Li M, Jiang X, Jiao W, et al. 2023. A telomere-to-telomere gap-free assembly of soybean genome. Molecular Plant 16:1711−14 doi: 10.1016/j.molp.2023.08.012 CrossRef Google Scholar
[20]	Liao WW, Asri M, Ebler J, Doerr D, Haukness M, et al. 2023. A draft human pangenome reference. Nature 617:312−24 doi: 10.1038/s41586-023-05896-x CrossRef Google Scholar
[21]	Li N, He Q, Wang J, Wang B, Zhao J, et al. 2023. Super-pangenome analyses highlight genomic diversity and structural variation across wild and cultivated tomato species. Nature Genetics 55:852−60 doi: 10.1038/s41588-023-01340-y CrossRef Google Scholar
[22]	Jiang YF, Wang S, Wang CL, Xu RH, Wang WW, et al. 2023. Pangenome obtained by long-read sequencing of 11 genomes reveal hidden functional structural variants in pigs. iScience 26:106119 doi: 10.1016/j.isci.2023.106119 CrossRef Google Scholar
[23]	Bian P, Li J, Zhou S, Wang X, Gong M, et al. 2024. A graph-based goat pangenome reveals structural variations involved in domestication and adaptation. Molecular Biology and Evolution 41:msae251 doi: 10.1093/molbev/msae251 CrossRef Google Scholar
[24]	Guo L, Wang X, Ayhan DH, Rhaman MS, Yan M, et al. 2025. Super pangenome of Vitis empowers identification of downy mildew resistance genes for grapevine improvement. Nature Genetics 57:741−53 doi: 10.1038/s41588-025-02111-7 CrossRef Google Scholar
[25]	Erwin GS, Gürsoy G, Al-Abri R, Suriyaprakash A, Dolzhenko E, et al. 2023. Recurrent repeat expansions in human cancer genomes. Nature 613:96−102 doi: 10.1038/s41586-022-05515-1 CrossRef Google Scholar
[26]	Burren OS, Dhindsa RS, Deevi SVV, Wen S, Nag A, et al. 2024. Genetic architecture of telomere length in 462,666 UK Biobank whole-genome sequences. Nature Genetics 56:1832−40 doi: 10.1038/s41588-024-01884-7 CrossRef Google Scholar
[27]	Mastrorosa FK, Rozanski AN, Harvey WT, Knuth J, Garcia G, et al. 2024. Complete chromosome 21 centromere sequences from a Down syndrome family reveal size asymmetry and differences in kinetochore attachment. bioRxiv Preprint doi: 10.1101/2024.02.25.581464 CrossRef Google Scholar
[28]	McGranahan N, Rosenthal R, Hiley CT, Rowan AJ, Watkins TBK, et al. 2017. Allele-specific HLA loss and immune escape in lung cancer evolution. Cell 171:1259−1271.e11 doi: 10.1016/j.cell.2017.10.001 CrossRef Google Scholar
[29]	Li H. 2018. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34:3094−100 doi: 10.1093/bioinformatics/bty191 CrossRef Google Scholar
[30]	Li H. 2021. New strategies to improve minimap2 alignment accuracy. Bioinformatics 37:4572−74 doi: 10.1093/bioinformatics/btab705 CrossRef Google Scholar
[31]	Jain C, Rhie A, Hansen NF, Koren S, Phillippy AM. 2022. Long-read mapping to repetitive reference sequences using Winnowmap2. Nature Methods 19:705−10 doi: 10.1038/s41592-022-01457-8 CrossRef Google Scholar
[32]	Yang J, Zhao X, Jiang H, Yang Y, Hou Y, et al. 2023. RAfilter: an algorithm for detecting and filtering false-positive alignments in repetitive genomic regions. Horticulture Research 10:uhac288 doi: 10.1093/hr/uhac288 CrossRef Google Scholar
[33]	Xu D, Song Y, Zhao X, Gong D, Yang Y, et al. 2022. RAviz: a visualization tool for detecting false-positive alignments in repetitive genomic regions. Horticulture Research 9:uhac161 doi: 10.1093/hr/uhac161 CrossRef Google Scholar
[34]	Wen H, Pan W. 2023. Reference-guided automatic assembly of genomic tandem repeats with only HiFi and Hi-C data enables population-level analysis. bioRxiv Preprint doi: 10.1101/2023.12.07.570710 CrossRef Google Scholar
[35]	Wick RR, Schultz MB, Zobel J, Holt KE. 2015. Bandage: interactive visualization of de novo genome assemblies. Bioinformatics 31:3350−52 doi: 10.1093/bioinformatics/btv383 CrossRef Google Scholar
[36]	Durand NC, Robinson JT, Shamim MS, Machol I, Mesirov JP, et al. 2016. Juicebox provides a visualization system for Hi-C contact maps with unlimited zoom. Cell Systems 3:99−101 doi: 10.1016/j.cels.2015.07.012 CrossRef Google Scholar
[37]	Zhang X, Zhang S, Zhao Q, Ming R, Tang H. 2019. Assembly of allele-aware, chromosomal-scale autopolyploid genomes based on Hi-C data. Nature Plants 5:833−45 doi: 10.1038/s41477-019-0487-8 CrossRef Google Scholar
[38]	Zeng X, Yi Z, Zhang X, Du Y, Li Y, et al. 2024. Chromosome-level scaffolding of haplotype-resolved assemblies using Hi-C data without reference genomes. Nature Plants 10:1184−200 doi: 10.1038/s41477-024-01755-3 CrossRef Google Scholar
[39]	Vrček L, Bresson X, Laurent T, Schmitz M, Šikić M. 2022. Learning to untangle genome assembly with graph convolutional networks. ArXiv Preprint doi: 10.48550/arXiv.2206.00668 CrossRef Google Scholar
[40]	Bao Z, Li C, Li G, Wang P, Peng Z, et al. 2022. Genome architecture and tetrasomic inheritance of autotetraploid potato. Molecular Plant 15:1211−26 doi: 10.1016/j.molp.2022.06.009 CrossRef Google Scholar
[41]	Chen LX, Anantharaman K, Shaiber A, Eren AM, Banfield JF. 2020. Accurate and complete genomes from metagenomes. Genome Research 30:315−33 doi: 10.1101/gr.258640.119 CrossRef Google Scholar
[42]	Kolmogorov M, Bickhart DM, Behsaz B, Gurevich A, Rayko M, et al. 2020. metaFlye: scalable long-read metagenome assembly using repeat graphs. Nature Methods 17:1103−10 doi: 10.1038/s41592-020-00971-x CrossRef Google Scholar
[43]	Bickhart DM, Kolmogorov M, Tseng E, Portik DM, Korobeynikov A, et al. 2022. Generating lineage-resolved, complete metagenome-assembled genomes from complex microbial communities. Nature Biotechnology 40:711−19 doi: 10.1038/s41587-021-01130-z CrossRef Google Scholar
[44]	Feng X, Cheng H, Portik D, Li H. 2022. Metagenome assembly of high-fidelity long reads with hifiasm-meta. Nature Methods 19:671−74 doi: 10.1038/s41592-022-01478-3 CrossRef Google Scholar
[45]	Benoit G, Raguideau S, James R, Phillippy AM, Chikhi R, et al. 2024. High-quality metagenome assembly from long accurate reads with metaMDBG. Nature Biotechnology 42:1378−83 doi: 10.1038/s41587-023-01983-6 CrossRef Google Scholar
[46]	Teng W, Chen M, Chen S, Xia T, Zhou Y, et al. 2024. Inappropriate application of mapping algorithms results in length-dependent gene abundances in metagenomic analysis. Genomics Communications 1:e007 doi: 10.48130/gcomm-0024-0007 CrossRef Google Scholar
[47]	Alneberg J, Bjarnason BS, de Bruijn I, Schirmer M, Quick J, et al. 2014. Binning metagenomic contigs by coverage and composition. Nature Methods 11:1144−46 doi: 10.1038/nmeth.3103 CrossRef Google Scholar
[48]	Kang DD, Froula J, Egan R, Wang Z. 2015. MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities. PeerJ 3:e1165 doi: 10.7717/peerj.1165 CrossRef Google Scholar
[49]	Wu YW, Tang YH, Tringe SG, Simmons BA, Singer SW. 2014. MaxBin: an automated binning method to recover individual genomes from metagenomes using an expectation-maximization algorithm. Microbiome 2:26 doi: 10.1186/2049-2618-2-26 CrossRef Google Scholar
[50]	Marbouty M, Cournac A, Flot JF, Marie-Nelly H, Mozziconacci J, et al. 2014. Metagenomic chromosome conformation capture (meta3C) unveils the diversity of chromosome organization in microorganisms. eLife 3:e03318 doi: 10.7554/eLife.03318 CrossRef Google Scholar
[51]	Lämke J, Bäurle I. 2017. Epigenetic and chromatin-based mechanisms in environmental stress adaptation and stress memory in plants. Genome Biology 18:124 doi: 10.1186/s13059-017-1263-6 CrossRef Google Scholar
[52]	Lu YX, Yang JB, Li CY, Tian YH, Chang RR, et al. 2024. Efficient and easy-to-use capturing three-dimensional metagenome interactions with GutHi-C. iMeta 3:e227 Google Scholar
[53]	DeMaere MZ, Darling AE. 2019. bin3C: exploiting Hi-C sequencing data to accurately resolve metagenome-assembled genomes. Genome Biology 20:46 doi: 10.1186/s13059-019-1643-1 CrossRef Google Scholar
[54]	Du Y, Sun F. 2022. HiCBin: binning metagenomic contigs and recovering metagenome-assembled genomes using Hi-C contact maps. Genome Biology 23:63 doi: 10.1186/s13059-022-02626-w CrossRef Google Scholar
[55]	Baudry L, Foutel-Rodier T, Thierry A, Koszul R, Marbouty M. 2019. MetaTOR: a computational pipeline to recover high-quality metagenomic bins from mammalian gut proximity-ligation (meta3C) libraries. Frontiers in Genetics 10:753 doi: 10.3389/fgene.2019.00753 CrossRef Google Scholar

About this article

Cite this article

Yang Y, Du W, Li Y, Lei J, Pan W. 2025. Recent advances and challenges in de novo genome assembly. Genomics Communications 2: e014 doi: 10.48130/gcomm-0025-0015

Yang Y, Du W, Li Y, Lei J, Pan W. 2025. Recent advances and challenges in de novo genome assembly. Genomics Communications 2: e014 doi: 10.48130/gcomm-0025-0015

Figures(2)

Download PDF

Article Metrics

Article views(5548) PDF downloads(2256)

Other Articles By Authors

on this site
on Google Scholar

HTML

Repeat assembly

The assembly of genomic repeat regions is a key focus of T2T assembly. Recent studies have revealed a strong association between repetitive genomic regions and various critical diseases^[25]. For example, differences in centromere length between parental chromosomes have been identified as a major contributing factor to chromosomal triploidization in offspring, such as in Down Syndrome^[26,27]. Additionally, approximately 40% of non-small cell lung cancers exhibit loss of heterozygosity (LOH) mutations in HLA genes, which are closely associated with immune evasion by tumor cells^[28]. As mentioned, in recent years, significant progress has been made in addressing this challenge. Currently, not only can dispersed repetitive sequences, such as transposons, be easily resolved, but in most cases, long tandem repeats, such as centromeres and telomeres, can be assembled with high quality. However, there are still some special types of ultra-long tandem repeat sequences that remain difficult to assemble. One of the most typical examples is the assembly of the rDNA region, which often spans several megabases. The sequences of individual copies are highly similar, with even a single copy of length 45 kb containing only a few SNPs. This makes it extremely challenging to differentiate between the copies. Currently, almost all T2T assemblies (including the human T2T-CHM13) have rDNA sequences generated by duplicating a consensus sequence to create pseudo-sequences, rather than being assembled as true sequences^[9]. From a methodological perspective, the core challenge in addressing this issue lies in ensuring that reads from repeat regions are not erroneously aligned to similar copies of other repeats. This is a limitation of the current mainstream alignment tools, such as mini-map2^[29,30], and winnowmap2^[31]. While alignment-filtering tools, such as RAfilter^[32] and RAviz^[33], are effective for removing false-positive alignments in normal tandem and dispersed repeats, they struggle to resolve highly similar copies, like those found in the rDNA regions.

Moreover, assembling repetitive sequences remains a significant challenge due to the high financial and time costs, severely limiting T2T-level assembly of large-scale pangenomes and ultra-large-size genomes. On one hand, existing assembly pipelines often require multiple data types, such as HiFi, ONT UL, and Hi-C, to achieve a more complete reconstruction of centromeres, telomeres, and other repetitive regions. For example, haplotype-resolved human genome assemblies generated using only HiFi and Hi-C data tend to lack most centromeric sequences. However, since ONT UL sequencing is relatively expensive, integrating all these data types can substantially increase overall costs. Recently, TRFill algorithm has been developed to fill large gaps in tandem repeat regions within chromosome-level assemblies constructed using only HiFi and Hi-C data, partially addressing the issue^[34]. However, the algorithm is currently limited to tandem repeats and does not apply to other types of repetitive regions, highlighting the need for further methodological innovations in this area. On the other hand, existing assembly pipelines often involve highly time-consuming manual curation, with the most labor-intensive step being manual gap filling. This process can take weeks or even months, accounting for the majority of the assembly workflow and significantly reducing overall efficiency. Addressing this challenge requires methodological innovations in two key areas. A temporary solution is to develop software tools that assist with manual curation, making the process more accessible, lowering the technical barrier, and improving efficiency. Several tools, such as Bandage^[35], Juicebox^[36], Gap-Aid, and Gap-Graph, have been developed to aid different steps of genome assembly, including contig assembly, scaffolding, and gap filling, demonstrating promising results. However, the ultimate solution lies in further advancing automated assembly algorithms to eliminate the need for manual curation entirely.

As a case study, we analyze the assembly process of the human haploid T2T genome (T2T-CHM13), which is considered the highest-quality T2T assembly among eukaryotic genomes^[20]. In this study, researchers combined approximately 30× coverage of PacBio HiFi sequencing with 120× coverage of ONT UL sequencing, and developed dedicated assembly and polishing methods. This approach enabled the successful assembly of highly repetitive centromeric satellite arrays and segmental duplications with high sequence homology. Specifically, a high-resolution string graph was constructed using PacBio HiFi reads to generate the initial assembly. Most components of this assembly originated from single chromosomes and exhibited near-linear structures, although gaps and mis-joins remained in some complex regions. Except for five rDNA regions, these complex gaps were resolved using ONT UL reads. Due to the limitations of string graphs in handling highly dynamic regions, the researchers ultimately constructed sparse de Bruijn graphs to assemble each rDNA array. The final T2T-CHM13 haploid assembly spans 3.05 Gb and achieves an estimated quality value (QV) of 73.9, representing a highly complete human haploid genome. However, it includes only 219 complete copies of the approximately 400 rDNA repeat units (accounting for 54.8% of the total), with the remaining sequences comprising partial or pseudo-sequences generated by duplicating consensus repeat units. This highlights that assembling rDNA sequences remains a major unresolved challenge in the field of genome assembly.

Metagenome assembly

Metagenome assembly is another major challenge. Compared to polyploid genomes, metagenomes exhibit even greater complexity and diversity, making assembly significantly more difficult. Before 2019, only a few dozen complete microbial genomes had been successfully assembled from metagenomic samples^[41]. The advent of HiFi sequencing has dramatically transformed this landscape. With HiFi data, a single sample can now yield dozens or even hundreds of complete or nearly complete microbial genomes using state-of-the-art assemblers such as metaFlye^[42], hifiasm-meta, and metaMDBG^[43−45]. As a result, the goal of metagenome assembly has shifted from merely recovering a useful subset of sequences to accurately assembling entire genomes (similar to single-genome assembly). Nevertheless, in assemblies generated from HiFi reads, most microbial genomes still exist in a fragmented form^[43]. Assigning these contigs to their respective genomes (binning) remains a critical and challenging problem, although it is somewhat easier than binning contigs derived from short reads^[46]. Traditional methods based on sequence-composition similarity and coverage similarity can only achieve coarse-grained binning^[47−49], which is insufficient for high-resolution analysis. As a result, leveraging metagenomic Hi-C signals for binning has emerged as the most promising solution. As early as 2015, metagenomic Hi-C had already been applied to binning, but it did not gain widespread attention^[50,51]. This was primarily due to two reasons: first, the complexity of metagenomic Hi-C sequencing made it challenging to generate high-quality data; second, it was difficult to obtain reliable Hi-C signals on the short contigs produced by short reads, limiting its applicability. In recent years, significant advancements in technology have greatly improved the proportion of valid read pairs (signals)^[52], making it highly promising for use with the high-quality, long contigs generated from HiFi reads. Existing metagenomic binning tools based on Hi-C data, such as bin3C^[53], HiCBin^[54], and MetaTOR^[55], can produce relatively accurate binning results. However, they still face challenges, such as mistakenly clustering sequences from multiple similar genomes together. Therefore, there is an urgent need for algorithmic innovations to address this issue and develop higher-resolution binning algorithms and tools.

As a case study, we examine the assembly of a high-depth sheep (Ovis aries) gut metagenomic dataset to evaluate the application of third-generation sequencing in metagenome assembly^[43]. In this study, researchers combined high-accuracy PacBio HiFi sequencing with Hi-C technology. Compared to conventional short-read (second-generation) sequencing, this approach significantly improved the continuity and accuracy of the assembly. Leveraging the integration of HiFi and Hi-C data, the study successfully assembled 428 high-quality metagenome-assembled genomes (MAGs) (with > 90% completeness and < 10% contamination), including 44 complete circular bacterial genomes. Furthermore, Feng et al. developed hifiasm-meta, a metagenome assembler specifically optimized for HiFi data^[44]. Without relying on any auxiliary data, hifiasm-meta successfully assembled 379 near-complete MAGs and 279 complete circular genomes from the same sheep (Ovis aries) gut metagenomic sample. However, despite these advances, the hifiasm-meta assembly still contained a large proportion of low-quality contigs (> 90%). In the final assembly graph, these contigs were entangled due to either local or global sequence similarity, preventing effective resolution. Current methods still struggle to fully separate and accurately assemble strains with high intra-species similarity or complex structural variations. This remains a major bottleneck in achieving truly high-quality metagenome assemblies.

{{lists.name}}

Recent advances and challenges in de novo genome assembly

Abstract