Can long-read sequencing tackle the barriers, which the next-generation could not? A review

The large-scale heterogeneity of genetic diseases necessitated the deeper examination of nucleotide sequence alterations enhancing the discovery of new targeted drug attack points. The appearance of new sequencing techniques was essential to get more interpretable genomic data. In contrast to the previous short-reads, longer lengths can provide a better insight into the potential health threatening genetic abnormalities. Long-reads offer more accurate variant identification and genome assembly methods, indicating advances in nucleotide deflect-related studies. In this review, we introduce the historical background of sequencing technologies and show their benefits and limits, as well. Furthermore, we highlight the differences between short- and long-read approaches, including their unique advances and difficulties in methodologies and evaluation. Additionally, we provide a detailed description of the corresponding bioinformatics and the current applications.


Introduction
The complete genetic information of the organisms is stored and transferred in singleand double-stranded ribonucleic (RNA) and deoxyribonucleic (DNA) acids [1].The mystery behind rare genetic conditions, like chromosomal irregularities or unique sequence variation and mutation profiles in cancer, induced the need of molecular examination at deeper levels.For many years, only a deficient tool set was available to get a better insight into the genetic attributes of genomes.This encouraged the development of novel technologies, such as RNA and DNA sequencing methods.Provoked by the technical and computational progress of the past 50 years, the features of sequence determination changed and evolved.In the early periods, only a few hundred bases were reachable in length; however, the emergence of long-read technologies allowed the reading of longer genomic sequences even with thousands of kilo bases.
The timeline of the sequencing techniques' evolution can be divided into three main parts: first-generation (FGS), nextgeneration (NGS), and third-generation (TGS) sequencing.Before short-read NGS approaches became available, FGS techniques were the only tools capable of describing the nucleic acid sequence of different organisms.Thus, their main advantage is that they emphasized the need to use and develop novel sequencing methods to get a deeper knowledge regarding DNA and RNA sequences with repetitive regions, alternative bases, splicing variants and telomeric regions.Later in time, the NGS and mainly TGS methods were capable of opening closed doors for the detection of the listed alteration types, thereby exploring many reasons (and also the curing solution) for diseases.In our review, we strived to show FGS techniques from this point of view, without explaining their applications and attributes in more detail.In this scope, FGS were the pioneers of sequencing around the 1980s, including Sanger's chromatography and Maxam-Gilbert's chemical modificationbased assays.In the early times, these technologies allowed focusing on relatively small genomes with a few hundred base pairs (bp) in length [2].Sanger's idea was to sequence the DNA strand by chain termination.Consequently, in this case, the DNA fragments were converted into chains by DNA polymerases and by the incorporation of nucleotides [3].Maxam and Gilbert provided a process, during which the sequences of DNA fragments were determined using the combination of radiolabeling, chemical cleaving, and gel electrophoresis of nucleotides, and autoradiography served as the detection method [4].
The second generation, namely, NGS, includes pyrosequencing [5] and sequencing-by-synthesis [6] approaches.They have a feature in common, which is that DNA polymerase moves along the template DNA and sequencing is performed by catalyzing the incorporation of deoxynucleotide triphosphates (dNTPs) in a new complementary DNA strand [7].Pyrosequencing is a sequence-based form, where a pyrophosphate is released, when dNTPs are sequentially added to the end of a nascent DNA fragment [8].Sequencing-by-synthesis is the construction of a nucleic acid chain from the emission spectra of fluorescently labeled nucleotides [6].
Although NGS provides more acceptable error rates and more sophisticated sequencing results than FGS, they have several weaknesses that should be mentioned.Read lengths are shorter than demanded, that is why they are referred to as short read techniques nowadays.Consequently, their shortness limits the study of full-length transcript variants, centromere and telomere genomic regions, and gene fusions [9].Additionally, they are unable to resolve repetitive regions of the genome, making genetic variations challenging to identify, including repeat expansion disorders and structural variants [10].Extreme guanine-cytosine (GC) content or sequences with multiple homologous elements in the genome and the epigenetically modified bases of DNA and RNA, like N6-methyladenosine (6mA), 5methylcytosine (5mC), and 5-hydroxymethylcytosine (5hmC) are challenging to characterize with NGS [11].PCR amplification is essential, which results in higher costs and longer times in the overall sequencing and evaluation process, involves the usage of large equipment and laborious experimental procedures, and expands the bioinformatics analysis with a data preprocessing step.To overcome the limitations, further sequencing techniques have been developed, the representatives of the TGS family, often referred to as long-read sequencing methods [12,13].
Many scientific papers describing the methodology, evaluation and use of different sequencing assays become available yearly.Currently, long-read TGS and short-read NGS methods are used problem-specifically, either interchangeably or in combination.Although both methods have their own advantages and disadvantages, reviews setting against the methods cannot be found among the currently available publications.Encouraged by this, our goal is to provide a general comparison of long-and short-read techniques.In the present paper, we aimed to review the development of sequencing assays, presenting a brief characteristics of FGS, NGS and TGS, with special emphasis on the possibilities offered by the TGS methods.We also detail the bioinformatics approaches along with the aspects considered during evaluation, as well as related clinical and biological applications.

Long-read sequencing
TGS provides more precise mapping of reads for reference genomes, promotes different variant detection methods, and offers new solutions for characterizing the epigenetic diversity [14].In contrast to NGS systems, the generated data is analyzed in real-time and generally, PCR amplification steps are not required before sequencing due to natural isolated nucleic acid strands can be read as well.The longer sequenced reads are the consequences of the improved sequencing chemistries [15,16].The increased sequencing speed and accuracy during experiments and the higher quality bioinformatics results also mark the effectiveness of the newly emerged technologies and the inherent chemical kits [17].
TGS technologies conceal the opportunity to emerge as long-term applicable tools in the future.As they provide the long-read sequencing of whole genomes, their usage in the field of genomics entails the chance of more and more accurate description of both human and non-human genetic diversities.Furthermore, improvements aimed to decrease costs and analysis time could invoke their application in routine diagnostics.

Nanopore sequencing
The nanopore sequencing (NS) method, distributed by Oxford Nanopore Technologies (ONT), is based on the detection of the electric current changes provoked by the disorganization of nanopore proteins [16][17][18][19].The alterations in the real-time produced electric current can be measured directly.During NS, dsDNA molecules are denatured, and the motor protein directs ssDNAs through the nanochannels (pores) one after the other.The passage of ssDNA molecules leads to disturbances in the electric current, which is detected by specific reader sensor proteins.The deflections are distinct for all nucleotides resulting in unique signatures for each base.The entire process happens inside a device-specific flow cell [20], which contains thousands of nanopore channels.The schematics of NS sequencing is presented on Figure 1A.
Since the release of the first ONT sequencing device-named MinION-in the mid-2010s, the continuous improvement of the key factors, like accuracy, read length, and sequencing throughput is present.The throughput is determined by the number of active pores on the flow cells and by the DNA/RNA translocation speed.To provide the maximal amount of available active pores on the flow cells, their periodical revision is secured [16,21].The read length and the accuracy are highly dependent on the released version and quality of the sequencing chemistry-which in this case includes the traits of nanopores and motor proteins,-however, by introducing special adapters during penetration, an increase in accuracy measure can also be reached with higher ~420 bps per sec sequencing speeds compared to the previous ~70 bps per sec rate [22].
NS reads are characterized by longer lengths of 10 kb up to 100 kb, which means more sequenced bases, more generated data and increased informatics resource needs compared to NGS.The large amount of data means longer bioinformatics analysis time and more expensive informatics hardware park.However, due to the increased information amount, a more accurate identification of alterations becomes available.As the most important disadvantage of the increased generated data, higher error rate and read misclassification can be experienced on the ONT platforms compared to NGS [23,24].

SMRT sequencing
Pacific BioSciences provided the first nanosensor-based technology in the early 2010s relying on the single molecule real-time (SMRT) sequencing model [25].The key factor in this method is the detection of alterations in light emission when the DNA polymerase incorporates a nucleotide [26].In more detail, SMRT sequencing is done by the immobilization of the DNA polymerase in each well of a special silicone chip (SMRTcell) using DNA as the mobile molecule.DNA templates are presented as closed, single-stranded DNA (ssDNA) molecules, named SMRTbells, which are created by ligating hairpin adaptors to both ends of a target double-stranded DNA (dsDNA).The SMRTcells contain four fluorescently labeled nucleotides with unique emission spectra.Zero mode waveguides (ZMW) are optical waveguides developed for rapid light sensing and provide the interface for the detection of light emitted by the incorporation of phosphate-labeled dNTPs of SMRTcells [27,28].The process of SMRT sequencing is illustrated on Figure 1B.
Compared to NGS, the precision of SMRT sequencing is lower, as an example, due to the many inaccuracies during base identification.Although, in contrast to the experienced higher error rates and costs-per-base, the technology grants several orders of magnitude increase in read lengths (few Mbps in contrast to the previous few hundred bps) and faster sequencing runs.As a possibility, the arising conflict regarding the advantages and disadvantages of NGS and SMRT sequencing suggests the consideration of hybrid-sequencing solutions in the future.Hybrid units are the combinations of different sequencing methods and can be promising to overcome the deficiencies [29].

Technical advances and difficulties of long reads
Following a brief histological and methodological overview of long-read approaches, we detail the technical background and give comprehensive knowledge regarding sequencing challenges and advances.Compared to NGS approaches, the main difficulties of TGS are the overall lower per read accuracy and poorer read quality [30].In contrast to short reads, long reads are much noisier.Prolonged lengths induce an increase in the number of bases and in reading time.Both contribute to a higher probability of collecting false information, promoting more noise and uncertainty [31].The continuous change of the sequencing reads in length during a single run also indicates the higher chance of inaccuracies.Due to the abovelisted reasons, the proper handling of deflections cannot be emphasized enough and the problem-concentrated improvements are published continuously [32].Although in the early times base identifying accuracy was around 85% (indicating the error rate to be nearly 15%), nowadays almost 99% (SMRT) and 95% (NS) can be reached [31,33].Error correction methods [34,35] provide a solution to resolve the inaccuracies and are divided into two groups: hybrid and nonhybrid approaches [36].Hybrid methods take the advantage of the high accuracy of short reads for correcting errors in the long threads, while non-hybrid methods perform self-correction with long reads using overlap information.The effectiveness of error correction methods is highly dependent on the sequencing coverage [36], thus shows a dependence on the percentage of all sequenced base pairs or loci of the genome.
In SMRT devices, the read quality is proportional to the number of DNA fragment transitions.For example, the reading accuracy is around 85%-87% in a 10 kb long sequence if it is passing only once [37]; however, with multiple reading, it can be further improved reaching 99%.In contrast, the quality of NS reads is independent of the reading repetition times and the length of nucleic acid sequences.It only depends on the ratchet rate per base through the nanopores.Fragments traverse only once, the median sign-pass accuracy is around 95% [38], and read length depends only on the amount and the quality of the high-molecular weight input DNA.To reach the maximal sequencing precision, companies focusing on long-reads tend to release chemistry, software, and hardware updates regularly [16,39].
Reference genomes are integral parts of sequencing assays as they provide the organism-specific support during base order construction [40].The progression of sequencing methods derived the breakthroughs regarding the imprecision of reference genomes [41], variant identification, genomic assemblies, and other specialized data analyses in the field of genetics.The Genome Reference Consortium (GRC) released the current form of human reference genome (GRCh38.p13) in 2013 with an origin tracing to the Human Genome Project [42,43].In contrast to the continuous improvement of the GRCh38.p13genome, over the last years, due to the technical limitations of NGS short reads, many problems remained unsolved.The underrepresentation of repetitive sequences, the unsolved assembly gaps due to structural polymorphisms and the unfinished polymorphic regions resulted in the need of further investigation.The 151 mega-base (Mbp) pair long unknown sequence data distributed throughout the GRCh38.p13 genome turned out to be fundamental and included centromere and telomere regions, segmental duplications, amplicon gene arrays and ribosomal DNA (rDNA) arrays, all highly affecting cellular processes [44].Long-read sequencing proved to be the problemsolver, indicating the birth of the Telomere-to-Telomere (T2T) Consortium to construct a new and almost complete human reference genome, the T2T-CHM13 assembly [44].In this cooperation, the advances of long-read techniques, including the multi-kilobase single-molecule reads of SMRT and the ultralong reads of NS were combined, providing evidence to the beneficial applications of hybrid sequencing methods.The T2T-CHM13 assembly resulted in a 3 billion-base pair long complete human haplotype, contributing to the recognition of almost 4,000 new genes, with high rates of protein coding nature.In addition, T2T-CHM13 includes the gapless telomere-totelomere assemblies for all 22 human autosomes and chromosome X, contains the corrected version of the 151 Mbp Additional packages are listed on webpage https://long-read-tools.org and can be found on bioinformatics-related pages.

Pathology & Oncology Research
Published by Frontiers long unknown genomic sequence data, and has the chance to arise as the mainly applied reference genome in human genomics-related fields.The successful application of the combination of NS and SMRT reads as a hybrid solution in the T2T Consortium projects that the further development of sequencing methods can be still expected, and the seeking to eliminate their limitations is continuous.

Bioinformatics of long reads
After exploring the scientific literature in detail, it clarified that sequencing techniques cannot address questions in genomics without bioinformatics.With the rise of new sequencing approaches, a new generation of bioinformatics tools emerged, being compatible with the unique features of long reads and trying to overcome their biases.As long reads, their analysis also presented many opportunities and challenges.Increased read lengths particularly affects how aligners, assemblers, variant callers store and analyze the data.Many software tools specialized for long-read sequencing data are provided by ONT and PacBio with continuous monitorization [45,46].Additional sources and packages are also presented, as it is demonstrated in Table 1.
As a summary of bioinformatics steps, the following section will provide a brief general discussion regarding base calling, detection of base modifications, variant calling, genome assembly, and a bit of specialized evaluation possibilities including both long-read and NGS techniques, emphasizing their unique prominences.

Base calling
The first main step in bioinformatics analysis is always a process named base calling during which the specific electric signals are translated into known nucleotides.The phrase of translation in this case means the conversion process from electric signals to nucleic acid sequences [73].Raw current and light pulse data and read information are stored in specific format files.In the NGS system, the primary analysis of sequencing data is a critical step before base calling.These sequencing platforms have their own chemical-or sensor-origin biases which should be eliminated before or during base calling [74].As a result of the pre-sequencing PCR amplification, many redundant PCR duplicates are present among aligned reads, which are marked and excluded in later analysis stages [75].Considering the two long-read techniques, base calling means the conversion of fluorescent light pulses in SMRT devices, while during NS, the translation of current intensities into k-mers of bases.The alignment of sequencing reads to a reference sequence is a compulsory step after base calling in NGS bioinformatics, however many TGS base callers [55][56][57] execute the alignment in parallel with base identification [55,56].As a side note, we would like to emphasize the importance of quality check of sequencing reads [47-54, 76, 77] preferably before and after every principal step, paying special attention to base calling and variant calling.

Epigenetic modifications: modified base calling
In addition to traditional bases, like adenine (A), thymine (T), uracil (U), guanine (G) and cytosine (C), DNA and RNA molecules can contain modified bases that alter from their original mates in nature and frequency and have different functional roles.In nucleic acids, the most frequently occurring modified bases are 6mA, 5mC, and 5hmC.Considering the location of 5mC and 5hmC in DNA, they are mostly observed on CpG dinucleotide sites.RNA modifications, including 6mA, are frequent in non-coding RNA like ribosomal RNA (rRNA), transfer RNA (tRNA), and also in coding mRNA.Modified DNA and RNA nucleotides play a key role in many biological processes including development, aging, and cancer [78][79][80].Their identification secures the analysis of open chromatin regions, the detection of DNA replication and the measurement of RNA metabolism using base analogs [81][82][83].
The methylation signature is not preserved in PCR amplification-which is essential before NGS assays -, thus approaches have been developed to conserve the epigenetic information.These pretreatments rely on methylationdependent enzymatic restriction, methyl-DNA enrichment, and direct bisulfite conversion [84].In NGS base modification analysis bisulfite-treated DNAs require specialized alignment to account for the C to T conversion.Encouraged by this, short read alignment algorithms were implemented that can be configured for bisulfite-converted DNA alignment [85].
However, the available NGS methods provided some sort of identification of modified bases in nucleic acid sequences as well, but the real landscape demonstration became fulfilled with TGS assays.The detection of modified bases in SMRT is based on the delay between fluorescence pulses [86].NS relies on the recognition of the signal shifts resulting from the different current flow through nanopores [19,87].Most TGS computational tools are capable of modified base detection from reference-aligned reads [34, 57-59, 62, 63, 88], and are based on machine learning training models and statistical tests.Algorithms using neural networks show the highest performance, although statistics-based approaches are the best suited for the identification of de novo modifications [34,89].Because of software developmental progress, long-read base callers became capable of calling modified bases directly [55,56].The key is the application of specific base calling configuration models indicating in their labels the name of the modified bases of interest [56].

Variant calling
Sequence variations can be grouped based on their somatic or germline nature.Germline variants are presented in all cells of the body, including the germ cells, while somatic mutations arise during lifetime.The standard pipeline of somatic mutation calling is the paired tumor-normal sequencing strategy [90].It can provide the true somatic mutations by filtering out the germlines of the normal from the tumor mutation data according to some known tissue-specific non-tumorous variant profiles.Germline and somatic groups also involve subtypes like structural variants, single nucleotide variations, short insertions/deletions, and copy number variations.
The shortest variations are single nucleotide polymorphisms, which are germline substitutions of single nucleotides at specific genomic positions.Copy number variation (CNV) is an alteration type describing the uniqueness among individual genomes, meaning a few and thousands of base-scale variations in the copy numbers of specific DNA segments.SVs are large genomic alterations, like insertions, deletions, inversions, and translocations.They are typically longer than 50 bp, describing different combinations of DNA losses, gains, or rearrangements [91].Structures shorter than 50 bp and longer than few bases are usually referred to as indels.
The key aspect of variant calling is the choice of a robust variant caller concerning NGS and TGS assays as well.To achieve the optimal performance, a prior fine-tuning considering the features of the input is needed.This optimal performance is reached by training and pre-testing the variant callers using the characteristics of the datasets.The exclusion of redundant and duplicate reads from binary alignment mapping (.bam) files, the quality control of .bams,and the identification and the reduction of false-positive variant calls caused by alignment artifacts are crucial steps in input preparation.The accuracy of variant calling can be validated by benchmarking datasets, which are publicly available.The quality of the collected variants is dependent on the precision (and version) of the reference genome, and on the error rate and accuracy of the base and variant identification method.Sequencing coverage affects the sensitivity in a hidden manner, since the appropriateness of the variant caller input is highly dependent on the coverage [92].We must consider the variant representation differences when searching valid variations from the reference by excluding the low coverage b(i)ases.The appropriate post-filtering of the output data is often required; it prevents us from artificial and false-positive calls [75].
TGS variant callers [58][59][60][61]88] are built upon de novo assembly, short-read alignment, or long-read mapping approaches.De novo-assembled sequences cover the alignment of the current assembly to another, or to a reference sequence, and the alterations can be identified by a pointwise positional comparison.During short-read alignment, the presence of SVs induces the appearance of abnormally oriented and spaced reads replacing the organized paired-end form.Long-read mapping approaches can span repetitive and other problematic regions simply, showing an overall better performance [93].
In nowadays-used techniques, long-read sequencing is the most suitable and the most accurate variant calling approach, but especially for the detection of structural variants (SVs) [94].The special role of genetic variations, especially SVs, has been highlighted primarily in medicine and molecular biology, e.g., in neurological diseases [95,96], or during the detection of oncogene-specific variations in breast, prostate, or primary gastric tumors [97].Although their importance is unquestionable, they have been understudied in the past.The origin of this issue arises from the fact that they can overlap or be nested giving rise to complex patterns, which are hard to identify with short-read approaches [93].

Genome assembly
Probably the most important benefit of long-read computational biology can be experienced in the fields of genomic de novo assemblies [64][65][66][67][68].The phrase assembly in this case means the comparison and coupling of the read sequences to each other.Assembly construction is crucial to understand the impact of genomic diversity on health and disease [98].In the last few years, the process has been simplified and the results are more accurate due to the improvements in the bioinformatics routines [99].Besides the sequence construction, another important application of genomic assemblies is the reassembling and fixing of the errors of former reference genomes (of fungal, plant, animal, and human) [44].Unfortunately, repetitive sequences with unresolved repeats are still problematic, enhancing confusion while joining assembled sequences.Linked sequences contain many gaps.To get rid of these, the scaffolding of sequences is a crucial aspect.The term scaffolding means the proper ordering and orientation of assembled sequences using genetic markers, optical maps, or linked reads [100].Assemblies of the short and long reads are both presented taking their advantages in different issues.Besides the success of their combination in T2T Consortium, many other hybrid applications have been published recently [29,101], invoking that for accurate genomic assemblies we need error-free short and long sequences.

Applications of long-read sequencing
Although the topic of long-read sequencing is quite recent, its successful application in several fields is highly presented in the scientific literature including cancer genomics, laboratory medicine, methylation studies and rare genetic conditions, as well.
In laboratory medicine, the currently applied diagnostic strategies involve the use of targeted NGS gene panels, exome sequencing, and genome sequencing.Targeted gene panels are somatic and hereditary disease-specific with the ability to maximize coverage, sensitivity, and specificity of characteristic genes.They offer higher diagnostic yield thanks to lower costs and faster diagnostic times, than exome or genome sequencing [102].The combination of whole-genome and long-read targeted sequencing has already been applied in hematology.Hematologic disorders, like hemophilia A, often involve the appearance of gene fusions and other pathologic events, thus the characterization of fusion transcripts is often done by the combination of NGS and TGS assay-based methods [103].Another example of laboratory medicine related application of long reads is the characterization of the human leukocyte antigen (HLA) system.The HLA system contains the genes that encode key components of the adaptive immune system, and accounts for the major genetic differences among ethnic populations [104].HLA-genotyping information is often yielded from targeted exome and non-targeted genome sequence data [105].
For diploid genomes, chromosomal DNA has two haplotypes.These are combinations of alleles from multiple genetic loci on the same chromosome including complex structural variants, one inherited from each parent.Distinguishing the maternal and paternal haplotypes allows the recognition of homozygous and heterozygous mutations in the human genome.Haplotypes within a diploid chromosome are determined by finding a partitioning of reads to two sets, one for each haplotype, such that the reads within subsets have a minimal number of errors compared to a consensus [106].Their presence helps to discover the nested structural variations, inversions, and other complex rearrangements and studies the interactions between variants in regulatory elements, aneuploidy, evolutionary processes, and drug resistance in viral infections.
The key concept to derive haplotypes using sequencing reads is the phasing of heterozygous variants.The advancements in sequencing associated computational tools like reference-based phasing, de novo assembly, or strain-resolved metagenome assembly [107] entail the potential for the near-complete human haplotype structure reconstruction.
The investigation of genomes containing segments with small allele fraction variants and observed rearrangements in regions of associated genes is still challenging even for current long-read methods [108].The appearance of sequencing techniques with higher-depths and longer-lengths is expected.Regardless, many successful applications can be discussed already.The characterization of tumor genomes and transcriptomes with the analysis of mRNA expression, mutation detection, gene fusions, or chromosomal copy number alterations can highlight new markers of malignancy.With better depiction of the genome-wide landscape and the extent of mutational processes, whole-genome long-read sequencing yields better treatment options in advanced thyroid [109] and other cancers [110].Improvements in sequencing technologies allowed the recognition and the description of long non-coding RNAs (lncRNAs).They are non-protein coding nucleic acids with lengths greater than 200 nucleotides and characterized by high cell type specificity [111].LncRNAs are found to be key players in tumorigenesis and immune responses, and evidence supports their unique cellular functions in the tumor immune microenvironment [112].Most studies related to lncRNAs relied on bulk RNA-sequencing; however, the potentials of scRNAseq can open new possibilities to understand the cell type-specific functions of lncRNA genes [112].
The examination of abnormal RNA expressions helps to understand the molecular mechanisms behind human cancer initiation, development, progression, and metastasis.RNA techniques include the classic bulk RNA (RNAseq), the singlecell RNA (scRNAseq), the spatial RNA (spRNAseq) [113] and the direct RNA (DRS) [114] sequencing methods.Bulk RNAseq means the sequencing of mRNA-only or whole transcriptome libraries with single-end short or paired-end longer approaches.scRNAseq procedures always include single-cell isolation and capture, cell lysis, reverse transcription, cDNA amplification, and library preparation [115].spRNAseq combines the transcriptional analysis of bulk RNAseq and in situ hybridization providing whole transcriptome data with spatial information [113].As a novelty, NS terminology offers the direct sequencing of individual polyadenylated RNAs without the need of any amplification step [114].
Circulating cell-free DNA (cfDNA) in the blood of cancer patients can be the signal of worsening tumor progression.Sequencing analyses revealed that tumor-derived cfDNA accounts for only a fraction of the total amount of cfDNAs and this fraction varies according to the tumor burden [116].Due to the low level and high fragmentation of cfDNAs, their analysis is challenging.In the past few years, NGS techniques were suitable tools for this assay [117], however, the long-reads will possibly promote the provision of deeper cfDNA characteristics providing higher clinical sensitivity for the detection of cancers [117].
The clinical diagnosis of rare genetic disorders often requires the identification of CNVs or repeat variants.Long-read genome sequencing provides an improved opportunity for CNV detection and broadens the possibilities of gene and variant level annotation [118].As an interesting example, primary mitochondrial diseases (PMD) comprise a group of rare genetic conditions characterized by impaired mitochondrial oxidative phosphorylation.The presence of mixed populations of mitochondria, named heteroplasmy, and the fact that those mitochondria contain its own genome consisting of mitochondrial DNA (mtDNA) poses a challenge in identifying PMD.Long-read sequencing enables the entire mitochondrial genome to be sequenced in one read, ensuring the overcome of the obstacles mentioned-above [119].
Using epigenetic alterations as biomarkers presents a unique opportunity for early cancer detection, monitoring, and prognosis.Methylation is the most widely studied epigenetic modification of nucleic acids and its landscape in cancer tissues is evidently complex and highly variable.DNA methylation plays an important role in the regulation of gene expression.The methylation-associated transcriptional inactivation of genes involved in cell cycle control and damage repair suggest that aberrant nucleotide methylations are hallmarks of carcinogenesis [120,121].NS provides the most precise detection and description of methylation landscapes [122].Studies showed that the methylation of both 5mC and 5hmC has a role in the pathogenesis of pediatric cancer [123], while the presence of 6mA in pancreatic tumors is highly upregulated and has a lower occurrence compared to 5mC [124].Thus, the idea to use methylation as a biomarker for cancer detection is not far to seek.Due to its prognostic property, DNA methylation was already applied as a prognostic marker in several cancer types, including prostate, bladder, colorectal, non-small-cell lung, breast, ovarian, cervical cancer, and liver malignancies [125,126].
Although we presented the potentials of TGS long-read sequencing, their utilization in routine diagnostics has not widespread yet.NGS whole exome and targeted sequencing techniques offer well applicable results in routine diagnostics including inborn discrepancy detection, cancer research and diagnostics, hematology, and neurological disorders [72,[127][128][129][130].Their instrumentation, the corresponding chemicals, and flow cells are more affordable, and the generated data are more targeted [131].On the other hand, as long-read techniques offer a wider genomic picture, thus providing a deeper insight into nucleic acid traits, their introduction into routine examinations has started [132][133][134][135][136] and their spreading is expected in the near future.

Conclusion
In this review, we discussed the milestones of sequencing techniques, their progression, current applications, and future opportunities.We also provided a general comparison between short-and long-read assays highlighting their strengths and drawbacks from various aspects including methodology, data analysis, and applications.As we introduced in the last chapter, the spread of long-read techniques has led to a rapid progress in genomics-related areas.By expanding and refining sequencing routines, it becomes possible to explore the genetic complexity of biological systems in greater depths facilitating a radical future advance in the field of sequence variances.

TABLE 1
Summary of the most recent and common-used long-read bioinformatics tools.
[58][59] perform structural variant calling on noisy long-read data.Clair3 is a deep neural network based variant caller even capable of haplotype-sensitive variant detecion performing variant detection from sequencing data containing modified bases[58][59][60]wf-human-variation, wf-somatic-variation Complex command line compatible workflows for NS variant detection.On demand, the separate or combined usage of tumor and normal data is insured with the production of well-detailed analysis reports