Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 6 de 6
Filtrar
1.
bioRxiv ; 2024 Mar 20.
Artigo em Inglês | MEDLINE | ID: mdl-38562829

RESUMO

The secreted mucins MUC5AC and MUC5B play critical defensive roles in airway pathogen entrapment and mucociliary clearance by encoding large glycoproteins with variable number tandem repeats (VNTRs). These polymorphic and degenerate protein coding VNTRs make the loci difficult to investigate with short reads. We characterize the structural diversity of MUC5AC and MUC5B by long-read sequencing and assembly of 206 human and 20 nonhuman primate (NHP) haplotypes. We find that human MUC5B is largely invariant (5761-5762aa); however, seven haplotypes have expanded VNTRs (6291-7019aa). In contrast, 30 allelic variants of MUC5AC encode 16 distinct proteins (5249-6325aa) with cysteine-rich domain and VNTR copy number variation. We grouped MUC5AC alleles into three phylogenetic clades: H1 (46%, ~5654aa), H2 (33%, ~5742aa), and H3 (7%, ~6325aa). The two most common human MUC5AC variants are smaller than NHP gene models, suggesting a reduction in protein length during recent human evolution. Linkage disequilibrium (LD) and Tajima's D analyses reveal that East Asians carry exceptionally large MUC5AC LD blocks with an excess of rare variation (p<0.05). To validate this result, we used Locityper for genotyping MUC5AC haplogroups in 2,600 unrelated samples from the 1000 Genomes Project. We observed signatures of positive selection in H1 and H2 among East Asians and a depletion of the likely ancestral haplogroup (H3). In Africans and Europeans, H3 alleles show an excess of common variation and deviate from Hardy-Weinberg equilibrium, consistent with heterozygote advantage and balancing selection. This study provides a generalizable strategy to characterize complex protein coding VNTRs for improved disease associations.

2.
bioRxiv ; 2024 Apr 20.
Artigo em Inglês | MEDLINE | ID: mdl-38659906

RESUMO

Structural variants (SVs) contribute significantly to human genetic diversity and disease 1-4 . Previously, SVs have remained incompletely resolved by population genomics, with short-read sequencing facing limitations in capturing the whole spectrum of SVs at nucleotide resolution 5-7 . Here we leveraged nanopore sequencing 8 to construct an intermediate coverage resource of 1,019 long-read genomes sampled within 26 human populations from the 1000 Genomes Project. By integrating linear and graph-based approaches for SV analysis via pangenome graph-augmentation, we uncover 167,291 sequence-resolved SVs in these samples, considerably advancing SV characterization compared to population-wide short-read sequencing studies 3,4 . Our analysis details diverse SV classes-deletions, duplications, insertions, and inversions-at population-scale. LINE-1 and SVA retrotransposition activities frequently mediate transductions 9,10 of unique sequences, with both mobile element classes transducing sequences at either the 3'- or 5'-end, depending on the source element locus. Furthermore, analyses of SV breakpoint junctions suggest a continuum of homology-mediated rearrangement processes are integral to SV formation, and highlight evidence for SV recurrence involving repeat sequences. Our open-access dataset underscores the transformative impact of long-read sequencing in advancing the characterisation of polymorphic genomic architectures, and provides a resource for guiding variant prioritisation in future long-read sequencing-based disease studies.

3.
Bioinformatics ; 39(39 Suppl 1): i279-i287, 2023 06 30.
Artigo em Inglês | MEDLINE | ID: mdl-37387146

RESUMO

MOTIVATION: Low-copy repeats (LCRs) or segmental duplications are long segments of duplicated DNA that cover > 5% of the human genome. Existing tools for variant calling using short reads exhibit low accuracy in LCRs due to ambiguity in read mapping and extensive copy number variation. Variants in more than 150 genes overlapping LCRs are associated with risk for human diseases. METHODS: We describe a short-read variant calling method, ParascopyVC, that performs variant calling jointly across all repeat copies and utilizes reads independent of mapping quality in LCRs. To identify candidate variants, ParascopyVC aggregates reads mapped to different repeat copies and performs polyploid variant calling. Subsequently, paralogous sequence variants that can differentiate repeat copies are identified using population data and used for estimating the genotype of variants for each repeat copy. RESULTS: On simulated whole-genome sequence data, ParascopyVC achieved higher precision (0.997) and recall (0.807) than three state-of-the-art variant callers (best precision = 0.956 for DeepVariant and best recall = 0.738 for GATK) in 167 LCR regions. Benchmarking of ParascopyVC using the genome-in-a-bottle high-confidence variant calls for HG002 genome showed that it achieved a very high precision of 0.991 and a high recall of 0.909 across LCR regions, significantly better than FreeBayes (precision = 0.954 and recall = 0.822), GATK (precision = 0.888 and recall = 0.873) and DeepVariant (precision = 0.983 and recall = 0.861). ParascopyVC demonstrated a consistently higher accuracy (mean F1 = 0.947) than other callers (best F1 = 0.908) across seven human genomes. AVAILABILITY AND IMPLEMENTATION: ParascopyVC is implemented in Python and is freely available at https://github.com/tprodanov/ParascopyVC.


Assuntos
Variações do Número de Cópias de DNA , Duplicações Segmentares Genômicas , Humanos , Sequenciamento Completo do Genoma , Benchmarking , Genoma Humano
4.
Nat Commun ; 13(1): 3221, 2022 06 09.
Artigo em Inglês | MEDLINE | ID: mdl-35680869

RESUMO

The human genome contains hundreds of low-copy repeats (LCRs) that are challenging to analyze using short-read sequencing technologies due to extensive copy number variation and ambiguity in read mapping. Copy number and sequence variants in more than 150 duplicated genes that overlap LCRs have been implicated in monogenic and complex human diseases. We describe a computational tool, Parascopy, for estimating the aggregate and paralog-specific copy number of duplicated genes using whole-genome sequencing (WGS). Parascopy is an efficient method that jointly analyzes reads mapped to different repeat copies without the need for global realignment. It leverages multiple samples to mitigate sequencing bias and to identify reliable paralogous sequence variants (PSVs) that differentiate repeat copies. Analysis of WGS data for 2504 individuals from diverse populations showed that Parascopy is robust to sequencing bias, has higher accuracy compared to existing methods and enables prioritization of pathogenic copy number changes in duplicated genes.


Assuntos
Variações do Número de Cópias de DNA , Genoma Humano , Variações do Número de Cópias de DNA/genética , Genoma Humano/genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Duplicações Segmentares Genômicas , Análise de Sequência de DNA/métodos , Sequenciamento Completo do Genoma/métodos
5.
Nucleic Acids Res ; 48(19): e114, 2020 11 04.
Artigo em Inglês | MEDLINE | ID: mdl-33035301

RESUMO

The ability to characterize repetitive regions of the human genome is limited by the read lengths of short-read sequencing technologies. Although long-read sequencing technologies such as Pacific Biosciences (PacBio) and Oxford Nanopore Technologies can potentially overcome this limitation, long segmental duplications with high sequence identity pose challenges for long-read mapping. We describe a probabilistic method, DuploMap, designed to improve the accuracy of long-read mapping in segmental duplications. It analyzes reads mapped to segmental duplications using existing long-read aligners and leverages paralogous sequence variants (PSVs)-sequence differences between paralogous sequences-to distinguish between multiple alignment locations. On simulated datasets, DuploMap increased the percentage of correctly mapped reads with high confidence for multiple long-read aligners including Minimap2 (74.3-90.6%) and BLASR (82.9-90.7%) while maintaining high precision. Across multiple whole-genome long-read datasets, DuploMap aligned an additional 8-21% of the reads in segmental duplications with high confidence relative to Minimap2. Using DuploMap-aligned PacBio circular consensus sequencing reads, an additional 8.9 Mb of DNA sequence was mappable, variant calling achieved a higher F1 score and 14 713 additional variants supported by linked-read data were identified. Finally, we demonstrate that a significant fraction of PSVs in segmental duplications overlaps with variants and adversely impacts short-read variant calling.


Assuntos
Genoma Humano , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Duplicações Segmentares Genômicas , Análise de Sequência de DNA/métodos , Software , Algoritmos , Bases de Dados Genéticas , Conjuntos de Dados como Assunto , Humanos
6.
BMC Med Genomics ; 11(Suppl 1): 13, 2018 02 13.
Artigo em Inglês | MEDLINE | ID: mdl-29504914

RESUMO

BACKGROUND: Cystic fibrosis (CF) is one of the most common life-threatening genetic disorders. Around 2000 variants in the CFTR gene have been identified, with some proportion known to be pathogenic and 300 disease-causing mutations have been characterized in detail by CFTR2 database, which complicates its analysis with conventional methods. METHODS: We conducted next-generation sequencing (NGS) in a cohort of 89 adult patients negative for p.Phe508del homozygosity. Complete clinical and demographic information were available for 84 patients. RESULTS: By combining MLPA with NGS, we identified disease-causing alleles in all the CF patients. Importantly, in 10% of cases, standard bioinformatics pipelines were inefficient in identifying causative mutations. Class IV-V mutations were observed in 38 (45%) cases, predominantly ones with pancreatic sufficient CF disease; rest of the patients had Class I-III mutations. Diabetes was seen only in patients homozygous for class I-III mutations. We found that 12% of the patients were heterozygous for more than two pathogenic CFTR mutations. Two patients were observed with p.[Arg1070Gln, Ser466*] complex allele which was associated with milder pulmonary obstructions (FVC 107 and 109% versus 67%, CI 95%: 63-72%; FEV 90 and 111% versus 47%, CI 95%: 37-48%). For the first time p.[Phe508del, Leu467Phe] complex allele was reported, observed in four patients (5%). CONCLUSION: NGS can be a more information-gaining technology compared to standard methods. Combined with its equivalent diagnostic performance, it can therefore be implemented in the clinical practice, although careful validation is still required.


Assuntos
Biomarcadores/análise , Regulador de Condutância Transmembrana em Fibrose Cística/deficiência , Fibrose Cística/genética , Fibrose Cística/patologia , Estudos de Associação Genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Mutação , Adulto , Estudos de Coortes , Regulador de Condutância Transmembrana em Fibrose Cística/genética , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Adulto Jovem
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...