Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 7 de 7
Filter
Add more filters










Database
Language
Publication year range
1.
bioRxiv ; 2024 Jun 12.
Article in English | MEDLINE | ID: mdl-38915693

ABSTRACT

Background: Variant Call Format (VCF) is the standard file format for interchanging genetic variation data and associated quality control metrics. The usual row-wise encoding of the VCF data model (either as text or packed binary) emphasises efficient retrieval of all data for a given variant, but accessing data on a field or sample basis is inefficient. Biobank scale datasets currently available consist of hundreds of thousands of whole genomes and hundreds of terabytes of compressed VCF. Row-wise data storage is fundamentally unsuitable and a more scalable approach is needed. Results: We present the VCF Zarr specification, an encoding of the VCF data model using Zarr which makes retrieving subsets of the data much more efficient. Zarr is a cloud-native format for storing multi-dimensional data, widely used in scientific computing. We show how this format is far more efficient than standard VCF based approaches, and competitive with specialised methods for storing genotype data in terms of compression ratios and calculation performance. We demonstrate the VCF Zarr format (and the vcf2zarr conversion utility) on a subset of the Genomics England aggV2 dataset comprising 78,195 samples and 59,880,903 variants, with a 5X reduction in storage and greater than 300X reduction in CPU usage in some representative benchmarks. Conclusions: Large row-encoded VCF files are a major bottleneck for current research, and storing and processing these files incurs a substantial cost. The VCF Zarr specification, building on widely-used, open-source technologies has the potential to greatly reduce these costs, and may enable a diverse ecosystem of next-generation tools for analysing genetic variation data directly from cloud-based object stores.

2.
Science ; 380(6647): 849-855, 2023 05 26.
Article in English | MEDLINE | ID: mdl-37228217

ABSTRACT

Population genetic models only provide coarse representations of real-world ancestry. We used a pedigree compiled from 4 million parish records and genotype data from 2276 French and 20,451 French Canadian individuals to finely model and trace French Canadian ancestry through space and time. The loss of ancestral French population structure and the appearance of spatial and regional structure highlights a wide range of population expansion models. Geographic features shaped migrations, and we find enrichments for migration, genetic, and genealogical relatedness patterns within river networks across regions of Quebec. Finally, we provide a freely accessible simulated whole-genome sequence dataset with spatiotemporal metadata for 1,426,749 individuals reflecting intricate French Canadian population structure. Such realistic population-scale simulations provide opportunities to investigate population genetics at an unprecedented resolution.


Subject(s)
Datasets as Topic , Pedigree , Population , Humans , Alleles , Canada , Genetics, Population , Genotype , Quebec , France/ethnology , Population/genetics , Whole Genome Sequencing , Models, Genetic , Human Migration , Genetic Variation
3.
Am J Hum Genet ; 110(5): 741-761, 2023 05 04.
Article in English | MEDLINE | ID: mdl-37030289

ABSTRACT

The advent of large-scale genome-wide association studies (GWASs) has motivated the development of statistical methods for phenotype prediction with single-nucleotide polymorphism (SNP) array data. These polygenic risk score (PRS) methods use a multiple linear regression framework to infer joint effect sizes of all genetic variants on the trait. Among the subset of PRS methods that operate on GWAS summary statistics, sparse Bayesian methods have shown competitive predictive ability. However, most existing Bayesian approaches employ Markov chain Monte Carlo (MCMC) algorithms, which are computationally inefficient and do not scale favorably to higher dimensions, for posterior inference. Here, we introduce variational inference of polygenic risk scores (VIPRS), a Bayesian summary statistics-based PRS method that utilizes variational inference techniques to approximate the posterior distribution for the effect sizes. Our experiments with 36 simulation configurations and 12 real phenotypes from the UK Biobank dataset demonstrated that VIPRS is consistently competitive with the state-of-the-art in prediction accuracy while being more than twice as fast as popular MCMC-based approaches. This performance advantage is robust across a variety of genetic architectures, SNP heritabilities, and independent GWAS cohorts. In addition to its competitive accuracy on the "White British" samples, VIPRS showed improved transferability when applied to other ethnic groups, with up to 1.7-fold increase in R2 among individuals of Nigerian ancestry for low-density lipoprotein (LDL) cholesterol. To illustrate its scalability, we applied VIPRS to a dataset of 9.6 million genetic markers, which conferred further improvements in prediction accuracy for highly polygenic traits, such as height.


Subject(s)
Genome-Wide Association Study , Multifactorial Inheritance , Humans , Multifactorial Inheritance/genetics , Genome-Wide Association Study/methods , Bayes Theorem , Polymorphism, Single Nucleotide/genetics , Risk Factors , Genetic Predisposition to Disease
4.
Genet Epidemiol ; 45(6): 621-632, 2021 09.
Article in English | MEDLINE | ID: mdl-34157784

ABSTRACT

Linkage-Disequilibrium Score Regression (LDSC) is a popular framework for analyzing Genome-wide Association Studies (GWAS) summary statistics that allows for estimating single nucleotide polymorphism heritability, confounding, and functional enrichment of genetic variants with different annotations. Recent work has highlighted the influence of implicit and explicit assumptions of the model on the biological interpretation of the results. In this study, we explored a formulation of LDSC that replaces the r2 measure of LD with a recently proposed unbiased estimator of the D2 statistic. In addition to modest statistical difference across estimators, this derivation highlighted implicit and unrealistic assumptions about the relationship between allele frequency, effect size, and annotation status. We carry out a systematic comparison of alternative LDSC formulations by applying them to summary statistics from 47 GWAS traits. Our results show that commonly used models likely underestimate functional enrichment. These results highlight the importance of calibrating the LDSC model to achieve a more robust understanding of polygenic traits.


Subject(s)
Genome-Wide Association Study , Multifactorial Inheritance , Humans , Linkage Disequilibrium , Models, Genetic , Polymorphism, Single Nucleotide
5.
Nucleic Acids Res ; 49(D1): D368-D372, 2021 01 08.
Article in English | MEDLINE | ID: mdl-33245761

ABSTRACT

MoonProt 3.0 (http://moonlightingproteins.org) is an updated open-access database storing expert-curated annotations for moonlighting proteins. Moonlighting proteins have two or more physiologically relevant distinct biochemical or biophysical functions performed by a single polypeptide chain. Here, we describe an expansion in the database since our previous report in the Database Issue of Nucleic Acids Research in 2018. For this release, the number of proteins annotated has been expanded to over 500 proteins and dozens of protein annotations have been updated with additional information, including more structures in the Protein Data Bank, compared with version 2.0. The new entries include more examples from humans, plants and archaea, more proteins involved in disease and proteins with different combinations of functions. More kinds of information about the proteins and the species in which they have multiple functions has been added, including CATH and SCOP classification of structure, known and predicted disorder, predicted transmembrane helices, type of organism, relationship of the protein to disease, and relationship of organism to cause of disease.


Subject(s)
Databases, Protein , Proteins/chemistry , Humans , Molecular Sequence Annotation , SARS-CoV-2/metabolism
6.
Nucleic Acids Res ; 46(D1): D640-D644, 2018 01 04.
Article in English | MEDLINE | ID: mdl-29126295

ABSTRACT

MoonProt 2.0 (http://moonlightingproteins.org) is an updated, comprehensive and open-access database storing expert-curated annotations for moonlighting proteins. Moonlighting proteins contain two or more physiologically relevant distinct functions performed by a single polypeptide chain. Here, we describe developments in the MoonProt website and database since our previous report in the Database Issue of Nucleic Acids Research. For this V 2.0 release, we expanded the number of proteins annotated to 370 and modified several dozen protein annotations with additional or updated information, including more links to protein structures in the Protein Data Bank, compared with the previous release. The new entries include more examples from humans and several model organisms, more proteins involved in disease, and proteins with different combinations of functions. The updated web interface includes a search function using BLAST to enable users to search the database for proteins that share amino acid sequence similarity with a protein of interest. The updated website also includes additional background information about moonlighting proteins and an expanded list of links to published articles about moonlighting proteins.


Subject(s)
Databases, Protein , Amino Acid Sequence , Humans , Internet , Molecular Sequence Annotation , Proteins/chemistry , Proteins/genetics , Proteins/metabolism , Search Engine , Sequence Alignment , User-Computer Interface
7.
Nucleic Acids Res ; 43(Database issue): D277-82, 2015 Jan.
Article in English | MEDLINE | ID: mdl-25324305

ABSTRACT

Moonlighting proteins comprise a class of multifunctional proteins in which a single polypeptide chain performs multiple biochemical functions that are not due to gene fusions, multiple RNA splice variants or pleiotropic effects. The known moonlighting proteins perform a variety of diverse functions in many different cell types and species, and information about their structures and functions is scattered in many publications. We have constructed the manually curated, searchable, internet-based MoonProt Database (http://www.moonlightingproteins.org) with information about the over 200 proteins that have been experimentally verified to be moonlighting proteins. The availability of this organized information provides a more complete picture of what is currently known about moonlighting proteins. The database will also aid researchers in other fields, including determining the functions of genes identified in genome sequencing projects, interpreting data from proteomics projects and annotating protein sequence and structural databases. In addition, information about the structures and functions of moonlighting proteins can be helpful in understanding how novel protein functional sites evolved on an ancient protein scaffold, which can also help in the design of proteins with novel functions.


Subject(s)
Databases, Protein , Proteins/chemistry , Proteins/physiology , Animals , Internet , Proteins/genetics
SELECTION OF CITATIONS
SEARCH DETAIL
...