Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 22
Filter
1.
Front Genet ; 11: 82, 2020.
Article in English | MEDLINE | ID: mdl-32153642

ABSTRACT

Copy number variants are duplications and deletions of the genome that play an important role in phenotypic changes and human disease. Many software applications have been developed to detect copy number variants using either whole-genome sequencing or whole-exome sequencing data. However, there is poor agreement in the results from these applications. Simulated datasets containing copy number variants allow comprehensive comparisons of the operating characteristics of existing and novel copy number variant detection methods. Several software applications have been developed to simulate copy number variants and other structural variants in whole-genome sequencing data. However, none of the applications reliably simulate copy number variants in whole-exome sequencing data. We have developed and tested Simulator of Exome Copy Number Variants (SECNVs), a fast, robust and customizable software application for simulating copy number variants and whole-exome sequences from a reference genome. SECNVs is easy to install, implements a wide range of commands to customize simulations, can output multiple samples at once, and incorporates a pipeline to output rearranged genomes, short reads and BAM files in a single command. Variants generated by SECNVs are detected with high sensitivity and precision by tools commonly used to detect copy number variants. SECNVs is publicly available at https://github.com/YJulyXing/SECNVs.

2.
Sci Rep ; 10(1): 2381, 2020 Feb 06.
Article in English | MEDLINE | ID: mdl-32024902

ABSTRACT

An amendment to this paper has been published and can be accessed via a link at the top of the paper.

3.
Sci Rep ; 7(1): 16776, 2017 12 01.
Article in English | MEDLINE | ID: mdl-29196624

ABSTRACT

Dogs with X-linked hereditary nephropathy (XLHN) have a glomerular basement membrane defect that leads to progressive juvenile-onset renal failure. Their disease is analogous to Alport syndrome in humans, and they also serve as a good model of progressive chronic kidney disease (CKD). However, the gene expression profile that affects progression in this disease has only been partially characterized. To help fill this gap, we used RNA sequencing to identify differentially expressed genes (DEGs), over-represented pathways, and upstream regulators that contribute to kidney disease progression. Total RNA from kidney biopsies was isolated at 3 clinical time points from 3 males with rapidly-progressing CKD, 3 males with slowly-progressing CKD, and 2 age-matched controls. We identified 70 DEGs by comparing rapid and slow groups at specific time points. Based on time course analysis, 1,947 DEGs were identified over the 3 time points revealing upregulation of inflammatory pathways: integrin signaling, T cell activation, and chemokine and cytokine signaling pathways. T cell infiltration was verified by immunohistochemistry. TGF-ß1 was identified as the primary upstream regulator. These results provide new insights into the underlying molecular mechanisms of disease progression in XLHN, and the identified DEGs can be potential biomarkers and therapeutic targets translatable to all CKDs.


Subject(s)
Dog Diseases/pathology , Gene Regulatory Networks , Genetic Diseases, X-Linked/veterinary , Nephritis, Hereditary/veterinary , Renal Insufficiency, Chronic/etiology , Sequence Analysis, RNA/veterinary , Animals , Biopsy , Case-Control Studies , Disease Progression , Dog Diseases/genetics , Dogs , Gene Expression Profiling/veterinary , Gene Expression Regulation , Genetic Diseases, X-Linked/complications , Genetic Diseases, X-Linked/genetics , Genetic Diseases, X-Linked/pathology , Male , Molecular Sequence Annotation , Nephritis, Hereditary/complications , Nephritis, Hereditary/genetics , Nephritis, Hereditary/pathology , Renal Insufficiency, Chronic/genetics , Time Factors
4.
J Pers Med ; 4(1): 65-78, 2014 Mar 07.
Article in English | MEDLINE | ID: mdl-25562143

ABSTRACT

Knowledge of a patient's cardiac age, or "heart age", could prove useful to both patients and physicians for better encouraging lifestyle changes potentially beneficial for cardiovascular health. This may be particularly true for patients who exhibit symptoms but who test negative for cardiac pathology. We developed a statistical model, using a Bayesian approach, that predicts an individual's heart age based on his/her electrocardiogram (ECG). The model is tailored to healthy individuals, with no known risk factors, who are at least 20 years old and for whom a resting ~5 min 12-lead ECG has been obtained. We evaluated the model using a database of ECGs from 776 such individuals. Secondarily, we also applied the model to other groups of individuals who had received 5-min ECGs, including 221 with risk factors for cardiac disease, 441 with overt cardiac disease diagnosed by clinical imaging tests, and a smaller group of highly endurance-trained athletes. Model-related heart age predictions in healthy non-athletes tended to center around body age, whereas about three-fourths of the subjects with risk factors and nearly all patients with proven heart diseases had higher predicted heart ages than true body ages. The model also predicted somewhat higher heart ages than body ages in a majority of highly endurance-trained athletes, potentially consistent with possible fibrotic or other anomalies recently noted in such individuals.

5.
BMC Bioinformatics ; 13 Suppl 16: S5, 2012.
Article in English | MEDLINE | ID: mdl-23176322

ABSTRACT

Shotgun proteomic data are affected by a variety of known and unknown systematic biases as well as high proportions of missing values. Typically, normalization is performed in an attempt to remove systematic biases from the data before statistical inference, sometimes followed by missing value imputation to obtain a complete matrix of intensities. Here we discuss several approaches to normalization and dealing with missing values, some initially developed for microarray data and some developed specifically for mass spectrometry-based data.


Subject(s)
Chromatography, Liquid/statistics & numerical data , Mass Spectrometry/statistics & numerical data , Proteomics/statistics & numerical data , Bias , Research Design
6.
Bioinformatics ; 28(18): 2404-6, 2012 Sep 15.
Article in English | MEDLINE | ID: mdl-22815360

ABSTRACT

MOTIVATION: The size and complex nature of mass spectrometry-based proteomics datasets motivate development of specialized software for statistical data analysis and exploration. We present DanteR, a graphical R package that features extensive statistical and diagnostic functions for quantitative proteomics data analysis, including normalization, imputation, hypothesis testing, interactive visualization and peptide-to-protein rollup. More importantly, users can easily extend the existing functionality by including their own algorithms under the Add-On tab. AVAILABILITY: DanteR and its associated user guide are available for download free of charge at http://omics.pnl.gov/software/. We have an updated binary source for the DanteR package up on our website together with a vignettes document. For Windows, a single click automatically installs DanteR along with the R programming environment. For Linux and Mac OS X, users must install R and then follow instructions on the DanteR website for package installation. CONTACT: rds@pnnl.gov.


Subject(s)
Proteomics/methods , Software , Algorithms , Data Interpretation, Statistical , Mass Spectrometry , Proteins/metabolism
7.
Bioinformatics ; 28(15): 1998-2003, 2012 Aug 01.
Article in English | MEDLINE | ID: mdl-22628520

ABSTRACT

MOTIVATION: Protein abundance in quantitative proteomics is often based on observed spectral features derived from liquid chromatography mass spectrometry (LC-MS) or LC-MS/MS experiments. Peak intensities are largely non-normal in distribution. Furthermore, LC-MS-based proteomics data frequently have large proportions of missing peak intensities due to censoring mechanisms on low-abundance spectral features. Recognizing that the observed peak intensities detected with the LC-MS method are all positive, skewed and often left-censored, we propose using survival methodology to carry out differential expression analysis of proteins. Various standard statistical techniques including non-parametric tests such as the Kolmogorov-Smirnov and Wilcoxon-Mann-Whitney rank sum tests, and the parametric survival model and accelerated failure time-model with log-normal, log-logistic and Weibull distributions were used to detect any differentially expressed proteins. The statistical operating characteristics of each method are explored using both real and simulated datasets. RESULTS: Survival methods generally have greater statistical power than standard differential expression methods when the proportion of missing protein level data is 5% or more. In particular, the AFT models we consider consistently achieve greater statistical power than standard testing procedures, with the discrepancy widening with increasing missingness in the proportions. AVAILABILITY: The testing procedures discussed in this article can all be performed using readily available software such as R. The R codes are provided as supplemental materials. CONTACT: ctekwe@stat.tamu.edu.


Subject(s)
Models, Statistical , Proteins/analysis , Proteomics/methods , Software , Statistics, Nonparametric , Chromatography, Liquid/methods , Computer Simulation , Diabetes Mellitus/metabolism , Humans , Likelihood Functions , Mass Spectrometry/methods , Tandem Mass Spectrometry/methods
8.
Bioinformatics ; 28(12): 1586-91, 2012 Jun 15.
Article in English | MEDLINE | ID: mdl-22522136

ABSTRACT

MOTIVATION: Quantitative mass spectrometry-based proteomics involves statistical inference on protein abundance, based on the intensities of each protein's associated spectral peaks. However, typical MS-based proteomics datasets have substantial proportions of missing observations, due at least in part to censoring of low intensities. This complicates intensity-based differential expression analysis. RESULTS: We outline a statistical method for protein differential expression, based on a simple Binomial likelihood. By modeling peak intensities as binary, in terms of 'presence/absence,' we enable the selection of proteins not typically amenable to quantitative analysis; e.g. 'one-state' proteins that are present in one condition but absent in another. In addition, we present an analysis protocol that combines quantitative and presence/absence analysis of a given dataset in a principled way, resulting in a single list of selected proteins with a single-associated false discovery rate. AVAILABILITY: All R code available here: http://www.stat.tamu.edu/~adabney/share/xuan_code.zip.


Subject(s)
Mass Spectrometry/methods , Proteins/analysis , Proteomics/methods , Computer Simulation , Humans , Likelihood Functions , Logistic Models
9.
Anal Chem ; 83(16): 6135-40, 2011 Aug 15.
Article in English | MEDLINE | ID: mdl-21692516

ABSTRACT

Current algorithms for quantifying peptide identification confidence in the accurate mass and time (AMT) tag approach assume that the AMT tags themselves have been correctly identified. However, there is uncertainty in the identification of AMT tags, because this is based on matching LC-MS/MS fragmentation spectra to peptide sequences. In this paper, we incorporate confidence measures for the AMT tag identifications into the calculation of probabilities for correct matches to an AMT tag database, resulting in a more accurate overall measure of identification confidence for the AMT tag approach. The method is referenced as Statistical Tools for AMT Tag Confidence (STAC). STAC additionally provides a uniqueness probability (UP) to help distinguish between multiple matches to an AMT tag and a method to calculate an overall false discovery rate (FDR). STAC is freely available for download, as both a command line and a Windows graphical application.


Subject(s)
Chromatography, Liquid/statistics & numerical data , Peptides/analysis , Proteomics/statistics & numerical data , Tandem Mass Spectrometry/statistics & numerical data , Algorithms , Chromatography, Liquid/standards , Databases, Protein , Models, Statistical , Peptides/chemistry , Probability , Proteomics/methods , Software , Tandem Mass Spectrometry/standards
10.
Ann Appl Stat ; 4(4): 1797-1823, 2010.
Article in English | MEDLINE | ID: mdl-21593992

ABSTRACT

Mass spectrometry-based proteomics has become the tool of choice for identifying and quantifying the proteome of an organism. Though recent years have seen a tremendous improvement in instrument performance and the computational tools used, significant challenges remain, and there are many opportunities for statisticians to make important contributions. In the most widely used "bottom-up" approach to proteomics, complex mixtures of proteins are first subjected to enzymatic cleavage, the resulting peptide products are separated based on chemical or physical properties and analyzed using a mass spectrometer. The two fundamental challenges in the analysis of bottom-up MS-based proteomics are: (1) Identifying the proteins that are present in a sample, and (2) Quantifying the abundance levels of the identified proteins. Both of these challenges require knowledge of the biological and technological context that gives rise to observed data, as well as the application of sound statistical principles for estimation and inference. We present an overview of bottom-up proteomics and outline the key statistical issues that arise in protein identification and quantification.

11.
PLoS One ; 4(9): e7087, 2009 Sep 18.
Article in English | MEDLINE | ID: mdl-19763254

ABSTRACT

Many mass spectrometry-based studies, as well as other biological experiments produce cluster-correlated data. Failure to account for correlation among observations may result in a classification algorithm overfitting the training data and producing overoptimistic estimated error rates and may make subsequent classifications unreliable. Current common practice for dealing with replicated data is to average each subject replicate sample set, reducing the dataset size and incurring loss of information. In this manuscript we compare three approaches to dealing with cluster-correlated data: unmodified Breiman's Random Forest (URF), forest grown using subject-level averages (SLA), and RF++ with subject-level bootstrapping (SLB). RF++, a novel Random Forest-based algorithm implemented in C++, handles cluster-correlated data through a modification of the original resampling algorithm and accommodates subject-level classification. Subject-level bootstrapping is an alternative sampling method that obviates the need to average or otherwise reduce each set of replicates to a single independent sample. Our experiments show nearly identical median classification and variable selection accuracy for SLB forests and URF forests when applied to both simulated and real datasets. However, the run-time estimated error rate was severely underestimated for URF forests. Predictably, SLA forests were found to be more severely affected by the reduction in sample size which led to poorer classification and variable selection accuracy. Perhaps most importantly our results suggest that it is reasonable to utilize URF for the analysis of cluster-correlated data. Two caveats should be noted: first, correct classification error rates must be obtained using a separate test dataset, and second, an additional post-processing step is required to obtain subject-level classifications. RF++ is shown to be an effective alternative for classifying both clustered and non-clustered data. Source code and stand-alone compiled versions of command-line and easy-to-use graphical user interface (GUI) versions of RF++ for Windows and Linux as well as a user manual (Supplementary File S2) are available for download at: http://sourceforge.org/projects/rfpp/ under the GNU public license.


Subject(s)
Gene Expression Profiling/methods , Oligonucleotide Array Sequence Analysis/methods , Pattern Recognition, Automated/methods , Algorithms , Cluster Analysis , Computer Simulation , Models, Genetic , Models, Statistical , Software , Spectrometry, Mass, Matrix-Assisted Laser Desorption-Ionization/methods
12.
Bioinformatics ; 25(19): 2573-80, 2009 Oct 01.
Article in English | MEDLINE | ID: mdl-19602524

ABSTRACT

MOTIVATION: LC-MS allows for the identification and quantification of proteins from biological samples. As with any high-throughput technology, systematic biases are often observed in LC-MS data, making normalization an important preprocessing step. Normalization models need to be flexible enough to capture biases of arbitrary complexity, while avoiding overfitting that would invalidate downstream statistical inference. Careful normalization of MS peak intensities would enable greater accuracy and precision in quantitative comparisons of protein abundance levels. RESULTS: We propose an algorithm, called EigenMS, that uses singular value decomposition to capture and remove biases from LC-MS peak intensity measurements. EigenMS is an adaptation of the surrogate variable analysis (SVA) algorithm of Leek and Storey, with the adaptations including (i) the handling of the widespread missing measurements that are typical in LC-MS, and (ii) a novel approach to preventing overfitting that facilitates the incorporation of EigenMS into an existing proteomics analysis pipeline. EigenMS is demonstrated using both large-scale calibration measurements and simulations to perform well relative to existing alternatives. AVAILABILITY: The software has been made available in the open source proteomics platform DAnTE (Polpitiya et al., 2008)) (http://omics.pnl.gov/software/), as well as in standalone software available at SourceForge (http://sourceforge.net).


Subject(s)
Computational Biology/methods , Proteins/chemistry , Proteome/analysis , Proteomics/methods , Databases, Protein , Mass Spectrometry , Software
13.
Bioinformatics ; 25(16): 2028-34, 2009 Aug 15.
Article in English | MEDLINE | ID: mdl-19535538

ABSTRACT

MOTIVATION: Quantitative mass spectrometry-based proteomics requires protein-level estimates and associated confidence measures. Challenges include the presence of low quality or incorrectly identified peptides and informative missingness. Furthermore, models are required for rolling peptide-level information up to the protein level. RESULTS: We present a statistical model that carefully accounts for informative missingness in peak intensities and allows unbiased, model-based, protein-level estimation and inference. The model is applicable to both label-based and label-free quantitation experiments. We also provide automated, model-based, algorithms for filtering of proteins and peptides as well as imputation of missing values. Two LC/MS datasets are used to illustrate the methods. In simulation studies, our methods are shown to achieve substantially more discoveries than standard alternatives. AVAILABILITY: The software has been made available in the open-source proteomics platform DAnTE (http://omics.pnl.gov/software/).


Subject(s)
Mass Spectrometry/methods , Proteins/analysis , Proteomics/methods , Databases, Protein , Models, Statistical , Proteome/analysis
14.
Anal Chem ; 80(3): 693-706, 2008 Feb 01.
Article in English | MEDLINE | ID: mdl-18163597

ABSTRACT

The high mass measurement accuracy and precision available with recently developed mass spectrometers is increasingly used in proteomics analyses to confidently identify tryptic peptides from complex mixtures of proteins, as well as post-translational modifications and peptides from nonannotated proteins. To take full advantage of high mass measurement accuracy instruments, it is necessary to limit systematic mass measurement errors. It is well known that errors in m/z measurements can be affected by experimental parameters that include, for example, outdated calibration coefficients, ion intensity, and temperature changes during the measurement. Traditionally, these variations have been corrected through the use of internal calibrants (well-characterized standards introduced with the sample being analyzed). In this paper, we describe an alternative approach where the calibration is provided through the use of a priori knowledge of the sample being analyzed. Such an approach has previously been demonstrated based on the dependence of systematic error on m/z alone. To incorporate additional explanatory variables, we employed multidimensional, nonparametric regression models, which were evaluated using several commercially available instruments. The applied approach is shown to remove any noticeable biases from the overall mass measurement errors and decreases the overall standard deviation of the mass measurement error distribution by 1.2-2-fold, depending on instrument type. Subsequent reduction of the random errors based on multiple measurements over consecutive spectra further improves accuracy and results in an overall decrease of the standard deviation by 1.8-3.7-fold. This new procedure will decrease the false discovery rates for peptide identifications using high-accuracy mass measurements.


Subject(s)
Chromatography, Liquid/methods , Complex Mixtures/analysis , Mass Spectrometry/methods , Peptides/analysis , Regression Analysis , Trypsin/analysis , Algorithms , Calibration , Complex Mixtures/chemistry , False Positive Reactions , Models, Biological , Peptides/chemistry , Protein Processing, Post-Translational , Proteomics/methods , Reproducibility of Results , Sensitivity and Specificity , Trypsin/chemistry , Trypsin/metabolism
15.
PLoS One ; 2(10): e1002, 2007 Oct 03.
Article in English | MEDLINE | ID: mdl-17912341

ABSTRACT

Nearest-centroid classifiers have recently been successfully employed in high-dimensional applications, such as in genomics. A necessary step when building a classifier for high-dimensional data is feature selection. Feature selection is frequently carried out by computing univariate scores for each feature individually, without consideration for how a subset of features performs as a whole. We introduce a new feature selection approach for high-dimensional nearest centroid classifiers that instead is based on the theoretically optimal choice of a given number of features, which we determine directly here. This allows us to develop a new greedy algorithm to estimate this optimal nearest-centroid classifier with a given number of features. In addition, whereas the centroids are usually formed from maximum likelihood estimates, we investigate the applicability of high-dimensional shrinkage estimates of centroids. We apply the proposed method to clinical classification based on gene-expression microarrays, demonstrating that the proposed method can outperform existing nearest centroid classifiers.


Subject(s)
Data Interpretation, Statistical , Gene Expression Regulation, Neoplastic , Genetic Techniques , Genomics , Algorithms , Child , Discriminant Analysis , Gene Expression Profiling , Humans , Leukemia , Lymphoma/genetics , Models, Statistical , Models, Theoretical , Oligonucleotide Array Sequence Analysis , Pattern Recognition, Automated
16.
Genome Biol ; 8(3): R44, 2007.
Article in English | MEDLINE | ID: mdl-17391524

ABSTRACT

In normalizing two-channel expression arrays, the ANOVA approach explicitly incorporates the experimental design in its model, and the MA plot-based approach accounts for intensity-dependent biases. However, both approaches can lead to inaccurate normalization in fairly common scenarios. We propose a method called efficient Common Array Dye Swap (eCADS) for normalizing two-channel microarrays that accounts for both experimental design and intensity-dependent biases. Under reasonable experimental designs, eCADS preserves differential expression relationships and requires only a single array per sample pair.


Subject(s)
Microarray Analysis/statistics & numerical data , Models, Statistical , Research Design/standards , Microarray Analysis/standards , Research Design/statistics & numerical data
17.
Biostatistics ; 8(1): 128-39, 2007 Jan.
Article in English | MEDLINE | ID: mdl-16636140

ABSTRACT

A two-channel microarray measures the relative expression levels of thousands of genes from a pair of biological samples. In order to reliably compare gene expression levels between and within arrays, it is necessary to remove systematic errors that distort the biological signal of interest. The standard for accomplishing this is smoothing "MA-plots" to remove intensity-dependent dye bias and array-specific effects. However, MA methods require strong assumptions, which limit their general applicability. We review these assumptions and derive several practical scenarios in which they fail. The "dye-swap" normalization method has been much less frequently used because it requires two arrays per pair of samples. We show that a dye-swap is accurate under general assumptions, even under intensity-dependent dye bias, and that a dye-swap removes dye bias from a single pair of samples in general. Based on a flexible model of the relationship between mRNA amount and single-channel fluorescence intensity, we demonstrate the general applicability of a dye-swap approach. We then propose a common array dye-swap (CADS) method for the normalization of two-channel microarrays. We show that CADS removes both dye bias and array-specific effects, and preserves the true differential expression signal for every gene under the assumptions of the model.


Subject(s)
Data Interpretation, Statistical , Models, Statistical , Oligonucleotide Array Sequence Analysis/methods , Computer Simulation , Fluorescent Dyes/chemistry , Gene Expression Profiling/methods , Humans , RNA, Messenger/chemistry , RNA, Messenger/genetics
18.
Genome Biol ; 7(3): 401, 2006.
Article in English | MEDLINE | ID: mdl-16563185

ABSTRACT

A response to Preferred analysis methods for Affymetrix GeneChips revealed by a wholly defined control dataset by SE Choe, M Boutros, AM Michelson, GM Church and MS Halfon. Genome Biology 2005, 6:R16.


Subject(s)
Oligonucleotide Array Sequence Analysis/methods , Algorithms , Animals , Gene Expression Profiling/methods , RNA, Messenger/genetics , Reproducibility of Results , Research Design
19.
Bioinformatics ; 22(4): 507-8, 2006 Feb 15.
Article in English | MEDLINE | ID: mdl-16357033

ABSTRACT

EDGE (Extraction of Differential Gene Expression) is an open source, point-and-click software program for the significance analysis of DNA microarray experiments. EDGE can perform both standard and time course differential expression analysis. The functions are based on newly developed statistical theory and methods. This document introduces the EDGE software package.


Subject(s)
Algorithms , Computer Graphics , Gene Expression Profiling/methods , Oligonucleotide Array Sequence Analysis/methods , Pattern Recognition, Automated/methods , Software , User-Computer Interface , Cluster Analysis
20.
Bioinformatics ; 22(1): 122-3, 2006 Jan 01.
Article in English | MEDLINE | ID: mdl-16269418

ABSTRACT

SUMMARY: ClaNC (classification to nearest centroids) is a simple and an accurate method for classifying microarrays. This document introduces a point-and-click interface to the ClaNC methodology. The software is available as an R package. AVAILABILITY: ClaNC is freely available from http://students.washington.edu/adabney/clanc


Subject(s)
Oligonucleotide Array Sequence Analysis/methods , Algorithms , Cluster Analysis , Computer Graphics , Computer Simulation , Computers , Data Interpretation, Statistical , Gene Expression Profiling , Information Storage and Retrieval , Internet , Models, Statistical , Pattern Recognition, Automated , Programming Languages , Software , User-Computer Interface
SELECTION OF CITATIONS
SEARCH DETAIL
...