Search | VHL Regional Portal

1.

Multiplexing the Identification of Microorganisms via Tandem Mass Tag Labeling Augmented by Interference Removal through a Novel Modification of the Expectation Maximization Algorithm.

Alves, Gelio; Ogurtsov, Aleksey Y; Porterfield, Harry; Maity, Tapan; Jenkins, Lisa M; Sacks, David B; Yu, Yi-Kuo.

J Am Soc Mass Spectrom ; 35(6): 1138-1155, 2024 Jun 05.

Article in English | MEDLINE | ID: mdl-38740383

ABSTRACT

Having fast, accurate, and broad spectrum methods for the identification of microorganisms is of paramount importance to public health, research, and safety. Bottom-up mass spectrometer-based proteomics has emerged as an effective tool for the accurate identification of microorganisms from microbial isolates. However, one major hurdle that limits the deployment of this tool for routine clinical diagnosis, and other areas of research such as culturomics, is the instrument time required for the mass spectrometer to analyze a single sample, which can take â¼1 h per sample, when using mass spectrometers that are presently used in most institutes. To address this issue, in this study, we employed, for the first time, tandem mass tags (TMTs) in multiplex identifications of microorganisms from multiple TMT-labeled samples in one MS/MS experiment. A difficulty encountered when using TMT labeling is the presence of interference in the measured intensities of TMT reporter ions. To correct for interference, we employed in the proposed method a modified version of the expectation maximization (EM) algorithm that redistributes the signal from ion interference back to the correct TMT-labeled samples. We have evaluated the sensitivity and specificity of the proposed method using 94 MS/MS experiments (covering a broad range of protein concentration ratios across TMT-labeled channels and experimental parameters), containing a total of 1931 true positive TMT-labeled channels and 317 true negative TMT-labeled channels. The results of the evaluation show that the proposed method has an identification sensitivity of 93-97% and a specificity of 100% at the species level. Furthermore, as a proof of concept, using an in-house-generated data set composed of some of the most common urinary tract pathogens, we demonstrated that by using the proposed method the mass spectrometer time required per sample, using a 1 h LC-MS/MS run, can be reduced to 10 and 6 min when samples are labeled with TMT-6 and TMT-10, respectively. The proposed method can also be used along with Orbitrap mass spectrometers that have faster MS/MS acquisition rates, like the recently released Orbitrap Astral mass spectrometer, to further reduce the mass spectrometer time required per sample.

Subject(s)

Algorithms , Proteomics , Tandem Mass Spectrometry , Tandem Mass Spectrometry/methods , Proteomics/methods , Humans , Bacteria/isolation & purification , Bacteria/chemistry , Bacterial Proteins/analysis , Bacterial Proteins/chemistry , Bacterial Proteins/isolation & purification

2.

Systematic Assessment of Deep Learning-Based Predictors of Fragmentation Intensity Profiles.

Hamaneh, Mehdi B; Ogurtsov, Aleksey Y; Obolensky, Oleg I; Yu, Yi-Kuo.

J Proteome Res ; 23(6): 1983-1999, 2024 Jun 07.

Article in English | MEDLINE | ID: mdl-38728051

ABSTRACT

In recent years, several deep learning-based methods have been proposed for predicting peptide fragment intensities. This study aims to provide a comprehensive assessment of six such methods, namely Prosit, DeepMass:Prism, pDeep3, AlphaPeptDeep, Prosit Transformer, and the method proposed by Guan et al. To this end, we evaluated the accuracy of the predicted intensity profiles for close to 1.7 million precursors (including both tryptic and HLA peptides) corresponding to more than 18 million experimental spectra procured from 40 independent submissions to the PRIDE repository that were acquired for different species using a variety of instruments and different dissociation types/energies. Specifically, for each method, distributions of similarity (measured by Pearson's correlation and normalized angle) between the predicted and the corresponding experimental b and y fragment intensities were generated. These distributions were used to ascertain the prediction accuracy and rank the prediction methods for particular types of experimental conditions. The effect of variables like precursor charge, length, and collision energy on the prediction accuracy was also investigated. In addition to prediction accuracy, the methods were evaluated in terms of prediction speed. The systematic assessment of these six methods may help in choosing the right method for MS/MS spectra prediction for particular needs.

Subject(s)

Deep Learning , Humans , Peptide Fragments/chemistry , Peptide Fragments/analysis , Tandem Mass Spectrometry/methods , Tandem Mass Spectrometry/statistics & numerical data , Proteomics/methods , Proteomics/statistics & numerical data

3.

Impact of Vaccination Rates and Gross Domestic Product on COVID-19 Pandemic Mortality Across United States.

Matveeva, Olga; Ogurtsov, Aleksey Y; Shabalina, Svetlana A.

medRxiv ; 2024 Jan 22.

Article in English | MEDLINE | ID: mdl-38313291

ABSTRACT

Objective: To investigate the relationship between vaccination rates and excess mortality during distinct waves of SARS-CoV-2 variant-specific infections, while considering a state's GDP per capita. Methods: We ranked U.S. states by vaccination rates and GDP and employed the CDC's excess mortality model for regression and odds ratio analysis. Results: Regression analysis reveals that both vaccination and GDP are significant factors related to mortality when considering the entire U.S. population. Notably, in wealthier states (with GDP above $65,000), excess mortality is primarily driven by slow vaccination rates, while in less affluent states, low GDP plays a major role. Odds ratio analysis demonstrates an almost twofold increase in mortality linked to the Delta and Omicron BA.1 virus variants in states with the slowest vaccination rates compared to those with the fastest (OR 1.8, 95% CI 1.7-1.9, p < 0.01). However, this gap disappeared in the post-Omicron BA.1 period. Conclusion: The interplay between slow vaccination and low GDP per capita drives high mortality.

4.

Uncertainty-aware and interpretable evaluation of Cas9-gRNA and Cas12a-gRNA specificity for fully matched and partially mismatched targets with Deep Kernel Learning.

Kirillov, Bogdan; Savitskaya, Ekaterina; Panov, Maxim; Ogurtsov, Aleksey Y; Shabalina, Svetlana A; Koonin, Eugene V; Severinov, Konstantin V.

Nucleic Acids Res ; 50(2): e11, 2022 01 25.

Article in English | MEDLINE | ID: mdl-34791389

ABSTRACT

The choice of guide RNA (gRNA) for CRISPR-based gene targeting is an essential step in gene editing applications, but the prediction of gRNA specificity remains challenging. Lack of transparency and focus on point estimates of efficiency disregarding the information on possible error sources in the model limit the power of existing Deep Learning-based methods. To overcome these problems, we present a new approach, a hybrid of Capsule Networks and Gaussian Processes. Our method predicts the cleavage efficiency of a gRNA with a corresponding confidence interval, which allows the user to incorporate information regarding possible model errors into the experimental design. We provide the first utilization of uncertainty estimation in computational gRNA design, which is a critical step toward accurate decision-making for future CRISPR applications. The proposed solution demonstrates acceptable confidence intervals for most test sets and shows regression quality similar to existing models. We introduce a set of criteria for gRNA selection based on off-target cleavage efficiency and its variance and present a collection of pre-computed gRNAs for human chromosome 22. Using Neural Network Interpretation methods, we show that our model rediscovers an established biological factor underlying cleavage efficiency, the importance of the seed region in gRNA.

Subject(s)

CRISPR-Cas Systems , Deep Learning , Gene Editing , Gene Targeting , RNA, Guide, Kinetoplastida/genetics , Algorithms , Gene Editing/methods , Gene Targeting/methods , Genomics/methods , Humans , Neural Networks, Computer , Reproducibility of Results

5.

The genomic structure of a human chromosome 22 nucleolar organizer region determined by TAR cloning.

Kim, Jung-Hyun; Noskov, Vladimir N; Ogurtsov, Aleksey Y; Nagaraja, Ramaiah; Petrov, Nikolai; Liskovykh, Mikhail; Walenz, Brian P; Lee, Hee-Sheung; Kouprina, Natalay; Phillippy, Adam M; Shabalina, Svetlana A; Schlessinger, David; Larionov, Vladimir.

Sci Rep ; 11(1): 2997, 2021 02 04.

Article in English | MEDLINE | ID: mdl-33542373

ABSTRACT

The rDNA clusters and flanking sequences on human chromosomes 13, 14, 15, 21 and 22 represent large gaps in the current genomic assembly. The organization and the degree of divergence of the human rDNA units within an individual nucleolar organizer region (NOR) are only partially known. To address this lacuna, we previously applied transformation-associated recombination (TAR) cloning to isolate individual rDNA units from chromosome 21. That approach revealed an unexpectedly high level of heterogeneity in human rDNA, raising the possibility of corresponding variations in ribosome dynamics. We have now applied the same strategy to analyze an entire rDNA array end-to-end from a copy of chromosome 22. Sequencing of TAR isolates provided the entire NOR sequence, including proximal and distal junctions that may be involved in nucleolar function. Comparison of the newly sequenced rDNAs to reference sequence for chromosomes 22 and 21 revealed variants that are shared in human rDNA in individuals from different ethnic groups, many of them at high frequency. Analysis infers comparable intra- and inter-individual divergence of rDNA units on the same and different chromosomes, supporting the concerted evolution of rDNA units. The results provide a route to investigate further the role of rDNA variation in nucleolar formation and in the empirical associations of nucleoli with pathology.

Subject(s)

Chromosomes, Human, Pair 22/genetics , DNA, Ribosomal/genetics , Genome, Human/genetics , Nucleolus Organizer Region/genetics , Cell Nucleolus/genetics , Cloning, Molecular , Genetic Heterogeneity , Genomics , Humans , Molecular Sequence Annotation , Ribosomes/genetics

6.

RAId: Knowledge-Integrated Proteomics Web Service with Accurate Statistical Significance Assignment.

Ogurtsov, Aleksey Y; Alves, Gelio; Yu, Yi-Kuo.

Proteomics ; 19(14): e1800367, 2019 07.

Article in English | MEDLINE | ID: mdl-30908818

ABSTRACT

Mass spectrometry-based proteomics starts with identifications of peptides and proteins, which provide the bases for forming the next-level hypotheses whose "validations" are often employed for forming even higher level hypotheses and so forth. Scientifically meaningful conclusions are thus attainable only if the number of falsely identified peptides/proteins is accurately controlled. For this reason, RAId continued to be developed in the past decade. RAId employs rigorous statistics for peptides/proteins identification, hence assigning accurate P-values/E-values that can be used confidently to control the number of falsely identified peptides and proteins. The RAId web service is a versatile tool built to identify peptides and proteins from tandem mass spectrometry data. Not only recognizing various spectra file formats, the web service also allows four peptide scoring functions and choice of three statistical methods for assigning P-values/E-values to identified peptides. Users may upload their own protein database or use one of the available knowledge integrated organismal databases that contain annotated information such as single amino acid polymorphisms, post-translational modifications, and their disease associations. The web service also provides a friendly interface to display, sort using different criteria, and download the identified peptides and proteins. RAId web service is freely available at https://www.ncbi.nlm.nih.gov/CBBresearch/Yu/raid.

Subject(s)

Databases, Protein , Mass Spectrometry/methods , Proteomics/methods , Computational Biology

7.

Sequence characteristics define trade-offs between on-target and genome-wide off-target hybridization of oligoprobes.

Matveeva, Olga V; Ogurtsov, Aleksey Y; Nazipova, Nafisa N; Shabalina, Svetlana A.

PLoS One ; 13(6): e0199162, 2018.

Article in English | MEDLINE | ID: mdl-29928000

ABSTRACT

Off-target oligoprobe's interaction with partially complementary nucleotide sequences represents a problem for many bio-techniques. The goal of the study was to identify oligoprobe sequence characteristics that control the ratio between on-target and off-target hybridization. To understand the complex interplay between specific and genome-wide off-target (cross-hybridization) signals, we analyzed a database derived from genomic comparison hybridization experiments performed with an Affymetrix tiling array. The database included two types of probes with signals derived from (i) a combination of specific signal and cross-hybridization and (ii) genomic cross-hybridization only. All probes from the database were grouped into bins according to their sequence characteristics, where both hybridization signals were averaged separately. For selection of specific probes, we analyzed the following sequence characteristics: vulnerability to self-folding, nucleotide composition bias, numbers of G nucleotides and GGG-blocks, and occurrence of probe's k-mers in the human genome. Increases in bin ranges for these characteristics are simultaneously accompanied by a decrease in hybridization specificity-the ratio between specific and cross-hybridization signals. However, both averaged hybridization signals exhibit growing trends along with an increase of probes' binding energy, where the hybridization specific signal increases significantly faster in comparison to the cross-hybridization. The same trend is evident for the S function, which serves as a combined evaluation of probe binding energy and occurrence of probe's k-mers in the genome. Application of S allows extracting a larger number of specific probes, as compared to using only binding energy. Thus, we showed that high values of specific and cross-hybridization signals are not mutually exclusive for probes with high values of binding energy and S. In this study, the application of a new set of sequence characteristics allows detection of probes that are highly specific to their targets for array design and other bio-techniques that require selection of specific probes.

Subject(s)

Nucleic Acid Hybridization , Oligonucleotide Array Sequence Analysis/methods , Base Sequence , Databases, Genetic , Genome , Humans

8.

Rapid Classification and Identification of Multiple Microorganisms with Accurate Statistical Significance via High-Resolution Tandem Mass Spectrometry.

Alves, Gelio; Wang, Guanghui; Ogurtsov, Aleksey Y; Drake, Steven K; Gucek, Marjan; Sacks, David B; Yu, Yi-Kuo.

J Am Soc Mass Spectrom ; 29(8): 1721-1737, 2018 Aug.

Article in English | MEDLINE | ID: mdl-29873019

ABSTRACT

Rapid and accurate identification and classification of microorganisms is of paramount importance to public health and safety. With the advance of mass spectrometry (MS) technology, the speed of identification can be greatly improved. However, the increasing number of microbes sequenced is complicating correct microbial identification even in a simple sample due to the large number of candidates present. To properly untwine candidate microbes in samples containing one or more microbes, one needs to go beyond apparent morphology or simple "fingerprinting"; to correctly prioritize the candidate microbes, one needs to have accurate statistical significance in microbial identification. We meet these challenges by using peptide-centric representations of microbes to better separate them and by augmenting our earlier analysis method that yields accurate statistical significance. Here, we present an updated analysis workflow that uses tandem MS (MS/MS) spectra for microbial identification or classification. We have demonstrated, using 226 MS/MS publicly available data files (each containing from 2500 to nearly 100,000 MS/MS spectra) and 4000 additional MS/MS data files, that the updated workflow can correctly identify multiple microbes at the genus and often the species level for samples containing more than one microbe. We have also shown that the proposed workflow computes accurate statistical significances, i.e., E values for identified peptides and unified E values for identified microbes. Our updated analysis workflow MiCId, a freely available software for Microorganism Classification and Identification, is available for download at https://www.ncbi.nlm.nih.gov/CBBresearch/Yu/downloads.html . Graphical Abstract á.

9.

Variation in human chromosome 21 ribosomal RNA genes characterized by TAR cloning and long-read sequencing.

Kim, Jung-Hyun; Dilthey, Alexander T; Nagaraja, Ramaiah; Lee, Hee-Sheung; Koren, Sergey; Dudekula, Dawood; Wood Iii, William H; Piao, Yulan; Ogurtsov, Aleksey Y; Utani, Koichi; Noskov, Vladimir N; Shabalina, Svetlana A; Schlessinger, David; Phillippy, Adam M; Larionov, Vladimir.

Nucleic Acids Res ; 46(13): 6712-6725, 2018 07 27.

Article in English | MEDLINE | ID: mdl-29788454

ABSTRACT

Despite the key role of the human ribosome in protein biosynthesis, little is known about the extent of sequence variation in ribosomal DNA (rDNA) or its pre-rRNA and rRNA products. We recovered ribosomal DNA segments from a single human chromosome 21 using transformation-associated recombination (TAR) cloning in yeast. Accurate long-read sequencing of 13 isolates covering â¼0.82 Mb of the chromosome 21 rDNA complement revealed substantial variation among tandem repeat rDNA copies, several palindromic structures and potential errors in the previous reference sequence. These clones revealed 101 variant positions in the 45S transcription unit and 235 in the intergenic spacer sequence. Approximately 60% of the 45S variants were confirmed in independent whole-genome or RNA-seq data, with 47 of these further observed in mature 18S/28S rRNA sequences. TAR cloning and long-read sequencing enabled the accurate reconstruction of multiple rDNA units and a new, high-quality 44 838 bp rDNA reference sequence, which we have annotated with variants detected from chromosome 21 of a single individual. The large number of variants observed reveal heterogeneity in human rDNA, opening up the possibility of corresponding variations in ribosome dynamics.

Subject(s)

Chromosomes, Human, Pair 21 , DNA, Ribosomal/chemistry , Genes, rRNA , Genetic Variation , Animals , Cell Line , Cloning, Molecular , DNA, Ribosomal/isolation & purification , DNA, Ribosomal Spacer/chemistry , Humans , Mice , Nucleic Acid Conformation , Nucleolus Organizer Region/chemistry , RNA, Ribosomal/chemistry , RNA, Ribosomal/metabolism , Sequence Analysis, DNA

10.

Adaptation of mRNA structure to control protein folding.

Faure, Guilhem; Ogurtsov, Aleksey Y; Shabalina, Svetlana A; Koonin, Eugene V.

RNA Biol ; 14(12): 1649-1654, 2017 12 02.

Article in English | MEDLINE | ID: mdl-28722509

ABSTRACT

Comparison of mRNA and protein structures shows that highly structured mRNAs typically encode compact protein domains suggesting that mRNA structure controls protein folding. This function is apparently performed by distinct structural elements in the mRNA, which implies 'fine tuning' of mRNA structure under selection for optimal protein folding. We find that, during evolution, changes in the mRNA folding energy follow amino acid replacements, reinforcing the notion of an intimate connection between the structures of a mRNA and the protein it encodes, and the double encoding of protein sequence and folding in the mRNA.

Subject(s)

Adaptation, Biological , Nucleic Acid Conformation , Protein Biosynthesis , Protein Folding , RNA, Messenger/chemistry , RNA, Messenger/genetics , Animals , Biological Evolution , Humans , RNA Stability , Selection, Genetic , Structure-Activity Relationship

11.

Optimization of signal-to-noise ratio for efficient microarray probe design.

Matveeva, Olga V; Nechipurenko, Yury D; Riabenko, Evgeniy; Ragan, Chikako; Nazipova, Nafisa N; Ogurtsov, Aleksey Y; Shabalina, Svetlana A.

Bioinformatics ; 32(17): i552-i558, 2016 09 01.

Article in English | MEDLINE | ID: mdl-27587674

ABSTRACT

MOTIVATION: Target-specific hybridization depends on oligo-probe characteristics that improve hybridization specificity and minimize genome-wide cross-hybridization. Interplay between specific hybridization and genome-wide cross-hybridization has been insufficiently studied, despite its crucial role in efficient probe design and in data analysis. RESULTS: In this study, we defined hybridization specificity as a ratio between oligo target-specific hybridization and oligo genome-wide cross-hybridization. A microarray database, derived from the Genomic Comparison Hybridization (GCH) experiment and performed using the Affymetrix platform, contains two different types of probes. The first type of oligo-probes does not have a specific target on the genome and their hybridization signals are derived from genome-wide cross-hybridization alone. The second type includes oligonucleotides that have a specific target on the genomic DNA and their signals are derived from specific and cross-hybridization components combined together in a total signal. A comparative analysis of hybridization specificity of oligo-probes, as well as their nucleotide sequences and thermodynamic features was performed on the database. The comparison has revealed that hybridization specificity was negatively affected by low stability of the fully-paired oligo-target duplex, stable probe self-folding, G-rich content, including GGG motifs, low sequence complexity and nucleotide composition symmetry. CONCLUSION: Filtering out the probes with defined 'negative' characteristics significantly increases specific hybridization and dramatically decreasing genome-wide cross-hybridization. Selected oligo-probes have two times higher hybridization specificity on average, compared to the probes that were filtered from the analysis by applying suggested cutoff thresholds to the described parameters. A new approach for efficient oligo-probe design is described in our study. CONTACT: shabalin@ncbi.nlm.nih.gov or olga.matveeva@gmail.com SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Genome , Nucleic Acid Hybridization , Oligonucleotide Array Sequence Analysis , Signal-To-Noise Ratio , DNA Probes , Gene Expression Profiling , Genomics , Oligonucleotides , Sensitivity and Specificity

12.

Role of mRNA structure in the control of protein folding.

Faure, Guilhem; Ogurtsov, Aleksey Y; Shabalina, Svetlana A; Koonin, Eugene V.

Nucleic Acids Res ; 44(22): 10898-10911, 2016 12 15.

Article in English | MEDLINE | ID: mdl-27466388

ABSTRACT

Specific structures in mRNA modulate translation rate and thus can affect protein folding. Using the protein structures from two eukaryotes and three prokaryotes, we explore the connections between the protein compactness, inferred from solvent accessibility, and mRNA structure, inferred from mRNA folding energy (ΔG). In both prokaryotes and eukaryotes, the ΔG value of the most stable 30 nucleotide segment of the mRNA (ΔGmin) strongly, positively correlates with protein solvent accessibility. Thus, mRNAs containing exceptionally stable secondary structure elements typically encode compact proteins. The correlations between ΔG and protein compactness are much more pronounced in predicted ordered parts of proteins compared to the predicted disordered parts, indicative of an important role of mRNA secondary structure elements in the control of protein folding. Additionally, ΔG correlates with the mRNA length and the evolutionary rate of synonymous positions. The correlations are partially independent and were used to construct multiple regression models which explain about half of the variance of protein solvent accessibility. These findings suggest a model in which the mRNA structure, particularly exceptionally stable RNA structural elements, act as gauges of protein co-translational folding by reducing ribosome speed when the nascent peptide needs time to form and optimize the core structure.

Subject(s)

Protein Folding , RNA, Messenger/physiology , Animals , Base Composition , Humans , Kinetics , Linear Models , Models, Molecular , Nucleic Acid Conformation , Protein Biosynthesis , Protein Structure, Secondary , Proteins/chemistry , Proteins/genetics , Proteins/metabolism , RNA Stability , RNA, Messenger/chemistry , Thermodynamics , Transcriptome

13.

Identification of Microorganisms by High Resolution Tandem Mass Spectrometry with Accurate Statistical Significance.

Alves, Gelio; Wang, Guanghui; Ogurtsov, Aleksey Y; Drake, Steven K; Gucek, Marjan; Suffredini, Anthony F; Sacks, David B; Yu, Yi-Kuo.

J Am Soc Mass Spectrom ; 27(2): 194-210, 2016 Feb.

Article in English | MEDLINE | ID: mdl-26510657

ABSTRACT

Correct and rapid identification of microorganisms is the key to the success of many important applications in health and safety, including, but not limited to, infection treatment, food safety, and biodefense. With the advance of mass spectrometry (MS) technology, the speed of identification can be greatly improved. However, the increasing number of microbes sequenced is challenging correct microbial identification because of the large number of choices present. To properly disentangle candidate microbes, one needs to go beyond apparent morphology or simple 'fingerprinting'; to correctly prioritize the candidate microbes, one needs to have accurate statistical significance in microbial identification. We meet these challenges by using peptidome profiles of microbes to better separate them and by designing an analysis method that yields accurate statistical significance. Here, we present an analysis pipeline that uses tandem MS (MS/MS) spectra for microbial identification or classification. We have demonstrated, using MS/MS data of 81 samples, each composed of a single known microorganism, that the proposed pipeline can correctly identify microorganisms at least at the genus and species levels. We have also shown that the proposed pipeline computes accurate statistical significances, i.e., E-values for identified peptides and unified E-values for identified microorganisms. The proposed analysis pipeline has been implemented in MiCId, a freely available software for Microorganism Classification and Identification. MiCId is available for download at http://www.ncbi.nlm.nih.gov/CBBresearch/Yu/downloads.html . Graphical Abstract á.

Subject(s)

Bacteria/classification , Tandem Mass Spectrometry/methods , Tandem Mass Spectrometry/statistics & numerical data , Bacteria/chemistry , Databases, Factual , Escherichia coli/classification , Peptides/analysis , Peptides/chemistry , Pseudomonas aeruginosa/classification , Software

14.

Evolution at protein ends: major contribution of alternative transcription initiation and termination to the transcriptome and proteome diversity in mammals.

Shabalina, Svetlana A; Ogurtsov, Aleksey Y; Spiridonov, Nikolay A; Koonin, Eugene V.

Nucleic Acids Res ; 42(11): 7132-44, 2014 Jun.

Article in English | MEDLINE | ID: mdl-24792168

ABSTRACT

Alternative splicing (AS), alternative transcription initiation (ATI) and alternative transcription termination (ATT) create the extraordinary complexity of transcriptomes and make key contributions to the structural and functional diversity of mammalian proteomes. Analysis of mammalian genomic and transcriptomic data shows that contrary to the traditional view, the joint contribution of ATI and ATT to the transcriptome and proteome diversity is quantitatively greater than the contribution of AS. Although the mean numbers of protein-coding constitutive and alternative nucleotides in gene loci are nearly identical, their distribution along the transcripts is highly non-uniform. On average, coding exons in the variable 5' and 3' transcript ends that are created by ATI and ATT contain approximately four times more alternative nucleotides than core protein-coding regions that diversify exclusively via AS. Short upstream exons that encompass alternative 5'-untranslated regions and N-termini of proteins evolve under strong nucleotide-level selection whereas in 3'-terminal exons that encode protein C-termini, protein-level selection is significantly stronger. The groups of genes that are subject to ATI and ATT show major differences in biological roles, expression and selection patterns.

Subject(s)

Evolution, Molecular , Protein Isoforms/genetics , Transcription Initiation, Genetic , Transcription Termination, Genetic , Animals , Genetic Variation , Humans , Mice , Proteome , Transcriptome

15.

Molecular Isotopic Distribution Analysis (MIDAs) with adjustable mass accuracy.

Alves, Gelio; Ogurtsov, Aleksey Y; Yu, Yi-Kuo.

J Am Soc Mass Spectrom ; 25(1): 57-70, 2014 Jan.

Article in English | MEDLINE | ID: mdl-24254576

ABSTRACT

In this paper, we present Molecular Isotopic Distribution Analysis (MIDAs), a new software tool designed to compute molecular isotopic distributions with adjustable accuracies. MIDAs offers two algorithms, one polynomial-based and one Fourier-transform-based, both of which compute molecular isotopic distributions accurately and efficiently. The polynomial-based algorithm contains few novel aspects, whereas the Fourier-transform-based algorithm consists mainly of improvements to other existing Fourier-transform-based algorithms. We have benchmarked the performance of the two algorithms implemented in MIDAs with that of eight software packages (BRAIN, Emass, Mercury, Mercury5, NeutronCluster, Qmass, JFC, IC) using a consensus set of benchmark molecules. Under the proposed evaluation criteria, MIDAs's algorithms, JFC, and Emass compute with comparable accuracy the coarse-grained (low-resolution) isotopic distributions and are more accurate than the other software packages. For fine-grained isotopic distributions, we compared IC, MIDAs's polynomial algorithm, and MIDAs's Fourier transform algorithm. Among the three, IC and MIDAs's polynomial algorithm compute isotopic distributions that better resemble their corresponding exact fine-grained (high-resolution) isotopic distributions. MIDAs can be accessed freely through a user-friendly web-interface at http://www.ncbi.nlm.nih.gov/CBBresearch/Yu/midas/index.html.

Subject(s)

Isotopes/chemistry , Mass Spectrometry/methods , Software , Algorithms , Internet , Molecular Weight , Proteomics

16.

Optimized models for design of efficient miR30-based shRNAs.

Matveeva, Olga V; Nazipova, Nafisa N; Ogurtsov, Aleksey Y; Shabalina, Svetlana A.

Front Genet ; 3: 163, 2012.

Article in English | MEDLINE | ID: mdl-22952469

ABSTRACT

Small hairpin RNAs (shRNAs) became an important research tool in cell biology. Reliable design of these molecules is essential for the needs of large functional genomics projects. To optimize the design of efficient shRNAs, we performed comparative, thermodynamic, and correlation analyses of ~18,000 miR30-based shRNAs with known functional efficiencies, derived from the Sensor Assay project (Fellmann et al., 2011). We identified features of the shRNA guide strand that significantly correlate with the silencing efficiency and performed multiple regression analysis, using 4/5 of the data for training purposes and 1/5 for cross validation. A model that included the position-dependent nucleotide preferences was predictive in the cross-validation data subset (R = 0.39). However, a model, which in addition to the nucleotide preferences included thermodynamic shRNA features such as a thermodynamic duplex stability and position-dependent thermodynamic profile (dinucleotide free energy) was performing better (R = 0.53). Software "miR_Scan" was developed based upon the optimized models. Calculated mRNA target secondary structure stability showed correlation with shRNA silencing efficiency but failed to improve the model. Correlation analysis demonstrates that our algorithm for identification of efficient miR30-based shRNA molecules performs better than approaches that were developed for design of chemically synthesized siRNAs (R(max) = 0.36).

17.

Assigning statistical significance to proteotypic peptides via database searches.

Alves, Gelio; Ogurtsov, Aleksey Y; Yu, Yi-Kuo.

J Proteomics ; 74(2): 199-211, 2011 Feb 01.

Article in English | MEDLINE | ID: mdl-21055489

ABSTRACT

Querying MS/MS spectra against a database containing only proteotypic peptides reduces data analysis time due to reduction of database size. Despite the speed advantage, this search strategy is challenged by issues of statistical significance and coverage. The former requires separating systematically significant identifications from less confident identifications, while the latter arises when the underlying peptide is not present, due to single amino acid polymorphisms (SAPs) or post-translational modifications (PTMs), in the proteotypic peptide libraries searched. To address both issues simultaneously, we have extended RAId's knowledge database to include proteotypic information, utilized RAId's statistical strategy to assign statistical significance to proteotypic peptides, and modified RAId's programs to allow for consideration of proteotypic information during database searches. The extended database alleviates the coverage problem since all annotated modifications, even those that occurred within proteotypic peptides, may be considered. Taking into account the likelihoods of observation, the statistical strategy of RAId provides accurate E-value assignments regardless whether a candidate peptide is proteotypic or not. The advantage of including proteotypic information is evidenced by its superior retrieval performance when compared to regular database searches.

Subject(s)

Data Interpretation, Statistical , Databases, Protein , Peptides/analysis , Protein Hydrolysates/analysis , Proteomics/methods , Information Storage and Retrieval , Peptides/chemistry , Protein Hydrolysates/chemistry , Protein Hydrolysates/metabolism , Trypsin/metabolism

18.

RAId_aPS: MS/MS analysis with multiple scoring functions and spectrum-specific statistics.

Alves, Gelio; Ogurtsov, Aleksey Y; Yu, Yi-Kuo.

PLoS One ; 5(11): e15438, 2010 Nov 16.

Article in English | MEDLINE | ID: mdl-21103371

ABSTRACT

Statistically meaningful comparison/combination of peptide identification results from various search methods is impeded by the lack of a universal statistical standard. Providing an E-value calibration protocol, we demonstrated earlier the feasibility of translating either the score or heuristic E-value reported by any method into the textbook-defined E-value, which may serve as the universal statistical standard. This protocol, although robust, may lose spectrum-specific statistics and might require a new calibration when changes in experimental setup occur. To mitigate these issues, we developed a new MS/MS search tool, RAId_aPS, that is able to provide spectrum-specific-values for additive scoring functions. Given a selection of scoring functions out of RAId score, K-score, Hyperscore and XCorr, RAId_aPS generates the corresponding score histograms of all possible peptides using dynamic programming. Using these score histograms to assign E-values enables a calibration-free protocol for accurate significance assignment for each scoring function. RAId_aPS features four different modes: (i) compute the total number of possible peptides for a given molecular mass range, (ii) generate the score histogram given a MS/MS spectrum and a scoring function, (iii) reassign E-values for a list of candidate peptides given a MS/MS spectrum and the scoring functions chosen, and (iv) perform database searches using selected scoring functions. In modes (iii) and (iv), RAId_aPS is also capable of combining results from different scoring functions using spectrum-specific statistics. The web link is http://www.ncbi.nlm.nih.gov/CBBresearch/Yu/raid_aps/index.html. Relevant binaries for Linux, Windows, and Mac OS X are available from the same page.

Subject(s)

Algorithms , Mass Spectrometry/methods , Peptides/analysis , Computational Biology/methods , Databases, Protein , Molecular Weight , Peptide Library , Peptides/chemistry , Proteomics/methods , Reproducibility of Results , Software

19.

Optimization of duplex stability and terminal asymmetry for shRNA design.

Matveeva, Olga V; Kang, Yibin; Spiridonov, Alexey N; Saetrom, Pål; Nemtsov, Vladimir A; Ogurtsov, Aleksey Y; Nechipurenko, Yury D; Shabalina, Svetlana A.

PLoS One ; 5(4): e10180, 2010 Apr 20.

Article in English | MEDLINE | ID: mdl-20422034

ABSTRACT

Prediction of efficient oligonucleotides for RNA interference presents a serious challenge, especially for the development of genome-wide RNAi libraries which encounter difficulties and limitations due to ambiguities in the results and the requirement for significant computational resources. Here we present a fast and practical algorithm for shRNA design based on the thermodynamic parameters. In order to identify shRNA and siRNA features universally associated with high silencing efficiency, we analyzed structure-activity relationships in thousands of individual RNAi experiments from publicly available databases (ftp://ftp.ncbi.nlm.nih.gov/pub/shabalin/siRNA/si_shRNA_selector/). Using this statistical analysis, we found free energy ranges for the terminal duplex asymmetry and for fully paired duplex stability, such that shRNAs or siRNAs falling in both ranges have a high probability of being efficient. When combined, these two parameters yield a approximately 72% success rate on shRNAs from the siRecords database, with the target RNA levels reduced to below 20% of the control. Two other parameters correlate well with silencing efficiency: the stability of target RNA and the antisense strand secondary structure. Both parameters also correlate with the short RNA duplex stability; as a consequence, adding these parameters to our prediction scheme did not substantially improve classification accuracy. To test the validity of our predictions, we designed 83 shRNAs with optimal terminal asymmetry, and experimentally verified that small shifts in duplex stability strongly affected silencing efficiency. We showed that shRNAs with short fully paired stems could be successfully selected by optimizing only two parameters: terminal duplex asymmetry and duplex stability of the hypothetical cleavage product, which also relates to the specificity of mRNA target recognition. Our approach performs at the level of the best currently utilized algorithms that take into account prediction of the secondary structure of the target and antisense RNAs, but at significantly lower computational costs. Based on this study, we created the si-shRNA Selector program that predicts both highly efficient shRNAs and functional siRNAs (ftp://ftp.ncbi.nlm.nih.gov/pub/shabalin/siRNA/si_shRNA_selector/).

Subject(s)

Algorithms , Drug Design , RNA Stability , RNA, Small Interfering/chemistry , Databases, Nucleic Acid , Gene Silencing , Nucleic Acid Conformation , RNA Interference , Thermodynamics

20.

Distinct patterns of expression and evolution of intronless and intron-containing mammalian genes.

Shabalina, Svetlana A; Ogurtsov, Aleksey Y; Spiridonov, Alexey N; Novichkov, Pavel S; Spiridonov, Nikolay A; Koonin, Eugene V.

Mol Biol Evol ; 27(8): 1745-9, 2010 Aug.

Article in English | MEDLINE | ID: mdl-20360214

ABSTRACT

Comparison of expression levels and breadth and evolutionary rates of intronless and intron-containing mammalian genes shows that intronless genes are expressed at lower levels, tend to be tissue specific, and evolve significantly faster than spliced genes. By contrast, monomorphic spliced genes that are not subject to detectable alternative splicing and polymorphic alternatively spliced genes show similar statistically indistinguishable patterns of expression and evolution. Alternative splicing is most common in ancient genes, whereas intronless genes appear to have relatively recent origins. These results imply tight coupling between different stages of gene expression, in particular, transcription, splicing, and nucleocytosolic transport of transcripts, and suggest that formation of intronless genes is an important route of evolution of novel tissue-specific functions in animals.

Subject(s)

Biological Evolution , Introns , Mammals/genetics , Animals , Humans , Molecular Sequence Data , RNA Splicing

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL