RESUMO
Many data sets exhibit well-defined structure that can be exploited to design faster search tools, but it is not always clear when such acceleration is possible. Here we introduce a framework for similarity search based on characterizing a data set's entropy and fractal dimension. We prove that searching scales in time with metric entropy (number of covering hyperspheres), if the fractal dimension of the data set is low, and scales in space with the sum of metric entropy and information-theoretic entropy (randomness of the data). Using these ideas, we present accelerated versions of standard tools, with no loss in specificity and little loss in sensitivity, for use in three domains-high-throughput drug screening (Ammolite, 150x speedup), metagenomics (MICA, 3.5x speedup of DIAMOND (3700x BLASTX)), and protein structure search (esFragBag, 10x speedup of FragBag). Our framework can be used to achieve 'compressive omics,' and the general theory can be readily applied to data science problems outside of biology. Source code: http://gems.csail.mit.edu.
RESUMO
It is becoming increasingly impractical to indefinitely store raw sequencing data for later processing in an uncompressed state. In this paper, we describe a scalable compressive framework, Read-Quality-Sparsifier (RQS), which substantially outperforms the compression ratio and speed of other de novo quality score compression methods while maintaining SNP-calling accuracy. Surprisingly, RQS also improves the SNP-calling accuracy on a gold-standard, real-life sequencing dataset (NA12878) using a k-mer density profile constructed from 77 other individuals from the 1000 Genomes Project. This improvement in downstream accuracy emerges from the observation that quality score values within NGS datasets are inherently encoded in the k-mer landscape of the genomic sequences. To our knowledge, RQS is the first scalable sequence based quality compression method that can efficiently compress quality scores of terabyte-sized and larger sequencing datasets. AVAILABILITY: An implementation of our method, RQS, is available for download at: http://rqs.csail.mit.edu/.
RESUMO
Using a partially purified bovine brain extract, our lab identified three novel endogenous acyl amino acids in mammalian tissues. The presence of numerous amino acids in the body and their ability to form amides with several saturated and unsaturated fatty acids indicated the potential existence of a large number of heretofore unidentified acyl amino acids. Reports of several additional acyl amino acids that activate G-protein coupled receptors (e.g., N-arachidonoyl glycine, N-arachidonoyl serine) and transient receptor potential channels (e.g., N-arachidonoyl dopamine, N-acyl taurines) suggested that some or many novel acyl amino acids could serve as signaling molecules. Here, we used a targeted lipidomics approach including specific enrichment steps, nano-LC/MS/MS, high-throughput screening of the datasets with a potent search algorithm based on fragment ion analysis, and quantification using the multiple reaction monitoring mode in Analyst software to measure the biological levels of acyl amino acids in rat brain. We successfully identified 50 novel endogenous acyl amino acids present at 0.2 to 69 pmol g(-1) wet rat brain.
Assuntos
Aminoácidos/análise , Encéfalo/metabolismo , Cromatografia Líquida de Alta Pressão/métodos , Espectrometria de Massas em Tandem/métodos , Animais , Bovinos , Metabolismo dos Lipídeos , Masculino , Receptores Ativados por Proliferador de Peroxissomo/metabolismo , Ratos , Ratos Sprague-Dawley , Receptores Acoplados a Proteínas G/agonistas , Extração em Fase Sólida , Ácido gama-Aminobutírico/metabolismoRESUMO
Great effort has been devoted to characterize signaling lipids in central nervous system. This has led to a search for novel strategies to characterize hitherto unknown lipid compositions. Here we developed two methods, one for identification and one for quantification, for N-acyl amino acids, a novel lipid family. The identification method contains a series of purification steps followed by nano-LC/MS/MS and high-throughput screening of the datasets with a potent search algorithm based on fragment ion analysis. MS/MS spectra with good quality can be obtained with 150 fmol of targeted lipids on column with our nano-LC/MS/MS. More than one thousand mass spectra generated using the information dependent acquisition mode of Analyst QS software can be analyzed in 1 min using our home built software. The quantification method utilized the multiple reaction monitoring mode in Analyst software to measure the endogenous levels of N-acyl amino acids in rat brain. Using these two methods we were able to identify and quantify 11 previously reported N-acyl amino acids with endogenous levels ranging from 0.26 to 333 pmol g(-1) wet rat brain.
Assuntos
Aminoácidos/química , Química Encefálica , Cromatografia Líquida/métodos , Lipídeos/química , Espectrometria de Massas em Tandem/métodos , Animais , Masculino , Ratos , Ratos Sprague-Dawley , SoftwareRESUMO
The discovery of endogenous fatty acyl amides such as N-arachidonoyl ethanolamide (anandamide), N-oleoyl ethanolamide (OEA), and N-arachidonoyl dopamine (NADA) as important signaling molecules in the central and peripheral nervous system has led us to pursue other unidentified signaling molecules. Until recently, technical challenges, particularly those associated with lipid purification and chemical analysis, have hindered the identification of low abundance signaling lipids. Improvements in chromatography and mass spectrometry (MS) such as miniaturization of high-performance liquid chromatography components, hybridization of multistage mass spectrometers and time-of-flight technology, the development of electrospray ionization (ESI) and of information-dependent acquisition, now permit rapid identification of novel, low abundance, signaling lipids.