Search | VHL Regional Portal

1.

Nephele: genotyping via complete composition vectors and MapReduce.

Colosimo, Marc E; Peterson, Matthew W; Mardis, Scott; Hirschman, Lynette.

Source Code Biol Med ; 6: 13, 2011 Aug 18.

Article in English | MEDLINE | ID: mdl-21851626

ABSTRACT

BACKGROUND: Current sequencing technology makes it practical to sequence many samples of a given organism, raising new challenges for the processing and interpretation of large genomics data sets with associated metadata. Traditional computational phylogenetic methods are ideal for studying the evolution of gene/protein families and using those to infer the evolution of an organism, but are less than ideal for the study of the whole organism mainly due to the presence of insertions/deletions/rearrangements. These methods provide the researcher with the ability to group a set of samples into distinct genotypic groups based on sequence similarity, which can then be associated with metadata, such as host information, pathogenicity, and time or location of occurrence. Genotyping is critical to understanding, at a genomic level, the origin and spread of infectious diseases. Increasingly, genotyping is coming into use for disease surveillance activities, as well as for microbial forensics. The classic genotyping approach has been based on phylogenetic analysis, starting with a multiple sequence alignment. Genotypes are then established by expert examination of phylogenetic trees. However, these traditional single-processor methods are suboptimal for rapidly growing sequence datasets being generated by next-generation DNA sequencing machines, because they increase in computational complexity quickly with the number of sequences. RESULTS: Nephele is a suite of tools that uses the complete composition vector algorithm to represent each sequence in the dataset as a vector derived from its constituent k-mers by passing the need for multiple sequence alignment, and affinity propagation clustering to group the sequences into genotypes based on a distance measure over the vectors. Our methods produce results that correlate well with expert-defined clades or genotypes, at a fraction of the computational cost of traditional phylogenetic methods run on traditional hardware. Nephele can use the open-source Hadoop implementation of MapReduce to parallelize execution using multiple compute nodes. We were able to generate a neighbour-joined tree of over 10,000 16S samples in less than 2 hours. CONCLUSIONS: We conclude that using Nephele can substantially decrease the processing time required for generating genotype trees of tens to hundreds of organisms at genome scale sequence coverage.

2.

Controlled vocabularies for microbial virulence factors.

Korves, Tonia; Colosimo, Marc E.

Trends Microbiol ; 17(7): 279-85, 2009 Jul.

Article in English | MEDLINE | ID: mdl-19577471

ABSTRACT

Knowledge about pathogenesis is increasing dramatically, and most of this information is stored in the scientific literature or in sequence databases. This information can be made more accessible by the use of ontologies or controlled vocabularies. Recently, several ontologies, controlled vocabularies and databases have been developed or adapted for virulence factors and their roles in pathogenesis. Here, we discuss these systems, how they are being used in research and the challenges that remain for developing and applying ontologies for virulence factors.

Subject(s)

Computational Biology/methods , Computational Biology/standards , Virulence Factors/genetics , Virulence Factors/physiology , Vocabulary, Controlled

3.

Evaluating the automatic mapping of human gene and protein mentions to unique identifiers.

Morgan, Alexander A; Wellner, Benjamin; Colombe, Jeffrey B; Arens, Robert; Colosimo, Marc E; Hirschman, Lynette.

Pac Symp Biocomput ; : 281-91, 2007.

Article in English | MEDLINE | ID: mdl-17990499

ABSTRACT

We have developed a challenge task for the second BioCreAtIvE (Critical Assessment of Information Extraction in Biology) that requires participating systems to provide lists of the EntrezGene (formerly LocusLink) identifiers for all human genes and proteins mentioned in a MEDLINE abstract. We are distributing 281 annotated abstracts and another 5,000 noisily annotated abstracts along with a gene name lexicon to participants. We have performed a series of baseline experiments to better characterize this dataset and form a foundation for participant exploration.

Subject(s)

Databases, Genetic , Databases, Protein , MEDLINE , Computational Biology , Genome, Human , Genomics/statistics & numerical data , Humans , Proteomics/statistics & numerical data

4.

TreeViewJ: An application for viewing and analyzing phylogenetic trees.

Peterson, Matthew W; Colosimo, Marc E.

Source Code Biol Med ; 2: 7, 2007 Oct 31.

Article in English | MEDLINE | ID: mdl-17974028

ABSTRACT

BACKGROUND: Phylogenetic trees are widely used to visualize evolutionary relationships between different organisms or samples of the same organism. There exists a variety of both free and commercial tree visualization software available, but limitations in these programs often require researchers to use multiple programs for analysis, annotation, and the production of publication-ready images. RESULTS: We present TreeViewJ, a Java tool for visualizing, editing and analyzing phylogenetic trees. The software allows researchers to color and change the width of branches that they wish to highlight, and add names to nodes. If collection dates are available for taxa, the software can map them onto a timeline, and sort the tree in ascending or descending date order. CONCLUSION: TreeViewJ is a tool for researchers to visualize, edit, "decorate," and produce publication-ready images of phylogenetic trees. It is open-source, and released under an GPL license, and available at http://treeviewj.sourceforge.net.

5.

Do you do text?

Blaschke, Christian; Yeh, Alexander; Camon, Evelyn; Colosimo, Marc; Apweiler, Rolf; Hirschman, Lynette; Valencia, Alfonso.

Bioinformatics ; 21(23): 4199-200, 2005 Dec 01.

Article in English | MEDLINE | ID: mdl-16195360

Subject(s)

Computational Biology/methods , Databases, Factual , Information Storage and Retrieval , Algorithms , Animals , Artificial Intelligence , Bayes Theorem , Computers , Data Interpretation, Statistical , Databases, Protein , Humans , Internet , Mice , Models, Genetic , Software

6.

The UNC-3 Olf/EBF protein represses alternate neuronal programs to specify chemosensory neuron identity.

Kim, Kyuhyung; Colosimo, Marc E; Yeung, Helen; Sengupta, Piali.

Dev Biol ; 286(1): 136-48, 2005 Oct 01.

Article in English | MEDLINE | ID: mdl-16143323

ABSTRACT

Neuronal identities are specified by the combinatorial functions of activators and repressors of gene expression. Members of the well-conserved Olf/EBF (O/E) transcription factor family have been shown to play important roles in neuronal and non-neuronal development and differentiation. O/E proteins are highly expressed in the olfactory epithelium, and O/E binding sites have been identified upstream of olfactory genes. However, the roles of O/E proteins in sensory neuron development are unclear. Here we show that the O/E protein UNC-3 is required for subtype specification of the ASI chemosensory neurons in Caenorhabditis elegans. UNC-3 promotes an ASI identity by directly repressing the expression of alternate neuronal programs and by activating expression of ASI-specific genes including the daf-7 TGF-beta gene. Our results indicate that UNC-3 is a critical component of the transcription factor code that integrates cell-intrinsic developmental programs with external signals to specify sensory neuronal identity and suggest models for O/E protein functions in other systems.

Subject(s)

Caenorhabditis elegans Proteins/metabolism , Caenorhabditis elegans/cytology , Caenorhabditis elegans/metabolism , Chemoreceptor Cells/metabolism , Transcription Factors/metabolism , Animals , Animals, Genetically Modified , Base Sequence , Binding Sites/genetics , Caenorhabditis elegans/genetics , Caenorhabditis elegans/growth & development , Caenorhabditis elegans Proteins/genetics , DNA, Helminth/genetics , DNA, Helminth/metabolism , Gene Expression Regulation, Developmental , Genes, Helminth , Mutation , Promoter Regions, Genetic , Transcription Factors/genetics

7.

Overview of BioCreAtIvE task 1B: normalized gene lists.

Hirschman, Lynette; Colosimo, Marc; Morgan, Alexander; Yeh, Alexander.

BMC Bioinformatics ; 6 Suppl 1: S11, 2005.

Article in English | MEDLINE | ID: mdl-15960823

ABSTRACT

BACKGROUND: Our goal in BioCreAtIve has been to assess the state of the art in text mining, with emphasis on applications that reflect real biological applications, e.g., the curation process for model organism databases. This paper summarizes the BioCreAtIvE task 1B, the "Normalized Gene List" task, which was inspired by the gene list supplied for each curated paper in a model organism database. The task was to produce the correct list of unique gene identifiers for the genes and gene products mentioned in sets of abstracts from three model organisms (Yeast, Fly, and Mouse). RESULTS: Eight groups fielded systems for three data sets (Yeast, Fly, and Mouse). For Yeast, the top scoring system (out of 15) achieved 0.92 F-measure (harmonic mean of precision and recall); for Mouse and Fly, the task was more difficult, due to larger numbers of genes, more ambiguity in the gene naming conventions (particularly for Fly), and complex gene names (for Mouse). For Fly, the top F-measure was 0.82 out of 11 systems and for Mouse, it was 0.79 out of 16 systems. CONCLUSION: This assessment demonstrates that multiple groups were able to perform a real biological task across a range of organisms. The performance was dependent on the organism, and specifically on the naming conventions associated with each organism. These results hold out promise that the technology can provide partial automation of the curation process in the near future.

Subject(s)

Computational Biology/methods , Databases, Bibliographic/classification , Genes , Information Storage and Retrieval/methods , Terminology as Topic , Animals , Computational Biology/standards , Databases, Bibliographic/standards , Drosophila/genetics , Mice/genetics , Saccharomyces cerevisiae/genetics

8.

Data preparation and interannotator agreement: BioCreAtIvE task 1B.

Colosimo, Marc E; Morgan, Alexander A; Yeh, Alexander S; Colombe, Jeffrey B; Hirschman, Lynette.

BMC Bioinformatics ; 6 Suppl 1: S12, 2005.

Article in English | MEDLINE | ID: mdl-15960824

ABSTRACT

BACKGROUND: We prepared and evaluated training and test materials for an assessment of text mining methods in molecular biology. The goal of the assessment was to evaluate the ability of automated systems to generate a list of unique gene identifiers from PubMed abstracts for the three model organisms Fly, Mouse, and Yeast. This paper describes the preparation and evaluation of answer keys for training and testing. These consisted of lists of normalized gene names found in the abstracts, generated by adapting the gene list for the full journal articles found in the model organism databases. For the training dataset, the gene list was pruned automatically to remove gene names not found in the abstract; for the testing dataset, it was further refined by manual annotation by annotators provided with guidelines. A critical step in interpreting the results of an assessment is to evaluate the quality of the data preparation. We did this by careful assessment of interannotator agreement and the use of answer pooling of participant results to improve the quality of the final testing dataset. RESULTS: Interannotator analysis on a small dataset showed that our gene lists for Fly and Yeast were good (87% and 91% three-way agreement) but the Mouse gene list had many conflicts (mostly omissions), which resulted in errors (69% interannotator agreement). By comparing and pooling answers from the participant systems, we were able to add an additional check on the test data; this allowed us to find additional errors, especially in Mouse. This led to 1% change in the Yeast and Fly "gold standard" answer keys, but to an 8% change in the mouse answer key. CONCLUSION: We found that clear annotation guidelines are important, along with careful interannotator experiments, to validate the generated gene lists. Also, abstracts alone are a poor resource for identifying genes in paper, containing only a fraction of genes mentioned in the full text (25% for Fly, 36% for Mouse). We found that there are intrinsic differences between the model organism databases related to the number of synonymous terms and also to curation criteria. Finally, we found that answer pooling was much faster and allowed us to identify more conflicting genes than interannotator analysis.

Subject(s)

Computational Biology/methods , Databases, Factual/classification , Writing , Animals , Computational Biology/standards , Databases, Factual/standards , Information Storage and Retrieval/classification , Information Storage and Retrieval/standards

9.

BioCreAtIvE task 1A: gene mention finding evaluation.

Yeh, Alexander; Morgan, Alexander; Colosimo, Marc; Hirschman, Lynette.

BMC Bioinformatics ; 6 Suppl 1: S2, 2005.

Article in English | MEDLINE | ID: mdl-15960832

ABSTRACT

BACKGROUND: The biological research literature is a major repository of knowledge. As the amount of literature increases, it will get harder to find the information of interest on a particular topic. There has been an increasing amount of work on text mining this literature, but comparing this work is hard because of a lack of standards for making comparisons. To address this, we worked with colleagues at the Protein Design Group, CNB-CSIC, Madrid to develop BioCreAtIvE (Critical Assessment for Information Extraction in Biology), an open common evaluation of systems on a number of biological text mining tasks. We report here on task 1A, which deals with finding mentions of genes and related entities in text. "Finding mentions" is a basic task, which can be used as a building block for other text mining tasks. The task makes use of data and evaluation software provided by the (US) National Center for Biotechnology Information (NCBI). RESULTS: 15 teams took part in task 1A. A number of teams achieved scores over 80% F-measure (balanced precision and recall). The teams that tried to use their task 1A systems to help on other BioCreAtIvE tasks reported mixed results. CONCLUSION: The 80% plus F-measure results are good, but still somewhat lag the best scores achieved in some other domains such as newswire, due in part to the complexity and length of gene names, compared to person or organization names in newswire.

Subject(s)

Computational Biology/methods , Genes , Internationality , Markov Chains

10.

Identification of thermosensory and olfactory neuron-specific genes via expression profiling of single neuron types.

Colosimo, Marc E; Brown, Adam; Mukhopadhyay, Saikat; Gabel, Christopher; Lanjuin, Anne E; Samuel, Aravinthan D T; Sengupta, Piali.

Curr Biol ; 14(24): 2245-51, 2004 Dec 29.

Article in English | MEDLINE | ID: mdl-15620651

ABSTRACT

Most C. elegans sensory neuron types consist of a single bilateral pair of neurons, and respond to a unique set of sensory stimuli. Although genes required for the development and function of individual sensory neuron types have been identified in forward genetic screens, these approaches are unlikely to identify genes that when mutated result in subtle or pleiotropic phenotypes. Here, we describe a complementary approach to identify sensory neuron type-specific genes via microarray analysis using RNA from sorted AWB olfactory and AFD thermosensory neurons. The expression patterns of subsets of these genes were further verified in vivo. Genes identified by this analysis encode 7-transmembrane receptors, kinases, and nuclear factors including dac-1, which encodes a homolog of the highly conserved Dachshund protein. dac-1 is expressed in a subset of sensory neurons including the AFD neurons and is regulated by the TTX-1 OTX homeodomain protein. On thermal gradients, dac-1 mutants fail to suppress a cryophilic drive but continue to track isotherms at the cultivation temperature, representing the first genetic separation of these AFD-mediated behaviors. Expression profiling of single neuron types provides a rapid, powerful, and unbiased method for identifying neuron-specific genes whose functions can then be investigated in vivo.

Subject(s)

Caenorhabditis elegans Proteins/genetics , Caenorhabditis elegans/genetics , Gene Expression Profiling , Nerve Tissue Proteins/genetics , Neurons/metabolism , Olfactory Nerve/metabolism , Thermosensing/genetics , Amino Acid Sequence , Animals , Caenorhabditis elegans/metabolism , Caenorhabditis elegans Proteins/metabolism , Cells, Cultured , Flow Cytometry , Membrane Proteins/genetics , Molecular Sequence Data , Nerve Tissue Proteins/metabolism , Oligonucleotide Array Sequence Analysis , Phylogeny , Reverse Transcriptase Polymerase Chain Reaction , Sequence Alignment

11.

Gene name identification and normalization using a model organism database.

Morgan, Alexander A; Hirschman, Lynette; Colosimo, Marc; Yeh, Alexander S; Colombe, Jeff B.

J Biomed Inform ; 37(6): 396-410, 2004 Dec.

Article in English | MEDLINE | ID: mdl-15542014

ABSTRACT

Biology has now become an information science, and researchers are increasingly dependent on expert-curated biological databases to organize the findings from the published literature. We report here on a series of experiments related to the application of natural language processing to aid in the curation process for FlyBase. We focused on listing the normalized form of genes and gene products discussed in an article. We broke this into two steps: gene mention tagging in text, followed by normalization of gene names. For gene mention tagging, we adopted a statistical approach. To provide training data, we were able to reverse engineer the gene lists from the associated articles and abstracts, to generate text labeled (imperfectly) with gene mentions. We then evaluated the quality of the noisy training data (precision of 78%, recall 88%) and the quality of the HMM tagger output trained on this noisy data (precision 78%, recall 71%). In order to generate normalized gene lists, we explored two approaches. First, we explored simple pattern matching based on synonym lists to obtain a high recall/low precision system (recall 95%, precision 2%). Using a series of filters, we were able to improve precision to 50% with a recall of 72% (balanced F-measure of 0.59). Our second approach combined the HMM gene mention tagger with various filters to remove ambiguous mentions; this approach achieved an F-measure of 0.72 (precision 88%, recall 61%). These experiments indicate that the lexical resources provided by FlyBase are complete enough to achieve high recall on the gene list task, and that normalization requires accurate disambiguation; different strategies for tagging and normalization trade off recall for precision.

Subject(s)

Abstracting and Indexing/methods , Computational Biology/methods , Databases, Genetic , Information Storage and Retrieval/methods , Algorithms , Animals , Artificial Intelligence , Biology/methods , Computers , Databases, Bibliographic , Drosophila , MEDLINE , Names , Natural Language Processing , Software

12.

The divergent orphan nuclear receptor ODR-7 regulates olfactory neuron gene expression via multiple mechanisms in Caenorhabditis elegans.

Colosimo, Marc E; Tran, Susan; Sengupta, Piali.

Genetics ; 165(4): 1779-91, 2003 Dec.

Article in English | MEDLINE | ID: mdl-14704165

ABSTRACT

Nuclear receptors regulate numerous critical biological processes. The C. elegans genome is predicted to encode approximately 270 nuclear receptors of which >250 are unique to nematodes. ODR-7 is the only member of this large divergent family whose functions have been defined genetically. ODR-7 is expressed in the AWA olfactory neurons and specifies AWA sensory identity by promoting the expression of AWA-specific signaling genes and repressing the expression of an AWC-specific olfactory receptor gene. To elucidate the molecular mechanisms of action of a divergent nuclear receptor, we have identified residues and domains required for different aspects of ODR-7 function in vivo. ODR-7 utilizes an unexpected diversity of mechanisms to regulate the expression of different sets of target genes. Moreover, these mechanisms are distinct in normal and heterologous cellular contexts. The odr-7 ortholog in the closely related nematode C. briggsae can fully substitute for all ODR-7-mediated functions, indicating conservation of function across 25-120 million years of divergence.

Subject(s)

Caenorhabditis elegans Proteins/physiology , Caenorhabditis elegans/genetics , Gene Expression Regulation , Nerve Tissue Proteins/physiology , Neurons/metabolism , Olfactory Nerve/cytology , Receptors, Odorant/physiology , Amino Acid Sequence , Animals , Animals, Genetically Modified , Caenorhabditis elegans/metabolism , Conserved Sequence , DNA, Helminth/genetics , DNA, Helminth/metabolism , Genes, Helminth , Genetic Variation , Mitogen-Activated Protein Kinases/metabolism , Molecular Sequence Data , Mutation , Phenotype , Protein Structure, Tertiary , Sequence Homology, Amino Acid , Signal Transduction

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL