Search | VHL Regional Portal

Phylotastic: Improving Access to Tree-of-Life Knowledge With Flexible, on-the-Fly Delivery of Trees.

Nguyen, Van D; Nguyen, Thanh H; Tayeen, Abu Saleh Md; Laughinghouse, H Dail; Sánchez-Reyes, Luna L; Wiggins, Jodie; Pontelli, Enrico; Mozzherin, Dmitry; O'Meara, Brian; Stoltzfus, Arlin.

Evol Bioinform Online ; 16: 1176934319899384, 2020.

Article in English | MEDLINE | ID: mdl-32372858

ABSTRACT

A comprehensive phylogeny of species, i.e., a tree of life, has potential uses in a variety of contexts, including research, education, and public policy. Yet, accessing the tree of life typically requires special knowledge, complex software, or long periods of training. The Phylotastic project aims make it as easy to get a phylogeny of species as it is to get driving directions from mapping software. In prior work, we presented a design for an open system to validate and manage taxon names, find phylogeny resources, extract subtrees matching a user's taxon list, scale trees to time, and integrate related resources such as species images. Here, we report the implementation of a set of tools that together represent a robust, accessible system for on-the-fly delivery of phylogenetic knowledge. This set of tools includes a web portal to execute several customizable workflows to obtain species phylogenies (scaled by geologic time and decorated with thumbnail images); more than 30 underlying web services (accessible via a common registry); and code toolkits in R and Python (allowing others to develop custom applications using Phylotastic services). The Phylotastic system, accessible via http://www.phylotastic.org, provides a unique resource to access the current state of phylogenetic knowledge, useful for a variety of cases in which a tree extracted quickly from online resources (as distinct from a tree custom-made from character data) is sufficient, as it is for many casual uses of trees identified here.

"gnparser": a powerful parser for scientific names based on Parsing Expression Grammar.

Mozzherin, Dmitry Y; Myltsev, Alexander A; Patterson, David J.

BMC Bioinformatics ; 18(1): 279, 2017 May 26.

Article in English | MEDLINE | ID: mdl-28549446

ABSTRACT

BACKGROUND: Scientific names in biology act as universal links. They allow us to cross-reference information about organisms globally. However variations in spelling of scientific names greatly diminish their ability to interconnect data. Such variations may include abbreviations, annotations, misspellings, etc. Authorship is a part of a scientific name and may also differ significantly. To match all possible variations of a name we need to divide them into their elements and classify each element according to its role. We refer to this as 'parsing' the name. Parsing categorizes name's elements into those that are stable and those that are prone to change. Names are matched first by combining them according to their stable elements. Matches are then refined by examining their varying elements. This two stage process dramatically improves the number and quality of matches. It is especially useful for the automatic data exchange within the context of "Big Data" in biology. RESULTS: We introduce Global Names Parser (gnparser). It is a Java tool written in Scala language (a language for Java Virtual Machine) to parse scientific names. It is based on a Parsing Expression Grammar. The parser can be applied to scientific names of any complexity. It assigns a semantic meaning (such as genus name, species epithet, rank, year of publication, authorship, annotations, etc.) to all elements of a name. It is able to work with nested structures as in the names of hybrids. gnparser performs with ≈99% accuracy and processes 30 million name-strings/hour per CPU thread. The gnparser library is compatible with Scala, Java, R, Jython, and JRuby. The parser can be used as a command line application, as a socket server, a web-app or as a RESTful HTTP-service. It is released under an Open source MIT license. CONCLUSIONS: Global Names Parser (gnparser) is a fast, high precision tool for biodiversity informaticians and biologists working with large numbers of scientific names. It can replace expensive and error-prone manual parsing and standardization of scientific names in many situations, and can quickly enhance the interoperability of distributed biological information.

Subject(s)

User-Computer Interface , Biodiversity , Informatics , Internet , Terminology as Topic

Challenges with using names to link digital biodiversity information.

Patterson, David; Mozzherin, Dmitry; Shorthouse, David Peter; Thessen, Anne.

Biodivers Data J ; (4): e8080, 2016.

Article in English | MEDLINE | ID: mdl-27346955

SeaBase: a multispecies transcriptomic resource and platform for gene network inference.

Fischer, Antje H L; Mozzherin, Dmitry; Eren, A Murat; Lans, Kristen D; Wilson, Nathan; Cosentino, Carlo; Smith, Joel.

Integr Comp Biol ; 54(2): 250-63, 2014 Jul.

Article in English | MEDLINE | ID: mdl-24907201

ABSTRACT

Marine and aquatic animals are extraordinarily useful as models for identifying mechanisms of development and evolution, regeneration, resistance to cancer, longevity and symbiosis, among many other areas of research. This is due to the great diversity of these organisms and their wide-ranging capabilities. Genomics tools are essential for taking advantage of these "free lessons" of nature. However, genomics and transcriptomics are challenging in emerging model systems. Here, we present SeaBase, a tool for helping to meet these needs. Specifically, SeaBase provides a platform for sharing and searching transcriptome data. More importantly, SeaBase will support a growing number of tools for inferring gene network mechanisms. The first dataset available on SeaBase is a developmental transcriptomic profile of the sea anemone Nematostella vectensis (Anthozoa, Cnidaria). Additional datasets are currently being prepared and we are aiming to expand SeaBase to include user-supplied data for any number of marine and aquatic organisms, thereby supporting many potentially new models for gene network studies. SeaBase can be accessed online at: http://seabase.core.cli.mbl.edu.

Subject(s)

Aquatic Organisms/genetics , Databases as Topic , Gene Regulatory Networks , Transcriptome , Animals , Genomics , Humans , Sea Anemones/genetics

The taxonomic name resolution service: an online tool for automated standardization of plant names.

Boyle, Brad; Hopkins, Nicole; Lu, Zhenyuan; Raygoza Garay, Juan Antonio; Mozzherin, Dmitry; Rees, Tony; Matasci, Naim; Narro, Martha L; Piel, William H; McKay, Sheldon J; Lowry, Sonya; Freeland, Chris; Peet, Robert K; Enquist, Brian J.

BMC Bioinformatics ; 14: 16, 2013 Jan 16.

Article in English | MEDLINE | ID: mdl-23324024

ABSTRACT

BACKGROUND: The digitization of biodiversity data is leading to the widespread application of taxon names that are superfluous, ambiguous or incorrect, resulting in mismatched records and inflated species numbers. The ultimate consequences of misspelled names and bad taxonomy are erroneous scientific conclusions and faulty policy decisions. The lack of tools for correcting this 'names problem' has become a fundamental obstacle to integrating disparate data sources and advancing the progress of biodiversity science. RESULTS: The TNRS, or Taxonomic Name Resolution Service, is an online application for automated and user-supervised standardization of plant scientific names. The TNRS builds upon and extends existing open-source applications for name parsing and fuzzy matching. Names are standardized against multiple reference taxonomies, including the Missouri Botanical Garden's Tropicos database. Capable of processing thousands of names in a single operation, the TNRS parses and corrects misspelled names and authorities, standardizes variant spellings, and converts nomenclatural synonyms to accepted names. Family names can be included to increase match accuracy and resolve many types of homonyms. Partial matching of higher taxa combined with extraction of annotations, accession numbers and morphospecies allows the TNRS to standardize taxonomy across a broad range of active and legacy datasets. CONCLUSIONS: We show how the TNRS can resolve many forms of taxonomic semantic heterogeneity, correct spelling errors and eliminate spurious names. As a result, the TNRS can aid the integration of disparate biological datasets. Although the TNRS was developed to aid in standardizing plant names, its underlying algorithms and design can be extended to all organisms and nomenclatural codes. The TNRS is accessible via a web interface at http://tnrs.iplantcollaborative.org/ and as a RESTful web service and application programming interface. Source code is available at https://github.com/iPlantCollaborativeOpenSource/TNRS/.

Subject(s)

Plants/classification , Software , Algorithms , Classification/methods , Databases, Factual , Internet , Names , User-Computer Interface

Applications of natural language processing in biodiversity science.

Thessen, Anne E; Cui, Hong; Mozzherin, Dmitry.

Adv Bioinformatics ; 2012: 391574, 2012.

Article in English | MEDLINE | ID: mdl-22685456

ABSTRACT

Centuries of biological knowledge are contained in the massive body of scientific literature, written for human-readability but too big for any one person to consume. Large-scale mining of information from the literature is necessary if biology is to transform into a data-driven science. A computer can handle the volume but cannot make sense of the language. This paper reviews and discusses the use of natural language processing (NLP) and machine-learning algorithms to extract information from systematic literature. NLP algorithms have been used for decades, but require special development for application in the biological realm due to the special nature of the language. Many tools exist for biological information extraction (cellular processes, taxonomic names, and morphological characters), but none have been applied life wide and most still require testing and development. Progress has been made in developing algorithms for automated annotation of taxonomic text, identification of taxonomic names in text, and extraction of morphological character information from taxonomic descriptions. This manuscript will briefly discuss the key steps in applying information extraction tools to enhance biodiversity science.

A single amino acid change (E85K) in human PCNA that leads, relative to wild type, to enhanced DNA synthesis by DNA polymerase delta past nucleotide base lesions (TLS) as well as on unmodified templates.

Fisher, Paul A; Moutsiakis, Demetrius L; McConnell, Maeve; Miller, Holly; Mozzherin, Dmitry Ju.

Biochemistry ; 43(50): 15915-21, 2004 Dec 21.

Article in English | MEDLINE | ID: mdl-15595847

ABSTRACT

Human proliferating cell nuclear antigen (hPCNA) containing a single amino acid substitution at position 85, that of lysine for glutamate (E85K), was compared to wild-type (wt) hPCNA for its ability to promote DNA synthesis by purified DNA polymerase delta (pol delta) both on unmodified templates and past chemically defined template base lesions (translesion synthesis; TLS). Significant enhancement (up to 4-5-fold or greater) was seen but depended both on the exact PCNA/pol delta ratio tested and on the specific nature of the template (e.g., unmodified versus lesion-containing; chemical nature of the template base lesion). These results suggest that human PCNA, either mutated to contain lysine (K) at position 85 or bearing similar primary mutations, would promote more secondary mutagenesis in cells and/or tissues where PCNA is normally expressed at low levels relative to pol delta. Over an entire lifetime, such secondary mutagenesis could be biomedically significant.

Subject(s)

DNA Damage , DNA Polymerase III/physiology , DNA Replication , Mutagenesis , Proliferating Cell Nuclear Antigen/genetics , Amino Acid Substitution , DNA Polymerase III/metabolism , Glutamic Acid/genetics , Humans , Lysine/genetics , Point Mutation/genetics , Proliferating Cell Nuclear Antigen/metabolism , Templates, Genetic

Site-specific mutagenesis of Drosophila proliferating cell nuclear antigen enhances its effects on calf thymus DNA polymerase delta.

Mozzherin, Dmitry Ju; McConnell, Maeve; Miller, Holly; Fisher, Paul A.

BMC Biochem ; 5: 13, 2004 Aug 13.

Article in English | MEDLINE | ID: mdl-15310391

ABSTRACT

BACKGROUND: We and others have shown four distinct and presumably related effects of mammalian proliferating cell nuclear antigen (PCNA) on DNA synthesis catalyzed by mammalian DNA polymerase delta(pol delta). In the presence of homologous PCNA, pol delta exhibits 1) increased absolute activity; 2) increased processivity of DNA synthesis; 3) stable binding of synthetic oligonucleotide template-primers (t1/2 of the pol deltaPCNAtemplate-primer complex >/=2.5 h); and 4) enhanced synthesis of DNA opposite and beyond template base lesions. This last effect is potentially mutagenic in vivo. Biochemical studies performed in parallel with in vivo genetic analyses, would represent an extremely powerful approach to investigate further, both DNA replication and repair in eukaryotes. RESULTS: Drosophila PCNA, although highly similar in structure to mammalian PCNA (e.g., it is >70% identical to human PCNA in amino acid sequence), can only substitute poorly for either calf thymus or human PCNA (approximately 10% as well) in affecting calf thymus pol delta. However, by mutating one or only a few amino acids in the region of Drosophila PCNA thought to interact with pol delta, all four effects can be enhanced dramatically. CONCLUSIONS: Our results therefore suggest that all four above effects depend at least in part on the PCNA-pol delta interaction. Moreover unlike mammals, Drosophila offers the potential for immediate in vivo genetic analyses. Although it has proven difficult to obtain sufficient amounts of homologous pol delta for parallel in vitro biochemical studies, by altering Drosophila PCNA using site-directed mutagenesis as suggested by our results, in vitro biochemical studies may now be performed using human and/or calf thymus pol delta preparations.

Subject(s)

DNA Polymerase III/metabolism , Drosophila Proteins/physiology , Proliferating Cell Nuclear Antigen/physiology , Amino Acid Sequence , Amino Acid Substitution , Animals , Cattle , DNA/metabolism , DNA Polymerase III/chemistry , DNA Replication , Drosophila Proteins/chemistry , Drosophila Proteins/genetics , Humans , Models, Molecular , Molecular Sequence Data , Mutagenesis, Site-Directed , Proliferating Cell Nuclear Antigen/chemistry , Proliferating Cell Nuclear Antigen/genetics , Protein Binding , Protein Conformation , Protein Interaction Mapping , Protein Structure, Tertiary , Species Specificity , Thymus Gland/enzymology

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL