Search | VHL Regional Portal

1.

The semantics of Chemical Markup Language (CML) for computational chemistry : CompChem.

Phadungsukanan, Weerapong; Kraft, Markus; Townsend, Joe A; Murray-Rust, Peter.

J Cheminform ; 4(1): 15, 2012 Aug 07.

Article in English | MEDLINE | ID: mdl-22870956

ABSTRACT

: This paper introduces a subdomain chemistry format for storing computational chemistry data called CompChem. It has been developed based on the design, concepts and methodologies of Chemical Markup Language (CML) by adding computational chemistry semantics on top of the CML Schema. The format allows a wide range of ab initio quantum chemistry calculations of individual molecules to be stored. These calculations include, for example, single point energy calculation, molecular geometry optimization, and vibrational frequency analysis. The paper also describes the supporting infrastructure, such as processing software, dictionaries, validation tools and database repositories. In addition, some of the challenges and difficulties in developing common computational chemistry dictionaries are discussed. The uses of CompChem are illustrated by two practical applications.

2.

Note on naive Bayes based on binary descriptors in cheminformatics.

Townsend, Joe A; Glen, Robert C; Mussa, Hamse Y.

J Chem Inf Model ; 52(10): 2494-500, 2012 Oct 22.

Article in English | MEDLINE | ID: mdl-22900941

ABSTRACT

A plethora of articles on naive Bayes classifiers, where the chemical compounds to be classified are represented by binary-valued (absent or present type) descriptors, have appeared in the cheminformatics literature over the past decade. The principal goal of this paper is to describe how a naive Bayes classifier based on binary descriptors (NBCBBD) can be employed as a feature selector in an efficient manner suitable for cheminformatics. In the process, we point out a fact well documented in other disciplines that NBCBBD is a linear classifier and is therefore intrinsically suboptimal for classifying compounds that are nonlinearly separable in their binary descriptor space. We investigate the performance of the proposed algorithm on classifying a subset of the MDDR data set, a standard molecular benchmark data set, into active and inactive compounds.

Subject(s)

Algorithms , Biological Products/chemistry , Enzyme Inhibitors/chemistry , Bayes Theorem , Biological Products/pharmacology , Enzyme Inhibitors/pharmacology , Humans , Informatics , Models, Molecular , Structure-Activity Relationship

3.

The Quixote project: Collaborative and Open Quantum Chemistry data management in the Internet age.

Adams, Sam; de Castro, Pablo; Echenique, Pablo; Estrada, Jorge; Hanwell, Marcus D; Murray-Rust, Peter; Sherwood, Paul; Thomas, Jens; Townsend, Joe.

J Cheminform ; 3: 38, 2011 Oct 14.

Article in English | MEDLINE | ID: mdl-21999363

ABSTRACT

Computational Quantum Chemistry has developed into a powerful, efficient, reliable and increasingly routine tool for exploring the structure and properties of small to medium sized molecules. Many thousands of calculations are performed every day, some offering results which approach experimental accuracy. However, in contrast to other disciplines, such as crystallography, or bioinformatics, where standard formats and well-known, unified databases exist, this QC data is generally destined to remain locally held in files which are not designed to be machine-readable. Only a very small subset of these results will become accessible to the wider community through publication.In this paper we describe how the Quixote Project is developing the infrastructure required to convert output from a number of different molecular quantum chemistry packages to a common semantically rich, machine-readable format and to build respositories of QC results. Such an infrastructure offers benefits at many levels. The standardised representation of the results will facilitate software interoperability, for example making it easier for analysis tools to take data from different QC packages, and will also help with archival and deposition of results. The repository infrastructure, which is lightweight and built using Open software components, can be implemented at individual researcher, project, organisation or community level, offering the exciting possibility that in future many of these QC results can be made publically available, to be searched and interpreted just as crystallography and bioinformatics results are today.Although we believe that quantum chemists will appreciate the contribution the Quixote infrastructure can make to the organisation and and exchange of their results, we anticipate that greater rewards will come from enabling their results to be consumed by a wider community. As the respositories grow they will become a valuable source of chemical data for use by other disciplines in both research and education.The Quixote project is unconventional in that the infrastructure is being implemented in advance of a full definition of the data model which will eventually underpin it. We believe that a working system which offers real value to researchers based on tools and shared, searchable repositories will encourage early participation from a broader community, including both producers and consumers of data. In the early stages, searching and indexing can be performed on the chemical subject of the calculations, and well defined calculation meta-data. The process of defining more specific quantum chemical definitions, adding them to dictionaries and extracting them consistently from the results of the various software packages can then proceed in an incremental manner, adding additional value at each stage.Not only will these results help to change the data management model in the field of Quantum Chemistry, but the methodology can be applied to other pressing problems related to data in computational and experimental science.

4.

CMLLite: a design philosophy for CML.

Townsend, Joe A; Murray-Rust, Peter.

J Cheminform ; 3(1): 39, 2011 Oct 14.

Article in English | MEDLINE | ID: mdl-21999395

ABSTRACT

CMLLite is a collection of definitions and processes which provide strong and flexible validation for a document in Chemical Markup Language (CML). It consists of an updated CML schema (schema3), conventions specifying rules in both human and machine-understandable forms and a validator available both online and offline to check conformance. This article explores the rationale behind the changes which have been made to the schema, explains how conventions interact and how they are designed, formulated, implemented and tested, and gives an overview of the validation service.

5.

The semantic architecture of the World-Wide Molecular Matrix (WWMM).

Murray-Rust, Peter; Adams, Sam E; Downing, Jim; Townsend, Joe A; Zhang, Yong.

J Cheminform ; 3(1): 42, 2011 Oct 14.

Article in English | MEDLINE | ID: mdl-21999475

ABSTRACT

The World-Wide Molecular Matrix (WWMM) is a ten year project to create a peer-to-peer (P2P) system for the publication and collection of chemical objects, including over 250, 000 molecules. It has now been instantiated in a number of repositories which include data encoded in Chemical Markup Language (CML) and linked by URIs and RDF. The technical specification and implementation is now complete. We discuss the types of architecture required to implement nodes in the WWMM and consider the social issues involved in adoption.

6.

The semantics of Chemical Markup Language (CML): dictionaries and conventions.

Murray-Rust, Peter; Townsend, Joe A; Adams, Sam E; Phadungsukanan, Weerapong; Thomas, Jens.

J Cheminform ; 3: 43, 2011 Oct 14.

Article in English | MEDLINE | ID: mdl-21999509

ABSTRACT

The semantic architecture of CML consists of conventions, dictionaries and units. The conventions conform to a top-level specification and each convention can constrain compliant documents through machine-processing (validation). Dictionaries conform to a dictionary specification which also imposes machine validation on the dictionaries. Each dictionary can also be used to validate data in a CML document, and provide human-readable descriptions. An additional set of conventions and dictionaries are used to support scientific units. All conventions, dictionaries and dictionary elements are identifiable and addressable through unique URIs.

7.

Ami - The chemist's amanuensis.

Brooks, Brian J; Thorn, Adam L; Smith, Matthew; Matthews, Peter; Chen, Shaoming; O'Steen, Ben; Adams, Sam E; Townsend, Joe A; Murray-Rust, Peter.

J Cheminform ; 3: 45, 2011 Oct 14.

Article in English | MEDLINE | ID: mdl-21999587

ABSTRACT

The Ami project was a six month Rapid Innovation project sponsored by JISC to explore the Virtual Research Environment space. The project brainstormed with chemists and decided to investigate ways to facilitate monitoring and collection of experimental data.A frequently encountered use-case was identified of how the chemist reaches the end of an experiment, but finds an unexpected result. The ability to replay events can significantly help make sense of how things progressed. The project therefore concentrated on collecting a variety of dimensions of ancillary data - data that would not normally be collected due to practicality constraints. There were three main areas of investigation: 1) Development of a monitoring tool using infrared and ultrasonic sensors; 2) Time-lapse motion video capture (for example, videoing 5 seconds in every 60); and 3) Activity-driven video monitoring of the fume cupboard environs.The Ami client application was developed to control these separate logging functions. The application builds up a timeline of the events in the experiment and around the fume cupboard. The videos and data logs can then be reviewed after the experiment in order to help the chemist determine the exact timings and conditions used.The project experimented with ways in which a Microsoft Kinect could be used in a laboratory setting. Investigations suggest that it would not be an ideal device for controlling a mouse, but it shows promise for usages such as manipulating virtual molecules.

8.

SPECTRa-T: machine-based data extraction and semantic searching of chemistry e-theses.

Downing, Jim; Harvey, Matt J; Morgan, Peter B; Murray-Rust, Peter; Rzepa, Henry S; Stewart, Diana C; Tonge, Alan P; Townsend, Joe A.

J Chem Inf Model ; 50(2): 251-61, 2010 Feb 22.

Article in English | MEDLINE | ID: mdl-20088574

ABSTRACT

The SPECTRa-T project has developed text-mining tools to extract named chemical entities (NCEs), such as chemical names and terms, and chemical objects (COs), e.g., experimental spectral assignments and physical chemistry properties, from electronic theses (e-theses). Although NCEs were readily identified within the two major document formats studied, only the use of structured documents enabled identification of chemical objects and their association with the relevant chemical entity (e.g., systematic chemical name). A corpus of theses was analyzed and it is shown that a high degree of semantic information can be extracted from structured documents. This integrated information has been deposited in a persistent Resource Description Framework (RDF) triple-store that allows users to conduct semantic searches. The strength and weaknesses of several document formats are reviewed.

Subject(s)

Academic Dissertations as Topic , Chemistry/education , Data Mining/methods , Software , Databases, Factual , Electronic Data Processing , False Positive Reactions

9.

Chemical documents: machine understanding and automated information extraction.

Townsend, Joe A; Adams, Sam E; Waudby, Christopher A; de Souza, Vanessa K; Goodman, Jonathan M; Murray-Rust, Peter.

Org Biomol Chem ; 2(22): 3294-300, 2004 Nov 21.

Article in English | MEDLINE | ID: mdl-15534707

ABSTRACT

Automatically extracting chemical information from documents is a challenging task, but an essential one for dealing with the vast quantity of data that is available. The task is least difficult for structured documents, such as chemistry department web pages or the output of computational chemistry programs, but requires increasingly sophisticated approaches for less structured documents, such as chemical papers. The identification of key units of information, such as chemical names, makes the extraction of useful information from unstructured documents possible.

Subject(s)

Chemistry/methods , Electronic Data Processing/methods , Software , Internet , Terminology as Topic

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL