Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 16 de 16
Filter
1.
J Cheminform ; 10(1): 59, 2018 Dec 06.
Article in English | MEDLINE | ID: mdl-30523437

ABSTRACT

Chemical named entity recognition (NER) has traditionally been dominated by conditional random fields (CRF)-based approaches but given the success of the artificial neural network techniques known as "deep learning" we decided to examine them as an alternative to CRFs. We present here several chemical named entity recognition systems. The first system translates the traditional CRF-based idioms into a deep learning framework, using rich per-token features and neural word embeddings, and producing a sequence of tags using bidirectional long short term memory (LSTM) networks-a type of recurrent neural net. The second system eschews the rich feature set-and even tokenisation-in favour of character labelling using neural character embeddings and multiple LSTM layers. The third system is an ensemble that combines the results of the first two systems. Our original BioCreative V.5 competition entry was placed in the top group with the highest F scores, and subsequent using transfer learning have achieved a final F score of 90.33% on the test data (precision 91.47%, recall 89.21%).

2.
J Biomed Semantics ; 2 Suppl 5: S11, 2011 Oct 06.
Article in English | MEDLINE | ID: mdl-22166494

ABSTRACT

BACKGROUND: Competitions in text mining have been used to measure the performance of automatic text processing solutions against a manually annotated gold standard corpus (GSC). The preparation of the GSC is time-consuming and costly and the final corpus consists at the most of a few thousand documents annotated with a limited set of semantic groups. To overcome these shortcomings, the CALBC project partners (PPs) have produced a large-scale annotated biomedical corpus with four different semantic groups through the harmonisation of annotations from automatic text mining solutions, the first version of the Silver Standard Corpus (SSC-I). The four semantic groups are chemical entities and drugs (CHED), genes and proteins (PRGE), diseases and disorders (DISO) and species (SPE). This corpus has been used for the First CALBC Challenge asking the participants to annotate the corpus with their text processing solutions. RESULTS: All four PPs from the CALBC project and in addition, 12 challenge participants (CPs) contributed annotated data sets for an evaluation against the SSC-I. CPs could ignore the training data and deliver the annotations from their genuine annotation system, or could train a machine-learning approach on the provided pre-annotated data. In general, the performances of the annotation solutions were lower for entities from the categories CHED and PRGE in comparison to the identification of entities categorized as DISO and SPE. The best performance over all semantic groups were achieved from two annotation solutions that have been trained on the SSC-I.The data sets from participants were used to generate the harmonised Silver Standard Corpus II (SSC-II), if the participant did not make use of the annotated data set from the SSC-I for training purposes. The performances of the participants' solutions were again measured against the SSC-II. The performances of the annotation solutions showed again better results for DISO and SPE in comparison to CHED and PRGE. CONCLUSIONS: The SSC-I delivers a large set of annotations (1,121,705) for a large number of documents (100,000 Medline abstracts). The annotations cover four different semantic groups and are sufficiently homogeneous to be reproduced with a trained classifier leading to an average F-measure of 85%. Benchmarking the annotation solutions against the SSC-II leads to better performance for the CPs' annotation solutions in comparison to the SSC-I.

3.
J Chem Inf Model ; 51(3): 739-53, 2011 Mar 28.
Article in English | MEDLINE | ID: mdl-21384929

ABSTRACT

We have produced an open source, freely available, algorithm (Open Parser for Systematic IUPAC Nomenclature, OPSIN) that interprets the majority of organic chemical nomenclature in a fast and precise manner. This has been achieved using an approach based on a regular grammar. This grammar is used to guide tokenization, a potentially difficult problem in chemical names. From the parsed chemical name, an XML parse tree is constructed that is operated on in a stepwise manner until the structure has been reconstructed from the name. Results from OPSIN on various computer generated name/structure pair sets are presented. These show exceptionally high precision (99.8%+) and, when using general organic chemical nomenclature, high recall (98.7-99.2%). This software can serve as the basis for future open source developments of chemical name interpretation.


Subject(s)
Terminology as Topic , Models, Molecular
4.
Ecol Appl ; 20(1): 263-77, 2010 Jan.
Article in English | MEDLINE | ID: mdl-20349846

ABSTRACT

Hybridization and introgression between introduced and native salmonids threaten the continued persistence of many inland cutthroat trout species. Environmental models have been developed to predict the spread of introgression, but few studies have assessed the role of propagule pressure. We used an extensive set of fish Stocking records and geographic information system (GIS) data to produce a spatially explicit index of potential propagule pressure exerted by introduced rainbow trout in the Upper Kootenay River, British Columbia, Canada. We then used logistic regression and the information-theoretic approach to test the ability of a set of environmental and spatial variables to predict the level of introgression between native westslope cutthroat trout and introduced rainbow trout. Introgression was assessed using between four and seven co-dominant, diagnostic nuclear markers at 45 sites in 31 different streams. The best model for predicting introgression included our GIS propagule pressure index and an environmental variable that accounted for the biogeoclimatic zone of the site (r2=0.62). This model was 1.4 times more likely to explain introgression than the next-best model, which consisted of only the propagule pressure index variable. We created a composite model based on the model-averaged results of the seven top models that included environmental, spatial, and propagule pressure variables. The propagule pressure index had the highest importance weight (0.995) of all variables tested and was negatively related to sites with no introgression. This study used an index of propagule pressure and demonstrated that propagule pressure had the greatest influence on the level of introgression between a native and introduced trout in a human-induced hybrid zone.


Subject(s)
Rivers , Trout/physiology , Alleles , Animals , British Columbia , Conservation of Natural Resources , Ecosystem , Models, Biological , Population Dynamics , Trout/genetics
5.
J Bioinform Comput Biol ; 8(1): 163-79, 2010 Feb.
Article in English | MEDLINE | ID: mdl-20183881

ABSTRACT

The CALBC initiative aims to provide a large-scale biomedical text corpus that contains semantic annotations for named entities of different kinds. The generation of this corpus requires that the annotations from different automatic annotation systems be harmonized. In the first phase, the annotation systems from five participants (EMBL-EBI, EMC Rotterdam, NLM, JULIE Lab Jena, and Linguamatics) were gathered. All annotations were delivered in a common annotation format that included concept identifiers in the boundary assignments and that enabled comparison and alignment of the results. During the harmonization phase, the results produced from those different systems were integrated in a single harmonized corpus ("silver standard" corpus) by applying a voting scheme. We give an overview of the processed data and the principles of harmonization--formal boundary reconciliation and semantic matching of named entities. Finally, all submissions of the participants were evaluated against that silver standard corpus. We found that species and disease annotations are better standardized amongst the partners than the annotations of genes and proteins. The raw corpus is now available for additional named entity annotations. Parts of it will be made available later on for a public challenge. We expect that we can improve corpus building activities both in terms of the numbers of named entity classes being covered, as well as the size of the corpus in terms of annotated documents.


Subject(s)
Computational Biology/standards , Data Mining/standards , Cooperative Behavior , Data Mining/statistics & numerical data , Databases, Factual/statistics & numerical data , Unified Medical Language System
6.
BMC Bioinformatics ; 9 Suppl 11: S4, 2008 Nov 19.
Article in English | MEDLINE | ID: mdl-19025690

ABSTRACT

BACKGROUND: Chemical named entities represent an important facet of biomedical text. RESULTS: We have developed a system to use character-based n-grams, Maximum Entropy Markov Models and rescoring to recognise chemical names and other such entities, and to make confidence estimates for the extracted entities. An adjustable threshold allows the system to be tuned to high precision or high recall. At a threshold set for balanced precision and recall, we were able to extract named entities at an F score of 80.7% from chemistry papers and 83.2% from PubMed abstracts. Furthermore, we were able to achieve 57.6% and 60.3% recall at 95% precision, and 58.9% and 49.1% precision at 90% recall. CONCLUSION: These results show that chemical named entities can be extracted with good performance, and that the properties of the extraction can be tuned to suit the demands of the task.


Subject(s)
Computational Biology/methods , Information Storage and Retrieval/methods , Algorithms , Models, Chemical , Models, Statistical , Natural Language Processing , Software , Terminology as Topic
7.
J Am Chem Soc ; 130(33): 10834-5, 2008 Aug 20.
Article in English | MEDLINE | ID: mdl-18646752

ABSTRACT

A simple water-soluble naphthalenedithiol building block is converted quantitatively into a series of octameric [2]-catenanes, composed of two interlocked molecular squares. When this mixture is re-equilibrated in the presence of an adamantyl ammonium guest, the catenanes disassemble into their macrocyclic components that bind the guest with nanomolar affinity in water.


Subject(s)
Catenanes/chemistry , Catenanes/chemical synthesis , Combinatorial Chemistry Techniques/methods , Sulfhydryl Compounds/chemistry , Sulfhydryl Compounds/chemical synthesis , Chromatography, High Pressure Liquid/methods , Cyclization , Magnetic Resonance Spectroscopy/methods , Models, Molecular , Molecular Structure , Particle Size , Solubility , Water/chemistry
8.
Chemistry ; 14(7): 2153-66, 2008.
Article in English | MEDLINE | ID: mdl-18081129

ABSTRACT

Herein we describe an extensive study of the response of a set of closely related dynamic combinatorial libraries (DCLs) of macrocyclic receptors to the introduction of a focused range of guest molecules. We have determined the amplification of two sets of diastereomeric receptors induced by a series of neutral and cationic guests, including biologically relevant compounds such as acetylcholine and morphine. The host-guest binding affinities were investigated using isothermal titration calorimetry. The resulting dataset enabled a detailed analysis of the relationship between the amplification of selected receptors and host-guest Gibbs binding energies, giving insight into the factors affecting the design, simulation and interpretation of DCL experiments. In particular, two questions were addressed: Is amplification by a given guest selective for the best receptor? And does the best guest induce the largest amplification of a given receptor? Our experimental results and computer simulations showed that the relative levels of amplification of hosts by a guest are well-correlated with their relative affinities, and simulations have confirmed previous observations that amplification can be selective for the best receptor when only modest amounts of guest are used. In contrast, the correlation between guest binding and the extent of amplification of a given receptor across a wide range of guests tends to be poorer, because every guest has its own unique set of affinities for competing receptors in the DCL. This implies that the results of screening a DCL for selective receptors by comparing the response of the mixture to two different guests should be interpreted with caution. DCLs are complex mixtures in which all compounds are connected through a set of equilibria. Obtaining quantitative information about all host-guest binding constants from such systems will require the explicit and simultaneous consideration of all of the main equilibria within a DCL.


Subject(s)
Combinatorial Chemistry Techniques , Macrocyclic Compounds/chemistry , Small Molecule Libraries/chemistry , Binding Sites , Computer Simulation , Macrocyclic Compounds/chemical synthesis , Models, Chemical , Molecular Structure , Reproducibility of Results , Small Molecule Libraries/chemical synthesis , Stereoisomerism , Water/chemistry
11.
Ecology ; 87(7): 1722-32, 2006 Jul.
Article in English | MEDLINE | ID: mdl-16922322

ABSTRACT

Forest fire occurrence is affected by multiple controls that operate at local to regional scales. At the spatial scale of forest stands, regional climatic controls may be obscured by local controls (e.g., stochastic ignitions, topography, and fuel loads), but the long-term role of such local controls is poorly understood. We report here stand-scale (<100 ha) fire histories of the past 5000 years based on the analysis of sediment charcoal at two lakes 11 km apart in southeastern British Columbia. The two lakes are today located in similar subalpine forests, and they likely have experienced the same late-Holocene climatic changes because of their close proximity. We evaluated two independent properties of fire history: (1) fire-interval distribution, a measure of the overall incidence of fire, and (2) fire synchroneity, a measure of the co-occurrence of fire (here, assessed at centennial to millennial time scales due to the resolution of sediment records). Fire-interval distributions differed between the sites prior to, but not after, 2500 yr before present. When the entire 5000-yr period is considered, no statistical synchrony between fire-episode dates existed between the two sites at any temporal scale, but for the last 2500 yr marginal levels of synchrony occurred at centennial scales. Each individual fire record exhibited little coherency with regional climate changes. In contrast, variations in the composite record (average of both sites) matched variations in climate evidenced by late-Holocene glacial advances. This was probably due to the increased sample size and spatial extent represented by the composite record (up to 200 ha) plus increased regional climatic variability over the last several millennia, which may have partially overridden local, non-climatic controls. We conclude that (1) over past millennia, neighboring stands with similar modern conditions may have experienced different fire intervals and asynchronous patterns in fire episodes, likely because local controls outweighed the synchronizing effect of climate; (2) the influence of climate on fire occurrence is more strongly expressed when climatic variability is relatively great; and (3) multiple records from a region are essential if climate-fire relations are to be reliably described.


Subject(s)
Climate , Ecosystem , Fires/history , British Columbia , Geologic Sediments , History, 15th Century , History, 16th Century , History, 17th Century , History, 18th Century , History, 19th Century , History, 20th Century , History, Ancient , History, Medieval , Time Factors , Trees/physiology
12.
J Am Chem Soc ; 127(25): 8902-3, 2005 Jun 29.
Article in English | MEDLINE | ID: mdl-15969538

ABSTRACT

A high-affinity, induced-fit receptor for NMe4I was discovered using dynamic combinatorial chemistry. The addition of the guest to a dynamic combinatorial library made using a racemic mixture of chiral building blocks caused the strong and highly diastereoselective amplification of the receptor at the expense of other library components. The receptor and its mode of binding were characterized by NMR, ITC, and re-equilibration experiments, from which it was deduced that the receptor probably forms a folded four-stave barrel shape on binding of the guest.


Subject(s)
Combinatorial Chemistry Techniques/methods , Disulfides/chemistry , Heterocyclic Compounds/chemistry , Macrocyclic Compounds/chemistry , Cyclization , Disulfides/chemical synthesis , Heterocyclic Compounds/chemical synthesis , Macrocyclic Compounds/chemical synthesis , Models, Molecular , Molecular Structure , Stereoisomerism , Thermodynamics
13.
J Am Chem Soc ; 127(26): 9390-2, 2005 Jul 06.
Article in English | MEDLINE | ID: mdl-15984865

ABSTRACT

Dynamic combinatorial chemistry is a powerful tool for the discovery of strong binders (synthetic receptors or ligands) because binding causes a shift in the equilibrium of library members toward those that bind well. Ideally, the best binders are selectively amplified. However, theoretical studies predict this is not always the case. This paper describes the first quantitative experimental evidence proving that, under special circumstances, the preferential amplification of suboptimal synthetic receptors can indeed occur. Our results also demonstrate that reducing the amount of guest in the library can rectify such undesirable behavior and ensures selective amplification of the fittest receptor.

14.
Chemistry ; 10(13): 3139-43, 2004 Jul 05.
Article in English | MEDLINE | ID: mdl-15224322

ABSTRACT

We present a versatile computer model of diverse dynamic combinatorial libraries, and examine how molecular recognition between library members and a template can be used to amplify the best binders. The correlation between host-guest binding and amplification was examined for a set of 50 libraries with >300 components each over a wide range of template and building block concentrations. Depending on these concentrations correlations vary from poor (when using a large excess of template) to good (for very dilute libraries and/or substoichiometric template concentrations), highlighting the need to choose the experimental conditions for dynamic combinatorial libraries thoughtfully.

15.
Org Lett ; 6(11): 1825-7, 2004 May 27.
Article in English | MEDLINE | ID: mdl-15151424

ABSTRACT

Using simple computer simulations of model dynamic combinatorial libraries, we show that the best binders can be amplified to useful concentrations in libraries containing 10-10(6) compounds. [structure: see text]

16.
Chronic Dis Can ; 23(3): 111-9, 2002.
Article in English | MEDLINE | ID: mdl-12443567

ABSTRACT

An age-stratified population-based random digit dial (RDD) telephone survey determined awareness and prevalence of prostate-specific antigen (PSA) testing among Alberta men aged 40 74 years, and assessed the role of indications for PSA testing in explaining patterns of PSA testing. The sample of 1984 men (participation rate 65%) with no history of prostate cancer was divided into three age strata: 40-49, 50-59, and 60-74 years. Awareness of PSA tests was low with fewer than half of the men indicating they had ever heard of PSA tests. The percentage of men who had ever had PSA testing was 4.5%, 13.1%, and 22.2% in the three age strata respectively. PSA testing was strongly associated with having at least one clinical indication for PSA testing (prevalence 21.8%, 26.9%, and 42.2% respectively). PSA testing rates were very low among men who had no clinical indications for PSA testing, suggesting infrequent PSA screening prior to the survey. PSA testing patterns in this population-based sample were consistent with Alberta clinical practice guidelines.


Subject(s)
Health Knowledge, Attitudes, Practice , Mass Screening/statistics & numerical data , Prostate-Specific Antigen , Prostatic Neoplasms/prevention & control , Adult , Age Distribution , Aged , Alberta , Humans , Logistic Models , Male , Middle Aged , Socioeconomic Factors
SELECTION OF CITATIONS
SEARCH DETAIL
...