Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 28
Filtrar
1.
J Optim Theory Appl ; 141(2): 429-443, 2009 May.
Artigo em Inglês | MEDLINE | ID: mdl-29456266

RESUMO

The pool adjacent violators (PAV) algorithm is an efficient technique for the class of isotonic regression problems with complete ordering. The algorithm yields a stepwise isotonic estimate which approximates the function and assigns maximum likelihood to the data. However, if one has reasons to believe that the data were generated by a continuous function, a smoother estimate may provide a better approximation to that function. In this paper, we consider the formulation which assumes that the data were generated by a continuous monotonic function obeying the Lipschitz condition. We propose a new algorithm, the Lipschitz pool adjacent violators (LPAV) algorithm, which approximates that function; we prove the convergence of the algorithm and examine its complexity.

2.
Comput Biol Chem ; 28(5-6): 387-91, 2004 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-15556479

RESUMO

Ontology construction requires an understanding of the meaning and usage of its encoded concepts. While definitions found in dictionaries or glossaries may be adequate for many concepts, the actual usage in expert writing could be a better source of information for many others. The goal of this paper is to describe an automated procedure for finding definitional content in expert writing. The approach uses machine learning on phrasal features to learn when sentences in a book contain definitional content, as determined by their similarity to glossary definitions provided in the same book. The end result is not a concise definition of a given concept, but for each sentence, a predicted probability that it contains information relevant to a definition. The approach is evaluated automatically for terms with explicit definitions, and manually for terms with no available definition.


Assuntos
Biologia Computacional , Armazenamento e Recuperação da Informação/métodos , Teorema de Bayes , Validação de Programas de Computador
3.
Comput Biol Chem ; 28(2): 97-107, 2004 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-15130538

RESUMO

Gene and protein names follow few, if any, true naming conventions and are subject to great variation in different occurrences of the same name. This gives rise to two important problems in natural language processing. First, can one locate the names of genes or proteins in free text, and second, can one determine when two names denote the same gene or protein? The first of these problems is a special case of the problem of named entity recognition, while the second is a special case of the problem of automatic term recognition (ATR). We study the second problem, that of gene or protein name variation. Here we describe a system which, given a query gene or protein name, identifies related gene or protein names in a large list. The system is based on a dynamic programming algorithm for sequence alignment in which the mutation matrix is allowed to vary under the control of a fully trainable hidden Markov model.


Assuntos
Genes , Armazenamento e Recuperação da Informação/métodos , Proteínas , Terminologia como Assunto , Algoritmos , Bases de Dados como Assunto , Cadeias de Markov , Processamento de Linguagem Natural , Reconhecimento Automatizado de Padrão
4.
Bioinformatics ; 20(14): 2320-1, 2004 Sep 22.
Artigo em Inglês | MEDLINE | ID: mdl-15073016

RESUMO

SUMMARY: We present a part-of-speech tagger that achieves over 97% accuracy on MEDLINE citations. AVAILABILITY: Software, documentation and a corpus of 5700 manually tagged sentences are available at ftp://ftp.ncbi.nlm.nih.gov/pub/lsmith/MedPost/medpost.tar.gz


Assuntos
Indexação e Redação de Resumos/métodos , Documentação/métodos , MEDLINE , Processamento de Linguagem Natural , Publicações Periódicas como Assunto , Semântica , Software , Biologia/métodos , Armazenamento e Recuperação da Informação/métodos , Medicina/métodos , Vocabulário Controlado
5.
Comput Biol Chem ; 27(1): 77-84, 2003 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-12798042

RESUMO

We present a formulation of the Needleman-Wunsch type algorithm for sequence alignment in which the mutation matrix is allowed to vary under the control of a hidden Markov process. The fully trainable model is applied to two problems in bioinformatics: the recognition of related gene/protein names and the alignment and scoring of homologous proteins.


Assuntos
Biologia Computacional , Cadeias de Markov , Modelos Estatísticos , Alinhamento de Sequência/métodos , Alinhamento de Sequência/estatística & dados numéricos , Animais , Biologia Computacional/métodos , Biologia Computacional/estatística & dados numéricos , Bases de Dados Genéticas , Genes/genética , Humanos , MEDLINE/estatística & dados numéricos , Camundongos , Homologia de Sequência do Ácido Nucleico
6.
Comput Chem ; 25(4): 411-22, 2001 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-11459355

RESUMO

The determination of a protein's structure from the knowledge of its linear chain is one of the important problems that remains as a bottleneck in interpreting the rapidly increasing repository of genetic sequence data. One approach to this problem that has shown promise and given a measure of success is threading. In this approach contact energies between different amino acids are first determined by statistical methods applied to known structures. These contact energies are then applied to a sequence whose structure is to be determined by threading it through various known structures and determining the total threading energy for each candidate structure. That structure that yields the lowest total energy is then considered the leading candidate among all the structures tested. Additional information is often needed in order to support the results of threading studies, as it is well known in the field that the contact potentials used are not sufficiently sensitive to allow definitive conclusions. Here, we investigate the hypothesis that the environment of an amino acid residue realized as all those residues not local to it on the chain but sufficiently close spatially can supply information predictive of the type of that residue that is not adequately reflected in the individual contact energies. We present evidence that confirms this hypothesis and suggests a high order cooperativity between the residues that surround a given residue and how they interact with it. We suggest a possible application to threading.


Assuntos
Aminoácidos/química , Proteínas/química , Algoritmos , Teorema de Bayes , Simulação por Computador , Matemática , Conformação Proteica , Homologia de Sequência de Aminoácidos
7.
Proc AMIA Symp ; : 319-23, 2001.
Artigo em Inglês | MEDLINE | ID: mdl-11825203

RESUMO

For computational purposes documents or other objects are most often represented by a collection of individual attributes that may be strings or numbers. Such attributes are often called features and success in solving a given problem can depend critically on the nature of the features selected to represent documents. Feature selection has received considerable attention in the machine learning literature. In the area of document retrieval we refer to feature selection as indexing. Indexing has not traditionally been evaluated by the same methods used in machine learning feature selection. Here we show how indexing quality may be evaluated in a machine learning setting and apply this methodology to results of the Indexing Initiative at the National Library of Medicine.


Assuntos
Indexação e Redação de Resumos/métodos , Processamento Eletrônico de Dados , Descritores , Algoritmos , Inteligência Artificial , Teorema de Bayes , National Library of Medicine (U.S.) , Unified Medical Language System , Estados Unidos
8.
Proc AMIA Symp ; : 17-21, 2000.
Artigo em Inglês | MEDLINE | ID: mdl-11079836

RESUMO

The objective of NLM's Indexing Initiative (IND) is to investigate methods whereby automated indexing methods partially or completely substitute for current indexing practices. The project will be considered a success if methods can be designed and implemented that result in retrieval performance that is equal to or better than the retrieval performance of systems based principally on humanly assigned index terms. We describe the current state of the project and discuss our plans for the future.


Assuntos
Indexação e Redação de Resumos/métodos , MEDLINE , Processamento de Linguagem Natural , Descritores , Unified Medical Language System , Armazenamento e Recuperação da Informação , National Library of Medicine (U.S.) , Estados Unidos
9.
Proc AMIA Symp ; : 918-22, 2000.
Artigo em Inglês | MEDLINE | ID: mdl-11080018

RESUMO

We are concerned with the rating of new documents that appear in a large database (MEDLINE) and are candidates for inclusion in a small specialty database (REBASE). The requirement is to rank the new documents as nearly in order of decreasing potential to be added to the smaller database as possible, so as to improve the coverage of the smaller database without increasing the effort of those who manage this specialty database. To perform this ranking task we have considered several machine learning approaches based on the naï ve Bayesian algorithm. We find that adaptive boosting outperforms naï ve Bayes, but that a new form of boosting which we term staged Bayesian retrieval outperforms adaptive boosting. Staged Bayesian retrieval involves two stages of Bayesian retrieval and we further find that if the second stage is replaced by a support vector machine we again obtain a significant improvement over the strictly Bayesian approach.


Assuntos
Algoritmos , Inteligência Artificial , Bases de Dados como Assunto , Armazenamento e Recuperação da Informação/métodos , Teorema de Bayes , Classificação , MEDLINE
10.
Artigo em Inglês | MEDLINE | ID: mdl-10977093

RESUMO

The immense volume of data resulting from DNA microarray experiments, accompanied by an increase in the number of publications discussing gene-related discoveries, presents a major data analysis challenge. Current methods for genome-wide analysis of expression data typically rely on cluster analysis of gene expression patterns. Clustering indeed reveals potentially meaningful relationships among genes, but can not explain the underlying biological mechanisms. In an attempt to address this problem, we have developed a new approach for utilizing the literature in order to establish functional relationships among genes on a genome-wide scale. Our method is based on revealing coherent themes within the literature, using a similarity-based search in document space. Content-based relationships among abstracts are then translated into functional connections among genes. We describe preliminary experiments applying our algorithm to a database of documents discussing yeast genes. A comparison of the produced results with well-established yeast gene functions demonstrates the effectiveness of our approach.


Assuntos
DNA/genética , Análise de Sequência com Séries de Oligonucleotídeos , Análise de Sequência de DNA/métodos , Animais , Humanos
11.
J Am Med Inform Assoc ; 7(5): 499-511, 2000.
Artigo em Inglês | MEDLINE | ID: mdl-10984469

RESUMO

PURPOSE: The authors study the extraction of useful phrases from a natural language database by statistical methods. The aim is to leverage human effort by providing preprocessed phrase lists with a high percentage of useful material. METHOD: The approach is to develop six different scoring methods that are based on different aspects of phrase occurrence. The emphasis here is not on lexical information or syntactic structure but rather on the statistical properties of word pairs and triples that can be obtained from a large database. MEASUREMENTS: The Unified Medical Language System (UMLS) incorporates a large list of humanly acceptable phrases in the medical field as a part of its structure. The authors use this list of phrases as a gold standard for validating their methods. A good method is one that ranks the UMLS phrases high among all phrases studied. Measurements are 11-point average precision values and precision-recall curves based on the rankings. RESULT: The authors find of six different scoring methods that each proves effective in identifying UMLS quality phrases in a large subset of MEDLINE. These methods are applicable both to word pairs and word triples. All six methods are optimally combined to produce composite scoring methods that are more effective than any single method. The quality of the composite methods appears sufficient to support the automatic placement of hyperlinks in text at the site of highly ranked phrases. CONCLUSION: Statistical scoring methods provide a promising approach to the extraction of useful phrases from a natural language database for the purpose of indexing or providing hyperlinks in text.


Assuntos
Indexação e Redação de Resumos , Hipermídia , Armazenamento e Recuperação da Informação/métodos , Processamento de Linguagem Natural , Unified Medical Language System , MEDLINE , Estatística como Assunto , Vocabulário Controlado
12.
Comput Chem ; 24(1): 33-42, 2000 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-10642878

RESUMO

Classical information theory concerns itself with communication through a noisy channel and how much one can infer about the channel input from a knowledge of the channel output. Because the channel is noisy the input and output are only related statistically and the rate of information transmission is a statistical concept with little meaning for the individual symbol used in transmission. Here we develop a more intuitive notion of information that is concerned with asking the right questions--that is, with finding those questions whose answer conveys the most information. We call this confirmatory information. In the first part of the paper we develop the general theory, show how it relates to classical information theory, and how in the special case of search problems it allows us to quantify the efficacy of information transmission regarding individual events. That is, confirmatory information measures how well a search for items having certain observable properties retrieves items having some unobserved property of interest. Thus confirmatory information facilitates a useful analysis of search problems and contrasts with classical information theory, which quantifies the efficiency of information transmission but is indifferent to the nature of the particular information being transmitted. The last part of the paper presents several examples where confirmatory information is used to quantify protein structural properties in a search setting.


Assuntos
Teoria da Informação , Proteínas/química , Estrutura Secundária de Proteína , Estrutura Terciária de Proteína
13.
Proc AMIA Symp ; : 176-80, 1999.
Artigo em Inglês | MEDLINE | ID: mdl-10566344

RESUMO

At the National Library of Medicine (NLM), a variety of biomedical vocabularies are found in data pertinent to its mission. In addition to standard medical terminology, there are specialized vocabularies including that of chemical nomenclature. Normal language tools including the lexically based ones used by the Unified Medical Language System (UMLS) to manipulate and normalize text do not work well on chemical nomenclature. In order to improve NLM's capabilities in chemical text processing, two approaches to the problem of recognizing chemical nomenclature were explored. The first approach was a lexical one and consisted of analyzing text for the presence of a fixed set of chemical segments. The approach was extended with general chemical patterns and also with terms from NLM's indexing vocabulary, MeSH, and the NLM SPECIALIST lexicon. The second approach applied Bayesian classification to n-grams of text via two different methods. The single lexical method and two statistical methods were tested against data from the 1999 UMLS Metathesaurus. One of the statistical methods had an overall classification accuracy of 97%.


Assuntos
Compostos Inorgânicos/classificação , Processamento de Linguagem Natural , Compostos Orgânicos/classificação , Vocabulário Controlado , Algoritmos , Teorema de Bayes , Classificação/métodos , Estudos de Avaliação como Assunto , Descritores , Terminologia como Assunto , Unified Medical Language System
14.
Fold Des ; 3(1): 51-65, 1998.
Artigo em Inglês | MEDLINE | ID: mdl-9502320

RESUMO

BACKGROUND: A common approach to the protein folding problem involves computer simulation of folding using lattice models of amino acid sequences. Key factors for good performance in such models are the correct choice of the temperature and the average interaction energy between residues. In order to push the lattice approach to its limit it is important to have a method to adjust these parameters for optimal folding that is not limited by our ability to successfully simulate folding in a reasonable time. RESULTS: In this study, we adopt a simple cubic-lattice model and present a method for calculating the free energy of a chain as a function of the number of native contacts. This does not require that we are able to fold the sequence by simulation and it provides a method of estimating the folding transition temperature. For a given set of parameters, the free energy analysis also allows an estimate of foldability. By applying the method to sequences with 27 and 125 residues, we show that optimal folding occurs near the folding transition temperature and at either zero or small negative average interaction energy. We find ourselves able to fold only 125-mers that have significant short-range native contacts. CONCLUSIONS: A free energy analysis during unfolding is a useful tool for the study of foldability and should be applicable to a variety of folding models. In this way we are able to fold some 125-mer designed sequences and our results confirm the finding that short-range contacts contribute to foldability.


Assuntos
Dobramento de Proteína , Cadeias de Markov , Modelos Químicos , Desnaturação Proteica
15.
Comput Biol Med ; 26(3): 209-22, 1996 May.
Artigo em Inglês | MEDLINE | ID: mdl-8725772

RESUMO

The biological literature presents a difficult challenge to information processing in its complexity, diversity, and in its sheer volume. Much of the diversity resides in its technical terminology, which has also become voluminous. In an effort to deal more effectively with this large vocabulary and improve information processing, a method of focus has been developed which allows one to classify terms based on a measure of their importance in describing the content of the documents in which they occur. The measurement is called the strength of a term and is a measure of how strongly the term's occurrences correlate with the subjects of documents in the database. If term occurrences are random then there will be no correlation and the strength will be zero, but if for any subject, the term is either always present or never present its strength will be one. We give here a new, information theoretical interpretation of term strength, review some of its uses in focusing the processing of documents for information retrieval and describe new results obtained in document categorization.


Assuntos
Indexação e Redação de Resumos , Armazenamento e Recuperação da Informação , Biologia Molecular , Descritores , Algoritmos , Teorema de Bayes , Viés , Humanos , Análise dos Mínimos Quadrados , MEDLINE , Reprodutibilidade dos Testes
16.
Biopolymers ; 38(4): 447-59, 1996 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-8867208

RESUMO

Given a probability distribution from which the energy spectrum of a random peptide is to be sampled, we derive a general expression for the probability that such a peptide will fold to a unique native state and for the probability distribution of the native energy. This latter result allows us to localize the energy of folding based on model parameters and is one advantage of our formulation. Evidence from both the lattice theory of proteins and protein threading experiments suggest that the energy spectrum for the compact states of a peptide chain is Gaussian in form. For this reason we have derived from the more general framework the specific formulas that apply in the Gaussian case, where one requires only the number of states and the variance of the Gaussian distribution in order to apply the theory. This simplicity allows us to perform calculations that we compare with calculations previously made by others based on statistical thermodynamics. We find qualitative agreement, but a significant correction to prior estimates of folding probability derived from the Gaussian assumption is necessary.


Assuntos
Peptídeos/química , Dobramento de Proteína , Computação Matemática
17.
Proc Biol Sci ; 245(1312): 7-11, 1991 Jul 22.
Artigo em Inglês | MEDLINE | ID: mdl-1682931

RESUMO

We examine a model evolutionary space consisting of genotypes mapped to their corresponding phenotypes. This mapping is derived from a lattice model for proteins which, despite its highly idealized nature, has been shown to share general properties with real proteins. Large evolutionary networks are observed, with genotypes corresponding to non-lethal phenotypes linked by unit mutational steps. Neutral mutations are necessary for traversing the evolutionary networks, and even one neutral mutation in a genotype can change the phenotypes attainable by a unit mutational step.


Assuntos
Evolução Biológica , Modelos Genéticos , Proteínas/genética , Genótipo , Mutação , Fenótipo , Conformação Proteica , Proteínas/química , Seleção Genética
18.
J Acoust Soc Am ; 80(1): 133-45, 1986 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-3745659

RESUMO

A mathematical model of cochlear processing is developed to account for the nonlinear dependence of frequency selectivity on intensity in inner hair cell and auditory nerve fiber responses. The model describes the transformation from acoustic stimulus to intracellular hair cell potentials in the cochlea. It incorporates a linear formulation of basilar membrane mechanics and subtectorial fluid-cilia displacement coupling, and a simplified description of the inner hair cell nonlinear transduction process. The analysis at this stage is restricted to low-frequency single tones. The computed responses to single tone inputs exhibit the experimentally observed nonlinear effects of increasing intensity such as the increase in the bandwidth of frequency selectivity and the downward shift of the best frequency. In the model, the first effect is primarily due to the saturating effect of the hair cell nonlinearity. The second results from the combined effects of both the nonlinearity and of the inner hair cell low-pass transfer function. In contrast to these shifts along the frequency axis, the model does not exhibit intensity dependent shifts of the spatial location along the cochlea of the peak response for a given single tone. The observed shifts therefore do not contradict an intensity invariant tonotopic code.


Assuntos
Cóclea/fisiologia , Modelos Biológicos , Estimulação Acústica , Membrana Basilar/fisiologia , Cílios/fisiologia , Células Ciliadas Auditivas Internas/fisiologia , Audição , Humanos , Matemática , Modelos Neurológicos , Fibras Nervosas/fisiologia , Membrana Tectorial/fisiologia , Nervo Vestibulococlear/fisiologia
19.
Mol Biol Evol ; 2(5): 434-47, 1985 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-3870870

RESUMO

The internal consistency of the PAM matrix model of protein evolution is here investigated. The 1 PAM matrix has been constructed from amino acid replacements observed in closely related sequences. Such replacements are of two types, those that do not require an intermediate amino acid replacement and those that do. The second type of replacement must generally be produced by a repetition of the first. This allows data on the first type to be used in predicting data on the second type so that some elements of the 1 PAM matrix may be used to predict others. A discrepancy of more than two orders of magnitude is found between the predictions and the data when this is carried out. This is partly accounted for by an error in constructing the matrix. However, it also seems necessary that the basic model be modified. Several possibilities are considered. One of these is to incorporate a site-dependent spectrum of mutabilities associated with each amino acid.


Assuntos
Evolução Biológica , Proteínas/genética , Sequência de Aminoácidos , Biometria , Modelos Genéticos
20.
J Mol Evol ; 21(2): 161-7, 1984.
Artigo em Inglês | MEDLINE | ID: mdl-6442990

RESUMO

We examined the codon usages in well-conserved and less-well-conserved regions of vertebrate protein genes and found them to be similar. Despite this similarity, there is a statistically significant decrease in codon bias in the less-well-conserved regions. Our analysis suggests that although those codon changes initially fixed under amino acid replacements tend to follow the overall codon usage pattern, they also reduce the bias in codon usage. This decrease in codon bias leads one to predict that the rate of change of synonymous codons should be greater in those regions that are less well conserved at the amino acid level than in the better-conserved regions. Our analysis supports this prediction. Furthermore, we demonstrate a significantly elevated rate of change of synonymous codons among the adjacent codons 5' to amino acid replacement positions. This provides further support for the idea that there are contextual constraints on the choice of synonymous codons in eukaryotes.


Assuntos
Fenômenos Fisiológicos Celulares , Códon , Células Eucarióticas/fisiologia , Código Genético , Proteínas/genética , RNA Mensageiro , Seleção Genética , Genes
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...