Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 12 de 12
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Patterns (N Y) ; 5(3): 100924, 2024 Mar 08.
Artigo em Inglês | MEDLINE | ID: mdl-38487799

RESUMO

Combining classification systems potentially improves predictive accuracy, but outcomes have proven impossible to predict. Similar to improving binary classification with fusion, fusing ranking systems most commonly increases Pearson or Spearman correlations with a target when the input classifiers are "sufficiently good" (generalized as "accuracy") and "sufficiently different" (generalized as "diversity"), but the individual and joint quantitative influence of these factors on the final outcome remains unknown. We resolve these issues. Building on our previous empirical work establishing the DIRAC (DIversity of Ranks and ACcuracy) framework, which accurately predicts the outcome of fusing binary classifiers, we demonstrate that the DIRAC framework similarly explains the outcome of fusing ranking systems. Specifically, precise geometric representation of diversity and accuracy as angle-based distances within rank-based combinatorial structures (permutahedra) fully captures their synergistic roles in rank approximation, uncouples them from the specific metrics of a given problem, and represents them as generally as possible.

2.
Patterns (N Y) ; 3(2): 100415, 2022 Feb 11.
Artigo em Inglês | MEDLINE | ID: mdl-35199065

RESUMO

Combining classifier systems potentially improves predictive accuracy, but outcomes have proven impossible to predict. Classification most commonly improves when the classifiers are "sufficiently good" (generalized as " accuracy ") and "sufficiently different" (generalized as " diversity "), but the individual and joint quantitative influence of these factors on the final outcome remains unknown. We resolve these issues. Beginning with simulated data, we develop the DIRAC framework (DIversity of Ranks and ACcuracy), which accurately predicts outcome of both score-based fusions originating from exponentially modified Gaussian distributions and rank-based fusions, which are inherently distribution independent. DIRAC was validated using biological dual-energy X-ray absorption and magnetic resonance imaging data. The DIRAC framework is domain independent and has expected utility in far-ranging areas such as clinical biomarker development/personalized medicine, clinical trial enrollment, insurance pricing, portfolio management, and sensor optimization.

3.
Sensors (Basel) ; 22(3)2022 Jan 29.
Artigo em Inglês | MEDLINE | ID: mdl-35161807

RESUMO

Combinatorial fusion algorithm (CFA) is a machine learning and artificial intelligence (ML/AI) framework for combining multiple scoring systems using the rank-score characteristic (RSC) function and cognitive diversity (CD). When measuring the relevance of a publication or document with respect to the 17 Sustainable Development Goals (SDGs) of the United Nations, a classification scheme is used. However, this classification process is a challenging task due to the overlapping goals and contextual differences of those diverse SDGs. In this paper, we use CFA to combine a topic model classifier (Model A) and a semantic link classifier (Model B) to improve the precision of the classification process. We characterize and analyze each of the individual models using the RSC function and CD between Models A and B. We evaluate the classification results from combining the models using a score combination and a rank combination, when compared to the results obtained from human experts. In summary, we demonstrate that the combination of Models A and B can improve classification precision only if these individual models perform well and are diverse.


Assuntos
Inteligência Artificial , Desenvolvimento Sustentável , Saúde Global , Humanos , Aprendizado de Máquina , Nações Unidas
4.
J Chem Inf Model ; 61(4): 1593-1602, 2021 04 26.
Artigo em Inglês | MEDLINE | ID: mdl-33797887

RESUMO

Combinatorial fusion analysis (CFA) is an approach for combining multiple scoring systems using the rank-score characteristic function and cognitive diversity measure. One example is to combine diverse machine learning models to achieve better prediction quality. In this work, we apply CFA to the synthesis of metal halide perovskites containing organic ammonium cations via inverse temperature crystallization. Using a data set generated by high-throughput experimentation, four individual models (support vector machines, random forests, weighted logistic classifier, and gradient boosted trees) were developed. We characterize each of these scoring systems and explore 66 possible combinations of the models. When measured by the precision on predicting crystal formation, the majority of the combination models improves the individual model results. The best combination models outperform the best individual models by 3.9 percentage points in precision. In addition to improving prediction quality, we demonstrate how the fusion models can be used to identify mislabeled input data and address issues of data quality. In particular, we identify example cases where all single models and all fusion models do not give the correct prediction. Experimental replication of these syntheses reveals that these compositions are sensitive to modest temperature variations across the different locations of the heating element that can hinder or enhance the crystallization process. In summary, we demonstrate that model fusion using CFA can not only identify a previously unconsidered influence on reaction outcome but also be used as a form of quality control for high-throughput experimentation.


Assuntos
Aprendizado de Máquina , Máquina de Vetores de Suporte , Compostos de Cálcio , Óxidos , Titânio
5.
Brain Inform ; 3(1): 63-72, 2016 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-27747600

RESUMO

There are many situations in which a joint decision, based on the observations or decisions of multiple individuals, is desired. The challenge is determining when a combined decision is better than each of the individual systems, along with choosing the best way to perform the combination. It has been shown that the diversity between systems plays a role in the performance of their fusion. This study involved several pairs of people, each viewing an event and reporting an observation, along with their confidence level. Each observer is treated as a visual perception system, and hence an associated scoring system is created based on the observer's confidence. A diversity rank-score function on a set of observation pairs is calculated using the notion of cognitive diversity between two scoring systems in the combinatorial fusion analysis framework. The resulting diversity rank-score function graph provides a powerful visualization tool for the diversity variation among a set of system pairs, helping to identify which system pairs are most likely to show improved performance with combination.

6.
PLoS One ; 8(3): e59484, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-23544073

RESUMO

BACKGROUND: Recent studies on genome assembly from short-read sequencing data reported the limitation of this technology to reconstruct the entire genome even at very high depth coverage. We investigated the limitation from the perspective of information theory to evaluate the effect of repeats on short-read genome assembly using idealized (error-free) reads at different lengths. METHODOLOGY/PRINCIPAL FINDINGS: We define a metric H(k) to be the entropy of sequencing reads at a read length k and use the relative loss of entropy ΔH(k) to measure the impact of repeats for the reconstruction of whole-genome from sequences of length k. In our experiments, we found that entropy loss correlates well with de-novo assembly coverage of a genome, and a score of ΔH(k)>1% indicates a severe loss in genome reconstruction fidelity. The minimal read lengths to achieve ΔH(k)<1% are different for various organisms and are independent of the genome size. For example, in order to meet the threshold of ΔH(k)<1%, a read length of 60 bp is needed for the sequencing of human genome (3.2 10(9) bp) and 320 bp for the sequencing of fruit fly (1.8×10(8) bp). We also calculated the ΔH(k) scores for 2725 prokaryotic chromosomes and plasmids at several read lengths. Our results indicate that the levels of repeats in different genomes are diverse and the entropy of sequencing reads provides a measurement for the repeat structures. CONCLUSIONS/SIGNIFICANCE: The proposed entropy-based measurement, which can be calculated in seconds to minutes in most cases, provides a rapid quantitative evaluation on the limitation of idealized short-read genome sequencing. Moreover, the calculation can be parallelized to scale up to large euakryotic genomes. This approach may be useful to tune the sequencing parameters to achieve better genome assemblies when a closely related genome is already available.


Assuntos
Entropia , Genoma/genética , Sequências Repetitivas de Ácido Nucleico/genética , Análise de Sequência de DNA/métodos , Animais , Bactérias/genética , Pareamento de Bases/genética , Sequência de Bases , Cromossomos/genética , Cromossomos Artificiais Bacterianos/genética , Humanos , Células Procarióticas/metabolismo
7.
BMC Genomics ; 13 Suppl 8: S12, 2012.
Artigo em Inglês | MEDLINE | ID: mdl-23282014

RESUMO

BACKGROUND: Due to the recent rapid development in ChIP-seq technologies, which uses high-throughput next-generation DNA sequencing to identify the targets of Chromatin Immunoprecipitation, there is an increasing amount of sequencing data being generated that provides us with greater opportunity to analyze genome-wide protein-DNA interactions. In particular, we are interested in evaluating and enhancing computational and statistical techniques for locating protein binding sites. Many peak detection systems have been developed; in this study, we utilize the following six: CisGenome, MACS, PeakSeq, QuEST, SISSRs, and TRLocator. RESULTS: We define two methods to merge and rescore the regions of two peak detection systems and analyze the performance based on average precision and coverage of transcription start sites. The results indicate that ChIP-seq peak detection can be improved by fusion using score or rank combination. CONCLUSION: Our method of combination and fusion analysis would provide a means for generic assessment of available technologies and systems and assist researchers in choosing an appropriate system (or fusion method) for analyzing ChIP-seq data. This analysis offers an alternate approach for increasing true positive rates, while decreasing false positive rates and hence improving the ChIP-seq peak identification process.


Assuntos
Algoritmos , Imunoprecipitação da Cromatina , Anticorpos/imunologia , DNA/metabolismo , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Ligação Proteica , Proteínas/imunologia , Proteínas/metabolismo
9.
Int J Comput Biol Drug Des ; 4(3): 274-89, 2011.
Artigo em Inglês | MEDLINE | ID: mdl-21778560

RESUMO

Ligand-based in silico drug screening is useful for lead discovery, in particular for those targets without structures. Here, we have developed LigSeeSVM, a ligand-based screening tool using data fusion and Support Vector Machines (SVMs). We used Atom Pair (AP) structure descriptors and Physicochemical (PC) descriptors of compounds to generate SVM-AP and SVM-PC models. Sequentially, the two models were combined using rank-based data fusion to create LigSeeSVM model. LigSeeSVM was evaluated on five data sets. Experimental results show that the performance of LigSeeSVM is better than other ligand-based virtual screening approaches. We believe that LigSeeSVM is useful for lead compounds.


Assuntos
Algoritmos , Inteligência Artificial , Técnicas de Química Combinatória/métodos , Biologia Computacional/métodos , Descoberta de Drogas/métodos , Bases de Dados Factuais , Curva ROC , Receptores de Estrogênio/agonistas , Receptores de Estrogênio/antagonistas & inibidores , Timidina Quinase/metabolismo
10.
IEEE Trans Nanobioscience ; 6(2): 186-96, 2007 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-17695755

RESUMO

The classification of protein structures is essential for their function determination in bioinformatics. At present, a reasonably high rate of prediction accuracy has been achieved in classifying proteins into four classes in the SCOP database according to their primary amino acid sequences. However, for further classification into fine-grained folding categories, especially when the number of possible folding patterns as those defined in the SCOP database is large, it is still quite a challenge. In our previous work, we have proposed a two-level classification strategy called hierarchical learning architecture (HLA) using neural networks and two indirect coding features to differentiate proteins according to their classes and folding patterns, which achieved an accuracy rate of 65.5%. In this paper, we use a combinatorial fusion technique to facilitate feature selection and combination for improving predictive accuracy in protein structure classification. When applying various criteria in combinatorial fusion to the protein fold prediction approach using neural networks with HLA and the radial basis function network (RBFN), the resulting classification has an overall prediction accuracy rate of 87% for four classes and 69.6% for 27 folding categories. These rates are significantly higher than the accuracy rate of 56.5% previously obtained by Ding and Dubchak. Our results demonstrate that data fusion is a viable method for feature selection and combination in the prediction and classification of protein structure.


Assuntos
Algoritmos , Modelos Químicos , Modelos Moleculares , Reconhecimento Automatizado de Padrão/métodos , Proteínas/química , Proteínas/ultraestrutura , Análise de Sequência de Proteína/métodos , Inteligência Artificial , Simulação por Computador , Bases de Dados de Proteínas , Armazenamento e Recuperação da Informação/métodos , Proteínas/classificação , Reprodutibilidade dos Testes , Sensibilidade e Especificidade
11.
Nucleic Acids Res ; 34(22): 6379-91, 2006.
Artigo em Inglês | MEDLINE | ID: mdl-17130169

RESUMO

The identification of regulatory elements recognized by transcription factors and chromatin remodeling factors is essential to studying the regulation of gene expression. When no auxiliary data, such as orthologous sequences or expression profiles, are used, the accuracy of most tools for motif discovery is strongly influenced by the motif degeneracy and the lengths of sequence. Since suitable auxiliary data may not always be available, more work must be conducted to enhance tool performance to identify transcription elements in the metazoan. A non-alignment-based algorithm, MotifSeeker, is proposed to enhance the accuracy of discovering degenerate motifs. MotifSeeker utilizes the property that variable sites of transcription elements are usually position-specific to reduce exposure to noise. Consequently, the efficiency and accuracy of motif identification are improved. Using data fusion, the ranking process integrates two measures of motif significance, resulting in a more robust significance measure. Testing results for the synthetic data reveal that the accuracy of MotifSeeker is less sensitive to the motif degeneracy and the length of input sequences. Furthermore, MotifSeeker has been tested on a well-known benchmark [M. Tompa, N. Li, T.L. Bailey, G.M. Church, B. De Moor, E. Eskin, A.V. Favorov, M.C. Frith, Y. Fu, W.J. Kent, et al. (2005) Nat. Biotechnol., 23, 137-144], yielding a correlation coefficient of 0.262, which compares favorably with those of other tools. The high applicability of MotifSeeker to biological data is further demonstrated experimentally on regulons of Saccharomyces cerevisiae and liver-specific genes with experimentally verified regulatory elements.


Assuntos
Algoritmos , Regiões Promotoras Genéticas , Análise de Sequência de DNA/métodos , Sítios de Ligação , Biologia Computacional/métodos , Humanos , Fígado/metabolismo , Regulon , Saccharomyces cerevisiae/genética , Fatores de Transcrição/metabolismo
12.
J Chem Inf Model ; 45(4): 1134-46, 2005.
Artigo em Inglês | MEDLINE | ID: mdl-16045308

RESUMO

MOTIVATION: Virtual screening of molecular compound libraries is a potentially powerful and inexpensive method for the discovery of novel lead compounds for drug development. The major weakness of virtual screening-the inability to consistently identify true positives (leads)-is likely due to our incomplete understanding of the chemistry involved in ligand binding and the subsequently imprecise scoring algorithms. It has been demonstrated that combining multiple scoring functions (consensus scoring) improves the enrichment of true positives. Previous efforts at consensus scoring have largely focused on empirical results, but they have yet to provide a theoretical analysis that gives insight into real features of combinations and data fusion for virtual screening. RESULTS: We demonstrate that combining multiple scoring functions improves the enrichment of true positives only if (a) each of the individual scoring functions has relatively high performance and (b) the individual scoring functions are distinctive. Notably, these two prediction variables are previously established criteria for the performance of data fusion approaches using either rank or score combinations. This work, thus, establishes a potential theoretical basis for the probable success of data fusion approaches to improve yields in in silico screening experiments. Furthermore, it is similarly established that the second criterion (b) can, in at least some cases, be functionally defined as the area between the rank versus score plots generated by the two (or more) algorithms. Because rank-score plots are independent of the performance of the individual scoring function, this establishes a second theoretically defined approach to determining the likely success of combining data from different predictive algorithms. This approach is, thus, useful in practical settings in the virtual screening process when the performance of at least two individual scoring functions (such as in criterion a) can be estimated as having a high likelihood of having high performance, even if no training sets are available. We provide initial validation of this theoretical approach using data from five scoring systems with two evolutionary docking algorithms on four targets, thymidine kinase, human dihydrofolate reductase, and estrogen receptors of antagonists and agonists. Our procedure is computationally efficient, able to adapt to different situations, and scalable to a large number of compounds as well as to a greater number of combinations. Results of the experiment show a fairly significant improvement (vs single algorithms) in several measures of scoring quality, specifically "goodness-of-hit" scores, false positive rates, and "enrichment". This approach (available online at http://gemdock.life. nctu.edu.tw/dock/download.php) has practical utility for cases where the basic tools are known or believed to be generally applicable, but where specific training sets are absent.


Assuntos
Algoritmos , Bases de Dados Factuais , Desenho de Fármacos , Ligantes , Inibidores Enzimáticos/química , Inibidores Enzimáticos/farmacologia , Antagonistas do Ácido Fólico/química , Antagonistas do Ácido Fólico/farmacologia , Ligação Proteica , Receptores de Estrogênio/antagonistas & inibidores , Tetra-Hidrofolato Desidrogenase/efeitos dos fármacos , Timidina Quinase/antagonistas & inibidores
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...