Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 52
Filtrar
1.
Bioinform Adv ; 3(1): vbad095, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37485423

RESUMO

Motivation: Figures in biomedical papers communicate essential information with the potential to identify relevant documents in biomedical and clinical settings. However, academic search interfaces mainly search over text fields. Results: We describe a search system for biomedical documents that leverages image modalities and an existing index server. We integrate a problem-specific taxonomy of image modalities and image-based data into a custom search system. Our solution features a front-end interface to enhance classical document search results with image-related data, including page thumbnails, figures, captions and image-modality information. We demonstrate the system on a subset of the CORD-19 document collection. A quantitative evaluation demonstrates higher precision and recall for biomedical document retrieval. A qualitative evaluation with domain experts further highlights our solution's benefits to biomedical search. Availability and implementation: A demonstration is available at https://runachay.evl.uic.edu/scholar. Our code and image models can be accessed via github.com/uic-evl/bio-search. The dataset is continuously expanded.

2.
J Am Med Inform Assoc ; 29(11): 1879-1889, 2022 10 07.
Artigo em Inglês | MEDLINE | ID: mdl-35923089

RESUMO

OBJECTIVE: Abnormalities in impulse propagation and cardiac repolarization are frequent in hypertrophic cardiomyopathy (HCM), leading to abnormalities in 12-lead electrocardiograms (ECGs). Computational ECG analysis can identify electrophysiological and structural remodeling and predict arrhythmias. This requires accurate ECG segmentation. It is unknown whether current segmentation methods developed using datasets containing annotations for mostly normal heartbeats perform well in HCM. Here, we present a segmentation method to effectively identify ECG waves across 12-lead HCM ECGs. METHODS: We develop (1) a web-based tool that permits manual annotations of P, P', QRS, R', S', T, T', U, J, epsilon waves, QRS complex slurring, and atrial fibrillation by 3 experts and (2) an easy-to-implement segmentation method that effectively identifies ECG waves in normal and abnormal heartbeats. Our method was tested on 131 12-lead HCM ECGs and 2 public ECG sets to evaluate its performance in non-HCM ECGs. RESULTS: Over the HCM dataset, our method obtained a sensitivity of 99.2% and 98.1% and a positive predictive value of 92% and 95.3% when detecting QRS complex and T-offset, respectively, significantly outperforming a state-of-the-art segmentation method previously employed for HCM analysis. Over public ECG sets, it significantly outperformed 3 state-of-the-art methods when detecting P-onset and peak, T-offset, and QRS-onset and peak regarding the positive predictive value and segmentation error. It performed at a level similar to other methods in other tasks. CONCLUSION: Our method accurately identified ECG waves in the HCM dataset, outperforming a state-of-the-art method, and demonstrated similar good performance as other methods in normal/non-HCM ECG sets.


Assuntos
Cardiomiopatia Hipertrófica , Cardiomiopatia Hipertrófica/diagnóstico , Eletrocardiografia/métodos , Humanos , Valor Preditivo dos Testes
3.
Front Artif Intell ; 5: 832909, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35757296

RESUMO

This work proposes a domain-informed neural network architecture for experimental particle physics, using particle interaction localization with the time-projection chamber (TPC) technology for dark matter research as an example application. A key feature of the signals generated within the TPC is that they allow localization of particle interactions through a process called reconstruction (i.e., inverse-problem regression). While multilayer perceptrons (MLPs) have emerged as a leading contender for reconstruction in TPCs, such a black-box approach does not reflect prior knowledge of the underlying scientific processes. This paper looks anew at neural network-based interaction localization and encodes prior detector knowledge, in terms of both signal characteristics and detector geometry, into the feature encoding and the output layers of a multilayer (deep) neural network. The resulting neural network, termed Domain-informed Neural Network (DiNN), limits the receptive fields of the neurons in the initial feature encoding layers in order to account for the spatially localized nature of the signals produced within the TPC. This aspect of the DiNN, which has similarities with the emerging area of graph neural networks in that the neurons in the initial layers only connect to a handful of neurons in their succeeding layer, significantly reduces the number of parameters in the network in comparison to an MLP. In addition, in order to account for the detector geometry, the output layers of the network are modified using two geometric transformations to ensure the DiNN produces localizations within the interior of the detector. The end result is a neural network architecture that has 60% fewer parameters than an MLP, but that still achieves similar localization performance and provides a path to future architectural developments with improved performance because of their ability to encode additional domain knowledge into the architecture.

4.
Database (Oxford) ; 20222022 05 18.
Artigo em Inglês | MEDLINE | ID: mdl-35616099

RESUMO

The discovery of drug-drug interactions (DDIs) that have a translational impact among in vitro pharmacokinetics (PK), in vivo PK and clinical outcomes depends largely on the quality of the annotated corpus available for text mining. We have developed a new DDI corpus based on an annotation scheme that builds upon and extends previous ones, where an abstract is fragmented and each fragment is then annotated along eight dimensions, namely, focus, polarity, certainty, evidence, directionality, study type, interaction type and mechanism. The guideline for defining these dimensions has undergone refinement during the annotation process. Our DDI corpus comprises 900 positive DDI abstracts and 750 that are not directly relevant to DDI. The abstracts in corpus are separated into eight categories of DDI or non-DDI evidence: DDI with pharmacokinetic (PK) mechanism, in vivo DDI PK, DDI clinical, drug-nutrition interaction, single drug, not drug related, in vitro pharmacodynamic (PD) and case report. Seven annotators, three annotators with drug-interaction research experience and four annotators with less drug-interaction research experience independently annotated the DDI corpus, where two researchers independently annotated each abstract. After two rounds of annotations with additional training in between, agreement improved from (0.79, 0.96, 0.86, 0.70, 0.91, 0.65, 0.78, 0.90) to (0.93, 0.99, 0.96, 0.94, 0.95, 0.93, 0.96, 0.97) for focus, certainty, evidence, study type, interaction type, mechanisms, polarity and direction, respectively. The novice-level annotators improved from 0.83 to 0.96, while the expert-level annotators stayed in high performance with some improvement, from 0.90 to 0.96. In summary, we achieved 96% agreement among each pair of annotators with regard to the eight dimensions. The annotated corpus is now available to the community for inclusion in their text-mining pipelines. Database URL https://github.com/zha204/DDI-Corpus-Database/tree/master/DDI%20corpus.


Assuntos
Mineração de Dados , Mineração de Dados/métodos , Bases de Dados Factuais , Interações Medicamentosas , Humanos
6.
Bioinformatics ; 37(Suppl_1): i468-i476, 2021 07 12.
Artigo em Inglês | MEDLINE | ID: mdl-34252939

RESUMO

MOTIVATION: Biomedical research findings are typically disseminated through publications. To simplify access to domain-specific knowledge while supporting the research community, several biomedical databases devote significant effort to manual curation of the literature-a labor intensive process. The first step toward biocuration requires identifying articles relevant to the specific area on which the database focuses. Thus, automatically identifying publications relevant to a specific topic within a large volume of publications is an important task toward expediting the biocuration process and, in turn, biomedical research. Current methods focus on textual contents, typically extracted from the title-and-abstract. Notably, images and captions are often used in publications to convey pivotal evidence about processes, experiments and results. RESULTS: We present a new document classification scheme, using both image and caption information, in addition to titles-and-abstracts. To use the image information, we introduce a new image representation, namely Figure-word, based on class labels of subfigures. We use word embeddings for representing captions and titles-and-abstracts. To utilize all three types of information, we introduce two information integration methods. The first combines Figure-words and textual features obtained from captions and titles-and-abstracts into a single larger vector for document representation; the second employs a meta-classification scheme. Our experiments and results demonstrate the usefulness of the newly proposed Figure-words for representing images. Moreover, the results showcase the value of Figure-words, captions and titles-and-abstracts in providing complementary information for document classification; these three sources of information when combined, lead to an overall improved classification performance. AVAILABILITY AND IMPLEMENTATION: Source code and the list of PMIDs of the publications in our datasets are available upon request.


Assuntos
Pesquisa Biomédica , Bases de Dados Factuais
7.
CJC Open ; 3(6): 801-813, 2021 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-34169259

RESUMO

BACKGROUND: Hypertrophic cardiomyopathy (HCM) patients have a high incidence of atrial fibrillation (AF) and increased stroke risk, even with low CHA2DS2-VASc (congestive heart failure, hypertension, age diabetes, previous stroke/transient ischemic attack) scores. Hence, there is a need to understand the pathophysiology of AF/stroke in HCM. In this retrospective study, we develop and apply a data-driven, machine learning-based method to identify AF cases, and clinical/imaging features associated with AF, using electronic health record data. METHODS: HCM patients with documented paroxysmal/persistent/permanent AF (n = 191) were considered AF cases, and the remaining patients in sinus rhythm (n = 640) were tagged as No-AF. We evaluated 93 clinical variables; the most informative variables useful for distinguishing AF from No-AF cases were selected based on the 2-sample t test and the information gain criterion. RESULTS: We identified 18 highly informative variables that are positively (n = 11) and negatively (n = 7) correlated with AF in HCM. Next, patient records were represented via these 18 variables. Data imbalance resulting from the relatively low number of AF cases was addressed via a combination of oversampling and undersampling strategies. We trained and tested multiple classifiers under this sampling approach, showing effective classification. Specifically, an ensemble of logistic regression and naïve Bayes classifiers, trained based on the 18 variables and corrected for data imbalance, proved most effective for separating AF from No-AF cases (sensitivity = 0.74, specificity = 0.70, C-index = 0.80). CONCLUSIONS: Our model (HCM-AF-Risk Model) is the first machine learning-based method for identification of AF cases in HCM. This model demonstrates good performance, addresses data imbalance, and suggests that AF is associated with a more severe cardiac HCM phenotype.


INTRODUCTION: Les patients atteints d'une cardiomyopathie hypertrophique (CMH) présentent une forte incidence de fibrillation auriculaire (FA) et un risque accru d'accident vasculaire cérébral (AVC), malgré des scores CHA2DS2-VASc (congestive heart failure, hypertension, age diabetes, previous stroke/transient ischemic attack, c'est-à-dire : insuffisance cardiaque congestive, hypertension, âge, diabète, AVC ou accident ischémique transitoire antérieur) faibles. Par conséquent, il est nécessaire de comprendre la physiopathologie de la FA et de l'AVC en présence d'une CMH. Dans la présente étude rétrospective, nous avons élaboré et appliqué une méthode d'apprentissage automatique dirigée sur les données pour déterminer les cas de FA, et les caractéristiques cliniques/d'imagerie associées à la FA, à l'aide des données des dossiers de santé électroniques. MÉTHODES: Nous avons considéré les patients atteints d'une CMH qui ont une FA paroxystique/persistante/permanente documentée (n = 191) comme des cas de FA, et avons étiqueté les autres patients en rythme sinusal (n = 640) comme des cas sans FA. Nous avons évalué 93 variables cliniques; nous avons sélectionné les variables les plus informatives qui sont utiles pour distinguer les cas de FA des cas sans FA en fonction du test t pour deux échantillons et du critère de gain d'information. RÉSULTATS: Nous avons relevé 18 variables hautement informatives qui ont une corrélation positive (n = 11) et une corrélation négative (n = 7) avec la FA en présence d'une CMH. Ensuite, nous avons représenté les dossiers des patients au moyen de ces 18 variables. Nous avons remédié au déséquilibre des données, qui résulte du nombre relativement faible de cas de FA, grâce à une combinaison de stratégies de suréchantillonnage et de sous-échantillonnage. Nous avons formé et testé de nombreux classificateurs selon cette approche d'échantillonnage, qui montre une classification efficace. Particulièrement, un ensemble de régression logistique et de classificateurs bayésiens naïfs formés en fonction des 18 variables et corrigés en fonction du déséquilibre des données s'est révélé le plus efficace pour séparer les cas de FA des cas sans FA (sensibilité = 0,74, spécificité = 0,70, indice C = 0,80). CONCLUSIONS: Notre modèle (modèle de risque de CMH-FA) est la première méthode d'apprentissage automatique qui sert à déterminer les cas de FA en présence de CMH. Ce modèle permet de démontrer une bonne performance, de remédier au déséquilibre des données, et de croire que la FA est associée à un phénotype grave de CMH.

8.
Ann Biomed Eng ; 49(2): 573-584, 2021 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-32779056

RESUMO

Prostate cancer (PCa) is a common, serious form of cancer in men that is still prevalent despite ongoing developments in diagnostic oncology. Current detection methods lead to high rates of inaccurate diagnosis. We present a method to directly model and exploit temporal aspects of temporal enhanced ultrasound (TeUS) for tissue characterization, which improves malignancy prediction. We employ a probabilistic-temporal framework, namely, hidden Markov models (HMMs), for modeling TeUS data obtained from PCa patients. We distinguish malignant from benign tissue by comparing the respective log-likelihood estimates generated by the HMMs. We analyze 1100 TeUS signals acquired from 12 patients. Our results show improved malignancy identification compared to previous results, demonstrating over 85% accuracy and AUC of 0.95. Incorporating temporal information directly into the models leads to improved tissue differentiation in PCa. We expect our method to generalize and be applied to other types of cancer in which temporal-ultrasound can be recorded.


Assuntos
Modelos Teóricos , Próstata/diagnóstico por imagem , Neoplasias da Próstata/diagnóstico , Humanos , Masculino , Cadeias de Markov , Ultrassonografia
9.
Ann Biomed Eng ; 48(12): 3025, 2020 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-32901381

RESUMO

The authors have noted an omission in the original acknowledgements. The correct acknowledgements are as follows: Acknowledgements: This work was partially supported by Grants from NSERC Discovery to Hagit Shatkay and Parvin Mousavi, NSERC and CIHR CHRP to Parvin Mousavi and NIH R01 LM012527, NIH U54 GM104941, NSF IIS EAGER #1650851 & NSF HDR #1940080 to Hagit Shatkay.

10.
Database (Oxford) ; 20202020 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-32294192

RESUMO

Gathering information from the scientific literature is essential for biomedical research, as much knowledge is conveyed through publications. However, the large and rapidly increasing publication rate makes it impractical for researchers to quickly identify all and only those documents related to their interest. As such, automated biomedical document classification attracts much interest. Such classification is critical in the curation of biological databases, because biocurators must scan through a vast number of articles to identify pertinent information within documents most relevant to the database. This is a slow, labor-intensive process that can benefit from effective automation.We present a document classification scheme aiming to identify papers containing information relevant to a specific topic, among a large collection of articles, for supporting the biocuration classification task. Our framework is based on a meta-classification scheme we have introduced before; here we incorporate into it features gathered from figure captions, in addition to those obtained from titles and abstracts. We trained and tested our classifier over a large imbalanced dataset, originally curated by the Gene Expression Database (GXD). GXD collects all the gene expression information in the Mouse Genome Informatics (MGI) resource. As part of the MGI literature classification pipeline, GXD curators identify MGI-selected papers that are relevant for GXD. The dataset consists of ~60 000 documents (5469 labeled as relevant; 52 866 as irrelevant), gathered throughout 2012-2016, in which each document is represented by the text of its title, abstract and figure captions. Our classifier attains precision 0.698, recall 0.784, f-measure 0.738 and Matthews correlation coefficient 0.711, demonstrating that the proposed framework effectively addresses the high imbalance in the GXD classification task. Moreover, our classifier's performance is significantly improved by utilizing information from image captions compared to using titles and abstracts alone; this observation clearly demonstrates that image captions provide substantial information for supporting biomedical document classification and curation.Database URL.


Assuntos
Pesquisa Biomédica/estatística & dados numéricos , Biologia Computacional/métodos , Curadoria de Dados/métodos , Bases de Dados Factuais , Animais , Pesquisa Biomédica/classificação , Pesquisa Biomédica/métodos , Biologia Computacional/classificação , Mineração de Dados/métodos , Humanos , Internet
11.
Clin Pharmacol Ther ; 107(4): 886-902, 2020 04.
Artigo em Inglês | MEDLINE | ID: mdl-31863452

RESUMO

Clinical translation of drug-drug interaction (DDI) studies is limited, and knowledge gaps across different types of DDI evidence make it difficult to consolidate and link them to clinical consequences. Consequently, we developed information retrieval (IR) models to retrieve DDI and drug-gene interaction (DGI) evidence from 25 million PubMed abstracts and distinguish DDI evidence into in vitro pharmacokinetic (PK), clinical PK, and clinical pharmacodynamic (PD) studies for US Food and Drug Administration (FDA) approved and withdrawn drugs. Additionally, information extraction models were developed to extract DDI-pairs and DGI-pairs from the IR-retrieved abstracts. An overlapping analysis identified 986 unique DDI-pairs between all 3 types of evidence. Another 2,157 and 13,012 DDI-pairs and 3,173 DGI-pairs were identified from known clinical PK/PD DDI, clinical PD DDI, and DGI evidence, respectively. By integrating DDI and DGI evidence, we discovered 119 and 18 new pharmacogenetic hypotheses associated with CYP3A and CYP2D6, respectively. Some of these DGI evidence can also aid us in understanding DDI mechanisms.


Assuntos
Mineração de Dados/métodos , Interações Medicamentosas/fisiologia , Descoberta do Conhecimento/métodos , Farmacogenética/métodos , Pesquisa Translacional Biomédica/métodos , United States Food and Drug Administration , Mineração de Dados/tendências , Humanos , Farmacogenética/tendências , Pesquisa Translacional Biomédica/tendências , Estados Unidos , United States Food and Drug Administration/tendências
12.
Bioinformatics ; 35(21): 4381-4388, 2019 11 01.
Artigo em Inglês | MEDLINE | ID: mdl-30949681

RESUMO

MOTIVATION: Figures and captions convey essential information in biomedical documents. As such, there is a growing interest in mining published biomedical figures and in utilizing their respective captions as a source of knowledge. Notably, an essential step underlying such mining is the extraction of figures and captions from publications. While several PDF parsing tools that extract information from such documents are publicly available, they attempt to identify images by analyzing the PDF encoding and structure and the complex graphical objects embedded within. As such, they often incorrectly identify figures and captions in scientific publications, whose structure is often non-trivial. The extraction of figures, captions and figure-caption pairs from biomedical publications is thus neither well-studied nor yet well-addressed. RESULTS: We introduce a new and effective system for figure and caption extraction, PDFigCapX. Unlike existing methods, we first separate between text and graphical contents, and then utilize layout information to effectively detect and extract figures and captions. We generate files containing the figures and their associated captions and provide those as output to the end-user.We test our system both over a public dataset of computer science documents previously used by others, and over two newly collected sets of publications focusing on the biomedical domain. Our experiments and results comparing PDFigCapX to other state-of-the-art systems show a significant improvement in performance, and demonstrate the effectiveness and robustness of our approach. AVAILABILITY AND IMPLEMENTATION: Our system is publicly available for use at: https://www.eecis.udel.edu/~compbio/PDFigCapX. The two new datasets are available at: https://www.eecis.udel.edu/~compbio/PDFigCapX/Downloads.


Assuntos
Publicações , Mineração de Dados
13.
Am J Cardiol ; 123(10): 1681-1689, 2019 05 15.
Artigo em Inglês | MEDLINE | ID: mdl-30952382

RESUMO

Clinical risk stratification for sudden cardiac death (SCD) in hypertrophic cardiomyopathy (HC) employs rules derived from American College of Cardiology Foundation/American Heart Association (ACCF/AHA) guidelines or the HCM Risk-SCD model (C-index ∼0.69), which utilize a few clinical variables. We assessed whether data-driven machine learning methods that consider a wider range of variables can effectively identify HC patients with ventricular arrhythmias (VAr) that lead to SCD. We scanned the electronic health records of 711 HC patients for sustained ventricular tachycardia or ventricular fibrillation. Patients with ventricular tachycardia or ventricular fibrillation (n = 61) were tagged as VAr cases and the remaining (n = 650) as non-VAr. The 2-sample ttest and information gain criterion were used to identify the most informative clinical variables that distinguish VAr from non-VAr; patient records were reduced to include only these variables. Data imbalance stemming from low number of VAr cases was addressed by applying a combination of over- and undersampling strategies. We trained and tested multiple classifiers under this sampling approach, showing effective classification. We evaluated 93 clinical variables, of which 22 proved predictive of VAr. The ensemble of logistic regression and naïve Bayes classifiers, trained based on these 22 variables and corrected for data imbalance, was most effective in separating VAr from non-VAr cases (sensitivity = 0.73, specificity = 0.76, C-index = 0.83). Our method (HCM-VAr-Risk Model) identified 12 new predictors of VAr, in addition to 10 established SCD predictors. In conclusion, this is the first application of machine learning for identifying HC patients with VAr, using clinical attributes. Our model demonstrates good performance (C-index) compared with currently employed SCD prediction algorithms, while addressing imbalance inherent in clinical data.


Assuntos
Registros Eletrônicos de Saúde , Aprendizado de Máquina , Sistema de Registros , Medição de Risco/métodos , Taquicardia Ventricular/diagnóstico , Cardiomiopatia Hipertrófica , Ecocardiografia sob Estresse , Eletrocardiografia , Feminino , Humanos , Imagem Cinética por Ressonância Magnética/métodos , Masculino , Pessoa de Meia-Idade , Valor Preditivo dos Testes , Prognóstico , Reprodutibilidade dos Testes , Estudos Retrospectivos , Fatores de Risco , Taquicardia Ventricular/etiologia
14.
Database (Oxford) ; 20192019 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-31032839

RESUMO

Published literature is an important source of knowledge supporting biomedical research. Given the large and increasing number of publications, automated document classification plays an important role in biomedical research. Effective biomedical document classifiers are especially needed for bio-databases, in which the information stems from many thousands of biomedical publications that curators must read in detail and annotate. In addition, biomedical document classification often amounts to identifying a small subset of relevant publications within a much larger collection of available documents. As such, addressing class imbalance is essential to a practical classifier. We present here an effective classification scheme for automatically identifying papers among a large pool of biomedical publications that contain information relevant to a specific topic, which the curators are interested in annotating. The proposed scheme is based on a meta-classification framework using cluster-based under-sampling combined with named-entity recognition and statistical feature selection strategies. We examined the performance of our method over a large imbalanced data set that was originally manually curated by the Jackson Laboratory's Gene Expression Database (GXD). The set consists of more than 90 000 PubMed abstracts, of which about 13 000 documents are labeled as relevant to GXD while the others are not relevant. Our results, 0.72 precision, 0.80 recall and 0.75 f-measure, demonstrate that our proposed classification scheme effectively categorizes such a large data set in the face of data imbalance.


Assuntos
Bases de Dados de Ácidos Nucleicos , Disseminação de Informação , Polimorfismo de Nucleotídeo Único , Linguagens de Programação
15.
BMC Med Inform Decis Mak ; 18(Suppl 4): 125, 2018 12 12.
Artigo em Inglês | MEDLINE | ID: mdl-30537962

RESUMO

BACKGROUND: Chronic Kidney Disease (CKD) is one of several conditions that affect a growing percentage of the US population; the disease is accompanied by multiple co-morbidities, and is hard to diagnose in-and-of itself. In its advanced forms it carries severe outcomes and can lead to death. It is thus important to detect the disease as early as possible, which can help devise effective intervention and treatment plan. Here we investigate ways to utilize information available in electronic health records (EHRs) from regular office visits of more than 13,000 patients, in order to distinguish among several stages of the disease. While clinical data stored in EHRs provide valuable information for risk-stratification, one of the major challenges in using them arises from data imbalance. That is, records associated with a more severe condition are typically under-represented compared to those associated with a milder manifestation of the disease. To address imbalance, we propose and develop a sampling-based ensemble approach, hierarchical meta-classification, aiming to stratify CKD patients into severity stages, using simple quantitative non-text features gathered from standard office visit records. METHODS: The proposed hierarchical meta-classification method frames the multiclass classification task as a hierarchy of two subtasks. The first is binary classification, separating records associated with the majority class from those associated with all minority classes combined, using meta-classification. The second subtask separates the records assigned to the combined minority classes into the individual constituent classes. RESULTS: The proposed method identifies a significant proportion of patients suffering from the more advanced stages of the condition, while also correctly identifying most of the less severe cases, maintaining high sensitivity, specificity and F-measure (≥ 93%). Our results show that the high level of performance attained by our method is preserved even when the size of the training set is significantly reduced, demonstrating the stability and generalizability of our approach. CONCLUSION: We present a new approach to perform classification while addressing data imbalance, which is inherent in the biomedical domain. Our model effectively identifies severity stages of CKD patients, using information readily available in office visit records within the realistic context of high data imbalance.


Assuntos
Registros Eletrônicos de Saúde , Aprendizado de Máquina , Visita a Consultório Médico , Insuficiência Renal Crônica/classificação , Idoso , Idoso de 80 Anos ou mais , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Sensibilidade e Especificidade , Índice de Gravidade de Doença
16.
New Phytol ; 220(3): 851-864, 2018 11.
Artigo em Inglês | MEDLINE | ID: mdl-30020552

RESUMO

Little is known about the characteristics and function of reproductive phased, secondary, small interfering RNAs (phasiRNAs) in the Poaceae, despite the availability of significant genomic resources, experimental data, and a growing number of computational tools. We utilized machine-learning methods to identify sequence-based and positional features that distinguish phasiRNAs in rice and maize from other small RNAs (sRNAs). We developed Random Forest classifiers that can distinguish reproductive phasiRNAs from other sRNAs in complex sets of sequencing data, utilizing sequence-based (k-mers) and features describing position-specific sequence biases. The classification performance attained is > 80% in accuracy, sensitivity, specificity, and positive predicted value. Feature selection identified important features in both ends of phasiRNAs. We demonstrated that phasiRNAs have strand specificity and position-specific nucleotide biases potentially influencing AGO sorting; we also predicted targets to infer functions of phasiRNAs, and computationally assessed their sequence characteristics relative to other sRNAs. Our results demonstrate that machine-learning methods effectively identify phasiRNAs despite the lack of characteristic features typically present in precursor loci of other small RNAs, such as sequence conservation or structural motifs. The 5'-end features we identified provide insights into AGO-phasiRNA interactions. We describe a hypothetical model of competition for AGO loading between phasiRNAs of different nucleotide compositions.


Assuntos
Poaceae/genética , RNA de Plantas/metabolismo , RNA Interferente Pequeno/metabolismo , Composição de Bases/genética , Nucleotídeos/genética , Reprodução
17.
IEEE Trans Biomed Eng ; 65(8): 1798-1809, 2018 08.
Artigo em Inglês | MEDLINE | ID: mdl-29989922

RESUMO

OBJECTIVES: Temporal enhanced ultrasound (TeUS) is a new ultrasound-based imaging technique that provides tissue-specific information. Recent studies have shown the potential of TeUS for improving tissue characterization in prostate cancer diagnosis. We study the temporal properties of TeUS-temporal order and length-and present a new framework to assess their impact on tissue information. METHODS: We utilize a probabilistic modeling approach using hidden Markov models (HMMs) to capture the temporal signatures of malignant and benign tissues from TeUS signals of nine patients. We model signals of benign and malignant tissues (284 and 286 signals, respectively) in their original temporal order as well as under order permutations. We then compare the resulting models using the Kullback-Liebler divergence and assess their performance differences in characterization. Moreover, we train HMMs using TeUS signals of different durations and compare their model performance when differentiating tissue types. RESULTS: Our findings demonstrate that models of order-preserved signals perform statistically significantly better (85% accuracy) in tissue characterization compared to models of order-altered signals (62% accuracy). The performance degrades as more changes in signal order are introduced. Additionally, models trained on shorter sequences perform as accurately as models of longer sequences. CONCLUSION: The work presented here strongly indicates that temporal order has substantial impact on TeUS performance; thus, it plays a significant role in conveying tissue-specific information. Furthermore, shorter TeUS signals can relay sufficient information to accurately distinguish between tissue types. SIGNIFICANCE: Understanding the impact of TeUS properties facilitates the process of its adopting in diagnostic procedures and provides insights on improving its acquisition.


Assuntos
Interpretação de Imagem Assistida por Computador/métodos , Próstata/diagnóstico por imagem , Neoplasias da Próstata/diagnóstico por imagem , Ultrassonografia/métodos , Humanos , Masculino , Cadeias de Markov , Sensibilidade e Especificidade , Processos Estocásticos
18.
J Biomed Inform ; 82: 31-40, 2018 06.
Artigo em Inglês | MEDLINE | ID: mdl-29655947

RESUMO

Patients associated with multiple co-occurring health conditions often face aggravated complications and less favorable outcomes. Co-occurring conditions are especially prevalent among individuals suffering from kidney disease, an increasingly widespread condition affecting 13% of the general population in the US. This study aims to identify and characterize patterns of co-occurring medical conditions in patients employing a probabilistic framework. Specifically, we apply topic modeling in a non-traditional way to find associations across SNOMED-CT codes assigned and recorded in the EHRs of >13,000 patients diagnosed with kidney disease. Unlike most prior work on topic modeling, we apply the method to codes rather than to natural language. Moreover, we quantitatively evaluate the topics, assessing their tightness and distinctiveness, and also assess the medical validity of our results. Our experiments show that each topic is succinctly characterized by a few highly probable and unique disease codes, indicating that the topics are tight. Furthermore, inter-topic distance between each pair of topics is typically high, illustrating distinctiveness. Last, most coded conditions grouped together within a topic, are indeed reported to co-occur in the medical literature. Notably, our results uncover a few indirect associations among conditions that have hitherto not been reported as correlated in the medical literature.


Assuntos
Comorbidade , Nefropatias/complicações , Informática Médica/métodos , Systematized Nomenclature of Medicine , Idoso , Idoso de 80 Anos ou mais , Registros Eletrônicos de Saúde , Feminino , Humanos , Classificação Internacional de Doenças , Nefropatias/epidemiologia , Masculino , Pessoa de Meia-Idade , Modelos Estatísticos , Probabilidade , Reprodutibilidade dos Testes , Estados Unidos
19.
Bioinformatics ; 34(7): 1192-1199, 2018 04 01.
Artigo em Inglês | MEDLINE | ID: mdl-29040394

RESUMO

Motivation: Images convey essential information in biomedical publications. As such, there is a growing interest within the bio-curation and the bio-databases communities, to store images within publications as evidence for biomedical processes and for experimental results. However, many of the images in biomedical publications are compound images consisting of multiple panels, where each individual panel potentially conveys a different type of information. Segmenting such images into constituent panels is an essential first step toward utilizing images. Results: In this article, we develop a new compound image segmentation system, FigSplit, which is based on Connected Component Analysis. To overcome shortcomings typically manifested by existing methods, we develop a quality assessment step for evaluating and modifying segmentations. Two methods are proposed to re-segment the images if the initial segmentation is inaccurate. Experimental results show the effectiveness of our method compared with other methods. Availability and implementation: The system is publicly available for use at: https://www.eecis.udel.edu/~compbio/FigSplit. The code is available upon request. Contact: shatkay@udel.edu. Supplementary information: Supplementary data are available online at Bioinformatics.


Assuntos
Biologia Computacional/métodos , Reconhecimento Automatizado de Padrão , Software , Algoritmos , Gráficos por Computador
20.
Database (Oxford) ; 2017(1)2017 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-28365740

RESUMO

The Gene Expression Database (GXD) is a comprehensive online database within the Mouse Genome Informatics resource, aiming to provide available information about endogenous gene expression during mouse development. The information stems primarily from many thousands of biomedical publications that database curators must go through and read. Given the very large number of biomedical papers published each year, automatic document classification plays an important role in biomedical research. Specifically, an effective and efficient document classifier is needed for supporting the GXD annotation workflow. We present here an effective yet relatively simple classification scheme, which uses readily available tools while employing feature selection, aiming to assist curators in identifying publications relevant to GXD. We examine the performance of our method over a large manually curated dataset, consisting of more than 25 000 PubMed abstracts, of which about half are curated as relevant to GXD while the other half as irrelevant to GXD. In addition to text from title-and-abstract, we also consider image captions, an important information source that we integrate into our method. We apply a captions-based classifier to a subset of about 3300 documents, for which the full text of the curated articles is available. The results demonstrate that our proposed approach is robust and effectively addresses the GXD document classification. Moreover, using information obtained from image captions clearly improves performance, compared to title and abstract alone, affirming the utility of image captions as a substantial evidence source for automatically determining the relevance of biomedical publications to a specific subject area. Database URL: www.informatics.jax.org.


Assuntos
Curadoria de Dados , Mineração de Dados/métodos , Bases de Dados Genéticas , Regulação da Expressão Gênica , Animais , Camundongos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...