Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 23
Filter
2.
Learn Health Syst ; 7(2): e10339, 2023 Apr.
Article in English | MEDLINE | ID: mdl-37066097

ABSTRACT

Introduction: Enterprise data warehouses (EDWs) serve as foundational infrastructure in a modern learning health system, housing clinical and other system-wide data and making it available for research, strategic, and quality improvement purposes. Building on a longstanding partnership between Northwestern University's Galter Health Sciences Library and the Northwestern Medicine Enterprise Data Warehouse (NMEDW), an end-to-end clinical research data management (cRDM) program was created to enhance clinical data workforce capacity and further expand related library-based services for the campus. Methods: The training program covers topics such as clinical database architecture, clinical coding standards, and translation of research questions into queries for proper data extraction. Here we describe this program, including partners and motivations, technical and social components, integration of FAIR principles into clinical data research workflows, and the long-term implications for this work to serve as a blueprint of best practice workflows for clinical research to support library and EDW partnerships at other institutions. Results: This training program has enhanced the partnership between our institution's health sciences library and clinical data warehouse to provide support services for researchers, resulting in more efficient training workflows. Through instruction on best practices for preserving and sharing outputs, researchers are given the tools to improve the reproducibility and reusability of their work, which has positive effects for the researchers as well as for the university. All training resources have been made publicly available so that those who support this critical need at other institutions can build on our efforts. Conclusions: Library-based partnerships to support training and consultation offer an important vehicle for clinical data science capacity building in learning health systems. The cRDM program launched by Galter Library and the NMEDW is an example of this type of partnership and builds on a strong foundation of past collaboration, expanding the scope of clinical data support services and training on campus.

3.
PLoS Comput Biol ; 18(8): e1010397, 2022 08.
Article in English | MEDLINE | ID: mdl-35921268

ABSTRACT

The National Institutes of Health (NIH) Policy for Data Management and Sharing (DMS Policy) recognizes the NIH's role as a key steward of United States biomedical research and information and seeks to enhance that stewardship through systematic recommendations for the preservation and sharing of research data generated by funded projects. The policy is effective as of January 2023. The recommendations include a requirement for the submission of a Data Management and Sharing Plan (DMSP) with funding applications, and while no strict template was provided, the NIH has released supplemental draft guidance on elements to consider when developing a plan. This article provides 10 key recommendations for creating a DMSP that is both maximally compliant and effective.


Subject(s)
Biomedical Research , Data Management , National Institutes of Health (U.S.) , United States
4.
Methods Inf Med ; 58(2-03): 71-78, 2019 09.
Article in English | MEDLINE | ID: mdl-31514208

ABSTRACT

OBJECTIVES: The quality of hospital discharge care and patient factors (health and sociodemographic) impact the rates of unplanned readmissions. This study aims to measure the effects of controlling for the patient factors when using readmission rates to quantify the weighted edges between health care providers in a collaboration network. This improved understanding may inform strategies to reduce hospital readmissions, and facilitate quality-improvement initiatives. METHODS: We extracted 4 years of patient, provider, and activity data related to cardiology discharge workflow. A Weibull model was developed to predict the risk of unplanned 30-day readmission. A provider-patient bipartite network was used to connect providers by shared patient encounters. We built collaboration networks and calculated the Shared Positive Outcome Ratio (SPOR) to quantify the relationship between providers by the relative rate of patient outcomes, using both risk-adjusted readmission rates and unadjusted readmission rates. The effect of risk adjustment on the calculation of the SPOR metric was quantified using a permutation test and descriptive statistics. RESULTS: Comparing the collaboration networks consisting of 2,359 provider pairs, we found that SPOR values with risk-adjusted outcomes are significantly different than unadjusted readmission as an outcome measure (p-value = 0.025). The two networks classified the same provider pairs as high-scoring 51.5% of the time, and the same low scoring provider pairs 85.6% of the time. The observed differences in patient demographics and disease characteristics between high-scoring and low-scoring provider pairs were reduced by applying the risk-adjusted model. The risk-adjusted model reduced the average variation across each individual's SPOR scored provider connections. CONCLUSIONS: Risk adjusting unplanned readmission in a collaboration network has an effect on SPOR-weighted edges, especially on classifying high-scoring SPOR provider pairs. The risk-adjusted model reduces the variance of providers' connections and balances shared patient characteristics between low- and high-scoring provider pairs. This indicates that the risk-adjusted SPOR edges better measure the impact of collaboration on readmissions by accounting for patients' risk of readmission.


Subject(s)
Cooperative Behavior , Health Personnel , Humans , Outcome Assessment, Health Care , Patient Readmission , Risk Factors
5.
Sci Rep ; 7(1): 2626, 2017 05 25.
Article in English | MEDLINE | ID: mdl-28572625

ABSTRACT

Community detection involves grouping the nodes of a network such that nodes in the same community are more densely connected to each other than to the rest of the network. Previous studies have focused mainly on identifying communities in networks using node connectivity. However, each node in a network may be associated with many attributes. Identifying communities in networks combining node attributes has become increasingly popular in recent years. Most existing methods operate on networks with attributes of binary, categorical, or numerical type only. In this study, we introduce kNN-enhance, a simple and flexible community detection approach that uses node attribute enhancement. This approach adds the k Nearest Neighbor (kNN) graph of node attributes to alleviate the sparsity and the noise effect of an original network, thereby strengthening the community structure in the network. We use two testing algorithms, kNN-nearest and kNN-Kmeans, to partition the newly generated, attribute-enhanced graph. Our analyses of synthetic and real world networks have shown that the proposed algorithms achieve better performance compared to existing state-of-the-art algorithms. Further, the algorithms are able to deal with networks containing different combinations of binary, categorical, or numerical attributes and could be easily extended to the analysis of massive networks.

6.
BMC Med Genomics ; 10(Suppl 1): 26, 2017 05 24.
Article in English | MEDLINE | ID: mdl-28589854

ABSTRACT

BACKGROUND: Complex diseases involve many genes, and these genes are often associated with several different illnesses. Disease similarity measurement can be based on shared genotype or phenotype. Quantifying relationships between genes can reveal previously unknown connections and form a reference base for therapy development and drug repurposing. METHODS: Here we introduce a method to measure disease similarity that incorporates the uniqueness of shared genes. For each disease pair, we calculated the uniqueness score and constructed disease similarity matrices using OMIM and Disease Ontology annotation. RESULTS: Using the Disease Ontology-based matrix, we identified several interesting connections between cancer and other disease and conditions such as malaria, along with studies to support our findings. We also found several high scoring pairwise relationships for which there was little or no literature support, highlighting potentially interesting connections warranting additional study. CONCLUSIONS: We developed a co-occurrence matrix based on gene uniqueness to examine the relationships between diseases from OMIM and DORIF data. Our similarity matrix can be used to identify potential disease relationships and to motivate further studies investigating the causal mechanisms in diseases.


Subject(s)
Computational Biology/methods , Disease/genetics , Gene Ontology , Databases, Genetic , Molecular Sequence Annotation
7.
Drug Saf ; 40(11): 1075-1089, 2017 11.
Article in English | MEDLINE | ID: mdl-28643174

ABSTRACT

The goal of pharmacovigilance is to detect, monitor, characterize and prevent adverse drug events (ADEs) with pharmaceutical products. This article is a comprehensive structured review of recent advances in applying natural language processing (NLP) to electronic health record (EHR) narratives for pharmacovigilance. We review methods of varying complexity and problem focus, summarize the current state-of-the-art in methodology advancement, discuss limitations and point out several promising future directions. The ability to accurately capture both semantic and syntactic structures in clinical narratives becomes increasingly critical to enable efficient and accurate ADE detection. Significant progress has been made in algorithm development and resource construction since 2000. Since 2012, statistical analysis and machine learning methods have gained traction in automation of ADE mining from EHR narratives. Current state-of-the-art methods for NLP-based ADE detection from EHRs show promise regarding their integration into production pharmacovigilance systems. In addition, integrating multifaceted, heterogeneous data sources has shown promise in improving ADE detection and has become increasingly adopted. On the other hand, challenges and opportunities remain across the frontier of NLP application to EHR-based pharmacovigilance, including proper characterization of ADE context, differentiation between off- and on-label drug-use ADEs, recognition of the importance of polypharmacy-induced ADEs, better integration of heterogeneous data sources, creation of shared corpora, and organization of shared-task challenges to advance the state-of-the-art.


Subject(s)
Adverse Drug Reaction Reporting Systems/standards , Drug-Related Side Effects and Adverse Reactions/diagnosis , Electronic Health Records/standards , Natural Language Processing , Pharmacovigilance , Humans
8.
J Am Med Inform Assoc ; 24(2): 288-294, 2017 Mar 01.
Article in English | MEDLINE | ID: mdl-27589944

ABSTRACT

OBJECTIVE: Using Failure Mode and Effects Analysis (FMEA) as an example quality improvement approach, our objective was to evaluate whether secondary use of orders, forms, and notes recorded by the electronic health record (EHR) during daily practice can enhance the accuracy of process maps used to guide improvement. We examined discrepancies between expected and observed activities and individuals involved in a high-risk process and devised diagnostic measures for understanding discrepancies that may be used to inform quality improvement planning. METHODS: Inpatient cardiology unit staff developed a process map of discharge from the unit. We matched activities and providers identified on the process map to EHR data. Using four diagnostic measures, we analyzed discrepancies between expectation and observation. RESULTS: EHR data showed that 35% of activities were completed by unexpected providers, including providers from 12 categories not identified as part of the discharge workflow. The EHR also revealed sub-components of process activities not identified on the process map. Additional information from the EHR was used to revise the process map and show differences between expectation and observation. CONCLUSION: Findings suggest EHR data may reveal gaps in process maps used for quality improvement and identify characteristics about workflow activities that can identify perspectives for inclusion in an FMEA. Organizations with access to EHR data may be able to leverage clinical documentation to enhance process maps used for quality improvement. While focused on FMEA protocols, findings from this study may be applicable to other quality activities that require process maps.


Subject(s)
Cardiology Service, Hospital/organization & administration , Electronic Health Records , Healthcare Failure Mode and Effect Analysis , Quality Improvement , Documentation/methods , Humans , Patient Discharge
9.
PLoS One ; 11(10): e0163861, 2016.
Article in English | MEDLINE | ID: mdl-27706199

ABSTRACT

Shared patient encounters form the basis of collaborative relationships, which are crucial to the success of complex and interdisciplinary teamwork in healthcare. Quantifying the strength of these relationships using shared risk-adjusted patient outcomes provides insight into interactions that occur between healthcare providers. We developed the Shared Positive Outcome Ratio (SPOR), a novel parameter that quantifies the concentration of positive outcomes between a pair of healthcare providers over a set of shared patient encounters. We constructed a collaboration network using hospital emergency department patient data from electronic health records (EHRs) over a three-year period. Based on an outcome indicating patient satisfaction, we used this network to assess pairwise collaboration and evaluate the SPOR. By comparing this network of 574 providers and 5,615 relationships to a set of networks based on randomized outcomes, we identified 295 (5.2%) pairwise collaborations having significantly higher patient satisfaction rates. Our results show extreme high- and low-scoring relationships over a set of shared patient encounters and quantify high variability in collaboration between providers. We identified 29 top performers in terms of patient satisfaction. Providers in the high-scoring group had both a greater average number of associated encounters and a higher percentage of total encounters with positive outcomes than those in the low-scoring group, implying that more experienced individuals may be able to collaborate more successfully. Our study shows that a healthcare collaboration network can be structurally evaluated to characterize the collaborative interactions that occur between healthcare providers in a hospital setting.


Subject(s)
Patient Care Team/organization & administration , Patient Satisfaction/statistics & numerical data , Clinical Decision-Making , Cooperative Behavior , Electronic Health Records , Emergency Service, Hospital , Health Personnel , Humans , Models, Theoretical , User-Computer Interface
10.
Circ Cardiovasc Qual Outcomes ; 9(6): 670-678, 2016 11.
Article in English | MEDLINE | ID: mdl-28051772

ABSTRACT

BACKGROUND: The nature of teamwork in healthcare is complex and interdisciplinary, and provider collaboration based on shared patient encounters is crucial to its success. Characterizing the intensity of working relationships with risk-adjusted patient outcomes supplies insight into provider interactions in a hospital environment. METHODS AND RESULTS: We extracted 4 years of patient, provider, and activity data for encounters in an inpatient cardiology unit from Northwestern Medicine's Enterprise Data Warehouse. We then created a provider-patient network to identify healthcare providers who jointly participated in patient encounters and calculated satisfaction rates for provider-provider pairs. We demonstrated the application of a novel parameter, the shared positive outcome ratio, a measure that assesses the strength of a patient-sharing relationship between 2 providers based on risk-adjusted encounter outcomes. We compared an observed collaboration network of 334 providers and 3453 relationships to 1000 networks with shared positive outcome ratio scores based on randomized outcomes and found 188 collaborative relationships between pairs of providers that showed significantly higher than expected patient satisfaction ratings. A group of 22 providers performed exceptionally in terms of patient satisfaction. Our results indicate high variability in collaboration scores across the network and highlight our ability to identify relationships with both higher and lower than expected scores across a set of shared patient encounters. CONCLUSIONS: Satisfaction rates seem to vary across different teams of providers. Team collaboration can be quantified using a composite measure of collaboration across provider pairs. Tracking provider pair outcomes over a sufficient set of shared encounters may inform quality improvement strategies such as optimizing team staffing, identifying characteristics and practices of high-performing teams, developing evidence-based team guidelines, and redesigning inpatient care processes.


Subject(s)
Cardiology Service, Hospital/organization & administration , Cardiovascular Diseases/therapy , Delivery of Health Care, Integrated/organization & administration , Medical Staff, Hospital/organization & administration , Nursing Staff, Hospital/organization & administration , Patient Care Team/organization & administration , Process Assessment, Health Care/organization & administration , Cardiovascular Diseases/diagnosis , Cooperative Behavior , Data Mining/methods , Databases, Factual , Humans , Inpatients , Interdisciplinary Communication , Logistic Models , Patient Satisfaction , Quality Improvement/standards , Quality Indicators, Health Care/organization & administration , Retrospective Studies , Risk Factors , Treatment Outcome
11.
IET Syst Biol ; 9(4): 128-34, 2015 Aug.
Article in English | MEDLINE | ID: mdl-26243828

ABSTRACT

The authors investigated the regulatory network motifs and corresponding motif positions of cancer-related genes. First, they mapped disease-related genes to a transcription factor regulatory network. Next, they calculated statistically significant motifs and subsequently identified positions within these motifs that were enriched in cancer-related genes. Potential mechanisms of these motifs and positions are discussed. These results could be used to identify other disease- and cancer-related genes and could also suggest mechanisms for how these genes relate to co-occurring diseases.


Subject(s)
Chromosome Mapping/methods , Genes, Neoplasm/genetics , Genetic Predisposition to Disease/genetics , Multigene Family/genetics , Neoplasms/genetics , Transcription Factors/genetics , Base Sequence , Gene Expression Regulation, Neoplastic/genetics , Humans , Molecular Sequence Data , Nucleotide Motifs
12.
BMC Med Genomics ; 8 Suppl 2: S9, 2015.
Article in English | MEDLINE | ID: mdl-26043920

ABSTRACT

BACKGROUND: In recent years, high-throughput protein interaction identification methods have generated a large amount of data. When combined with the results from other in vivo and in vitro experiments, a complex set of relationships between biological molecules emerges. The growing popularity of network analysis and data mining has allowed researchers to recognize indirect connections between these molecules. Due to the interdependent nature of network entities, evaluating proteins in this context can reveal relationships that may not otherwise be evident. METHODS: We examined the human protein interaction network as it relates to human illness using the Disease Ontology. After calculating several topological metrics, we trained an alternating decision tree (ADTree) classifier to identify disease-associated proteins. Using a bootstrapping method, we created a tree to highlight conserved characteristics shared by many of these proteins. Subsequently, we reviewed a set of non-disease-associated proteins that were misclassified by the algorithm with high confidence and searched for evidence of a disease relationship. RESULTS: Our classifier was able to predict disease-related genes with 79% area under the receiver operating characteristic (ROC) curve (AUC), which indicates the tradeoff between sensitivity and specificity and is a good predictor of how a classifier will perform on future data sets. We found that a combination of several network characteristics including degree centrality, disease neighbor ratio, eccentricity, and neighborhood connectivity help to distinguish between disease- and non-disease-related proteins. Furthermore, the ADTree allowed us to understand which combinations of strongly predictive attributes contributed most to protein-disease classification. In our post-processing evaluation, we found several examples of potential novel disease-related proteins and corresponding literature evidence. In addition, we showed that first- and second-order neighbors in the PPI network could be used to identify likely disease associations. CONCLUSIONS: We analyzed the human protein interaction network and its relationship to disease and found that both the number of interactions with other proteins and the disease relationship of neighboring proteins helped to determine whether a protein had a relationship to disease. Our classifier predicted many proteins with no annotated disease association to be disease-related, which indicated that these proteins have network characteristics that are similar to disease-related proteins and may therefore have disease associations not previously identified. By performing a post-processing step after the prediction, we were able to identify evidence in literature supporting this possibility. This method could provide a useful filter for experimentalists searching for new candidate protein targets for drug repositioning and could also be extended to include other network and data types in order to refine these predictions.


Subject(s)
Computational Biology/methods , Data Mining , Disease/genetics , Genetic Predisposition to Disease , Protein Interaction Maps , Algorithms , Genetic Association Studies , Humans , ROC Curve
13.
J Am Med Inform Assoc ; 22(2): 299-311, 2015 Mar.
Article in English | MEDLINE | ID: mdl-25710558

ABSTRACT

OBJECTIVE: To visualize and describe collaborative electronic health record (EHR) usage for hospitalized patients with heart failure. MATERIALS AND METHODS: We identified records of patients with heart failure and all associated healthcare provider record usage through queries of the Northwestern Medicine Enterprise Data Warehouse. We constructed a network by equating access and updates of a patient's EHR to a provider-patient interaction. We then considered shared patient record access as the basis for a second network that we termed the provider collaboration network. We calculated network statistics, the modularity of provider interactions, and provider cliques. RESULTS: We identified 548 patient records accessed by 5113 healthcare providers in 2012. The provider collaboration network had 1504 nodes and 83 998 edges. We identified 7 major provider collaboration modules. Average clique size was 87.9 providers. We used a graph database to demonstrate an ad hoc query of our provider-patient network. DISCUSSION: Our analysis suggests a large number of healthcare providers across a wide variety of professions access records of patients with heart failure during their hospital stay. This shared record access tends to take place not only in a pairwise manner but also among large groups of providers. CONCLUSION: EHRs encode valuable interactions, implicitly or explicitly, between patients and providers. Network analysis provided strong evidence of multidisciplinary record access of patients with heart failure across teams of 100+ providers. Further investigation may lead to clearer understanding of how record access information can be used to strategically guide care coordination for patients hospitalized for heart failure.


Subject(s)
Data Display , Electronic Health Records/statistics & numerical data , Heart Failure , Pattern Recognition, Automated , Data Mining , Health Personnel , Hospitalization , Humans , User-Computer Interface
14.
PLoS One ; 9(1): e86044, 2014.
Article in English | MEDLINE | ID: mdl-24475069

ABSTRACT

ChIP-seq, which combines chromatin immunoprecipitation (ChIP) with next-generation parallel sequencing, allows for the genome-wide identification of protein-DNA interactions. This technology poses new challenges for the development of novel motif-finding algorithms and methods for determining exact protein-DNA binding sites from ChIP-enriched sequencing data. State-of-the-art heuristic, exhaustive search algorithms have limited application for the identification of short (l, d) motifs (l ≤ 10, d ≤ 2) contained in ChIP-enriched regions. In this work we have developed a more powerful exhaustive method (FMotif) for finding long (l, d) motifs in DNA sequences. In conjunction with our method, we have adopted a simple ChIP-enriched sampling strategy for finding these motifs in large-scale ChIP-enriched regions. Empirical studies on synthetic samples and applications using several ChIP data sets including 16 TF (transcription factor) ChIP-seq data sets and five TF ChIP-exo data sets have demonstrated that our proposed method is capable of finding these motifs with high efficiency and accuracy. The source code for FMotif is available at http://211.71.76.45/FMotif/.


Subject(s)
Binding Sites , Chromatin Immunoprecipitation , Computational Biology/methods , High-Throughput Nucleotide Sequencing , Nucleotide Motifs , Algorithms , Animals , DNA-Binding Proteins/metabolism , Embryonic Stem Cells , Mice , Position-Specific Scoring Matrices , Sensitivity and Specificity , Transcription Factors/metabolism
15.
BMC Bioinformatics ; 14: 227, 2013 Jul 17.
Article in English | MEDLINE | ID: mdl-23865838

ABSTRACT

BACKGROUND: Identification of transcription factor binding sites (also called 'motif discovery') in DNA sequences is a basic step in understanding genetic regulation. Although many successful programs have been developed, the problem is far from being solved on account of diversity in gene expression/regulation and the low specificity of binding sites. State-of-the-art algorithms have their own constraints (e.g., high time or space complexity for finding long motifs, low precision in identification of weak motifs, or the OOPS constraint: one occurrence of the motif instance per sequence) which limit their scope of application. RESULTS: In this paper, we present a novel and fast algorithm we call TFBSGroup. It is based on community detection from a graph and is used to discover long and weak (l,d) motifs under the ZOMOPS constraint (zero, one or multiple occurrence(s) of the motif instance(s) per sequence), where l is the length of a motif and d is the maximum number of mutations between a motif instance and the motif itself. Firstly, TFBSGroup transforms the (l, d) motif search in sequences to focus on the discovery of dense subgraphs within a graph. It identifies these subgraphs using a fast community detection method for obtaining coarse-grained candidate motifs. Next, it greedily refines these candidate motifs towards the true motif within their own communities. Empirical studies on synthetic (l, d) samples have shown that TFBSGroup is very efficient (e.g., it can find true (18, 6), (24, 8) motifs within 30 seconds). More importantly, the algorithm has succeeded in rapidly identifying motifs in a large data set of prokaryotic promoters generated from the Escherichia coli database RegulonDB. The algorithm has also accurately identified motifs in ChIP-seq data sets for 12 mouse transcription factors involved in ES cell pluripotency and self-renewal. CONCLUSIONS: Our novel heuristic algorithm, TFBSGroup, is able to quickly identify nearly exact matches for long and weak (l, d) motifs in DNA sequences under the ZOMOPS constraint. It is also capable of finding motifs in real applications. The source code for TFBSGroup can be obtained from http://bioinformatics.bioengr.uic.edu/TFBSGroup/.


Subject(s)
Algorithms , Promoter Regions, Genetic , Sequence Analysis, DNA/methods , Transcription Factors/metabolism , Animals , Binding Sites , DNA/chemistry , Escherichia coli K12/genetics , Escherichia coli K12/metabolism , Mice , Nucleotide Motifs
16.
PLoS Comput Biol ; 6(5): e1000755, 2010 May 27.
Article in English | MEDLINE | ID: mdl-20523742

ABSTRACT

Through combinatorial regulation, regulators partner with each other to control common targets and this allows a small number of regulators to govern many targets. One interesting question is that given this combinatorial regulation, how does the number of regulators scale with the number of targets? Here, we address this question by building and analyzing co-regulation (co-transcription and co-phosphorylation) networks that describe partnerships between regulators controlling common genes. We carry out analyses across five diverse species: Escherichia coli to human. These reveal many properties of partnership networks, such as the absence of a classical power-law degree distribution despite the existence of nodes with many partners. We also find that the number of co-regulatory partnerships follows an exponential saturation curve in relation to the number of targets. (For E. coli and Bacillus subtilis, only the beginning linear part of this curve is evident due to arrangement of genes into operons.) To gain intuition into the saturation process, we relate the biological regulation to more commonplace social contexts where a small number of individuals can form an intricate web of connections on the internet. Indeed, we find that the size of partnership networks saturates even as the complexity of their output increases. We also present a variety of models to account for the saturation phenomenon. In particular, we develop a simple analytical model to show how new partnerships are acquired with an increasing number of target genes; with certain assumptions, it reproduces the observed saturation. Then, we build a more general simulation of network growth and find agreement with a wide range of real networks. Finally, we perform various down-sampling calculations on the observed data to illustrate the robustness of our conclusions.


Subject(s)
Computational Biology/methods , Gene Regulatory Networks , Models, Genetic , Models, Statistical , Transcription, Genetic , Animals , Computer Simulation , Escherichia coli/genetics , Gene Expression Regulation , Humans , Mice , Operon , Phosphorylation , Rats , Yeasts/genetics
17.
Nucleic Acids Res ; 38(Web Server issue): W431-5, 2010 Jul.
Article in English | MEDLINE | ID: mdl-20478832

ABSTRACT

Nucleic acid-binding proteins are involved in a great number of cellular processes. Understanding the mechanisms underlying these proteins first requires the identification of specific residues involved in nucleic acid binding. Prediction of NA-binding residues can provide practical assistance in the functional annotation of NA-binding proteins. Predictions can also be used to expedite mutagenesis experiments, guiding researchers to the correct binding residues in these proteins. Here, we present a method for the identification of amino acid residues involved in DNA- and RNA-binding using sequence-based attributes. The method used in this work combines the C4.5 algorithm with bootstrap aggregation and cost-sensitive learning. Our DNA-binding model achieved 79.1% accuracy, while the RNA-binding model reached an accuracy of 73.2%. The NAPS web server is freely available at http://proteomics.bioengr.uic.edu/NAPS.


Subject(s)
DNA-Binding Proteins/chemistry , RNA-Binding Proteins/chemistry , Software , Algorithms , Binding Sites , Internet , Reproducibility of Results , Sequence Analysis, Protein
18.
Article in English | MEDLINE | ID: mdl-19163536

ABSTRACT

CpG island (CpGI) methylation is an epigenetic modification that occurs in eukaryotes and is based on the addition of a methyl group to the number 5 carbon of the pyrimidine ring of cytosine. When methylation of a CpGI occurs, the associated gene (if any) is not expressed [1]. Aberrant methylation is thought to be a causative agent in disease [2] and drug sensitivity [3], [4]. In this work, we have predicted the methylation status of CpGIs in human chromosome 21 using sequence patterns. These patterns showed a significantly different distribution between methylated and unmethylated islands in a previous work [5]. Using C4.5 with bagging and cost-sensitive learning, we achieved 85.6% accuracy, 82.8% sensitivity, and 86.4% specificity.We then constructed 1000 alternating decision trees using a bootstrapping method and analyzed the nodes that were conserved between the trees. This allowed us to find specific combinations of sequence patterns that distinguished between methylated and unmethylated CpGIs. Analysis of these characteristics offers certain insight into the conditions that permit or prevent methylation.


Subject(s)
Chromosomes, Human, Pair 21 , CpG Islands/genetics , Algorithms , DNA Methylation , Decision Support Techniques , Gene Expression , Gene Silencing , Genome, Human , Humans , Models, Statistical , Models, Theoretical , Promoter Regions, Genetic , Reproducibility of Results
19.
Article in English | MEDLINE | ID: mdl-18003133

ABSTRACT

Gene regulation requires specific protein-DNA interactions. Detecting the short and variable DNA sequences in gene promoter regions to which transcription factors (TF) bind is a difficult challenge in bioinformatics. Here we have developed two-body and three-body interaction potentials that are able to assess protein-DNA interaction and achieve a higher level of specificity in the recognition of TF-binding sites. The potentials were calculated using experimentally characterized 3-D structures of protein-DNA complexes. We implemented two approaches in order to evaluate the potentials. Using the first method, we calculated the Z-score of the potential energy of a true TF-binding sequence when compared to 50,000 randomly generated DNA sequences. The second method allowed us to take advantage of the ability of statistical potentials to recognize novel TF-binding sites within the promoter region of genes. We found that the three-body potential, which takes into account the interaction between a DNA base and a protein residue with regard to the effect of a neighboring DNA base, had a better average Z-score than that of the two-body potential. This neighbor effect suggests that the local conformation of DNA does play a critical role in specific residue-base recognition. In all cases, the potentials developed here outperformed published results. The two sets of potentials were tested further by applying them in genome-scale TF-binding site prediction for the CRP protein in E. coli. Out of the 142 cases, 28% of the true binding sites ranked first (i.e.


Subject(s)
DNA/chemistry , Proteins/chemistry , Binding Sites , DNA/genetics , DNA Replication , Protein Binding , Proteins/genetics , Reproducibility of Results
20.
Ann Biomed Eng ; 35(6): 1043-52, 2007 Jun.
Article in English | MEDLINE | ID: mdl-17436108

ABSTRACT

A protein's function depends in a large part on interactions with other molecules. With an increasing number of protein structures becoming available every year, a corresponding structural annotation approach identifying such interactions grows more expedient. At the same time, machine learning has gained popularity in bioinformatics providing robust annotation of genes and proteins without sequence homology. Here we have developed a general machine learning protocol to identify proteins that bind DNA and membrane. In general, there is no theory or even rule of thumb to pick the best machine learning algorithm. Thus, a systematic comparison of several classification algorithms known to perform well is investigated. Indeed, the boosted tree classifier is found to give the best performance, achieving 93% and 88% accuracy to discriminate non-homologous proteins that bind membrane and DNA, respectively, significantly outperforming all previously published works. We also attempted to address the importance of the attributes in function prediction and the relationships between relevant attributes. A graphical model based on boosted trees is applied to study the important features in discriminating DNA-binding proteins. In summary, the current protocol identified physical features important in DNA and membrane binding, rather than annotating function through sequence similarity.


Subject(s)
Algorithms , Artificial Intelligence , DNA-Binding Proteins/chemistry , Membrane Proteins/chemistry , Models, Chemical , Pattern Recognition, Automated/methods , Sequence Analysis, Protein/methods , Amino Acid Sequence , Computer Simulation , DNA-Binding Proteins/classification , DNA-Binding Proteins/ultrastructure , Membrane Proteins/classification , Membrane Proteins/ultrastructure , Models, Molecular , Molecular Sequence Data , Sequence Alignment/methods , Sequence Homology, Amino Acid , Structure-Activity Relationship
SELECTION OF CITATIONS
SEARCH DETAIL
...