Search | VHL Regional Portal

Measuring quality of DNA sequence data via degradation.

Karr, Alan F; Hauzel, Jason; Porter, Adam A; Schaefer, Marcel.

PLoS One ; 17(8): e0271970, 2022.

Article in English | MEDLINE | ID: mdl-35921272

ABSTRACT

We formulate and apply a novel paradigm for characterization of genome data quality, which quantifies the effects of intentional degradation of quality. The rationale is that the higher the initial quality, the more fragile the genome and the greater the effects of degradation. We demonstrate that this phenomenon is ubiquitous, and that quantified measures of degradation can be used for multiple purposes, illustrated by outlier detection. We focus on identifying outliers that may be problematic with respect to data quality, but might also be true anomalies or even attempts to subvert the database.

Subject(s)

Genome , Base Sequence , Databases, Factual

Specified Certainty Classification, with Application to Read Classification for Reference-Guided Metagenomic Assembly.

Karr, Alan F; Hauzel, Jason; Menon, Prahlad; Porter, Adam A; Schaefer, Marcel.

ArXiv ; 2021 Sep 13.

Article in English | MEDLINE | ID: mdl-34545333

ABSTRACT

Specified Certainty Classification (SCC) classifiers whose outputs carry uncertainties, typically in the form of Bayesian posterior probabilities. By allowing the classifier output to be less precise than one of a set of atomic decisions, SCC allows all decisions to achieve a specified level of certainty, as well as provides insights into classifier behavior by examining all decisions that are possible. Our primary illustration is read classification for reference-guided genome assembly, but we demonstrate the breadth of SCC by also analyzing COVID-19 vaccination data.

Comparing record linkage software programs and algorithms using real-world data.

Karr, Alan F; Taylor, Matthew T; West, Suzanne L; Setoguchi, Soko; Kou, Tzuyung D; Gerhard, Tobias; Horton, Daniel B.

PLoS One ; 14(9): e0221459, 2019.

Article in English | MEDLINE | ID: mdl-31550255

ABSTRACT

Linkage of medical databases, including insurer claims and electronic health records (EHRs), is increasingly common. However, few studies have investigated the behavior and output of linkage software. To determine how linkage quality is affected by different algorithms, blocking variables, methods for string matching and weight determination, and decision rules, we compared the performance of 4 nonproprietary linkage software packages linking patient identifiers from noninteroperable inpatient and outpatient EHRs. We linked datasets using first and last name, gender, and date of birth (DOB). We evaluated DOB and year of birth (YOB) as blocking variables and used exact and inexact matching methods. We compared the weights assigned to record pairs and evaluated how matching weights corresponded to a gold standard, medical record number. Deduplicated datasets contained 69,523 inpatient and 176,154 outpatient records, respectively. Linkage runs blocking on DOB produced weights ranging in number from 8 for exact matching to 64,273 for inexact matching. Linkage runs blocking on YOB produced 8 to 916,806 weights. Exact matching matched record pairs with identical test characteristics (sensitivity 90.48%, specificity 99.78%) for the highest ranked group, but algorithms differentially prioritized certain variables. Inexact matching behaved more variably, leading to dramatic differences in sensitivity (range 0.04-93.36%) and positive predictive value (PPV) (range 86.67-97.35%), even for the most highly ranked record pairs. Blocking on DOB led to higher PPV of highly ranked record pairs. An ensemble approach based on averaging scaled matching weights led to modestly improved accuracy. In summary, we found few differences in the rankings of record pairs with the highest matching weights across 4 linkage packages. Performance was more consistent for exact string matching than for inexact string matching. Most methods and software packages performed similarly when comparing matching accuracy with the gold standard. In some settings, an ensemble matching approach may outperform individual linkage algorithms.

Subject(s)

Algorithms , Electronic Health Records/statistics & numerical data , Medical Record Linkage/methods , Software , Databases, Factual/statistics & numerical data , Electronic Health Records/standards , Humans , Medical Record Linkage/standards

A Bayesian spatio-temporal approach for real-time detection of disease outbreaks: a case study.

Zou, Jian; Karr, Alan F; Datta, Gauri; Lynch, James; Grannis, Shaun.

BMC Med Inform Decis Mak ; 14: 108, 2014 Dec 05.

Article in English | MEDLINE | ID: mdl-25476843

ABSTRACT

BACKGROUND: For researchers and public health agencies, the complexity of high-dimensional spatio-temporal data in surveillance for large reporting networks presents numerous challenges, which include low signal-to-noise ratios, spatial and temporal dependencies, and the need to characterize uncertainties. Central to the problem in the context of disease outbreaks is a decision structure that requires trading off false positives for delayed detections. METHODS: In this paper we apply a previously developed Bayesian hierarchical model to a data set from the Indiana Public Health Emergency Surveillance System (PHESS) containing three years of emergency department visits for influenza-like illness and respiratory illness. Among issues requiring attention were selection of the underlying network (Too few nodes attenuate important structure, while too many nodes impose barriers to both modeling and computation.); ensuring that confidentiality protections in the data do not impede important modeling day of week effects; and evaluating the performance of the model. RESULTS: Our results show that the model captures salient spatio-temporal dynamics that are present in public health surveillance data sets, and that it appears to detect both "annual" and "atypical" outbreaks in a timely, accurate manner. We present maps that help make model output accessible and comprehensible to public health authorities. We use an illustrative family of decision rules to show how output from the model can be used to inform false positive-delayed detection tradeoffs. CONCLUSIONS: The advantages of our methodology for addressing the complicated issues of real world surveillance data applications are three-fold. We can easily incorporate additional covariate information and spatio-temporal dynamics in the data. Second, we furnish a unified framework to provide uncertainties associated with each parameter. Third, we are able to handle multiplicity issues by using a Bayesian approach. The urgent need to quickly and effectively monitor the health of the public makes our methodology a potentially plausible and useful surveillance approach for health professionals.

Subject(s)

Disease Outbreaks/statistics & numerical data , Emergency Service, Hospital/statistics & numerical data , Influenza, Human/epidemiology , Population Surveillance/methods , Spatio-Temporal Analysis , Bayes Theorem , Humans , Indiana/epidemiology , Markov Chains , Models, Biological , Normal Distribution , Organizational Case Studies , Respiratory Tract Diseases/epidemiology

A spatio-temporal absorbing state model for disease and syndromic surveillance.

Heaton, Matthew J; Banks, David L; Zou, Jian; Karr, Alan F; Datta, Gauri; Lynch, James; Vera, Francisco.

Stat Med ; 31(19): 2123-36, 2012 Aug 30.

Article in English | MEDLINE | ID: mdl-22388709

ABSTRACT

Reliable surveillance models are an important tool in public health because they aid in mitigating disease outbreaks, identify where and when disease outbreaks occur, and predict future occurrences. Although many statistical models have been devised for surveillance purposes, none are able to simultaneously achieve the important practical goals of good sensitivity and specificity, proper use of covariate information, inclusion of spatio-temporal dynamics, and transparent support to decision-makers. In an effort to achieve these goals, this paper proposes a spatio-temporal conditional autoregressive hidden Markov model with an absorbing state. The model performs well in both a large simulation study and in an application to influenza/pneumonia fatality data.

Subject(s)

Disease Outbreaks/statistics & numerical data , Influenza, Human/epidemiology , Population Surveillance/methods , Space-Time Clustering , Bayes Theorem , Computer Simulation , Humans , Markov Chains , Poisson Distribution , Syndrome , United States/epidemiology

Estimation of propensity scores using generalized additive models.

Woo, Mi-Ja; Reiter, Jerome P; Karr, Alan F.

Stat Med ; 27(19): 3805-16, 2008 Aug 30.

Article in English | MEDLINE | ID: mdl-18366144

ABSTRACT

Propensity score matching is often used in observational studies to create treatment and control groups with similar distributions of observed covariates. Typically, propensity scores are estimated using logistic regressions that assume linearity between the logistic link and the predictors. We evaluate the use of generalized additive models (GAMs) for estimating propensity scores. We compare logistic regressions and GAMs in terms of balancing covariates using simulation studies with artificial and genuine data. We find that, when the distributions of covariates in the treatment and control groups overlap sufficiently, using GAMs can improve overall covariate balance, especially for higher-order moments of distributions. When the distributions in the two groups overlap insufficiently, GAM more clearly reveals this fact than logistic regression does. We also demonstrate via simulation that matching with GAMs can result in larger reductions in bias when estimating treatment effects than matching with logistic regression.

Subject(s)

Logistic Models , Randomized Controlled Trials as Topic/methods , Analysis of Variance , Computer Simulation , Confounding Factors, Epidemiologic , Humans , Observation , Treatment Outcome

Secure analysis of distributed chemical databases without data integration.

Karr, Alan F; Feng, Jun; Lin, Xiaodong; Sanil, Ashish P; Young, S Stanley; Reiter, Jerome P.

J Comput Aided Mol Des ; 19(9-10): 739-47, 2005.

Article in English | MEDLINE | ID: mdl-16267693

ABSTRACT

We present a method for performing statistically valid linear regressions on the union of distributed chemical databases that preserves confidentiality of those databases. The method employs secure multi-party computation to share local sufficient statistics necessary to compute least squares estimators of regression coefficients, error variances and other quantities of interest. We illustrate our method with an example containing four companies' rather different databases.

Subject(s)

Databases, Factual , Models, Chemical , Algorithms , Computer Security , Least-Squares Analysis , Linear Models , Organic Chemicals/chemistry , Regression Analysis , Solubility , Water

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL