Search | VHL Regional Portal

Creating diagnostic scores using data-adaptive regression: An application to prediction of 30-day mortality among stroke victims in a rural hospital in India.

Birkner, Merrill D; Kalantri, Sp; Solao, Vaishali; Badam, Priya; Joshi, Rajnish; Goel, Ashish; Pai, Madhukar; Hubbard, Alan E.

Ther Clin Risk Manag ; 3(3): 475-84, 2007 Jun.

Article in English | MEDLINE | ID: mdl-18488068

ABSTRACT

Developing diagnostic scores for prediction of clinical outcomes uses medical knowledge regarding which variables are most important and empirical/statistical learning to find the functional form of these covariates that provides the most accurate prediction (eg, highest specificity and sensitivity). Given the variables chosen by the clinician as most relevant or available due to limited resources, the job is a purely statistical one: which model, among competitors, provides the most accurate prediction of clinical outcomes, where accuracy is relative to some loss function. An optimal algorithm for choosing a model follows: (1) provides a flexible, sequence of models, which can 'twist and bend' to fit the data and (2) use of a validation procedure that optimally balances bias/variance by choosing models of the right size (complexity). We propose a solution to creating diagnostic scores that, given the available variables, will appropriately trade-off model complexity with variability of estimation; the algorithm uses a combination of machine learning, logistic regression (POLYCLASS) and cross-validation. For example, we apply the procedure to data collected from stroke victims in a rural clinic in India, where the outcome of interest is death within 30 days. A quick and accurate diagnosis of stroke is important for immediate resuscitation. Equally important is giving patients and their families an indication of the prognosis. Accurate predictions of clinical outcomes made soon after the onset of stroke can also help choose appropriate supporting treatment decisions. Severity scores have been created in developed nations (for instance, Guy's Prognostic Score, Canadian Neurological Score, and the National Institute of Health Stroke Scale). However, we propose a method for developing scores appropriate to local settings in possibly very different medical circumstances. Specifically, we used a freely available and easy to use exploratory regression technique (POLYCLASS) to predict 30-day mortality following stroke in a rural Indian population and compared the accuracy of the technique with these existing stroke scales, resulting in more accurate prediction than the existing scores (POLYCLASS sensitivity and specificity of 90% and 76%, respectively). This method can easily be extrapolated to different clinical settings and for different disease outcomes. In addition, the software and algorithms used are open-source (free) and we provide the code in the appendix.

Comorbidity was associated with neurologic and psychiatric diseases: a general practice-based controlled study.

Nuyen, Jasper; Schellevis, François G; Satariano, William A; Spreeuwenberg, Peter M; Birkner, Merrill D; van den Bos, Geertrudis A M; Groenewegen, Peter P.

J Clin Epidemiol ; 59(12): 1274-84, 2006 Dec.

Article in English | MEDLINE | ID: mdl-17098570

ABSTRACT

BACKGROUND AND OBJECTIVE: To comprehensively examine comorbidity in unselected cohorts of patients with depression, stroke, multiple sclerosis (MS), Parkinson's disease/parkinsonism (PD/PKM), dementia, migraine, and epilepsy. METHODS: This cross-sectional study used morbidity data recorded by Dutch general practitioners. Index disease cohort sizes ranged from 241 patients with MS to 6,641 patients with lifetime depression. Thirty somatic and seven psychiatric disease categories were examined to determine whether they were comorbid with the index diseases by performing comparisons with age- and gender-matched control cohorts. Identified comorbidities were classified as either "possible" or "highly probable" comorbidity. RESULTS: An extensive range of 26 disease categories was found to be comorbid with lifetime depression. The comorbidity profile of stroke was also wide, including 21 disease categories. The comorbidity patterns of migraine and epilepsy comprised each 11 disease categories. Those concerning MS, PD/PKM, and dementia included a small number of disease categories. CONCLUSION: This study provides comprehensive knowledge of the occurrence of somatic and psychiatric comorbidity in general populations of patients with depression, stroke, MS, PD/PKM, dementia, migraine, and epilepsy. The implications of the findings for clinical practice and research are discussed.

Subject(s)

Family Practice , Mental Disorders/epidemiology , Nervous System Diseases/epidemiology , Aged , Aged, 80 and over , Case-Control Studies , Cohort Studies , Comorbidity , Cross-Sectional Studies , Dementia/epidemiology , Depression/epidemiology , Epilepsy/epidemiology , Female , Humans , Male , Middle Aged , Migraine Disorders/epidemiology , Multiple Sclerosis/epidemiology , Netherlands/epidemiology , Parkinson Disease/epidemiology , Retrospective Studies , Stroke/epidemiology

Issues of processing and multiple testing of SELDI-TOF MS proteomic data.

Birkner, Merrill D; Hubbard, Alan E; van der Laan, Mark J; Skibola, Christine F; Hegedus, Christine M; Smith, Martyn T.

Stat Appl Genet Mol Biol ; 5: Article11, 2006.

Article in English | MEDLINE | ID: mdl-16646865

ABSTRACT

A new data filtering method for SELDI-TOF MS proteomic spectra data is described. We examined technical repeats (2 per subject) of intensity versus m/z (mass/charge) of bone marrow cell lysate for two groups of childhood leukemia patients: acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL). As others have noted, the type of data processing as well as experimental variability can have a disproportionate impact on the list of "interesting'' proteins (see Baggerly et al. (2004)). We propose a list of processing and multiple testing techniques to correct for 1) background drift; 2) filtering using smooth regression and cross-validated bandwidth selection; 3) peak finding; and 4) methods to correct for multiple testing (van der Laan et al. (2005)). The result is a list of proteins (indexed by m/z) where average expression is significantly different among disease (or treatment, etc.) groups. The procedures are intended to provide a sensible and statistically driven algorithm, which we argue provides a list of proteins that have a significant difference in expression. Given no sources of unmeasured bias (such as confounding of experimental conditions with disease status), proteins found to be statistically significant using this technique have a low probability of being false positives.

Subject(s)

Leukemia, Myeloid/metabolism , Neoplasm Proteins/metabolism , Precursor Cell Lymphoblastic Leukemia-Lymphoma/metabolism , Proteomics/methods , Spectrometry, Mass, Matrix-Assisted Laser Desorption-Ionization/methods , Acute Disease , Algorithms , Bone Marrow Cells/metabolism , Child , Data Interpretation, Statistical , Humans , Probability

Empirical Bayes and resampling based multiple testing procedure controlling tail probability of the proportion of false positives.

van der Laan, Mark J; Birkner, Merrill D; Hubbard, Alan E.

Stat Appl Genet Mol Biol ; 4: Article29, 2005.

Article in English | MEDLINE | ID: mdl-16646847

ABSTRACT

Simultaneously testing a collection of null hypotheses about a data generating distribution based on a sample of independent and identically distributed observations is a fundamental and important statistical problem involving many applications. In this article we propose a new re-sampling based multiple testing procedure asymptotically controlling the probability that the proportion of false positives among the set of rejections exceeds q at level alpha, where q and alpha are user supplied numbers. The procedure involves 1) specifying a conditional distribution for a guessed set of true null hypotheses, given the data, which asymptotically is degenerate at the true set of null hypotheses, and 2) specifying a generally valid null distribution for the vector of test-statistics proposed in Pollard & van der Laan (2003), and generalized in our subsequent article Dudoit, van der Laan, & Pollard (2004), van der Laan, Dudoit, & Pollard (2004), and van der Laan, Dudoit, & Pollard (2004b). Ingredient 1) is established by fitting the empirical Bayes two component mixture model (Efron (2001b)) to the data to obtain an upper bound for marginal posterior probabilities of the null being true, given the data. We establish the finite sample rational behind our proposal, and prove that this new multiple testing procedure asymptotically controls the wished tail probability for the proportion of false positives under general data generating distributions. In addition, we provide simulation studies establishing that this method is generally more powerful in finite samples than our previously proposed augmentation multiple testing procedure (van der Laan, Dudoit, & Pollard (2004b)) and competing procedures from the literature. Finally, we illustrate our methodology with a data analysis.

Multiple testing and data adaptive regression: an application to HIV-1 sequence data.

Birkner, Merrill D; Sinisi, Sandra E; van der Laan, Mark J.

Stat Appl Genet Mol Biol ; 4: Article8, 2005.

Article in English | MEDLINE | ID: mdl-16646861

ABSTRACT

Analysis of viral strand sequence data and viral replication capacity could potentially lead to biological insights regarding the replication ability of HIV-1. Determining specific target codons on the viral strand will facilitate the manufacturing of target-specific antiretrovirals. Various algorithmic and analysis techniques can be applied to this application. In this paper, we apply two techniques to a data set consisting of 317 patients, each with 282 sequenced protease and reverse transcriptase codons. The first application is recently developed multiple testing procedures to find codons which have significant univariate associations with the replication capacity of the virus. A single-step multiple testing procedure (Pollard and van der Laan 2003) method was used to control the family wise error rate (FWER) at the five percent alpha level as well as the application of augmentation multiple testing procedures to control the generalized family wise error (gFWER) or the tail probability of the proportion of false positives (TPPFP). We also applied a data adaptive multiple regression algorithm to obtain a prediction of viral replication capacity based on an entire mutant/non-mutant sequence profile. This is a loss-based, cross-validated Deletion/Substitution/Addition regression algorithm (Sinisi and van der Laan 2004), which builds candidate estimators in the prediction of a univariate outcome by minimizing an empirical risk. These methods are two separate techniques with distinct goals used to analyze this structure of viral data.

Spatial and temporal variability in schistosome cercarial density detected by mouse bioassays in village irrigation ditches in Sichuan, China.

Spear, Robert C; Zhong, Bo; Mao, Yong; Hubbard, Alan; Birkner, Merrill; Remais, Justin; Qiu, Dongchuan.

Am J Trop Med Hyg ; 71(5): 554-7, 2004 Nov.

Article in English | MEDLINE | ID: mdl-15569783

ABSTRACT

A mouse bioassay was used monthly over the infection season of 2001 to determine the temporal and spatial variability of schistosome cercarial density in irrigation ditches in five villages in southwestern Sichuan Province in the People's Republic of China. Analysis of variance showed that approximately half of the variability was due to the village and site within the village, with little contribution from air temperature, weekly average rainfall, or the month within the infection season in which the bioassay was performed. The location-specific variability in these data suggest that epidemiologic studies will generally have low power to detect the influence of water-contact intensity on human parasite burden without taking account of variations in cercarial density at sites of water contact.

Subject(s)

Schistosoma/physiology , Schistosomiasis/epidemiology , Schistosomiasis/parasitology , Water Microbiology , Animals , China/epidemiology , Disease Vectors , Humans , Mice/parasitology , Population Density , Rain , Schistosomiasis/etiology , Schistosomiasis/transmission , Seasons , Snails/parasitology , Temperature , Therapeutic Irrigation

Factors influencing the transmission of Schistosoma japonicum in the mountains of Sichuan Province of China.

Spear, Robert C; Seto, Edmund; Liang, Song; Birkner, Merrill; Hubbard, Alan; Qiu, Dongchuan; Yang, Changhong; Zhong, Bo; Xu, Fashen; Gu, Xueguang; Davis, George M.

Am J Trop Med Hyg ; 70(1): 48-56, 2004 Jan.

Article in English | MEDLINE | ID: mdl-14971698

ABSTRACT

Twenty villages in the Anning River Valley of southwestern Sichuan China were surveyed for Schistosoma japonicum infections in humans and domestic animals. Also surveyed were human water contact patterns, snail populations, cercarial risk in irrigation systems, and agricultural land use. Few animals were infected, while village prevalence of infection in humans ranged from 3% to 68% and average village eggs per gram of stool ranged from 0 to 110. Except for occupation and education, individual characteristics were not strong determinants of infection intensity within a village. Differences in human infection intensity between these villages are strongly associated with crop type, with low-intensity villages principally growing rice, in contrast to villages devoting more land to vegetables and tobacco. Cercarial risk in village irrigation systems is associated with snail density and human infection intensity through the use of manure-based fertilizer. Some of the agricultural and environmental factors associated with infection risk can be quantified using remote sensing technology.

Subject(s)

Endemic Diseases , Schistosoma japonicum/isolation & purification , Schistosomiasis japonica/transmission , Snails/parasitology , Water/parasitology , Adolescent , Adult , Age Factors , Agriculture , Animals , Child , Child, Preschool , China/epidemiology , Cross-Sectional Studies , Environment , Feces/parasitology , Female , Humans , Male , Mice , Middle Aged , Parasite Egg Count , Prevalence , Schistosomiasis japonica/epidemiology , Schistosomiasis japonica/parasitology , Sex Factors

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL