Search | VHL Regional Portal

Finding reproducible cluster partitions for the k-means algorithm.

Lisboa, Paulo J G; Etchells, Terence A; Jarman, Ian H; Chambers, Simon J.

BMC Bioinformatics ; 14 Suppl 1: S8, 2013.

Article in English | MEDLINE | ID: mdl-23369085

ABSTRACT

K-means clustering is widely used for exploratory data analysis. While its dependence on initialisation is well-known, it is common practice to assume that the partition with lowest sum-of-squares (SSQ) total i.e. within cluster variance, is both reproducible under repeated initialisations and also the closest that k-means can provide to true structure, when applied to synthetic data. We show that this is generally the case for small numbers of clusters, but for values of k that are still of theoretical and practical interest, similar values of SSQ can correspond to markedly different cluster partitions. This paper extends stability measures previously presented in the context of finding optimal values of cluster number, into a component of a 2-d map of the local minima found by the k-means algorithm, from which not only can values of k be identified for further analysis but, more importantly, it is made clear whether the best SSQ is a suitable solution or whether obtaining a consistently good partition requires further application of the stability index. The proposed method is illustrated by application to five synthetic datasets replicating a real world breast cancer dataset with varying data density, and a large bioinformatics dataset.

Subject(s)

Algorithms , Breast Neoplasms , Cardiotocography , Cluster Analysis , Computational Biology/methods , Female , Humans , Reproducibility of Results

A methodology to identify consensus classes from clustering algorithms applied to immunohistochemical data from breast cancer patients.

Soria, Daniele; Garibaldi, Jonathan M; Ambrogi, Federico; Green, Andrew R; Powe, Des; Rakha, Emad; Macmillan, R Douglas; Blamey, Roger W; Ball, Graham; Lisboa, Paulo J G; Etchells, Terence A; Boracchi, Patrizia; Biganzoli, Elia; Ellis, Ian O.

Comput Biol Med ; 40(3): 318-30, 2010 Mar.

Article in English | MEDLINE | ID: mdl-20106472

ABSTRACT

Single clustering methods have often been used to elucidate clusters in high dimensional medical data, even though reliance on a single algorithm is known to be problematic. In this paper, we present a methodology to determine a set of 'core classes' by using a range of techniques to reach consensus across several different clustering algorithms, and to ascertain the key characteristics of these classes. We apply the methodology to immunohistochemical data from breast cancer patients. In doing so, we identify six core classes, of which several may be novel sub-groups not previously emphasised in literature.

Subject(s)

Algorithms , Breast Neoplasms/metabolism , Cluster Analysis , Female , Humans , Immunohistochemistry

Partial logistic artificial neural network for competing risks regularized with automatic relevance determination.

Lisboa, Paulo J G; Etchells, Terence A; Jarman, Ian H; Arsene, Corneliu T C; Aung, M S Hane; Eleuteri, Antonio; Taktak, Azzam F G; Ambrogi, Federico; Boracchi, Patrizia; Biganzoli, Elia.

IEEE Trans Neural Netw ; 20(9): 1403-16, 2009 Sep.

Article in English | MEDLINE | ID: mdl-19628458

ABSTRACT

Time-to-event analysis is important in a wide range of applications from clinical prognosis to risk modeling for credit scoring and insurance. In risk modeling, it is sometimes required to make a simultaneous assessment of the hazard arising from two or more mutually exclusive factors. This paper applies to an existing neural network model for competing risks (PLANNCR), a Bayesian regularization with the standard approximation of the evidence to implement automatic relevance determination (PLANNCR-ARD). The theoretical framework for the model is described and its application is illustrated with reference to local and distal recurrence of breast cancer, using the data set of Veronesi (1995).

Subject(s)

Automation/methods , Logistic Models , Neural Networks, Computer , Risk , Adolescent , Adult , Aged , Algorithms , Bayes Theorem , Breast Neoplasms/diagnosis , Computer Simulation , Databases, Factual , Female , Follow-Up Studies , Humans , Middle Aged , Neoplasm Recurrence, Local/diagnosis , Nonlinear Dynamics , Probability , Proportional Hazards Models , Survival Analysis , Time Factors , Young Adult

How to find simple and accurate rules for viral protease cleavage specificities.

Rögnvaldsson, Thorsteinn; Etchells, Terence A; You, Liwen; Garwicz, Daniel; Jarman, Ian; Lisboa, Paulo J G.

BMC Bioinformatics ; 10: 149, 2009 May 16.

Article in English | MEDLINE | ID: mdl-19445713

ABSTRACT

BACKGROUND: Proteases of human pathogens are becoming increasingly important drug targets, hence it is necessary to understand their substrate specificity and to interpret this knowledge in practically useful ways. New methods are being developed that produce large amounts of cleavage information for individual proteases and some have been applied to extract cleavage rules from data. However, the hitherto proposed methods for extracting rules have been neither easy to understand nor very accurate. To be practically useful, cleavage rules should be accurate, compact, and expressed in an easily understandable way. RESULTS: A new method is presented for producing cleavage rules for viral proteases with seemingly complex cleavage profiles. The method is based on orthogonal search-based rule extraction (OSRE) combined with spectral clustering. It is demonstrated on substrate data sets for human immunodeficiency virus type 1 (HIV-1) protease and hepatitis C (HCV) NS3/4A protease, showing excellent prediction performance for both HIV-1 cleavage and HCV NS3/4A cleavage, agreeing with observed HCV genotype differences. New cleavage rules (consensus sequences) are suggested for HIV-1 and HCV NS3/4A cleavages. The practical usability of the method is also demonstrated by using it to predict the location of an internal cleavage site in the HCV NS3 protease and to correct the location of a previously reported internal cleavage site in the HCV NS3 protease. The method is fast to converge and yields accurate rules, on par with previous results for HIV-1 protease and better than previous state-of-the-art for HCV NS3/4A protease. Moreover, the rules are fewer and simpler than previously obtained with rule extraction methods. CONCLUSION: A rule extraction methodology by searching for multivariate low-order predicates yields results that significantly outperform existing rule bases on out-of-sample data, but are more transparent to expert users. The approach yields rules that are easy to use and useful for interpreting experimental data.

Subject(s)

Data Interpretation, Statistical , Peptide Hydrolases/chemistry , Peptide Hydrolases/metabolism , Protease Inhibitors/chemistry , Proteomics/methods , Amino Acid Sequence , Catalytic Domain , Cluster Analysis , Computer Simulation , Databases, Protein , HIV Protease/chemistry , HIV Protease/genetics , HIV Protease/metabolism , Humans , Peptide Hydrolases/genetics , ROC Curve , Reproducibility of Results , Serine Endopeptidases/chemistry , Serine Endopeptidases/genetics , Serine Endopeptidases/metabolism , Viral Nonstructural Proteins/chemistry , Viral Nonstructural Proteins/genetics , Viral Nonstructural Proteins/metabolism , Viral Proteins/chemistry , Viral Proteins/genetics , Viral Proteins/metabolism

An integrated framework for risk profiling of breast cancer patients following surgery.

Jarman, Ian H; Etchells, Terence A; Martín, Jose D; Lisboa, Paulo J G.

Artif Intell Med ; 42(3): 165-88, 2008 Mar.

Article in English | MEDLINE | ID: mdl-18242967

ABSTRACT

OBJECTIVE: An integrated decision support framework is proposed for clinical oncologists making prognostic assessments of patients with operable breast cancer. The framework may be delivered over a web interface. It comprises a triangulation of prognostic modelling, visualisation of historical patient data and an explanatory facility to interpret risk group assignments using empirically derived Boolean rules expressed directly in clinical terms. METHODS AND MATERIALS: The prognostic inferences in the interface are validated in a multicentre longitudinal cohort study by modelling retrospective data from 917 patients recruited at Christie Hospital, Wilmslow between 1983 and 1989 and predicting for 931 patients recruited in the same centre during 1990-1993. There were also 291 patients recruited between 1984 and 1998 at the Clatterbridge Centre for Oncology and the Linda McCartney Centre, Liverpool, UK. RESULTS AND CONCLUSIONS: There are three novel contributions relating this paper to breast cancer cases. First, the widely used Nottingham prognostic index (NPI) is enhanced with additional clinical features from which prognostic assessments can be made more specific for patients in need of adjuvant treatment. This is shown with a cross matching of the NPI and a new prognostic index which also provides a two-dimensional visualisation of the complete patient database by risk of negative outcome. Second, a principled rule-extraction method, orthogonal search rule extraction, generates readily interpretable explanations of risk group allocations derived from a partial logistic artificial neural network with automatic relevance determination (PLANN-ARD). Third, 95% confidence intervals for individual predictions of survival are obtained by Monte Carlo sampling from the PLANN-ARD model.

Subject(s)

Breast Neoplasms/surgery , Decision Support Systems, Clinical , Decision Support Techniques , Mastectomy , Patient Selection , Adult , Algorithms , Artificial Intelligence , Breast Neoplasms/mortality , Confidence Intervals , Female , Health Status Indicators , Humans , Internet , Middle Aged , Models, Biological , Monte Carlo Method , Neural Networks, Computer , Prognosis , Reproducibility of Results , Retrospective Studies , Risk Assessment , Treatment Outcome , User-Computer Interface

Time-to-event analysis with artificial neural networks: an integrated analytical and rule-based study for breast cancer.

Lisboa, Paulo J G; Etchells, Terence A; Jarman, Ian H; Hane Aung, M S; Chabaud, Sylvie; Bachelot, Thomas; Perol, David; Gargi, Thérèse; Bourdès, Valérie; Bonnevay, Stéphane; Négrier, Sylvie.

Neural Netw ; 21(2-3): 414-26, 2008.

Article in English | MEDLINE | ID: mdl-18304780

ABSTRACT

This paper presents an analysis of censored survival data for breast cancer specific mortality and disease-free survival. There are three stages to the process, namely time-to-event modelling, risk stratification by predicted outcome and model interpretation using rule extraction. Model selection was carried out using the benchmark linear model, Cox regression but risk staging was derived with Cox regression and with Partial Logistic Regression Artificial Neural Networks regularised with Automatic Relevance Determination (PLANN-ARD). This analysis compares the two approaches showing the benefit of using the neural network framework especially for patients at high risk. The neural network model also has results in a smooth model of the hazard without the need for limiting assumptions of proportionality. The model predictions were verified using out-of-sample testing with the mortality model also compared with two other prognostic models called TNG and the NPI rule model. Further verification was carried out by comparing marginal estimates of the predicted and actual cumulative hazards. It was also observed that doctors seem to treat mortality and disease-free models as equivalent, so a further analysis was performed to observe if this was the case. The analysis was extended with automatic rule generation using Orthogonal Search Rule Extraction (OSRE). This methodology translates analytical risk scores into the language of the clinical domain, enabling direct validation of the operation of the Cox or neural network model. This paper extends the existing OSRE methodology to data sets that include continuous-valued variables.

Subject(s)

Breast Neoplasms/mortality , Breast Neoplasms/therapy , Neural Networks, Computer , Numerical Analysis, Computer-Assisted , Pattern Recognition, Automated/methods , Cohort Studies , Disease-Free Survival , Humans , Logistic Models , Models, Biological , Neoplasm Staging , Predictive Value of Tests , Proportional Hazards Models , Reproducibility of Results , Risk Assessment , Time Factors

Development of a rule based prognostic tool for HER 2 positive breast cancer patients.

Lisboa, Paulo J G; Etchells, Terence A; Jarman, Ian H; Aung, M S Hane; Chabaud, Sylvie; Bachelot, Thomas; Perol, David; Gargi, Thérèse; Bourdès, Valérie; Bonnevay, Stéphane; Négrier, Sylvie.

Annu Int Conf IEEE Eng Med Biol Soc ; 2007: 5416-9, 2007.

Article in English | MEDLINE | ID: mdl-18003233

ABSTRACT

A three stage development process for the production of a hierarchical rule based prognosis tool is described. The application for this tool is specific to breast cancer patients that have a positive expression of the HER 2 gene. The first stage is the development of a Bayesian classification neural network to classify for cancer specific mortality. Secondly, low-order Boolean rules are extracted form this model using an Orthogonal Search based Rule Extraction (OSRE) algorithm. Further to these rules additional information is gathered from the Kaplan-Meier survival estimates of the population, stratified by the categorizations of the input variables. Finally, expert knowledge is used to further simplify the rules and to rank them hierarchically in the form of a decision tree. The resulting decision tree groups all observations into specific categories by clinical profile and by event rate. The practical clinical value of this decision support tool will in future be tested by external validation with additional data from other clinical centres.

Subject(s)

Algorithms , Breast Neoplasms/metabolism , Breast Neoplasms/mortality , Proportional Hazards Models , Receptor, ErbB-2/metabolism , Risk Assessment/methods , Survival Analysis , Female , France/epidemiology , Humans , Incidence , Logistic Models , Prognosis , Reproducibility of Results , Risk Factors , Sensitivity and Specificity , Software , Survival Rate

Orthogonal search-based rule extraction (OSRE) for trained neural networks: a practical and efficient approach.

Etchells, Terence A; Lisboa, Paulo J G.

IEEE Trans Neural Netw ; 17(2): 374-84, 2006 Mar.

Article in English | MEDLINE | ID: mdl-16566465

ABSTRACT

There is much interest in rule extraction from neural networks and a plethora of different methods have been proposed for this purpose. We discuss the merits of pedagogical and decompositional approaches to rule extraction from trained neural networks, and show that some currently used methods for binary data comply with a theoretical formalism for extraction of Boolean rules from continuously valued logic. This formalism is extended into a generic methodology for rule extraction from smooth decision surfaces fitted to discrete or quantized continuous variables independently of the analytical structure of the underlying model, and in a manner that is efficient even for high input dimensions. This methodology is then tested with Monks' data, for which exact rules are obtained and to Wisconsin's breast cancer data, where a small number of high-order rules are identified whose discriminatory performance can be directly visualized.

Subject(s)

Algorithms , Decision Support Techniques , Models, Theoretical , Neural Networks, Computer , Numerical Analysis, Computer-Assisted , Pattern Recognition, Automated/methods , Artificial Intelligence , Computer Simulation

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL