Search | VHL Regional Portal

Comparison of Imputation Methods for Categorical Real-World Prostate Cancer Data with Natural Order.

Schmitt, Susanne; Rothlauf, Franz.

Stud Health Technol Inform ; 316: 1800-1804, 2024 Aug 22.

Article in English | MEDLINE | ID: mdl-39176840

ABSTRACT

Missing values (NA) often occur in cancer research, which may be due to reasons such as data protection, data loss, or missing follow-up data. Such incomplete patient information can have an impact on prediction models and other data analyses. Imputation methods are a tool for dealing with NA. Cancer data is often presented in an ordered categorical form, such as tumour grading and staging, which requires special methods. This work compares mode imputation, k nearest neighbour (knn) imputation, and, in the context of Multiple Imputation by Chained Equations (MICE), logistic regression model with proportional odds (mice_polr) and random forest (mice_rf) on a real-world prostate cancer dataset provided by the Cancer Registry of Rhineland-Palatinate in Germany. Our dataset contains relevant information for the risk classification of patients and the time between date of diagnosis and date of death. For the imputation comparison, we use Rubin's (1974) Missing Completely At Random (MCAR) mechanism to remove 10%, 20%, 30%, and 50% observations. The results are evaluated and ranked based on the accuracy per patient. Mice_rf performs significantly best for each percentage of NA, followed by knn, and mice_polr performs significantly worst. Furthermore, our findings indicate that the accuracy of imputation methods increases with a lower number of categories, a relatively even proportion of patients in the categories, or a majority of patients in a particular category.

Subject(s)

Prostatic Neoplasms , Male , Humans , Germany , Registries , Data Interpretation, Statistical

On Entity Embeddings for Ordinal Features as Representation Learning in Recurrence Prediction of Urothelial Bladder Cancer.

Schwarz, Louisa; Rothlauf, Franz.

Stud Health Technol Inform ; 316: 690-694, 2024 Aug 22.

Article in English | MEDLINE | ID: mdl-39176889

ABSTRACT

BACKGROUND: Urothelial Bladder Cancer (UBC) is a common cancer with a high risk of recurrence, which is influenced by the TNM classification, grading, age, and other factors. Recent studies demonstrate reliable and accurate recurrence prediction using Machine Learning (ML) algorithms and even outperform traditional approaches. However, most ML algorithms cannot process categorical input features, which must first be encoded into numerical values. Choosing the appropriate encoding strategy has a significant impact on the prediction quality. OBJECTIVE: We investigate the impact of encoding strategies for ordinal features in the prediction quality of ML algorithms. METHOD: We compare three different encoding strategies namely one-hot, ordinal, and entity embedding in predicting the 2-year recurrence in UBC patients using an artificial neural network. We use ordered categorical and numerical data of UBC patients provided by the Cancer Registry Rhineland-Palatinate. RESULTS: We show superior prediction quality using entity embedding encoding with 84.6% precision, an overall accuracy of 73.8%, and 68.9% AUC on testing data over 100 epochs after 30 runs compared to one-hot and ordinal encoding. CONCLUSION: We confirm the superiority of entity embedding encoding as it could provide a more detailed and accurate representation of ordinal features in numerical scales. This can lead to enhanced generalizability, resulting in significantly improved prediction quality.

Subject(s)

Machine Learning , Neoplasm Recurrence, Local , Urinary Bladder Neoplasms , Humans , Neural Networks, Computer , Algorithms

On relevant features for the recurrence prediction of urothelial carcinoma of the bladder.

Schwarz, Louisa; Sobania, Dominik; Rothlauf, Franz.

Int J Med Inform ; 186: 105414, 2024 Jun.

Article in English | MEDLINE | ID: mdl-38531255

ABSTRACT

BACKGROUND: Urothelial bladder cancer (UBC) is characterized by a high recurrence rate, which is predicted by scoring systems. However, recent studies show the superiority of Machine Learning (ML) models. Nevertheless, these ML approaches are rarely used in medical practice because most of them are black-box models, that cannot adequately explain how a prediction is made. OBJECTIVE: We investigate the global feature importance of different ML models. By providing information on the most relevant features, we can facilitate the use of ML in everyday medical practice. DESIGN, SETTING, AND PARTICIPANTS: The data is provided by the cancer registry Rhineland-Palatinate gGmbH, Germany. It consists of numerical and categorical features of 1,944 patients with UBC. We retrospectively predict 2-year recurrence through ML models using Support Vector Machine, Gradient Boosting, and Artificial Neural Network. We then determine the global feature importance using performance-based Permutation Feature Importance (PFI) and variance-based Feature Importance Ranking Measure (FIRM). RESULTS: We show reliable recurrence prediction of UBC with 82.02% to 83.89% F1-Score, 83.95% to 84.49% Precision, and an overall performance of 69.20% to 70.82% AUC on testing data, depending on the model. Gradient Boosting performs best among all black-box models with an average F1-Score (83.89%), AUC (70.82%), and Precision (83.95%). Furthermore, we show consistency across PFI and FIRM by identifying the same features as relevant across the different models. These features are exclusively therapeutic measures and are consistent with findings from both medical research and clinical trials. CONCLUSIONS: We confirm the superiority of ML black-box models in predicting UBC recurrence compared to more traditional logistic regression. In addition, we present an approach that increases the explanatory power of black-box models by identifying the underlying influence of input features, thus facilitating the use of ML in clinical practice and therefore providing improved recurrence prediction through the application of black-box models.

Subject(s)

Biomedical Research , Carcinoma, Transitional Cell , Urinary Bladder Neoplasms , Humans , Urinary Bladder Neoplasms/diagnosis , Urinary Bladder Neoplasms/epidemiology , Urinary Bladder , Retrospective Studies

Using machine learning to link electronic health records in cancer registries: On the tradeoff between linkage quality and manual effort.

Röchner, Philipp; Rothlauf, Franz.

Int J Med Inform ; 185: 105387, 2024 May.

Article in English | MEDLINE | ID: mdl-38428200

ABSTRACT

BACKGROUND: Cancer registries link a large number of electronic health records reported by medical institutions to already registered records of the matching individual and tumor. Records are automatically linked using deterministic and probabilistic approaches; machine learning is rarely used. Records that cannot be matched automatically with sufficient accuracy are typically processed manually. For application, it is important to know how well record linkage approaches match real-world records and how much manual effort is required to achieve the desired linkage quality. We study the task of linking reported records to the matching registered tumor in cancer registries. METHODS: We compare the tradeoff between linkage quality and manual effort of five machine learning methods (logistic regression, random forest, gradient boosting, neural network, and a stacked method) to a deterministic baseline. The record linkage methods are compared in a two-class setting (no-match/ match) and a three-class setting (no-match/ undecided/ match). A cancer registry collected and linked the dataset consisting of categorical variables matching 145,755 reported records with 33,289 registered tumors. RESULTS: In the two-class setting, the gradient boosting, neural network, and stacked models have higher accuracy and F1 score (accuracy: 0.968-0.978, F1 score: 0.983-0.988) than the deterministic baseline (accuracy: 0.964, F1 score: 0.980) when the same records are manually processed (0.89% of all records). In the three-class setting, these three machine learning methods can automatically process all reported records and still have higher accuracy and F1 score than the deterministic baseline. The linkage quality of the machine learning methods studied, except for the neural network, increase as the number of manually processed records increases. CONCLUSION: Machine learning methods can significantly improve linkage quality and reduce the manual effort required by medical coders to match tumor records in cancer registries compared to a deterministic baseline. Our results help cancer registries estimate how linkage quality increases as more records are manually processed.

Subject(s)

Electronic Health Records , Neoplasms , Humans , Medical Record Linkage/methods , Neoplasms/epidemiology , Registries , Databases, Factual

Informed Down-Sampled Lexicase Selection: Identifying productive training cases for efficient problem solving.

Boldi, Ryan; Briesch, Martin; Sobania, Dominik; Lalejini, Alexander; Helmuth, Thomas; Rothlauf, Franz; Ofria, Charles; Spector, Lee.

Evol Comput ; : 1-32, 2024 Jan 26.

Article in English | MEDLINE | ID: mdl-38271633

ABSTRACT

Genetic Programming (GP) often uses large training sets and requires all individuals to be evaluated on all training cases during selection. Random down-sampled lexicase selection evaluates individuals on only a random subset of the training cases allowing for more individuals to be explored with the same amount of program executions. However, sampling randomly can exclude important cases from the down-sample for a number of generations, while cases that measure the same behavior (synonymous cases) may be overused. In this work, we introduce Informed Down-Sampled Lexicase Selection. This method leverages population statistics to build down-samples that contain more distinct and therefore informative training cases. Through an empirical investigation across two different GP systems (PushGP and Grammar-Guided GP), we find that informed down-sampling significantly outperforms random down-sampling on a set of contemporary program synthesis benchmark problems. Through an analysis of the created down-samples, we find that important training cases are included in the down-sample consistently across independent evolutionary runs and systems. We hypothesize that this improvement can be attributed to the ability of Informed Down-Sampled Lexicase Selection to maintain more specialist individuals over the course of evolution, while still benefiting from reduced per-evaluation costs.

Unsupervised anomaly detection of implausible electronic health records: a real-world evaluation in cancer registries.

Röchner, Philipp; Rothlauf, Franz.

BMC Med Res Methodol ; 23(1): 125, 2023 05 24.

Article in English | MEDLINE | ID: mdl-37226114

ABSTRACT

BACKGROUND: Cancer registries collect patient-specific information about cancer diseases. The collected information is verified and made available to clinical researchers, physicians, and patients. When processing information, cancer registries verify that the patient-specific records they collect are plausible. This means that the collected information about a particular patient makes medical sense. METHODS: Unsupervised machine learning approaches can detect implausible electronic health records without human guidance. Therefore, this article investigates two unsupervised anomaly detection approaches, a pattern-based approach (FindFPOF) and a compression-based approach (autoencoder), to identify implausible electronic health records in cancer registries. Unlike most existing work that analyzes synthetic anomalies, we compare the performance of both approaches and a baseline (random selection of records) on a real-world dataset. The dataset contains 21,104 electronic health records of patients with breast, colorectal, and prostate tumors. Each record consists of 16 categorical variables describing the disease, the patient, and the diagnostic procedure. The samples identified by FindFPOF, the autoencoder, and a random selection-a total of 785 different records-are evaluated in a real-world scenario by medical domain experts. RESULTS: Both anomaly detection methods are good at detecting implausible electronic health records. First, domain experts identified [Formula: see text] of 300 randomly selected records as implausible. With FindFPOF and the autoencoder, [Formula: see text] of the proposed 300 records in each sample were implausible. This corresponds to a precision of [Formula: see text] for FindFPOF and the autoencoder. Second, for 300 randomly selected records that were labeled by domain experts, the sensitivity of the autoencoder was [Formula: see text] and the sensitivity of FindFPOF was [Formula: see text]. Both anomaly detection methods had a specificity of [Formula: see text]. Third, FindFPOF and the autoencoder suggested samples with a different distribution of values than the overall dataset. For example, both anomaly detection methods suggested a higher proportion of colorectal records, the tumor localization with the highest percentage of implausible records in a randomly selected sample. CONCLUSIONS: Unsupervised anomaly detection can significantly reduce the manual effort of domain experts to find implausible electronic health records in cancer registries. In our experiments, the manual effort was reduced by a factor of approximately 3.5 compared to evaluating a random sample.

Subject(s)

Colorectal Neoplasms , Physicians , Prostatic Neoplasms , Male , Humans , Electronic Health Records , Registries

An Analysis of the Influence of Noneffective Instructions in Linear Genetic Programming.

Sotto, Léo Françoso Dal Piccol; Rothlauf, Franz; de Melo, Vinícius Veloso; Basgalupp, Márcio P.

Evol Comput ; 30(1): 51-74, 2022 Mar 01.

Article in English | MEDLINE | ID: mdl-34428302

ABSTRACT

Linear Genetic Programming (LGP) represents programs as sequences of instructions and has a Directed Acyclic Graph (DAG) dataflow. The results of instructions are stored in registers that can be used as arguments by other instructions. Instructions that are disconnected from the main part of the program are called noneffective instructions, or structural introns. They also appear in other DAG-based GP approaches like Cartesian Genetic Programming (CGP). This article studies four hypotheses on the role of structural introns: noneffective instructions (1) serve as evolutionary memory, where evolved information is stored and later used in search, (2) preserve population diversity, (3) allow neutral search, where structural introns increase the number of neutral mutations and improve performance, and (4) serve as genetic material to enable program growth. We study different variants of LGP controlling the influence of introns for symbolic regression, classification, and digital circuits problems. We find that there is (1) evolved information in the noneffective instructions that can be reactivated and that (2) structural introns can promote programs with higher effective diversity. However, both effects have no influence on LGP search performance. On the other hand, allowing mutations to not only be applied to effective but also to noneffective instructions (3) increases the rate of neutral mutations and (4) contributes to program growth by making use of the genetic material available as structural introns. This comes along with a significant increase of LGP performance, which makes structural introns important for LGP.

Subject(s)

Algorithms , Biological Evolution

Redundant representations in evolutionary computation.

Rothlauf, Franz; Goldberg, David E.

Evol Comput ; 11(4): 381-415, 2003.

Article in English | MEDLINE | ID: mdl-14629864

ABSTRACT

This paper discusses how the use of redundant representations influences the performance of genetic and evolutionary algorithms. Representations are redundant if the number of genotypes exceeds the number of phenotypes. A distinction is made between synonymously and non-synonymously redundant representations. Representations are synonymously redundant if the genotypes that represent the same phenotype are very similar to each other. Non-synonymously redundant representations do not allow genetic operators to work properly and result in a lower performance of evolutionary search. When using synonymously redundant representations, the performance of selectorecombinative genetic algorithms (GAs) depends on the modification of the initial supply. We have developed theoretical models for synonymously redundant representations that show the necessary population size to solve a problem and the number of generations goes with O(2(kr)/r), where kr is the order of redundancy and r is the number of genotypic building blocks (BB) that represent the optimal phenotypic BB. As a result, uniformly redundant representations do not change the behavior of GAs. Only by increasing r, which means overrepresenting the optimal solution, does GA performance increase. Therefore, non-uniformly redundant representations can only be used advantageously if a-priori information exists regarding the optimal solution. The validity of the proposed theoretical concepts is illustrated for the binary trivial voting mapping and the real-valued link-biased encoding. Our empirical investigations show that the developed population sizing and time to convergence models allow an accurate prediction of the empirical results.

Subject(s)

Algorithms , Biological Evolution , Computational Biology , Models, Genetic , Genotype , Phenotype , Population Density , Time Factors

Network random keys - a tree representation scheme for genetic and evolutionary algorithms.

Rothlauf, Franz; Goldberg, David E; Heinzl, Armin.

Evol Comput ; 10(1): 75-97, 2002.

Article in English | MEDLINE | ID: mdl-11911783

ABSTRACT

When using genetic and evolutionary algorithms for network design, choosing a good representation scheme for the construction of the genotype is important for algorithm performance. One of the most common representation schemes for networks is the characteristic vector representation. However, with encoding trees, and using crossover and mutation, invalid individuals occur that are either under- or over-specified. When constructing the offspring or repairing the invalid individuals that do not represent a tree, it is impossible to distinguish between the importance of the links that should be used. These problems can be overcome by transferring the concept of random keys from scheduling and ordering problems to the encoding of trees. This paper investigates the performance of a simple genetic algorithm (SGA) using network random keys (NetKeys) for the one-max tree and a real-world problem. The comparison between the network random keys and the characteristic vector encoding shows that despite the effects of stealth mutation, which favors the characteristic vector representation, selectorecombinative SGAs with NetKeys have some advantages for small and easy optimization problems. With more complex problems, SGAs with network random keys significantly outperform SGAs using characteristic vectors. This paper shows that random keys can be used for the encoding of trees, and that genetic algorithms using network random keys are able to solve complex tree problems much faster than when using the characteristic vector. Users should therefore be encouraged to use network random keys for the representation of trees.

Subject(s)

Algorithms , Biological Evolution , Evolution, Molecular , Models, Genetic , Random Allocation , Research Design

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL