Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 14 de 14
Filter
1.
bioRxiv ; 2024 May 09.
Article in English | MEDLINE | ID: mdl-38766268

ABSTRACT

Recent advances in cytometry technology have enabled high-throughput data collection with multiple single-cell protein expression measurements. The significant biological and technical variance between samples in cytometry has long posed a formidable challenge during the gating process, especially for the initial gates which deal with unpredictable events, such as debris and technical artifacts. Even with the same experimental machine and protocol, the target population, as well as the cell population that needs to be excluded, may vary across different measurements. To address this challenge and mitigate the labor-intensive manual gating process, we propose a deep learning framework UNITO to rigorously identify the hierarchical cytometric subpopulations. The UNITO framework transformed a cell-level classification task into an image-based semantic segmentation problem. For reproducibility purposes, the framework was applied to three independent cohorts and successfully detected initial gates that were required to identify single cellular events as well as subsequent cell gates. We validated the UNITO framework by comparing its results with previous automated methods and the consensus of at least four experienced immunologists. UNITO outperformed existing automated methods and differed from human consensus by no more than each individual human. Most critically, UNITO framework functions as a fully automated pipeline after training and does not require human hints or prior knowledge. Unlike existing multi-channel classification or clustering pipelines, UNITO can reproduce a similar contour compared to manual gating for each intermediate gating to achieve better interpretability and provide post hoc visual inspection. Beyond acting as a pioneering framework that uses image segmentation to do auto-gating, UNITO gives a fast and interpretable way to assign the cell subtype membership, and the speed of UNITO will not be impacted by the number of cells from each sample. The pre-gating and gating inference takes approximately 2 minutes for each sample using our pre-defined 9 gates system, and it can also adapt to any sequential prediction with different configurations.

2.
bioRxiv ; 2024 Feb 06.
Article in English | MEDLINE | ID: mdl-38370767

ABSTRACT

Single-cell technologies have emerged as a transformative technology enabling high-dimensional characterization of cell populations at an unprecedented scale. The data's innate complexity and voluminous nature pose significant computational and analytical challenges, especially in comparative studies delineating cellular architectures across various biological conditions (i.e., generation of sample level distance matrices). Optimal Transport (OT) is a mathematical tool that captures the intrinsic structure of data geometrically and has been applied to many bioinformatics tasks. In this paper, we propose QOT (Quantized Optimal Transport), a new method enables efficient computation of sample level distance matrix from large-scale single-cell omics data through a quantization step. We apply our algorithm to real-world single-cell genomics and pathomics datasets, aiming to extrapolate cell-level insights to inform sample level categorizations. Our empirical study shows that QOT outperforms OT-based algorithms in terms of accuracy and robustness when obtaining a distance matrix at the sample level from high throughput single-cell measures. Moreover, the sample level distance matrix could be used in downstream analysis (i.e. uncover the trajectory of disease progression), highlighting its usage in biomedical informatics and data science.

3.
J Hum Hypertens ; 37(10): 898-906, 2023 10.
Article in English | MEDLINE | ID: mdl-36528682

ABSTRACT

The study characterises vascular phenotypes of hypertensive patients utilising machine learning approaches. Newly diagnosed and treatment-naïve primary hypertensive patients without co-morbidities (aged 18-55, n = 73), and matched normotensive controls (n = 79) were recruited (NCT04015635). Blood pressure (BP) and BP variability were determined using 24 h ambulatory monitoring. Vascular phenotyping included SphygmoCor® measurement of pulse wave velocity (PWV), pulse wave analysis-derived augmentation index (PWA-AIx), and central BP; EndoPAT™-2000® provided reactive hyperaemia index (LnRHI) and augmentation index adjusted to heart rate of 75bpm. Ultrasound was used to analyse flow mediated dilatation and carotid intima-media thickness (CIMT). In addition to standard statistical methods to compare normotensive and hypertensive groups, machine learning techniques including biclustering explored hypertensive phenotypic subgroups. We report that arterial stiffness (PWV, PWA-AIx, EndoPAT-2000-derived AI@75) and central pressures were greater in incident hypertension than normotension. Endothelial function, percent nocturnal dip, and CIMT did not differ between groups. The vascular phenotype of white-coat hypertension imitated sustained hypertension with elevated arterial stiffness and central pressure; masked hypertension demonstrating values similar to normotension. Machine learning revealed three distinct hypertension clusters, representing 'arterially stiffened', 'vaso-protected', and 'non-dipper' patients. Key clustering features were nocturnal- and central-BP, percent dipping, and arterial stiffness measures. We conclude that untreated patients with primary hypertension demonstrate early arterial stiffening rather than endothelial dysfunction or CIMT alterations. Phenotypic heterogeneity in nocturnal and central BP, percent dipping, and arterial stiffness observed early in the course of disease may have implications for risk stratification.


Subject(s)
Hypertension , Vascular Stiffness , Humans , Carotid Intima-Media Thickness , Pulse Wave Analysis , Blood Pressure Monitoring, Ambulatory , Hypertension/diagnosis , Blood Pressure/physiology , Phenotype
4.
Sci Adv ; 8(47): eabl4747, 2022 11 25.
Article in English | MEDLINE | ID: mdl-36417520

ABSTRACT

Understanding the strengths and weaknesses of machine learning (ML) algorithms is crucial to determine their scope of application. Here, we introduce the Diverse and Generative ML Benchmark (DIGEN), a collection of synthetic datasets for comprehensive, reproducible, and interpretable benchmarking of ML algorithms for classification of binary outcomes. The DIGEN resource consists of 40 mathematical functions that map continuous features to binary targets for creating synthetic datasets. These 40 functions were found using a heuristic algorithm designed to maximize the diversity of performance among multiple popular ML algorithms, thus providing a useful test suite for evaluating and comparing new methods. Access to the generative functions facilitates understanding of why a method performs poorly compared to other algorithms, thus providing ideas for improvement.

5.
J Clin Med ; 11(7)2022 Apr 06.
Article in English | MEDLINE | ID: mdl-35407664

ABSTRACT

The COVID-19 pandemic has sparked a barrage of primary research and reviews. We investigated the publishing process, time and resource wasting, and assessed the methodological quality of the reviews on artificial intelligence techniques to diagnose COVID-19 in medical images. We searched nine databases from inception until 1 September 2020. Two independent reviewers did all steps of identification, extraction, and methodological credibility assessment of records. Out of 725 records, 22 reviews analysing 165 primary studies met the inclusion criteria. This review covers 174,277 participants in total, including 19,170 diagnosed with COVID-19. The methodological credibility of all eligible studies was rated as critically low: 95% of papers had significant flaws in reporting quality. On average, 7.24 (range: 0-45) new papers were included in each subsequent review, and 14% of studies did not include any new paper into consideration. Almost three-quarters of the studies included less than 10% of available studies. More than half of the reviews did not comment on the previously published reviews at all. Much wasting time and resources could be avoided if referring to previous reviews and following methodological guidelines. Such information chaos is alarming. It is high time to draw conclusions from what we experienced and prepare for future pandemics.

6.
Adv Neural Inf Process Syst ; 2021(DB1): 1-16, 2021 Dec.
Article in English | MEDLINE | ID: mdl-38715933

ABSTRACT

Many promising approaches to symbolic regression have been presented in recent years, yet progress in the field continues to suffer from a lack of uniform, robust, and transparent benchmarking standards. We address this shortcoming by introducing an open-source, reproducible benchmarking platform for symbolic regression. We assess 14 symbolic regression methods and 7 machine learning methods on a set of 252 diverse regression problems. Our assessment includes both real-world datasets with no known model form as well as ground-truth benchmark problems. For the real-world datasets, we benchmark the ability of each method to learn models with low error and low complexity relative to state-of-the-art machine learning methods. For the synthetic problems, we assess each method's ability to find exact solutions in the presence of varying levels of noise. Under these controlled experiments, we conclude that the best performing methods for real-world regression combine genetic algorithms with parameter estimation and/or semantic search drivers. When tasked with recovering exact equations in the presence of noise, we find that several approaches perform similarly. We provide a detailed guide to reproducing this experiment and contributing new methods, and encourage other researchers to collaborate with us on a common and living symbolic regression benchmark.

7.
BioData Min ; 12: 14, 2019.
Article in English | MEDLINE | ID: mdl-31320928

ABSTRACT

BACKGROUND: The principal line of investigation in Genome Wide Association Studies (GWAS) is the identification of main effects, that is individual Single Nucleotide Polymorphisms (SNPs) which are associated with the trait of interest, independent of other factors. A variety of methods have been proposed to this end, mostly statistical in nature and differing in assumptions and type of model employed. Moreover, for a given model, there may be multiple choices for the SNP genotype encoding. As an alternative to statistical methods, machine learning methods are often applicable. Typically, for a given GWAS, a single approach is selected and utilized to identify potential SNPs of interest. Even when multiple GWAS are combined through meta-analyses within a consortium, each GWAS is typically analyzed with a single approach and the resulting summary statistics are then utilized in meta-analyses. RESULTS: In this work we use as case studies a Type 2 Diabetes (T2D) and a breast cancer GWAS to explore a diversity of applicable approaches spanning different methods and encoding choices. We assess similarity of these approaches based on the derived ranked lists of SNPs and, for each GWAS, we identify a subset of representative approaches that we use as an ensemble to derive a union list of top SNPs. Among these are SNPs which are identified by multiple approaches as well as several SNPs identified by only one or a few of the less frequently used approaches. The latter include SNPs from established loci and SNPs which have other supporting lines of evidence in terms of their potential relevance to the traits. CONCLUSIONS: Not every main effect analysis method is suitable for every GWAS, but for each GWAS there are typically multiple applicable methods and encoding options. We suggest a workflow for a single GWAS, extensible to multiple GWAS from consortia, where representative approaches are selected among a pool of suitable options, to yield a more comprehensive set of SNPs, potentially including SNPs that would typically be missed with the most popular analyses, but that could provide additional valuable insights for follow-up.

8.
Gigascience ; 8(7)2019 07 01.
Article in English | MEDLINE | ID: mdl-31251324

ABSTRACT

Biclustering is a technique of discovering local similarities within data. For many years the complexity of the methods and parallelization issues limited its application to big data problems. With the development of novel scalable methods, biclustering has finally started to close this gap. In this paper we discuss the caveats of biclustering and present its current challenges and guidelines for practitioners. We also try to explain why biclustering may soon become one of the standards for big data analytics.


Subject(s)
Big Data , Genomics/methods , Sequence Analysis, DNA/methods , Cluster Analysis , Data Mining/methods , Genome, Human , Genomics/standards , Humans , Sequence Alignment/methods , Sequence Alignment/standards , Sequence Analysis, DNA/standards , Software
9.
Bioinformatics ; 35(17): 3181-3183, 2019 09 01.
Article in English | MEDLINE | ID: mdl-30649199

ABSTRACT

MOTIVATION: In this paper, we present an open source package with the latest release of Evolutionary-based BIClustering (EBIC), a next-generation biclustering algorithm for mining genetic data. The major contribution of this paper is adding a full support for multiple graphics processing units (GPUs) support, which makes it possible to run efficiently large genomic data mining analyses. Multiple enhancements to the first release of the algorithm include integration with R and Bioconductor, and an option to exclude missing values from the analysis. RESULTS: Evolutionary-based BIClustering was applied to datasets of different sizes, including a large DNA methylation dataset with 436 444 rows. For the largest dataset we observed over 6.6-fold speedup in computation time on a cluster of eight GPUs compared to running the method on a single GPU. This proves high scalability of the method. AVAILABILITY AND IMPLEMENTATION: The latest version of EBIC could be downloaded from http://github.com/EpistasisLab/ebic. Installation and usage instructions are also available online. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Data Analysis , Software , Algorithms , DNA Methylation , Genomics
10.
Bioinformatics ; 34(24): 4302-4304, 2018 12 15.
Article in English | MEDLINE | ID: mdl-29939213

ABSTRACT

Motivation: Biclustering is an unsupervised technique of simultaneous clustering of rows and columns of input matrix. With multiple biclustering algorithms proposed, UniBic remains one of the most accurate methods developed so far. Results: In this paper we introduce a Bioconductor package called runibic with parallel implementation of UniBic. For the convenience the algorithm was reimplemented, parallelized and wrapped within an R package called runibic. The package includes: (i) a couple of times faster parallel version of the original sequential algorithm, (ii) much more efficient memory management, (iii) modularity which allows to build new methods on top of the provided one and (iv) integration with the modern Bioconductor packages such as SummarizedExperiment, ExpressionSet and biclust. Availability and implementation: The package is implemented in R and is available from Bioconductor (starting from version 3.6) at the following URL http://bioconductor.org/packages/runibic with installation instructions and tutorial. Supplementary information: Supplementary data are available at Bioinformatics online.


Subject(s)
Algorithms , Gene Expression Profiling/methods , Software , Cluster Analysis , Computational Biology , Gene Expression
11.
Bioinformatics ; 34(21): 3719-3726, 2018 11 01.
Article in English | MEDLINE | ID: mdl-29790909

ABSTRACT

Motivation: Biclustering algorithms are commonly used for gene expression data analysis. However, accurate identification of meaningful structures is very challenging and state-of-the-art methods are incapable of discovering with high accuracy different patterns of high biological relevance. Results: In this paper, a novel biclustering algorithm based on evolutionary computation, a sub-field of artificial intelligence, is introduced. The method called EBIC aims to detect order-preserving patterns in complex data. EBIC is capable of discovering multiple complex patterns with unprecedented accuracy in real gene expression datasets. It is also one of the very few biclustering methods designed for parallel environments with multiple graphics processing units. We demonstrate that EBIC greatly outperforms state-of-the-art biclustering methods, in terms of recovery and relevance, on both synthetic and genetic datasets. EBIC also yields results over 12 times faster than the most accurate reference algorithms. Availability and implementation: EBIC source code is available on GitHub at https://github.com/EpistasisLab/ebic. Supplementary information: Supplementary data are available at Bioinformatics online.


Subject(s)
Algorithms , Artificial Intelligence , Cluster Analysis , Gene Expression Profiling , Software
12.
Pac Symp Biocomput ; 23: 123-132, 2018.
Article in English | MEDLINE | ID: mdl-29218875

ABSTRACT

Electronic Health Records (EHRs) contain a wealth of patient data useful to biomedical researchers. At present, both the extraction of data and methods for analyses are frequently designed to work with a single snapshot of a patient's record. Health care providers often perform and record actions in small batches over time. By extracting these care events, a sequence can be formed providing a trajectory for a patient's interactions with the health care system. These care events also offer a basic heuristic for the level of attention a patient receives from health care providers. We show that is possible to learn meaningful embeddings from these care events using two deep learning techniques, unsupervised autoencoders and long short-term memory networks. We compare these methods to traditional machine learning methods which require a point in time snapshot to be extracted from an EHR.


Subject(s)
Critical Care/statistics & numerical data , Machine Learning/statistics & numerical data , Computational Biology/methods , Databases, Factual/statistics & numerical data , Electronic Health Records/statistics & numerical data , Female , Humans , Male , Supervised Machine Learning/statistics & numerical data , Unsupervised Machine Learning/statistics & numerical data
13.
Pac Symp Biocomput ; 23: 460-471, 2018.
Article in English | MEDLINE | ID: mdl-29218905

ABSTRACT

With the maturation of metabolomics science and proliferation of biobanks, clinical metabolic profiling is an increasingly opportunistic frontier for advancing translational clinical research. Automated Machine Learning (AutoML) approaches provide exciting opportunity to guide feature selection in agnostic metabolic profiling endeavors, where potentially thousands of independent data points must be evaluated. In previous research, AutoML using high-dimensional data of varying types has been demonstrably robust, outperforming traditional approaches. However, considerations for application in clinical metabolic profiling remain to be evaluated. Particularly, regarding the robustness of AutoML to identify and adjust for common clinical confounders. In this study, we present a focused case study regarding AutoML considerations for using the Tree-Based Optimization Tool (TPOT) in metabolic profiling of exposure to metformin in a biobank cohort. First, we propose a tandem rank-accuracy measure to guide agnostic feature selection and corresponding threshold determination in clinical metabolic profiling endeavors. Second, while AutoML, using default parameters, demonstrated potential to lack sensitivity to low-effect confounding clinical covariates, we demonstrated residual training and adjustment of metabolite features as an easily applicable approach to ensure AutoML adjustment for potential confounding characteristics. Finally, we present increased homocysteine with long-term exposure to metformin as a potentially novel, non-replicated metabolite association suggested by TPOT; an association not identified in parallel clinical metabolic profiling endeavors. While warranting independent replication, our tandem rank-accuracy measure suggests homocysteine to be the metabolite feature with largest effect, and corresponding priority for further translational clinical research. Residual training and adjustment for a potential confounding effect by BMI only slightly modified the suggested association. Increased homocysteine is thought to be associated with vitamin B12 deficiency - evaluation for potential clinical relevance is suggested. While considerations for clinical metabolic profiling are recommended, including adjustment approaches for clinical confounders, AutoML presents an exciting tool to enhance clinical metabolic profiling and advance translational research endeavors.


Subject(s)
Homocysteine/blood , Hypoglycemic Agents/adverse effects , Metabolome , Metformin/adverse effects , Supervised Machine Learning/statistics & numerical data , Bias , Body Mass Index , Case-Control Studies , Computational Biology/methods , Diabetes Mellitus, Type 2/blood , Diabetes Mellitus, Type 2/drug therapy , Humans , Metabolomics/statistics & numerical data , Risk Factors , Translational Research, Biomedical
14.
BioData Min ; 10: 36, 2017.
Article in English | MEDLINE | ID: mdl-29238404

ABSTRACT

BACKGROUND: The selection, development, or comparison of machine learning methods in data mining can be a difficult task based on the target problem and goals of a particular study. Numerous publicly available real-world and simulated benchmark datasets have emerged from different sources, but their organization and adoption as standards have been inconsistent. As such, selecting and curating specific benchmarks remains an unnecessary burden on machine learning practitioners and data scientists. RESULTS: The present study introduces an accessible, curated, and developing public benchmark resource to facilitate identification of the strengths and weaknesses of different machine learning methodologies. We compare meta-features among the current set of benchmark datasets in this resource to characterize the diversity of available data. Finally, we apply a number of established machine learning methods to the entire benchmark suite and analyze how datasets and algorithms cluster in terms of performance. From this study, we find that existing benchmarks lack the diversity to properly benchmark machine learning algorithms, and there are several gaps in benchmarking problems that still need to be considered. CONCLUSIONS: This work represents another important step towards understanding the limitations of popular benchmarking suites and developing a resource that connects existing benchmarking standards to more diverse and efficient standards in the future.

SELECTION OF CITATIONS
SEARCH DETAIL
...