Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 6 de 6
Filter
Add more filters










Database
Language
Publication year range
1.
BMC Med Inform Decis Mak ; 24(1): 49, 2024 Feb 14.
Article in English | MEDLINE | ID: mdl-38355504

ABSTRACT

BACKGROUND: Unsupervised clustering and outlier detection are important in medical research to understand the distributional composition of a collective of patients. A number of clustering methods exist, also for high-dimensional data after dimension reduction. Clustering and outlier detection may, however, become less robust or contradictory if multiple high-dimensional data sets per patient exist. Such a scenario is given when the focus is on 3-D data of multiple organs per patient, and a high-dimensional feature matrix per organ is extracted. METHODS: We use principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE) and multiple co-inertia analysis (MCIA) combined with bagplots to study the distribution of multi-organ 3-D data taken by computed tomography scans. After point-set registration of multiple organs from two public data sets, multiple hundred shape features are extracted per organ. While PCA and t-SNE can only be applied to each organ individually, MCIA can project the data of all organs into the same low-dimensional space. RESULTS: MCIA is the only approach, here, with which data of all organs can be projected into the same low-dimensional space. We studied how frequently (i.e., by how many organs) a patient was classified to belong to the inner or outer 50% of the population, or as an outlier. Outliers could only be detected with MCIA and PCA. MCIA and t-SNE were more robust in judging the distributional location of a patient in contrast to PCA. CONCLUSIONS: MCIA is more appropriate and robust in judging the distributional location of a patient in the case of multiple high-dimensional data sets per patient. It is still recommendable to apply PCA or t-SNE in parallel to MCIA to study the location of individual organs.


Subject(s)
Algorithms , Tomography, X-Ray Computed , Humans , Cluster Analysis , Principal Component Analysis
2.
Genet Sel Evol ; 55(1): 78, 2023 Nov 09.
Article in English | MEDLINE | ID: mdl-37946104

ABSTRACT

BACKGROUND: The ever-increasing availability of high-density genomic markers in the form of single nucleotide polymorphisms (SNPs) enables genomic prediction, i.e. the inference of phenotypes based solely on genomic data, in the field of animal and plant breeding, where it has become an important tool. However, given the limited number of individuals, the abundance of variables (SNPs) can reduce the accuracy of prediction models due to overfitting or irrelevant SNPs. Feature selection can help to reduce the number of irrelevant SNPs and increase the model performance. In this study, we investigated an incremental feature selection approach based on ranking the SNPs according to the results of a genome-wide association study that we combined with random forest as a prediction model, and we applied it on several animal and plant datasets. RESULTS: Applying our approach to different datasets yielded a wide range of outcomes, i.e. from a substantial increase in prediction accuracy in a few cases to minor improvements when only a fraction of the available SNPs were used. Compared with models using all available SNPs, our approach was able to achieve comparable performances with a considerably reduced number of SNPs in several cases. Our approach showcased state-of-the-art efficiency and performance while having a faster computation time. CONCLUSIONS: The results of our study suggest that our incremental feature selection approach has the potential to improve prediction accuracy substantially. However, this gain seems to depend on the genomic data used. Even for datasets where the number of markers is smaller than the number of individuals, feature selection may still increase the performance of the genomic prediction. Our approach is implemented in R and is available at https://github.com/FelixHeinrich/GP_with_IFS/ .


Subject(s)
Genome-Wide Association Study , Models, Genetic , Humans , Animals , Genome-Wide Association Study/methods , Genome , Genomics/methods , Phenotype
3.
Front Immunol ; 14: 1134371, 2023.
Article in English | MEDLINE | ID: mdl-36926332

ABSTRACT

Introduction: Naturally attenuated Langat virus (LGTV) and highly pathogenic tick-borne encephalitis virus (TBEV) share antigenically similar viral proteins and are grouped together in the same flavivirus serocomplex. In the early 1970s, this has encouraged the usage of LGTV as a potential live attenuated vaccine against tick-borne encephalitis (TBE) until cases of encephalitis were reported among vaccinees. Previously, we have shown in a mouse model that immunity induced against LGTV protects mice against lethal TBEV challenge infection. However, the immune correlates of this protection have not been studied. Methods: We used the strategy of adoptive transfer of either serum or T cells from LGTV infected mice into naïve recipient mice and challenged them with lethal dose of TBEV. Results: We show that mouse infection with LGTV induced both cross-reactive antibodies and T cells against TBEV. To identify correlates of protection, Monitoring the disease progression in these mice for 16 days post infection, showed that serum from LGTV infected mice efficiently protected from developing severe disease. On the other hand, adoptive transfer of T cells from LGTV infected mice failed to provide protection. Histopathological investigation of infected brains suggested a possible role of microglia and T cells in inflammatory processes within the brain. Discussion: Our data provide key information regarding the immune correlates of protection induced by LGTV infection of mice which may help design better vaccines against TBEV.


Subject(s)
Encephalitis Viruses, Tick-Borne , Encephalitis, Tick-Borne , Flavivirus Infections , Mice , Animals , Antibodies , Brain , Vaccines, Attenuated
4.
Genes (Basel) ; 14(2)2023 02 01.
Article in English | MEDLINE | ID: mdl-36833313

ABSTRACT

Outliers in the training or test set used to fit and evaluate a classifier on transcriptomics data can considerably change the estimated performance of the model. Hence, an either too weak or a too optimistic accuracy is then reported and the estimated model performance cannot be reproduced on independent data. It is then also doubtful whether a classifier qualifies for clinical usage. We estimate classifier performances in simulated gene expression data with artificial outliers and in two real-world datasets. As a new approach, we use two outlier detection methods within a bootstrap procedure to estimate the outlier probability for each sample and evaluate classifiers before and after outlier removal by means of cross-validation. We found that the removal of outliers changed the classification performance notably. For the most part, removing outliers improved the classification results. Taking into account the fact that there are various, sometimes unclear reasons for a sample to be an outlier, we strongly advocate to always report the performance of a transcriptomics classifier with and without outliers in training and test data. This provides a more diverse picture of a classifier's performance and prevents reporting models that later turn out to be not applicable for clinical diagnoses.


Subject(s)
Gene Expression Profiling , Transcriptome , Probability , Research Design
5.
Int J Mol Sci ; 23(5)2022 Feb 24.
Article in English | MEDLINE | ID: mdl-35269624

ABSTRACT

To better understand the molecular basis of respiratory diseases of viral origin, high-throughput gene-expression data are frequently taken by means of DNA microarray or RNA-seq technology. Such data can also be useful to classify infected individuals by molecular signatures in the form of machine-learning models with genes as predictor variables. Early diagnosis of patients by molecular signatures could also contribute to better treatments. An approach that has rarely been considered for machine-learning models in the context of transcriptomics is data augmentation. For other data types it has been shown that augmentation can improve classification accuracy and prevent overfitting. Here, we compare three strategies for data augmentation of DNA microarray and RNA-seq data from two selected studies on respiratory diseases of viral origin. The first study involves samples of patients with either viral or bacterial origin of the respiratory disease, the second study involves patients with either SARS-CoV-2 or another respiratory virus as disease origin. Specifically, we reanalyze these public datasets to study whether patient classification by transcriptomic signatures can be improved when adding artificial data for training of the machine-learning models. Our comparison reveals that augmentation of transcriptomic data can improve the classification accuracy and that fewer genes are necessary as explanatory variables in the final models. We also report genes from our signatures that overlap with signatures presented in the original publications of our example data. Due to strict selection criteria, the molecular role of these genes in the context of respiratory infectious diseases is underlined.


Subject(s)
COVID-19/genetics , Gene Expression Profiling/methods , Machine Learning , Neural Networks, Computer , RNA-Seq/methods , Transcriptome/genetics , Algorithms , COVID-19/classification , COVID-19/virology , Gene Ontology , Humans , Reproducibility of Results , SARS-CoV-2/physiology
6.
Genes (Basel) ; 12(11)2021 10 31.
Article in English | MEDLINE | ID: mdl-34828361

ABSTRACT

Estimating the taxonomic composition of viral sequences in a biological samples processed by next-generation sequencing is an important step in comparative metagenomics. Mapping sequencing reads against a database of known viral reference genomes, however, fails to classify reads from novel viruses whose reference sequences are not yet available in public databases. Instead of a mapping approach, and in order to classify sequencing reads at least to a taxonomic level, the performance of artificial neural networks and other machine learning models was studied. Taxonomic and genomic data from the NCBI database were used to sample labelled sequencing reads as training data. The fitted neural network was applied to classify unlabelled reads of simulated and real-world test sets. Additional auxiliary test sets of labelled reads were used to estimate the conditional class probabilities, and to correct the prior estimation of the taxonomic distribution in the actual test set. Among the taxonomic levels, the biological order of viruses provided the most comprehensive data base to generate training data. The prediction accuracy of the artificial neural network to classify test reads to their viral order was considerably higher than that of a random classification. Posterior estimation of taxa frequencies could correct the primary classification results.


Subject(s)
Computational Biology/methods , High-Throughput Nucleotide Sequencing/methods , Viruses/classification , Algorithms , Databases, Genetic , Genome, Viral , Machine Learning , Metagenomics , Neural Networks, Computer , Viruses/genetics
SELECTION OF CITATIONS
SEARCH DETAIL
...