Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 7 de 7
Filtrar
Añadir filtros








Intervalo de año
1.
Genomics & Informatics ; : e2-2020.
Artículo en Inglés | WPRIM | ID: wpr-898402

RESUMEN

In this paper, we propose a new approach to detecting outliers in a set of segmented genomes of the flu virus, a data set with a heterogeneous set of sequences. The approach has the following computational phases: feature extraction, which is a mapping into feature space, alignment-free distance measure to measure the distance between any two segmented genomes, and a mapping into distance space to analyze a quantum of distance values. The approach is implemented using supervised and unsupervised learning modes. The experiments show robustness in detecting outliers of the segmented genome of the flu virus.

2.
Genomics & Informatics ; : e7-2020.
Artículo en Inglés | WPRIM | ID: wpr-898397

RESUMEN

In this paper, we present few technical notes about the distance distribution paradigm for Mosaab-metric using 1, 2, and 3 grams feature extraction techniques to analyze composite data points in high dimensional feature spaces. This technical analysis will help the specialist in bioinformatics and biotechnology to deeply explore the biodiversity of influenza virus genome as a composite data point. Various technical examples are presented in this paper, in addition, the integrated statistical learning pipeline to process segmented genomes of influenza virus is illustrated as sequential-parallel computational pipeline.

3.
Genomics & Informatics ; : e2-2020.
Artículo en Inglés | WPRIM | ID: wpr-890698

RESUMEN

In this paper, we propose a new approach to detecting outliers in a set of segmented genomes of the flu virus, a data set with a heterogeneous set of sequences. The approach has the following computational phases: feature extraction, which is a mapping into feature space, alignment-free distance measure to measure the distance between any two segmented genomes, and a mapping into distance space to analyze a quantum of distance values. The approach is implemented using supervised and unsupervised learning modes. The experiments show robustness in detecting outliers of the segmented genome of the flu virus.

4.
Genomics & Informatics ; : e7-2020.
Artículo en Inglés | WPRIM | ID: wpr-890693

RESUMEN

In this paper, we present few technical notes about the distance distribution paradigm for Mosaab-metric using 1, 2, and 3 grams feature extraction techniques to analyze composite data points in high dimensional feature spaces. This technical analysis will help the specialist in bioinformatics and biotechnology to deeply explore the biodiversity of influenza virus genome as a composite data point. Various technical examples are presented in this paper, in addition, the integrated statistical learning pipeline to process segmented genomes of influenza virus is illustrated as sequential-parallel computational pipeline.

5.
Genomics & Informatics ; : e4-2019.
Artículo en Inglés | WPRIM | ID: wpr-763799

RESUMEN

In this paper, we propose a window-based mechanism visualization approach as an alternative way to measure the seriousness of the difference among data-insights extracted from a composite biodata point. The approach is based on two components: undirected graph and Mosaab-metric space. The significant application of this approach is to visualize the segmented genome of a virus. We use Influenza and Ebola viruses as examples to demonstrate the robustness of this approach and to conduct comparisons. This approach can provide researchers with deep insights about information structures extracted from a segmented genome as a composite biodata point, and consequently, to capture the segmented genetic variations and diversity (variants) in composite data points.


Asunto(s)
Ebolavirus , Variación Genética , Genoma , Gripe Humana
6.
Genomics & Informatics ; : 39-2019.
Artículo en Inglés | WPRIM | ID: wpr-785802

RESUMEN

Analyzing patterns in data points embedded in linear and non-linear feature spaces is considered as one of the common research problems among different research areas, for example: data mining, machine learning, pattern recognition, and multivariate analysis. In this paper, data points are heterogeneous sets of biosequences (composite data points). A composite data point is a set of ordinary data points (e.g., set of feature vectors). We theoretically extend the derivation of the largest generalized eigenvalue-based distance metric D(ij)(γ₁) in any linear and non-linear feature spaces. We prove that D(ij)(γ₁) is a metric under any linear and non-linear feature transformation function. We show the sufficiency and efficiency of using the decision rule δ(Ξi) (i.e., mean of D(ij)(γ₁)) in classification of heterogeneous sets of biosequences compared with the decision rules min(Ξi) and median(Ξi). We analyze the impact of linear and non-linear transformation functions on classifying/clustering collections of heterogeneous sets of biosequences. The impact of the length of a sequence in a heterogeneous sequence-set generated by simulation on the classification and clustering results in linear and non-linear feature spaces is empirically shown in this paper. We propose a new concept: the limiting dispersion map of the existing clusters in heterogeneous sets of biosequences embedded in linear and nonlinear feature spaces, which is based on the limiting distribution of nucleotide compositions estimated from real data sets. Finally, the empirical conclusions and the scientific evidences are deduced from the experiments to support the theoretical side stated in this paper.


Asunto(s)
Clasificación , Análisis por Conglomerados , Minería de Datos , Conjunto de Datos , Aprendizaje Automático , Análisis Multivariante
7.
Genomics & Informatics ; : e39-2019.
Artículo en Inglés | WPRIM | ID: wpr-830122

RESUMEN

Analyzing patterns in data points embedded in linear and non-linear feature spaces is considered as one of the common research problems among different research areas, for example: data mining, machine learning, pattern recognition, and multivariate analysis. In this paper, data points are heterogeneous sets of biosequences (composite data points). A composite data point is a set of ordinary data points (e.g., set of feature vectors). We theoretically extend the derivation of the largest generalized eigenvalue-based distance metric D(ij)(γ₁) in any linear and non-linear feature spaces. We prove that D(ij)(γ₁) is a metric under any linear and non-linear feature transformation function. We show the sufficiency and efficiency of using the decision rule δ(Ξi) (i.e., mean of D(ij)(γ₁)) in classification of heterogeneous sets of biosequences compared with the decision rules min(Ξi) and median(Ξi). We analyze the impact of linear and non-linear transformation functions on classifying/clustering collections of heterogeneous sets of biosequences. The impact of the length of a sequence in a heterogeneous sequence-set generated by simulation on the classification and clustering results in linear and non-linear feature spaces is empirically shown in this paper. We propose a new concept: the limiting dispersion map of the existing clusters in heterogeneous sets of biosequences embedded in linear and nonlinear feature spaces, which is based on the limiting distribution of nucleotide compositions estimated from real data sets. Finally, the empirical conclusions and the scientific evidences are deduced from the experiments to support the theoretical side stated in this paper.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA