Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 12 de 12
Filter
Add more filters










Publication year range
1.
Sci Rep ; 13(1): 17783, 2023 Oct 18.
Article in English | MEDLINE | ID: mdl-37853092

ABSTRACT

Nowadays, several companies prefer storing their data on multiple data centers with replication for many reasons. The data that spans various data centers ensures the fastest possible response time for customers and workforces who are geographically separated. It also provides protecting the information from the loss in case a single data center experiences a disaster. However, the amount of data is increasing at a rapid pace, which leads to challenges in storage, analysis, and various processing tasks. In this paper, we propose and design a geographically distributed data management framework to manage the massive data stored and distributed among geo-distributed data centers. The goal of the proposed framework is to enable efficient use of the distributed data blocks for various data analysis tasks. The architecture of the proposed framework is composed of a grid of geo-distributed data centers connected to a data controller (DCtrl). The DCtrl is responsible for organizing and managing the block replicas across the geo-distributed data centers. We use the BDMS system as the installed system on the distributed data centers. BDMS stores the big data file as a set of random sample data blocks, each being a random sample of the whole data file. Then, DCtrl distributes these data blocks into multiple data centers with replication. In analyzing a big data file distributed based on the proposed framework, we randomly select a sample of data blocks replicated from other data centers on any data center. We use simulation results to demonstrate the performance of the proposed framework in big data analysis across geo-distributed data centers.

2.
J Supercomput ; : 1-33, 2023 May 31.
Article in English | MEDLINE | ID: mdl-37359333

ABSTRACT

For decision-making support and evidence based on healthcare, high quality data are crucial, particularly if the emphasized knowledge is lacking. For public health practitioners and researchers, the reporting of COVID-19 data need to be accurate and easily available. Each nation has a system in place for reporting COVID-19 data, albeit these systems' efficacy has not been thoroughly evaluated. However, the current COVID-19 pandemic has shown widespread flaws in data quality. We propose a data quality model (canonical data model, four adequacy levels, and Benford's law) to assess the quality issue of COVID-19 data reporting carried out by the World Health Organization (WHO) in the six Central African Economic and Monitory Community (CEMAC) region countries between March 6,2020, and June 22, 2022, and suggest potential solutions. These levels of data quality sufficiency can be interpreted as dependability indicators and sufficiency of Big Dataset inspection. This model effectively identified the quality of the entry data for big dataset analytics. The future development of this model requires scholars and institutions from all sectors to deepen their understanding of its core concepts, improve integration with other data processing technologies, and broaden the scope of its applications.

3.
Knowl Inf Syst ; 65(5): 2017-2042, 2023.
Article in English | MEDLINE | ID: mdl-36683607

ABSTRACT

An obvious defect of extreme learning machine (ELM) is that its prediction performance is sensitive to the random initialization of input-layer weights and hidden-layer biases. To make ELM insensitive to random initialization, GPRELM adopts the simple an effective strategy of integrating Gaussian process regression into ELM. However, there is a serious overfitting problem in kernel-based GPRELM (kGPRELM). In this paper, we investigate the theoretical reasons for the overfitting of kGPRELM and further propose a correlation-based GPRELM (cGPRELM), which uses a correlation coefficient to measure the similarity between two different hidden-layer output vectors. cGPRELM reduces the likelihood that the covariance matrix becomes an identity matrix when the number of hidden-layer nodes is increased, effectively controlling overfitting. Furthermore, cGPRELM works well for improper initialization intervals where ELM and kGPRELM fail to provide good predictions. The experimental results on real classification and regression data sets demonstrate the feasibility and superiority of cGPRELM, as it not only achieves better generalization performance but also has a lower computational complexity.

4.
Article in English | MEDLINE | ID: mdl-35130148

ABSTRACT

The optimization methods for solving the normalized cut model usually involve three steps, i.e., problem relaxation, problem solving and post-processing. However, these methods are problematic in both performance since they do not directly solve the original problem, and efficiency since they usually depend on the time-consuming eigendecomposition and k-means (or spectral rotation) for post-processing. In this paper, we propose a fast optimization method to speedup the classical normalized cut clustering process, in which an auxiliary variable is introduced and alternatively updated along with the cluster indicator matrix. The new method is faster than the conventional three-step optimization methods since it solves the normalized cut problem in one step. Theoretical analysis reveals that the new method is able to monotonically decrease the normalized cut objective function and converge in finite iterations. Moreover, we have proposed efficient methods for adjust two regularization parameters. Extensive experimental results show the superior performance of the new method. Moreover, it is faster than the existing methods for solving the normalized cut.

5.
Entropy (Basel) ; 22(1)2020 Jan 18.
Article in English | MEDLINE | ID: mdl-33285894

ABSTRACT

Event-based social networks (EBSNs) are widely used to create online social groups and organize offline events for users. Activeness and loyalty are crucial characteristics of these online social groups in terms of determining the growth or inactiveness of the social groups in a specific time frame. However, there is less research on these concepts to clarify the existence of groups in event-based social networks. In this paper, we study the problem of group activeness and user loyalty to provide a novel insight into online social networks. First, we analyze the structure of EBSNs and generate features from the crawled datasets. Second, we define the concepts of group activeness and user loyalty based on a series of time windows, and propose a method to measure the group activeness. In this proposed method, we first compute a ratio of a number of events between two consecutive time windows. We then develop an association matrix to assign the activeness label for each group after several consecutive time windows. Similarly, we measure the user loyalty in terms of attended events gathered in time windows and treat loyalty as a contributive feature of the group activeness. Finally, three well-known machine learning techniques are used to verify the activeness label and to generate features for each group. As a consequence, we also find a small group of features that are highly correlated and result in higher accuracy as compared to the whole features.

6.
IEEE Trans Neural Netw Learn Syst ; 31(3): 725-736, 2020 Mar.
Article in English | MEDLINE | ID: mdl-31094694

ABSTRACT

Although many spectral clustering algorithms have been proposed during the past decades, they are not scalable to large-scale data due to their high computational complexities. In this paper, we propose a novel spectral clustering method for large-scale data, namely, large-scale balanced min cut (LABIN). A new model is proposed to extend the self-balanced min-cut (SBMC) model with the anchor-based strategy and a fast spectral rotation with linear time complexity is proposed to solve the new model. Extensive experimental results show the superior performance of our proposed method in comparison with the state-of-the-art methods including SBMC.

7.
IEEE Trans Neural Netw Learn Syst ; 29(10): 4593-4606, 2018 10.
Article in English | MEDLINE | ID: mdl-29990068

ABSTRACT

In data mining, objects are often represented by a set of features, where each feature of an object has only one value. However, in reality, some features can take on multiple values, for instance, a person with several job titles, hobbies, and email addresses. These features can be referred to as set-valued features and are often treated with dummy features when using existing data mining algorithms to analyze data with set-valued features. In this paper, we propose an SV- $k$ -modes algorithm that clusters categorical data with set-valued features. In this algorithm, a distance function is defined between two objects with set-valued features, and a set-valued mode representation of cluster centers is proposed. We develop a heuristic method to update cluster centers in the iterative clustering process and an initialization algorithm to select the initial cluster centers. The convergence and complexity of the SV- $k$ -modes algorithm are analyzed. Experiments are conducted on both synthetic data and real data from five different applications. The experimental results have shown that the SV- $k$ -modes algorithm performs better when clustering real data than do three other categorical clustering algorithms and that the algorithm is scalable to large data.

8.
IEEE Trans Neural Netw Learn Syst ; 29(12): 6362-6373, 2018 12.
Article in English | MEDLINE | ID: mdl-29994271

ABSTRACT

Most feature selection methods first compute a similarity matrix by assigning a fixed value to pairs of objects in the whole data or to pairs of objects in a class or by computing the similarity between two objects from the original data. The similarity matrix is fixed as a constant in the subsequent feature selection process. However, the similarities computed from the original data may be unreliable, because they are affected by noise features. Moreover, the local structure within classes cannot be recovered if the similarities between the pairs of objects in a class are equal. In this paper, we propose a novel local adaptive projection (LAP) framework. Instead of computing fixed similarities before performing feature selection, LAP simultaneously learns an adaptive similarity matrix and a projection matrix with an iterative method. In each iteration, is computed from the projected distance with the learned and W is computed with the learned . Therefore, LAP can learn better projection matrix by weakening the effect of noise features with the adaptive similarity matrix. A supervised feature selection with LAP (SLAP) method and an unsupervised feature selection with LAP (ULAP) method are proposed. Experimental results on eight data sets show the superiority of SLAP compared with seven supervised feature selection methods and the superiority of ULAP compared with five unsupervised feature selection methods.

9.
ScientificWorldJournal ; 2015: 471371, 2015.
Article in English | MEDLINE | ID: mdl-25879059

ABSTRACT

Random forests (RFs) have been widely used as a powerful classification method. However, with the randomization in both bagging samples and feature selection, the trees in the forest tend to select uninformative features for node splitting. This makes RFs have poor accuracy when working with high-dimensional data. Besides that, RFs have bias in the feature selection process where multivalued features are favored. Aiming at debiasing feature selection in RFs, we propose a new RF algorithm, called xRF, to select good features in learning RFs for high-dimensional data. We first remove the uninformative features using p-value assessment, and the subset of unbiased features is then selected based on some statistical measures. This feature subset is then partitioned into two subsets. A feature weighting sampling technique is used to sample features from these two subsets for building trees. This approach enables one to generate more accurate trees, while allowing one to reduce dimensionality and the amount of data needed for learning RFs. An extensive set of experiments has been conducted on 47 high-dimensional real-world datasets including image datasets. The experimental results have shown that RFs with the proposed approach outperformed the existing random forests in increasing the accuracy and the AUC measures.

10.
Int J Data Min Bioinform ; 7(1): 1-21, 2013.
Article in English | MEDLINE | ID: mdl-23437512

ABSTRACT

This paper proposes a new analytical process highlighted by a soft subspace clustering method, a changing window technique, and a series of post-processing strategies to enhance the identification and characterisation of local gene expression patterns. The proposed method can be conducted in an interactive way, facilitating the exploration and analysis of local gene expression patterns in real applications. Experimental results have shown that the proposed method is effective in identification and characterization of functional gene groups in terms of both local expression similarities and biological coherence of genes in a cluster.


Subject(s)
Algorithms , Gene Expression Profiling/methods , Gene Expression , Cluster Analysis , Oligonucleotide Array Sequence Analysis/methods
11.
IEEE Trans Pattern Anal Mach Intell ; 29(3): 503-7, 2007 Mar.
Article in English | MEDLINE | ID: mdl-17224620

ABSTRACT

This correspondence describes extensions to the k-modes algorithm for clustering categorical data. By modifying a simple matching dissimilarity measure for categorical objects, a heuristic approach was developed in [4], [12] which allows the use of the k-modes paradigm to obtain a cluster with strong intrasimilarity and to efficiently cluster large categorical data sets. The main aim of this paper is to rigorously derive the updating formula of the k-modes clustering algorithm with the new dissimilarity measure and the convergence of the algorithm under the optimization framework.


Subject(s)
Algorithms , Artifacts , Artificial Intelligence , Cluster Analysis , Information Storage and Retrieval/methods , Pattern Recognition, Automated/methods , Numerical Analysis, Computer-Assisted , Reproducibility of Results , Sensitivity and Specificity
12.
IEEE Trans Pattern Anal Mach Intell ; 27(5): 657-68, 2005 May.
Article in English | MEDLINE | ID: mdl-15875789

ABSTRACT

This paper proposes a k-means type clustering algorithm that can automatically calculate variable weights. A new step is introduced to the k-means clustering process to iteratively update variable weights based on the current partition of data and a formula for weight calculation is proposed. The convergency theorem of the new clustering process is given. The variable weights produced by the algorithm measure the importance of variables in clustering and can be used in variable selection in data mining applications where large and complex real data are often involved. Experimental results on both synthetic and real data have shown that the new algorithm outperformed the standard k-means type algorithms in recovering clusters in data.


Subject(s)
Algorithms , Artificial Intelligence , Cluster Analysis , Information Storage and Retrieval/methods , Models, Statistical , Pattern Recognition, Automated/methods , Computer Simulation , Numerical Analysis, Computer-Assisted , Reproducibility of Results , Sensitivity and Specificity , Signal Processing, Computer-Assisted
SELECTION OF CITATIONS
SEARCH DETAIL
...