Search | VHL Regional Portal

Machine learning methods to predict particulate matter PM _2.5.

Palanichamy, Naveen; Haw, Su-Cheng; S, Subramanian; Murugan, Rishanti; Govindasamy, Kuhaneswaran.

F1000Res ; 11: 406, 2022.

Article in English | MEDLINE | ID: mdl-36531254

ABSTRACT

Introduction Pollution of air in urban cities across the world has been steadily increasing in recent years. An increasing trend in particulate matter, PM 2.5, is a threat because it can lead to uncontrollable consequences like worsening of asthma and cardiovascular disease. The metric used to measure air quality is the air pollutant index (API). In Malaysia, machine learning (ML) techniques for PM 2.5 have received less attention as the concentration is on predicting other air pollutants. To fill the research gap, this study focuses on correctly predicting PM 2.5 concentrations in the smart cities of Malaysia by comparing supervised ML techniques, which helps to mitigate its adverse effects. Methods In this paper, ML models for forecasting PM 2.5 concentrations were investigated on Malaysian air quality data sets from 2017 to 2018. The dataset was preprocessed by data cleaning and a normalization process. Next, it was reduced into an informative dataset with location and time factors in the feature extraction process. The dataset was fed into three supervised ML classifiers, which include random forest (RF), artificial neural network (ANN) and long short-term memory (LSTM). Finally, their output was evaluated using the confusion matrix and compared to identify the best model for the accurate prediction of PM 2.5. Results Overall, the experimental result shows an accuracy of 97.7% was obtained by the RF model in comparison with the accuracy of ANN (61.14%) and LSTM (61.77%) in predicting PM 2.5. Discussion RF performed well when compared with ANN and LSTM for the given data with minimum features. RF was able to reach good accuracy as the model learns from the random samples by using decision tree with the maximum vote on the predictions.

Subject(s)

Air Pollutants , Air Pollution , Particulate Matter , Environmental Monitoring/methods , Air Pollutants/adverse effects , Air Pollution/adverse effects , Air Pollution/analysis , Machine Learning

Improving the data access control using blockchain for healthcare domain.

Tahir Yinka, Olaosebikan; Haw, Su-Cheng; Yap, Timothy Tzen Vun; Subramaniam, Samini.

F1000Res ; 10: 901, 2021.

Article in English | MEDLINE | ID: mdl-34858590

ABSTRACT

Introduction Unauthorized access to data is one of the most significant privacy issues that hinder most industries from adopting big data technologies. Even though specific processes and structures have been put in place to deal with access authorization and identity management for large databases nonetheless, the scalability criteria are far beyond the capabilities of traditional databases. Hence, most researchers are looking into other solutions, such as big data management. Methods In this paper, we firstly study the strengths and weaknesses of implementing cryptography and blockchain for identity management and authorization control in big data, focusing on the healthcare domain. Subsequently, we propose a decentralized data access and sharing system that preserves privacy to ensure adequate data access management under the blockchain. In addition, we designed a blockchain framework to resolve the decentralized data access and sharing system privacy issues, by implementing a public key infrastructure model, which utilizes a signature cryptography algorithm (elliptic curve and signcryption). Lastly, we compared the proposed blockchain model to previous techniques to see how well it performed. Results We evaluated the blockchain on four performance metrics which include throughput, latency, scalability, and security. The proposed blockchain model was tested using a sample of 5000 patients and 500,000 observations. The performance evaluation results further showed that the proposed model achieves higher throughput and lower latency compared to existing approaches when the workload varies up to 10,000 transactions. Discussion This research reviews the importance of blockchains as they provide infinite possibilities to individuals, companies, and governments.

Subject(s)

Blockchain , Algorithms , Delivery of Health Care , Health Facilities , Humans

A hybrid recommender system based on data enrichment on the ontology modelling.

Chew, Lit-Jie; Haw, Su-Cheng; Subramaniam, Samini.

F1000Res ; 10: 937, 2021.

Article in English | MEDLINE | ID: mdl-34868563

ABSTRACT

Background: A recommender system captures the user preferences and behaviour to provide a relevant recommendation to the user. In a hybrid model-based recommender system, it requires a pre-trained data model to generate recommendations for a user. Ontology helps to represent the semantic information and relationships to model the expressivity and linkage among the data. Methods: We enhanced the matrix factorization model accuracy by utilizing ontology to enrich the information of the user-item matrix by integrating the item-based and user-based collaborative filtering techniques. In particular, the combination of enriched data, which consists of semantic similarity together with rating pattern, will help to reduce the cold start problem in the model-based recommender system. When the new user or item first coming into the system, we have the user demographic or item profile that linked to our ontology. Thus, semantic similarity can be calculated during the item-based and user-based collaborating filtering process. The item-based and user-based filtering process are used to predict the unknown rating of the original matrix. Results: Experimental evaluations have been carried out on the MovieLens 100k dataset to demonstrate the accuracy rate of our proposed approach as compared to the baseline method using (i) Singular Value Decomposition (SVD) and (ii) combination of item-based collaborative filtering technique with SVD. Experimental results demonstrated that our proposed method has reduced the data sparsity from 0.9542% to 0.8435%. In addition, it also indicated that our proposed method has achieved better accuracy with Root Mean Square Error (RMSE) of 0.9298, as compared to the baseline method (RMSE: 0.9642) and the existing method (RMSE: 0.9492). Conclusions: Our proposed method enhanced the dataset information by integrating user-based and item-based collaborative filtering techniques. The experiment results shows that our system has reduced the data sparsity and has better accuracy as compared to baseline method and existing method.

Subject(s)

Algorithms

An empirical study on Resource Description Framework reification for trustworthiness in knowledge graphs.

Govindapillai, Sini; Soon, Lay-Ki; Haw, Su-Cheng.

F1000Res ; 10: 881, 2021.

Article in English | MEDLINE | ID: mdl-34900233

ABSTRACT

Knowledge graph (KG) publishes machine-readable representation of knowledge on the Web. Structured data in the knowledge graph is published using Resource Description Framework (RDF) where knowledge is represented as a triple (subject, predicate, object). Due to the presence of erroneous, outdated or conflicting data in the knowledge graph, the quality of facts cannot be guaranteed. Therefore, the provenance of knowledge can assist in building up the trust of these knowledge graphs. In this paper, we have provided an analysis of popular, general knowledge graphs Wikidata and YAGO4 with regard to the representation of provenance and context data. Since RDF does not support metadata for providing provenance and contextualization, an alternate method, RDF reification is employed by most of the knowledge graphs. Trustworthiness of facts in knowledge graph can be enhanced by the addition of metadata like the source of information, location and time of the fact occurrence. Wikidata employs qualifiers to include metadata to facts, while YAGO4 collects metadata from Wikidata qualifiers. RDF reification increases the magnitude of data as several statements are required to represent a single fact. However, facts in Wikidata and YAGO4 can be fetched without using reification. Another limitation for applications that uses provenance data is that not all facts in these knowledge graphs are annotated with provenance data. Structured data in the knowledge graph is noisy. Therefore, the reliability of data in knowledge graphs can be increased by provenance data. To the best of our knowledge, this is the first paper that investigates the method and the extent of the addition of metadata of two prominent KGs, Wikidata and YAGO4.

Subject(s)

Metadata , Pattern Recognition, Automated , Empirical Research , Research Design

Utilizing data sampling techniques on algorithmic fairness for customer churn prediction with data imbalance problems.

Maw, Maw; Haw, Su-Cheng; Ho, Chin-Kuan.

F1000Res ; 10: 988, 2021.

Article in English | MEDLINE | ID: mdl-36071889

ABSTRACT

Background: Customer churn prediction (CCP) refers to detecting which customers are likely to cancel the services provided by a service provider, for example, internet services. The class imbalance problem (CIP) in machine learning occurs when there is a huge difference in the samples of the positive class compared to the negative class. It is one of the major obstacles in CCP as it deteriorates performance in the classification process. Utilizing data sampling techniques (DSTs) helps to resolve the CIP to some extent. Methods: In this paper, we review the effect of using DSTs on algorithmic fairness, i.e., to investigate whether the results pose any discrimination between male and female groups and compare the results before and after using DSTs. Three real-world datasets with unequal balancing rates were prepared and four ubiquitous DSTs were applied to them. Six popular classification techniques were utilized in the classification process. Both classifier's performance and algorithmic fairness are evaluated with notable metrics. Results: The results indicated that the Random Forest classifier outperforms other classifiers in all three datasets and, that using SMOTE and ADASYN techniques causes more discrimination in the female group. The rate of unintentional discrimination seems to be higher in the original data of extremely unbalanced datasets under the following classifiers: Logistics Regression, LightGBM, and XGBoost. Conclusions: Algorithmic fairness has become a broadly studied area in recent years, yet there is very little systematic study on the effect of using DSTs on algorithmic fairness. This study presents important findings to further the use of algorithmic fairness in CCP research.

Subject(s)

Machine Learning , Female , Humans , Logistic Models , Male

Improving the support for XML dynamic updates using a hybridization labeling scheme (ORD-GAP).

Haw, Su-Cheng; Amin, Aisyah; Wong, Chee-Onn; Subramaniam, Samini.

F1000Res ; 10: 907, 2021.

Article in English | MEDLINE | ID: mdl-35106138

ABSTRACT

Background: As the standard for the exchange of data over the World Wide Web, it is important to ensure that the eXtensible Markup Language (XML) database is capable of supporting not only efficient query processing but also capable of enduring frequent data update operations over the dynamic changes of Web content. Most of the existing XML annotation is based on a labeling scheme to identify each hierarchical position of the XML nodes. This computation is costly as any updates will cause the whole XML tree to be re-labelled. This impact can be observed on large datasets. Therefore, a robust labeling scheme that avoids re-labeling is crucial. Method: Here, we present ORD-GAP (named after Order Gap), a robust and persistent XML labeling scheme that supports dynamic updates. ORD-GAP assigns unique identifiers with gaps in-between XML nodes, which could easily identify the level, Parent-Child (P-C), Ancestor-Descendant (A-D) and sibling relationship. ORD-GAP adopts the OrdPath labeling scheme for any future insertion. Results: We demonstrate that ORD-GAP is robust enough for dynamic updates, and have implemented it in three use cases: (i) left-most, (ii) in-between and (iii) right-most insertion. Experimental evaluations on DBLP dataset demonstrated that ORD-GAP outperformed existing approaches such as ORDPath and ME Labeling concerning database storage size, data loading time and query retrieval. On average, ORD-GAP has the best storing and query retrieval time. Conclusion: The main contributions of this paper are: (i) A robust labeling scheme named ORD-GAP that assigns certain gap between each node to support future insertion, and (ii) An efficient mapping scheme, which built upon ORD-GAP labeling scheme to transform XML into RDB effectively.

Subject(s)

Language , Programming Languages , Databases, Factual , Humans

Neural matrix factorization++ based recommendation system.

Ong, Kyle; Ng, Kok-Why; Haw, Su-Cheng.

F1000Res ; 10: 1079, 2021.

Article in English | MEDLINE | ID: mdl-38550618

ABSTRACT

In recent years, Recommender System (RS) research work has covered a wide variety of Artificial Intelligence techniques, ranging from traditional Matrix Factorization (MF) to complex Deep Neural Networks (DNN). Traditional Collaborative Filtering (CF) recommendation methods such as MF, have limited learning capabilities as it only considers the linear combination between user and item vectors. For learning non-linear relationships, methods like Neural Collaborative Filtering (NCF) incorporate DNN into CF methods. Though, CF methods still suffer from cold start and data sparsity. This paper proposes an improved hybrid-based RS, namely Neural Matrix Factorization++ (NeuMF++), for effectively learning user and item features to improve recommendation accuracy and alleviate cold start and data sparsity. NeuMF++ is proposed by incorporating effective latent representation into NeuMF via Stacked Denoising Autoencoders (SDAE). NeuMF++ can also be seen as the fusion of GMF++ and MLP++. NeuMF is an NCF framework which associates with GMF (Generalized Matrix Factorization) and MLP (Multilayer Perceptrons). NeuMF achieves state-of-the-art results due to the integration of GMF linearity and MLP non-linearity. Concurrently, incorporating latent representations has shown tremendous improvement in GMF and MLP, which result in GMF++ and MLP++. Latent representation obtained through the SDAEs' latent space allows NeuMF++ to effectively learn user and item features, significantly enhancing its learning capability. However, sharing feature extractions among GMF++ and MLP++ in NeuMF++ might hinder its performance. Hence, allowing GMF++ and MLP++ to learn separate features provides more flexibility and greatly improves its performance. Experiments performed on a real-world dataset have demonstrated that NeuMF++ achieves an outstanding result of a test root-mean-square error of 0.8681. In future work, we can extend NeuMF++ by introducing other auxiliary information like text or images. Different neural network building blocks can also be integrated into NeuMF++ to form a more robust recommendation model.

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL