Search | VHL Regional Portal

1.

Elevating Smart Manufacturing with a Unified Predictive Maintenance Platform: The Synergy between Data Warehousing, Apache Spark, and Machine Learning.

Su, Naijing; Huang, Shifeng; Su, Chuanjun.

Sensors (Basel) ; 24(13)2024 Jun 29.

Article in English | MEDLINE | ID: mdl-39001017

ABSTRACT

The transition to smart manufacturing introduces heightened complexity in regard to the machinery and equipment used within modern collaborative manufacturing landscapes, presenting significant risks associated with equipment failures. The core ambition of smart manufacturing is to elevate automation through the integration of state-of-the-art technologies, including artificial intelligence (AI), the Internet of Things (IoT), machine-to-machine (M2M) communication, cloud technology, and expansive big data analytics. This technological evolution underscores the necessity for advanced predictive maintenance strategies that proactively detect equipment anomalies before they escalate into costly downtime. Addressing this need, our research presents an end-to-end platform that merges the organizational capabilities of data warehousing with the computational efficiency of Apache Spark. This system adeptly manages voluminous time-series sensor data, leverages big data analytics for the seamless creation of machine learning models, and utilizes an Apache Spark-powered engine for the instantaneous processing of streaming data for fault detection. This comprehensive platform exemplifies a significant leap forward in smart manufacturing, offering a proactive maintenance model that enhances operational reliability and sustainability in the digital manufacturing era.

2.

SeQual-Stream: approaching stream processing to quality control of NGS datasets.

Castellanos-Rodríguez, Óscar; Expósito, Roberto R; Touriño, Juan.

BMC Bioinformatics ; 24(1): 403, 2023 Oct 27.

Article in English | MEDLINE | ID: mdl-37891497

ABSTRACT

BACKGROUND: Quality control of DNA sequences is an important data preprocessing step in many genomic analyses. However, all existing parallel tools for this purpose are based on a batch processing model, needing to have the complete genetic dataset before processing can even begin. This limitation clearly hinders quality control performance in those scenarios where the dataset must be downloaded from a remote repository and/or copied to a distributed file system for its parallel processing. RESULTS: In this paper we present SeQual-Stream, a streaming tool that allows performing multiple quality control operations on genomic datasets in a fast, distributed and scalable way. To do so, our approach relies on the Apache Spark framework and the Hadoop Distributed File System (HDFS) to fully exploit the stream paradigm and accelerate the preprocessing of large datasets as they are being downloaded and/or copied to HDFS. The experimental results have shown significant improvements in the execution times of SeQual-Stream when compared to a batch processing tool with similar quality control features, providing a maximum speedup of 2.7[Formula: see text] when processing a dataset with more than 250 million DNA sequences, while also demonstrating good scalability features. CONCLUSION: Our solution provides a more scalable and higher performance way to carry out quality control of large genomic datasets by taking advantage of stream processing features. The tool is distributed as free open-source software released under the GNU AGPLv3 license and is publicly available to download at https://github.com/UDC-GAC/SeQual-Stream .

Subject(s)

Genomics , Software , Genomics/methods , Genome , Base Sequence , Algorithms , High-Throughput Nucleotide Sequencing/methods

3.

Investigation on the use of ensemble learning and big data in crop identification.

Ahmed, Sayed; Mahmoud, Amira S; Farg, Eslam; Mohamed, Amany M; Moustafa, Marwa S; Abutaleb, Khaled; Saleh, Ahmed M; AbdelRahman, Mohamed A E; AbdelSalam, Hisham M; Arafat, Sayed M.

Heliyon ; 9(2): e13339, 2023 Feb.

Article in English | MEDLINE | ID: mdl-36820038

ABSTRACT

The agriculture sector in Egypt faces several problems, such as climate change, water storage, and yield variability. The comprehensive capabilities of Big Data (BD) can help in tackling the uncertainty of food supply occurs due to several factors such as soil erosion, water pollution, climate change, socio-cultural growth, governmental regulations, and market fluctuations. Crop identification and monitoring plays a vital role in modern agriculture. Although several machine learning models have been utilized in identifying crops, the performance of ensemble learning has not been investigated extensively. The massive volume of satellite imageries has been established as a big data problem forcing to deploy the proposed solution using big data technologies to manage, store, analyze, and visualize satellite data. In this paper, we have developed a weighted voting mechanism for improving crop classification performance in a large scale, based on ensemble learning and big data schema. Built upon Apache Spark, the popular DB Framework, the proposed approach was tested on El Salheya, Ismaili governate. The proposed ensemble approach boosted accuracy by 6.5%, 1.9%, 4.4%, 4.9%, 4.7% in precision, recall, F-score, Overall Accuracy (OA), and Matthews correlation coefficient (MCC) metrics respectively. Our findings confirm the generalization of the proposed crop identification approach at a large-scale setting.

4.

Framing Apache Spark in life sciences.

Manconi, Andrea; Gnocchi, Matteo; Milanesi, Luciano; Marullo, Osvaldo; Armano, Giuliano.

Heliyon ; 9(2): e13368, 2023 Feb.

Article in English | MEDLINE | ID: mdl-36852030

ABSTRACT

Advances in high-throughput and digital technologies have required the adoption of big data for handling complex tasks in life sciences. However, the drift to big data led researchers to face technical and infrastructural challenges for storing, sharing, and analysing them. In fact, this kind of tasks requires distributed computing systems and algorithms able to ensure efficient processing. Cutting edge distributed programming frameworks allow to implement flexible algorithms able to adapt the computation to the data over on-premise HPC clusters or cloud architectures. In this context, Apache Spark is a very powerful HPC engine for large-scale data processing on clusters. Also thanks to specialised libraries for working with structured and relational data, it allows to support machine learning, graph-based computation, and stream processing. This review article is aimed at helping life sciences researchers to ascertain the features of Apache Spark and to assess whether it can be successfully used in their research activities.

5.

A Parallel Multiobjective PSO Weighted Average Clustering Algorithm Based on Apache Spark.

Ling, Huidong; Zhu, Xinmu; Zhu, Tao; Nie, Mingxing; Liu, Zhenghai; Liu, Zhenyu.

Entropy (Basel) ; 25(2)2023 Jan 31.

Article in English | MEDLINE | ID: mdl-36832627

ABSTRACT

Multiobjective clustering algorithm using particle swarm optimization has been applied successfully in some applications. However, existing algorithms are implemented on a single machine and cannot be directly parallelized on a cluster, which makes it difficult for existing algorithms to handle large-scale data. With the development of distributed parallel computing framework, data parallelism was proposed. However, the increase in parallelism will lead to the problem of unbalanced data distribution affecting the clustering effect. In this paper, we propose a parallel multiobjective PSO weighted average clustering algorithm based on apache Spark (Spark-MOPSO-Avg). First, the entire data set is divided into multiple partitions and cached in memory using the distributed parallel and memory-based computing of Apache Spark. The local fitness value of the particle is calculated in parallel according to the data in the partition. After the calculation is completed, only particle information is transmitted, and there is no need to transmit a large number of data objects between each node, reducing the communication of data in the network and thus effectively reducing the algorithm's running time. Second, a weighted average calculation of the local fitness values is performed to improve the problem of unbalanced data distribution affecting the results. Experimental results show that the Spark-MOPSO-Avg algorithm achieves lower information loss under data parallelism, losing about 1% to 9% accuracy, but can effectively reduce the algorithm time overhead. It shows good execution efficiency and parallel computing capability under the Spark distributed cluster.

6.

Scalable feature subset selection for big data using parallel hybrid evolutionary algorithm based wrapper under apache spark environment.

Vivek, Yelleti; Ravi, Vadlamani; Krishna, P Radha.

Cluster Comput ; 26(3): 1949-1983, 2023.

Article in English | MEDLINE | ID: mdl-36105649

ABSTRACT

Extant sequential wrapper-based feature subset selection (FSS) algorithms are not scalable and yield poor performance when applied to big datasets. Hence, to circumvent these challenges, we propose parallel and distributed hybrid evolutionary algorithms (EAs) based wrappers under Apache Spark. We propose two hybrid EAs based on the Binary Differential Evolution (BDE), and Binary Threshold Accepting (BTA), namely, (i) Parallel Binary Differential Evolution and Threshold Accepting (PB-DETA), where BDE and BTA work in tandem in every iteration, and (ii) its ablation variant, Parallel Binary Threshold Accepting and Differential Evolution (PB-TADE). Here, BTA is invoked to enhance the search capability and avoid premature convergence of BDE. For comparison purposes, we also parallelized two state-of-the-art algorithms: adaptive DE (ADE) and permutation based DE (DE-FSPM), and named them PB-ADE and P-DE-FSPM respectively. Throughout, logistic regression (LR) is employed to compute the fitness function, namely, area under the receiver operator characteristic curve (AUC). The effectiveness of the proposed algorithms is tested over the five big datasets of varying dimensions. It is noteworthy that the PB-TADE turned out to be statistically significant than the rest. All the algorithms have shown the repeatability property. The proposed parallel model attained a speedup of 2.2-2.9. We also reported feature subset with high AUC and least cardinality.

7.

ï»¿SparkEC: speeding up alignment-based DNA error correction tools.

Expósito, Roberto R; Martínez-Sánchez, Marco; Touriño, Juan.

BMC Bioinformatics ; 23(1): 464, 2022 Nov 07.

Article in English | MEDLINE | ID: mdl-36344928

ABSTRACT

BACKGROUND: In recent years, huge improvements have been made in the context of sequencing genomic data under what is called Next Generation Sequencing (NGS). However, the DNA reads generated by current NGS platforms are not free of errors, which can affect the quality of downstream analysis. Although error correction can be performed as a preprocessing step to overcome this issue, it usually requires long computational times to analyze those large datasets generated nowadays through NGS. Therefore, new software capable of scaling out on a cluster of nodes with high performance is of great importance. RESULTS: In this paper, we present SparkEC, a parallel tool capable of fixing those errors produced during the sequencing process. For this purpose, the algorithms proposed by the CloudEC tool, which is already proved to perform accurate corrections, have been analyzed and optimized to improve their performance by relying on the Apache Spark framework together with the introduction of other enhancements such as the usage of memory-efficient data structures and the avoidance of any input preprocessing. The experimental results have shown significant improvements in the computational times of SparkEC when compared to CloudEC for all the representative datasets and scenarios under evaluation, providing an average and maximum speedups of 4.9[Formula: see text] and 11.9[Formula: see text], respectively, over its counterpart. CONCLUSION: As error correction can take excessive computational time, SparkEC provides a scalable solution for correcting large datasets. Due to its distributed implementation, SparkEC speed can increase with respect to the number of nodes in a cluster. Furthermore, the software is freely available under GPLv3 license and is compatible with different operating systems (Linux, Windows and macOS).

Subject(s)

High-Throughput Nucleotide Sequencing , Software , Sequence Analysis, DNA/methods , High-Throughput Nucleotide Sequencing/methods , Genomics/methods , Algorithms , DNA/genetics

8.

Detecting Reconnaissance and Discovery Tactics from the MITRE ATT&CK Framework in Zeek Conn Logs Using Spark's Machine Learning in the Big Data Framework.

Bagui, Sikha; Mink, Dustin; Bagui, Subhash; Ghosh, Tirthankar; McElroy, Tom; Paredes, Esteban; Khasnavis, Nithisha; Plenkers, Russell.

Sensors (Basel) ; 22(20)2022 Oct 20.

Article in English | MEDLINE | ID: mdl-36298351

ABSTRACT

While computer networks and the massive amount of communication taking place on these networks grow, the amount of damage that can be done by network intrusions grows in tandem. The need is for an effective and scalable intrusion detection system (IDS) to address these potential damages that come with the growth of these networks. A great deal of contemporary research on near real-time IDS focuses on applying machine learning classifiers to labeled network intrusion datasets, but these datasets need be relevant pertaining to the currency of the network intrusions. This paper focuses on a newly created dataset, UWF-ZeekData22, that analyzes data from Zeek's Connection Logs collected using Security Onion 2 network security monitor and labelled using the MITRE ATT&CK framework TTPs. Due to the volume of data, Spark, in the big data framework, was used to run many of the well-known classifiers (naïve Bayes, random forest, decision tree, support vector classifier, gradient boosted trees, and logistic regression) to classify the reconnaissance and discovery tactics from this dataset. In addition to looking at the performance of these classifiers using Spark, scalability and response time were also analyzed.

Subject(s)

Big Data , Machine Learning , Bayes Theorem , Logistic Models

9.

A machine learning-based approach for sentiment analysis on distance learning from Arabic Tweets.

Almalki, Jameel.

PeerJ Comput Sci ; 8: e1047, 2022.

Article in English | MEDLINE | ID: mdl-36092011

ABSTRACT

Social media platforms such as Twitter, YouTube, Instagram and Facebook are leading sources of large datasets nowadays. Twitter's data is one of the most reliable due to its privacy policy. Tweets have been used for sentiment analysis and to identify meaningful information within the dataset. Our study focused on the distance learning domain in Saudi Arabia by analyzing Arabic tweets about distance learning. This work proposes a model for analyzing people's feedback using a Twitter dataset in the distance learning domain. The proposed model is based on the Apache Spark product to manage the large dataset. The proposed model uses the Twitter API to get the tweets as raw data. These tweets were stored in the Apache Spark server. A regex-based technique for preprocessing removed retweets, links, hashtags, English words and numbers, usernames, and emojis from the dataset. After that, a Logistic-based Regression model was trained on the pre-processed data. This Logistic Regression model, from the field of machine learning, was used to predict the sentiment inside the tweets. Finally, a Flask application was built for sentiment analysis of the Arabic tweets. The proposed model gives better results when compared to various applied techniques. The proposed model is evaluated on test data to calculate Accuracy, F1 Score, Precision, and Recall, obtaining scores of 91%, 90%, 90%, and 89%, respectively.

10.

Real-time internet of medical things framework for early detection of Covid-19.

Yildirim, Emre; Cicioglu, Murtaza; Çalhan, Ali.

Neural Comput Appl ; 34(22): 20365-20378, 2022.

Article in English | MEDLINE | ID: mdl-35912366

ABSTRACT

The Covid-19 pandemic is a deadly epidemic and continues to affect all world. This situation dragged the countries into a global crisis and caused the collapse of some health systems. Therefore, many technologies are needed to slow down the spread of the Covid-19 epidemic and produce solutions. In this context, some developments have been made with artificial intelligence, machine learning and deep learning support systems in order to alleviate the burden on the health system. In this study, a new Internet of Medical Things (IoMT) framework is proposed for the detection and early prevention of Covid-19 infection. In the proposed IoMT framework, a Covid-19 scenario consisting of various numbers of sensors is created in the Riverbed Modeler simulation software. The health data produced in this scenario are analyzed in real time with Apache Spark technology, and disease prediction is made. In order to provide more accurate results for Covid-19 disease prediction, Random Forest and Gradient Boosted Tree (GBT) Ensemble Learning classifiers, which are formed by Decision Tree classifiers, are compared for the performance evaluation. In addition, throughput, end-to-end delay results and Apache Spark data processing performance of heterogeneous nodes with different priorities are analyzed in the Covid-19 scenario. The MongoDB NoSQL database is used in the IoMT framework to store big health data produced in real time and use it in subsequent processes. The proposed IoMT framework experimental results show that the GBTs classifier has the best performance with 95.70% training, 95.30% test accuracy and 0.970 area under the curve (AUC) values. Moreover, the promising real-time performances of wireless body area network (WBAN) simulation scenario and Apache Spark show that they can be used for the early detection of Covid-19 disease.

11.

A Novel Reinforcement Learning Approach for Spark Configuration Parameter Optimization.

Huang, Xu; Zhang, Hong; Zhai, Xiaomeng.

Sensors (Basel) ; 22(15)2022 Aug 08.

Article in English | MEDLINE | ID: mdl-35957487

ABSTRACT

Apache Spark is a popular open-source distributed data processing framework that can efficiently process massive amounts of data. It provides more than 180 configuration parameters for users to manually select the appropriate parameter values according to their own experience. However, due to the large number of parameters and the inherent correlation between them, manual tuning is very tedious. To solve the problem of tuning through personal experience, we designed and implemented a reinforcement-learning-based Spark configuration parameter optimizer. First, we trained a Spark application performance prediction model with deep neural networks, and verified the accuracy and effectiveness of the model from multiple perspectives. Second, in order to improve the search efficiency of better configuration parameters, we improved the Q-learning algorithm, and automatically set start and end states in each iteration of training, which effectively improves the agent's poor performance in exploring better configuration parameters. Lastly, comparing our proposed configuration with the default configuration as the baseline, experimental results show that the optimized configuration gained an average performance improvement of 47%, 43%, 31%, and 45% for four different types of Spark applications, which indicates that our Spark configuration parameter optimizer could efficiently find the better configuration parameters and improve the performance of various Spark applications.

Subject(s)

Algorithms , Neural Networks, Computer

12.

SSK-DDoS: distributed stream processing framework based classification system for DDoS attacks.

Patil, Nilesh Vishwasrao; Krishna, C Rama; Kumar, Krishan.

Cluster Comput ; 25(2): 1355-1372, 2022.

Article in English | MEDLINE | ID: mdl-35068996

ABSTRACT

Distributed denial of service (DDoS) is an immense threat for Internet based-applications and their resources. It immediately floods the victim system by transmitting a large number of network packets, and due to this, the victim system resources become unavailable for legitimate users. Therefore, this attack is claimed to be a dangerous attack for Internet-based applications and their resources. Several security approaches have been proposed in the literature to protect Internet-based applications from this type of threat. However, the frequency and strength of DDoS attacks are increasing day-by-day. Further, most of the traditional and distributed processing frameworks-based DDoS attack detection systems analyzed network flows in offline batch processing. Hence, they failed to classify network flows in real-time. This paper proposes a novel Spark Streaming and Kafka-based distributed classification system, named by SSK-DDoS, for classifying different types of DDoS attacks and legitimate network flows. This classification approach is implemented using a distributed Spark MLlib machine learning algorithms on a Hadoop cluster and deployed on the Spark streaming platform to classify streams in real-time. The incoming streams consume by Kafka's topic to perform preprocessing tasks such as extracting and formulating features for classifying them into seven groups: Benign, DDoS-DNS, DDoS-LDAP, DDoS-MSSQL, DDoS-NetBIOS, DDoS-UDP, and DDoS-SYN. Further, the SSK-DDoS classification system stores formulated features with their predicted class into the HDFS that will help to retrain the distributed classification approach using a new set of samples. The proposed SSK-DDoS classification system has been validated using the recent CICDDoS2019 dataset. The results show that the proposed SSK-DDoS efficiently classified network flows into seven classes and stored formulated features with the predicted value of each incoming network flow into HDFS.

13.

Halvade somatic: Somatic variant calling with Apache Spark.

Decap, Dries; de Schaetzen van Brienen, Louise; Larmuseau, Maarten; Costanza, Pascal; Herzeel, Charlotte; Wuyts, Roel; Marchal, Kathleen; Fostier, Jan.

Gigascience ; 11(1)2022 01 12.

Article in English | MEDLINE | ID: mdl-35022699

ABSTRACT

BACKGROUND: The accurate detection of somatic variants from sequencing data is of key importance for cancer treatment and research. Somatic variant calling requires a high sequencing depth of the tumor sample, especially when the detection of low-frequency variants is also desired. In turn, this leads to large volumes of raw sequencing data to process and hence, large computational requirements. For example, calling the somatic variants according to the GATK best practices guidelines requires days of computing time for a typical whole-genome sequencing sample. FINDINGS: We introduce Halvade Somatic, a framework for somatic variant calling from DNA sequencing data that takes advantage of multi-node and/or multi-core compute platforms to reduce runtime. It relies on Apache Spark to provide scalable I/O and to create and manage data streams that are processed on different CPU cores in parallel. Halvade Somatic contains all required steps to process the tumor and matched normal sample according to the GATK best practices recommendations: read alignment (BWA), sorting of reads, preprocessing steps such as marking duplicate reads and base quality score recalibration (GATK), and, finally, calling the somatic variants (Mutect2). Our approach reduces the runtime on a single 36-core node to 19.5 h compared to a runtime of 84.5 h for the original pipeline, a speedup of 4.3 times. Runtime can be further decreased by scaling to multiple nodes, e.g., we observe a runtime of 1.36 h using 16 nodes, an additional speedup of 14.4 times. Halvade Somatic supports variant calling from both whole-genome sequencing and whole-exome sequencing data and also supports Strelka2 as an alternative or complementary variant calling tool. We provide a Docker image to facilitate single-node deployment. Halvade Somatic can be executed on a variety of compute platforms, including Amazon EC2 and Google Cloud. CONCLUSIONS: To our knowledge, Halvade Somatic is the first somatic variant calling pipeline that leverages Big Data processing platforms and provides reliable, scalable performance. Source code is freely available.

Subject(s)

High-Throughput Nucleotide Sequencing , Software , High-Throughput Nucleotide Sequencing/methods , Polymorphism, Single Nucleotide , Sequence Analysis, DNA/methods , Exome Sequencing , Whole Genome Sequencing

14.

A Digital Twin Decision Support System for the Urban Facility Management Process.

Bujari, Armir; Calvio, Alessandro; Foschini, Luca; Sabbioni, Andrea; Corradi, Antonio.

Sensors (Basel) ; 21(24)2021 Dec 18.

Article in English | MEDLINE | ID: mdl-34960550

ABSTRACT

The ever increasing pace of IoT deployment is opening the door to concrete implementations of smart city applications, enabling the large-scale sensing and modeling of (near-)real-time digital replicas of physical processes and environments. This digital replica could serve as the basis of a decision support system, providing insights into possible optimizations of resources in a smart city scenario. In this article, we discuss an extension of a prior work, presenting a detailed proof-of-concept implementation of a Digital Twin solution for the Urban Facility Management (UFM) process. The Interactive Planning Platform for City District Adaptive Maintenance Operations (IPPODAMO) is a distributed geographical system, fed with and ingesting heterogeneous data sources originating from different urban data providers. The data are subject to continuous refinements and algorithmic processes, used to quantify and build synthetic indexes measuring the activity level inside an area of interest. IPPODAMO takes into account potential interference from other stakeholders in the urban environment, enabling the informed scheduling of operations, aimed at minimizing interference and the costs of operations.

15.

Detection of COVID-19 in Chest X-ray Images: A Big Data Enabled Deep Learning Approach.

Awan, Mazhar Javed; Bilal, Muhammad Haseeb; Yasin, Awais; Nobanee, Haitham; Khan, Nabeel Sabir; Zain, Azlan Mohd.

Int J Environ Res Public Health ; 18(19)2021 Sep 27.

Article in English | MEDLINE | ID: mdl-34639450

ABSTRACT

Coronavirus disease (COVID-19) spreads from one person to another rapidly. A recently discovered coronavirus causes it. COVID-19 has proven to be challenging to detect and cure at an early stage all over the world. Patients showing symptoms of COVID-19 are resulting in hospitals becoming overcrowded, which is becoming a significant challenge. Deep learning's contribution to big data medical research has been enormously beneficial, offering new avenues and possibilities for illness diagnosis techniques. To counteract the COVID-19 outbreak, researchers must create a classifier distinguishing between positive and negative corona-positive X-ray pictures. In this paper, the Apache Spark system has been utilized as an extensive data framework and applied a Deep Transfer Learning (DTL) method using Convolutional Neural Network (CNN) three architectures -InceptionV3, ResNet50, and VGG19-on COVID-19 chest X-ray images. The three models are evaluated in two classes, COVID-19 and normal X-ray images, with 100 percent accuracy. But in COVID/Normal/pneumonia, detection accuracy was 97 percent for the inceptionV3 model, 98.55 percent for the ResNet50 Model, and 98.55 percent for the VGG19 model, respectively.

Subject(s)

COVID-19 , Deep Learning , Big Data , Humans , SARS-CoV-2 , X-Rays

16.

VC@Scale: Scalable and high-performance variant calling on cluster environments.

Ahmad, Tanveer; Al Ars, Zaid; Hofstee, H Peter.

Gigascience ; 10(9)2021 09 07.

Article in English | MEDLINE | ID: mdl-34494101

ABSTRACT

BACKGROUND: Recently many new deep learning-based variant-calling methods like DeepVariant have emerged as more accurate compared with conventional variant-calling algorithms such as GATK HaplotypeCaller, Sterlka2, and Freebayes albeit at higher computational costs. Therefore, there is a need for more scalable and higher performance workflows of these deep learning methods. Almost all existing cluster-scaled variant-calling workflows that use Apache Spark/Hadoop as big data frameworks loosely integrate existing single-node pre-processing and variant-calling applications. Using Apache Spark just for distributing/scheduling data among loosely coupled applications or using I/O-based storage for storing the output of intermediate applications does not exploit the full benefit of Apache Spark in-memory processing. To achieve this, we propose a native Spark-based workflow that uses Python and Apache Arrow to enable efficient transfer of data between different workflow stages. This benefits from the ease of programmability of Python and the high efficiency of Arrow's columnar in-memory data transformations. RESULTS: Here we present a scalable, parallel, and efficient implementation of next-generation sequencing data pre-processing and variant-calling workflows. Our design tightly integrates most pre-processing workflow stages, using Spark built-in functions to sort reads by coordinates and mark duplicates efficiently. Our approach outperforms state-of-the-art implementations by >2 times for the pre-processing stages, creating a scalable and high-performance solution for DeepVariant for both CPU-only and CPU + GPU clusters. CONCLUSIONS: We show the feasibility and easy scalability of our approach to achieve high performance and efficient resource utilization for variant-calling analysis on high-performance computing clusters using the standardized Apache Arrow data representations. All codes, scripts, and configurations used to run our implementations are publicly available and open sourced; see https://github.com/abs-tudelft/variant-calling-at-scale.

Subject(s)

High-Throughput Nucleotide Sequencing , Software , Algorithms , Big Data , High-Throughput Nucleotide Sequencing/methods , Workflow

17.

Human Behavior Analysis Using Intelligent Big Data Analytics.

Tariq, Muhammad Usman; Babar, Muhammad; Poulin, Marc; Khattak, Akmal Saeed; Alshehri, Mohammad Dahman; Kaleem, Sarah.

Front Psychol ; 12: 686610, 2021.

Article in English | MEDLINE | ID: mdl-34295289

ABSTRACT

Intelligent big data analysis is an evolving pattern in the age of big data science and artificial intelligence (AI). Analysis of organized data has been very successful, but analyzing human behavior using social media data becomes challenging. The social media data comprises a vast and unstructured format of data sources that can include likes, comments, tweets, shares, and views. Data analytics of social media data became a challenging task for companies, such as Dailymotion, that have billions of daily users and vast numbers of comments, likes, and views. Social media data is created in a significant amount and at a tremendous pace. There is a very high volume to store, sort, process, and carefully study the data for making possible decisions. This article proposes an architecture using a big data analytics mechanism to efficiently and logically process the huge social media datasets. The proposed architecture is composed of three layers. The main objective of the project is to demonstrate Apache Spark parallel processing and distributed framework technologies with other storage and processing mechanisms. The social media data generated from Dailymotion is used in this article to demonstrate the benefits of this architecture. The project utilized the application programming interface (API) of Dailymotion, allowing it to incorporate functions suitable to fetch and view information. The API key is generated to fetch information of public channel data in the form of text files. Hive storage machinist is utilized with Apache Spark for efficient data processing. The effectiveness of the proposed architecture is also highlighted.

18.

QoS-Aware Approximate Query Processing for Smart Cities Spatial Data Streams.

Al Jawarneh, Isam Mashhour; Bellavista, Paolo; Corradi, Antonio; Foschini, Luca; Montanari, Rebecca.

Sensors (Basel) ; 21(12)2021 Jun 17.

Article in English | MEDLINE | ID: mdl-34204451

ABSTRACT

Large amounts of georeferenced data streams arrive daily to stream processing systems. This is attributable to the overabundance of affordable IoT devices. In addition, interested practitioners desire to exploit Internet of Things (IoT) data streams for strategic decision-making purposes. However, mobility data are highly skewed and their arrival rates fluctuate. This nature poses an extra challenge on data stream processing systems, which are required in order to achieve pre-specified latency and accuracy goals. In this paper, we propose ApproxSSPS, which is a system for approximate processing of geo-referenced mobility data, at scale with quality of service guarantees. We focus on stateful aggregations (e.g., means, counts) and top-N queries. ApproxSSPS features a controller that interactively learns the latency statistics and calculates proper sampling rates to meet latency or/and accuracy targets. An overarching trait of ApproxSSPS is its ability to strike a plausible balance between latency and accuracy targets. We evaluate ApproxSSPS on Apache Spark Structured Streaming with real mobility data. We also compared ApproxSSPS against a state-of-the-art online adaptive processing system. Our extensive experiments prove that ApproxSSPS can fulfill latency and accuracy targets with varying sets of parameter configurations and load intensities (i.e., transient peaks in data loads versus slow arriving streams). Moreover, our results show that ApproxSSPS outperforms the baseline counterpart by significant magnitudes. In short, ApproxSSPS is a novel spatial data stream processing system that can deliver real accurate results in a timely manner, by dynamically specifying the limits on data samples.

Subject(s)

Algorithms , Internet of Things , Cities

19.

Synonymous variants that disrupt messenger RNA structure are significantly constrained in the human population.

Gaither, Jeffrey B S; Lammi, Grant E; Li, James L; Gordon, David M; Kuck, Harkness C; Kelly, Benjamin J; Fitch, James R; White, Peter.

Gigascience ; 10(4)2021 04 05.

Article in English | MEDLINE | ID: mdl-33822938

ABSTRACT

BACKGROUND: The role of synonymous single-nucleotide variants in human health and disease is poorly understood, yet evidence suggests that this class of "silent" genetic variation plays multiple regulatory roles in both transcription and translation. One mechanism by which synonymous codons direct and modulate the translational process is through alteration of the elaborate structure formed by single-stranded mRNA molecules. While tools to computationally predict the effect of non-synonymous variants on protein structure are plentiful, analogous tools to systematically assess how synonymous variants might disrupt mRNA structure are lacking. RESULTS: We developed novel software using a parallel processing framework for large-scale generation of secondary RNA structures and folding statistics for the transcriptome of any species. Focusing our analysis on the human transcriptome, we calculated 5 billion RNA-folding statistics for 469 million single-nucleotide variants in 45,800 transcripts. By considering the impact of all possible synonymous variants globally, we discover that synonymous variants predicted to disrupt mRNA structure have significantly lower rates of incidence in the human population. CONCLUSIONS: These findings support the hypothesis that synonymous variants may play a role in genetic disorders due to their effects on mRNA structure. To evaluate the potential pathogenic impact of synonymous variants, we provide RNA stability, edge distance, and diversity metrics for every nucleotide in the human transcriptome and introduce a "Structural Predictivity Index" (SPI) to quantify structural constraint operating on any synonymous variant. Because no single RNA-folding metric can capture the diversity of mechanisms by which a variant could alter secondary mRNA structure, we generated a SUmmarized RNA Folding (SURF) metric to provide a single measurement to predict the impact of secondary structure altering variants in human genetic studies.

Subject(s)

Protein Biosynthesis , RNA Stability , Codon , Humans , Nucleotides , RNA, Messenger/genetics , RNA, Messenger/metabolism

20.

Apache Spark based kernelized fuzzy clustering framework for single nucleotide polymorphism sequence analysis.

Jha, Preeti; Tiwari, Aruna; Bharill, Neha; Ratnaparkhe, Milind; Mounika, Mukkamalla; Nagendra, Neha.

Comput Biol Chem ; 92: 107454, 2021 Jun.

Article in English | MEDLINE | ID: mdl-33684695

ABSTRACT

This paper introduces a kernel based fuzzy clustering approach to deal with the non-linear separable problems by applying kernel Radial Basis Functions (RBF) which maps the input data space non-linearly into a high-dimensional feature space. Discovering clusters in the high-dimensional genomics data is extremely challenging for the bioinformatics researchers for genome analysis. To support the investigations in bioinformatics, explicitly on genomic clustering, we proposed high-dimensional kernelized fuzzy clustering algorithms based on Apache Spark framework for clustering of Single Nucleotide Polymorphism (SNP) sequences. The paper proposes the Kernelized Scalable Random Sampling with Iterative Optimization Fuzzy c-Means (KSRSIO-FCM) which inherently uses another proposed Kernelized Scalable Literal Fuzzy c-Means (KSLFCM) clustering algorithm. Both the approaches completely adapt the Apache Spark cluster framework by localized sub-clustering Resilient Distributed Dataset (RDD) method. Additionally, we are also proposing a preprocessing approach for generating numeric feature vectors for huge SNP sequences and making it a scalable preprocessing approach by executing it on an Apache Spark cluster, which is applied to real-world SNP datasets taken from open-internet repositories of two different plant species, i.e., soybean and rice. The comparison of the proposed scalable kernelized fuzzy clustering results with similar works shows the significant improvement of the proposed algorithm in terms of time and space complexity, Silhouette index, and Davies-Bouldin index. Exhaustive experiments are performed on various SNP datasets to show the effectiveness of proposed KSRSIO-FCM in comparison with proposed KSLFCM and other scalable clustering algorithms, i.e., SRSIO-FCM, and SLFCM.

Subject(s)

Algorithms , Cluster Analysis , Fuzzy Logic , Polymorphism, Single Nucleotide/genetics , Computational Biology , Databases, Genetic , Humans

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL