Optimizing performance of parallel computing platforms for large-scale genome data analysis
Noor, Sumaiya ; Awan, Hamid Hussain ; Hashmi, Amber Sarwar ; Saeed, Aamir ; Khan, Salman ; AlQahtani, Salman A.
Noor, Sumaiya
Awan, Hamid Hussain
Hashmi, Amber Sarwar
Saeed, Aamir
Khan, Salman
AlQahtani, Salman A.
Supervisor
Department
Computer Vision
Embargo End Date
Type
Journal article
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
With high-speed technological advancements and Next-Generation Sequencing (NGS), the discovered genome dataset (i.e., DNA/RNA/Protein) is increasing exponentially, magnifying every month. Timely analysis of the genome dataset is very important for understanding biological activities and drug development. However, due to the vast amounts of sequences and the complex structure of genome datasets, the storage and timely analysis of genome datasets is becoming challenging for traditional analysis techniques. Distributed and cluster computing platforms are becoming significant for big data analytics and are now required in computational biology. This paper presents a computational model named Sprak-Pi-DNN, based on a parallel deep neural network for the timely classification of large RNA sequences into piRNAs and non-piRNAs. The proposed model takes advantage of parallel and distributed computing platforms. The performance of the proposed Sprak-Pi-DNN was extensively evaluated in two parts. In the first part, we compare and analyze the performance of two widely used cluster-based big data analytics platforms (Apache Hadoop and Apache Spark) on the same benchmark dataset. The computational-based metrics include computation times, speedup, and scalability. The second part assessed the proposed model’s effectiveness using performance metrics such as accuracy, specificity, sensitivity, and Matthews’s correlation coefficient. We employed the PseKNC algorithm for feature extraction with varied K-sizes. The experimental results revealed that Apache Spark’s execution time performed better and faster than Apache Hadoop. Moreover, the evaluation results in both cases showed that the proposed model improved computation speedup without affecting accuracy.
Citation
S. Noor, H. H. Awan, A. S. Hashmi, A. Saeed, S. Khan, and S. A. AlQahtani, “Optimizing performance of parallel computing platforms for large-scale genome data analysis,” Computing, vol. 107, no. 3, pp. 1–22, Mar. 2025, doi: 10.1007/S00607-025-01441-Y/METRICS.
Source
Computing
Conference
Keywords
Hadoop, Big Data, Spark, Cluster Computing, Distributed Computing, Bioinformatics, Genome
Subjects
Source
Publisher
Springer Nature
