Publication

Title: Alignment-free viral sequence classification at scale
Authors: van Zyl D, Dunaiski M, Tegally H, Baxter C, de Oliveira T, Xavier J, Riley C, Winters A, Naranbhai V, Made F, Karim S, Otwombe K, Abimiku A, Osawe S, Onyemata J, Dakum P, Murtala-Ibrahim F, Andrew N, Musa A, Adenekan T, Ewerem K, Etuk V, de Oliveira T, Baxter C, Wilkinson E, Tegally H, Poongavanan J, Parker M, Silva D, Xavier J, Stafford K, Charurat M, Blanco N, O’Connor T, Fitzpatrick M, Sajadi M, Lawal O, Xiong C, Luo W, Wu X.
Journal: BMC Genomics,26:389 (2025)

Abstract

Background

The rapid increase in nucleotide sequence data generated by next-generation sequencing (NGS) technologies demands efficient computational tools for sequence comparison. Alignment-free (AF) methods offer a scalable alternative to traditional alignment-based approaches such as BLAST. This study evaluates alignment-free methods as scalable and rapid alternatives for viral sequence classification, focusing on identifying techniques that maintain high accuracy and efficiency when applied to extremely large datasets.

Results

We employed six established AF techniques to extract feature vectors from viral genomes, which were subsequently used to train Random Forest classifiers. Our primary dataset comprises 297,186 SARS-CoV- 2 nucleotide sequences, categorized into 3502 distinct lineages. Furthermore, we validated our models using dengue and HIV sequences to demonstrate robustness across different viral datasets. Our AF classifiers achieved 97.8% accuracy on the SARS-CoV- 2 test set, and 99.8% and 89.1% accuracy on dengue and HIV test sets, respectively.

Conclusion

Despite the high-class dimensionality, we show that word-based AF methods effectively represent viral sequences. Our study highlights the practical advantages of AF techniques, including significantly faster processing compared to alignment-based methods and the ability to classify sequences using modest computational resources.


KRISP has been created by the coordinated effort of the University of KwaZulu-Natal (UKZN), the Technology Innovation Agency (TIA) and the South African Medical Research Countil (SAMRC).


Location: K-RITH Tower Building
Nelson R Mandela School of Medicine, UKZN
719 Umbilo Road, Durban, South Africa.
Director: Prof. Tulio de Oliveira