Implementation of Naïve Bayes and Gini Index for Spam Email Classification

  • Fikri Rozan Imadudin Telkom University
  • Danang Triantoro Murdiansyah
  • Adiwijaya
Abstract views: 350 , 88 downloads: 546
Keywords: Complete Gini-Index Text, Multinomial Naïve Bayes, Email Classification

Abstract

Email is a medium of information that is still frequently used by people today. At the moment email still has an endless problem that is spam email. Spam email is an email that can pollute, damage or disturb the recipient. In this study, we show the performance and accuracy of Multinomial Naïve Bayes (MNNB) and Complete Gini-Index Text (GIT) for use in spam email filtering. In this study, we used 6 cross-validations as testers for the built classification machines. We found that the average yield can exceed Multinomial Naïve Bayes without using feature selection which only uses 80000 features with a difference of 0.39%. Feature selection also increases speed during classification and can reduce features that are less relevant to the category to be classified.

Downloads

Download data is not yet available.

Author Biography

Fikri Rozan Imadudin, Telkom University
Student

References

Email statistics report, 2018-2022. 2018. THE RADICATI GROUP, INC.

2018 Internet Security Threat Report, volume 23. 2018. symantec.

A. Sharma, D. Manisha, Manisha, and D. R. Jain. Data pre-processing in spam detection. IJSTE International Journal of Science Technology & Engineering, 1(11), 2015.

H. Park, S. Kwon, and H. Kwon. Complete Gini-Index Text (GIT) feature-selection algorithm for text classification. In The 2nd International Conference on Software Engineering and Data Mining, pages 366–371, June 2010.

W. Gad and S. Rady. Email filtering based on supervised learning and mutual information feature selection. In 2015 Tenth International Conference on Computer Engineering Systems (ICCES), pages 147–152, 2015.

S. R. Gomes, S. G. Saroar, M. Mosfaiul, A. Telot, B. N. Khan, A. Chakrabarty, and M. Mostakim. A comparative approach to email classification using naive bayes classifier and hidden markov model. In 2017 4th International Conference on Advances in Electrical Engineering (ICAEE), pages 482–487, 2017.

J. Yang, Y. Liu, Z. Liu, X. Zhu, and X. Zhang. A new feature selection algorithm based on binomial hypothesis testing for spam filtering. Know.-Based Syst., 24(6):904–914, 2011.

The enron-spam datasets, 2006. Accessed on October 2018.

B. Issac, W. Jap, and J. Sutanto. Improved bayesian anti-spam filter implementation and analysis on independent spam corpuses. volume 2, pages 326 – 330, 02 2009.

M. Singh. Classification of spam email using intelligent water drops algorithm with naïve bayes classifier. In C. R. Panigrahi, A. K. Pujari, S. Misra, B. Pati, and K.-C. Li, editors, Progress in Advanced Computing and Intelligent Engineering, pages 133–138, Singapore, 2019. Springer Singapore.

G. P. Vangelis Metsis, Ion Androutsopoulos. Spam filtering with naive bayes – which naive bayes? In THIRD CONFERENCE ON EMAIL AND ANTI-SPAM (CEAS), 2006.

J. J. Eberhardt. Bayesian spam detection. Scholarly Horizons: University of Minnesota, Morris Undergraduate Journal, 2(1):2, 2015.

G. Mujtaba, L. Shuib, R. G. Raj, N. Majeed, and M. A. Al-Garadi. Email classification research trends: Review and open issues, 2017.

M. F. Porter. Readings in information retrieval. chapter An Algorithm for Suffix Stripping, pages 313–316. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1997.

E. Jones, T. Oliphant, P. Peterson, et al. SciPy: Open source scientific tools for Python, 2001–. [ accessed July 2019].

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.

Published
2021-05-03
How to Cite
Imadudin, F. R., Murdiansyah, D. T., & Adiwijaya. (2021). Implementation of Naïve Bayes and Gini Index for Spam Email Classification. Indonesia Journal on Computing (Indo-JC), 6(1), 85-94. https://doi.org/10.34818/INDOJC.2021.6.1.452
Section
Computational and Simulation