Toxic Comment Classification on Social Media Using Support Vector Machine and Chi Square Feature Selection

  • Nadhia Salsabila Azzahra Telkom University
  • Danang Triantoro Murdiansyah Telkom University
  • Kemas M. Lhaksmana Telkom University
Abstract views: 1286 , pdf downloads: 782
Keywords: text classification, toxic comment, social media, support vector machine

Abstract

The use of social media in society continues to increase over time and the ease of access and familiarity of social media then make it easier for an irresponsible user to do unethical things such as spreading hatred, defamation, radicalism, pornography so on. Although there are regulations that govern all the activities on social media. However, the regulations are still not working effectively. In this study, we conducted a classification of toxic comments containing unethical matters using the SVM method with TF-IDF as the feature extraction and Chi Square as the feature selection. The best performance result based on the experiment that has been carried out is by using the SVM model with a linear kernel, without implementing Chi Square, and using stemming and stopwords removal with the F1 − Score equal to 76.57%.

Downloads

Download data is not yet available.

References

Abbas, W. (2019).Toxic comment classification of roman urdu text(Unpublished doctoral dissertation).Department of Computer Science, COMSATS University Islamabad, Lahore Campus.
Bahassine, S., Madani, A., Al-Sarem, M., & Kissi, M. (2020). Feature selection using an improved chi-square for arabic text classification.Journal of King Saud University-Computer and InformationSciences,32(2), 225–231.
Chatterjee, S., Jose, P. G., & Datta, D. (2019). Text classification using svm enhanced by multithreadingand cuda.International Journal of Modern Education & Computer Science,11(1).
Chekina, L., Rokach, L., & Shapira, B. (2011). Meta-learning for selecting a multi-label classificationalgorithm. In2011 ieee 11th international conference on data mining workshops(pp. 220–227).
Cristianini, N., Shawe-Taylor, J., et al. (2000).An introduction to support vector machines and otherkernel-based learning methods. Cambridge university press.
Fung, G. M., Mangasarian, O. L., & Shavlik, J. W. (2003). Knowledge-based nonlinear kernel classifiers.InLearning theory and kernel machines(pp. 102–113). Springer.
Hana, K. M., Al Faraby, S., Bramantoro, A., et al. (2020). Multi-label classification of indonesian hatespeech on twitter using support vector machines. In2020 international conference on data scienceand its applications (icodsa)(pp. 1–7).
Hsu, C.-W., Chang, C.-C., Lin, C.-J., et al. (2003).A practical guide to support vector classification.Taipei.
Ibrohim, M. O., & Budi, I. (2019). Multi-label hate speech and abusive language detection in indonesiantwitter. InProceedings of the third workshop on abusive language online(pp. 46–57).
Izzan, A., Wibisono, C., & Putra, I. F. (2018).Indonesian social media post toxicity classification.https://github.com/ahmadizzan/netifier. GitHub.
Jing, L.-P., Huang, H.-K., & Shi, H.-B. (2002). Improved feature selection approach tfidf in text mining. InProceedings. international conference on machine learning and cybernetics(Vol. 2, pp. 944–946).
Kemp, S. (2020).Digital 2020 indonesia.Retrieved 2020-02-18, fromhttps://datareportal.com/reports/digital-2020-indonesia
Lapedes, D. N. (1974).Mcgraw-hill dictionary of scientific and technical terms.
Malmasi, S., & Zampieri, M.(2017).Detecting hate speech in social media.arXiv preprintarXiv:1712.06427.
Sagar, A. A., & Kiran, J. S. (2008). Toxic comment classification using natural language processing.
Sun, Y.-Y., Zhang, Y., & Zhou, Z.-H. (2010). Multi-label learning with weak label. InProceedings ofthe aaai conference on artificial intelligence(Vol. 24).
Syahputra, H., Basyar, L., & Tamba, A. (2020). Setiment analysis of public opinion on the go-jekindonesia through twitter using algorithm support vector machine. InJournal of physics: Conferenceseries(Vol. 1462, p. 012063).
Taha, A. Y., & Tiun, S. (2016). Binary relevance (br) method classifier of multi-label classification forarabic text.Journal of Theoretical & Applied Information Technology,84(3).
Trstenjak, B., Mikac, S., & Donko, D. (2014). Knn with tf-idf based framework for text categorization.Procedia Engineering,69, 1356–1364.
Tsoumakas, G., & Katakis, I. (2007). Multi-label classification: An overview.International Journal ofData Warehousing and Mining (IJDWM),3(3), 1–13.
Wieczorkowska, A., Synak, P., & Ra ́s, Z. W. (2006). Multi-label classification of emotions in music. InIntelligent information processing and web mining(pp. 307–315). Springer.
Wongso, R., Luwinda, F. A., Trisnajaya, B. C., Rusli, O., et al. (2017). News article text classificationin indonesian language.Procedia Computer Science,116, 137–143.
Ye, J., Jing, X., & Li, J. (2017). Sentiment analysis using modified lda. InInternational conference onsignal and information processing, networking and computers(pp. 205–212).
Yulietha, I., Faraby, S., & Adiwijaya, A. (2017). Klasifikasi sentimen review film menggunakan algoritmasupport vector machine.eProceedings of Engineering,4(3).
Published
2021-07-02
How to Cite
Azzahra, N., Murdiansyah, D., & Lhaksmana, K. (2021). Toxic Comment Classification on Social Media Using Support Vector Machine and Chi Square Feature Selection. International Journal on Information and Communication Technology (IJoICT), 7(1), 64-76. https://doi.org/10.21108/ijoict.v7i1.552
Section
Intelligence System