Overcoming Data Imbalance Problems in Sexual Harassment Classification with SMOTE

Aji Gautama Putrada; Irfan Dwi  Wijaya; Dita  Oktaria

doi:10.21108/ijoict.v8i1.622

Aji Gautama Putrada Telkom University
Irfan Dwi Wijaya
Dita Oktaria

https://doi.org/10.21108/ijoict.v8i1.622

Abstract views: 272 ,

pdf downloads: 233

Keywords: SMOTE, QPQ, HWE, sexual harassment, data imbalance, text analysis, TF-IDF, naÃ¯ve Bayes, KNN

Abstract

Delivery of justice with the help of artificial intelligence is a current research interest. Machine learning with natural language processing (NLP) can classify the types of sexual harassment experiences into quid pro quo (QPQ) and hostile work environments (HWE). However, imbalanced data are often present in classes of sexual harassment classification on specific datasets. Data imbalance can cause a decrease in the classifier's performance because it usually tends to choose the majority class. This study proposes the implementation and performance evaluation of the synthetic minority over-sampling technique (SMOTE) to improve the QPQ and HWE harassment classifications in the sexual harassment experience dataset. The term frequency-inverse document frequency (TF-IDF) method applies document weighting in the classification process. Then, we compare naÃ¯ve Bayes with K-Nearest Neighbor (KNN) in classifying sexual harassment experiences. The comparison shows that the performance of the naÃ¯ve Bayes classifier is superior to the KNN classifier in classifying QPQ and HWE, with AUC values of 0.95 versus 0.92, respectively. The evaluation results show that by applying the SMOTE method to the naÃ¯ve Bayes classifier, the precision of the minority class can increase from 74% to 90%.

Downloads

Download data is not yet available.

References

[1] B. Fileborn, â€œJustice 2.0: Street harassment victimsâ€™ use of social media and online activism as sites of informal justice,â€ British journal of criminology, vol. 57, no. 6, pp. 1482â€“1501, 2017.
[2] W. Perkins and J. Warner, â€œSexual Violence Response and Prevention: Studies of Campus Policies and Practices,â€ Journal of School Violence, vol. 16, no. 3, pp. 237â€“242, Jul. 2017, doi: 10.1080/15388220.2017.1318569.
[3] D. N. Simorangkir, M. S. Saraswati, E. Melissa, L. L. Perangin-Angin, and S. Schumacher, â€œRAISING AWARENESS ABOUT SEXUAL HARASSMENT IN THE MEDIA INDUSTRY,â€ Jurnal Sinergitas PKM dan CSR, vol. 4, no. 3, 2020.
[4] Y. N. Pappoe, â€œThe shortcomings of Title VII for the Black female plaintiff,â€ U. Pa. JL & Soc. Change, vol. 22, p. 1, 2019.
[5] C. Girgis, â€œSexual Harassment,â€ in Burnout in Women Physicians, Springer, 2020, pp. 105â€“128.
[6] G. Chandra, R. Gupta, and N. Agarwal, â€œRole of artificial intelligence in transforming the justice delivery system in covid-19 pandemic,â€ Chandra, G., Gupta, R. and Agarwal, no. 2020, pp. 344â€“350, 2020.
[7] E. Alawneh, M. Al-Fawaâ€™reh, M. T. Jafar, and M. Al Fayoumi, â€œSentiment analysis-based sexual harassment detection using machine learning techniques,â€ in 2021 international symposium on electronics and smart devices (ISESD), 2021, pp. 1â€“6.
[8] J. Jang, Y. Kim, K. Choi, and S. Suh, â€œSequential Targeting: an incremental learning approach for data imbalance in text classification,â€ arXiv preprint arXiv:2011.10216, 2020.
[9] M. Khushi et al., â€œA comparative performance analysis of data resampling methods on imbalance medical data,â€ IEEE Access, vol. 9, pp. 109960â€“109975, 2021.
[10] A. Sarker et al., â€œData and systems for medication-related text classification and concept normalization from Twitter: insights from the Social Media Mining for Health (SMM4H)-2017 shared task,â€ Journal of the American Medical Informatics Association, vol. 25, no. 10, pp. 1274â€“1283, 2018.
[11] A. FernÃ¡ndez, S. Garcia, F. Herrera, and N. V. Chawla, â€œSMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary,â€ Journal of artificial intelligence research, vol. 61, pp. 863â€“905, 2018.
[12] K. Polat, â€œA hybrid approach to Parkinson disease classification using speech signal: The combination of SMOTE and random forests,â€ in 2019 Scientific Meeting on Electrical-Electronics & Biomedical Engineering and Computer Science (EBBT), 2019, pp. 1â€“3.
[13] A. S. Hussein, T. Li, C. W. Yohannese, and K. Bashir, â€œA-SMOTE: A new preprocessing approach for highly imbalanced datasets by improving SMOTE,â€ International Journal of Computational Intelligence Systems, vol. 12, no. 2, pp. 1412â€“1422, 2019.
[14] F. Last, G. Douzas, and F. Bacao, â€œOversampling for imbalanced learning based on k-means and smote,â€ arXiv preprint arXiv:1711.00837, 2017.
[15] T. A. M. Putri, U. Enri, and B. N. Sari, â€œAnalisis Algoritma Naive Bayes Classifier untuk Klasifikasi Tweet Pelecehan Seksual dengan #MeToo,â€ p. 10.
[16] S. Modrek and B. Chakalov, â€œThe# MeToo movement in the United States: text analysis of early twitter conversations,â€ Journal of medical Internet research, vol. 21, no. 9, p. e13837, 2019.
[17] R. Sawhney, A. K. Gautam, and R. R. Shah, â€œBMGC 2020 Grand Challenge: Multi-Aspect Analysis of the MeToo Movement on Twitter,â€ in 2020 IEEE Sixth International Conference on Multimedia Big Data (BigMM), 2020, pp. 481â€“484.
[18] P. Basu, S. Tiwari, J. Mohanty, and S. Karmakar, â€œMultimodal Sentiment Analysis of# MeToo Tweets using Focal Loss (Grand Challenge),â€ in 2020 IEEE Sixth International Conference on Multimedia Big Data (BigMM), 2020, pp. 461â€“465.
[19] A. Reyes-Menendez, J. R. Saura, and F. Filipe, â€œMarketing challenges in the# MeToo era: Gaining business insights using an exploratory sentiment analysis,â€ Heliyon, vol. 6, no. 3, p. e03626, 2020.
[20] A. Priyanshu et al., â€œStance Classification with Improved Elementary Classifiers Using Lemmatization (Grand Challenge),â€ in 2020 IEEE Sixth International Conference on Multimedia Big Data (BigMM), 2020, pp. 466â€“470.
[21] A. Karami, S. C. Swan, C. N. White, and K. Ford, â€œHidden in plain sight for too long: Using text mining techniques to shine a light on workplace sexism and sexual harassment.,â€ Psychology of Violence, 2019.
[22] S. Qaiser and R. Ali, â€œText mining: use of TF-IDF to examine the relevance of words to documents,â€ International Journal of Computer Applications, vol. 181, no. 1, pp. 25â€“29, 2018.
[23] D. Prabowo, M. Fhadli, M. Najib, H. Fauzi, and I. Cholissodin, â€œTF-IDF-Enhanced Genetic Algorithm Untuk Extractive Automatic Text Summarization,â€ Jurnal Teknologi Informasi dan Ilmu Komputer, vol. 3, p. 208, Sep. 2016, doi: 10.25126/jtiik.201633217.
[24] A. M. Putrada, M. Abdurohman, and A. G. Putrada, â€œIncreasing smoke classifier accuracy using naive bayes method on internet of things,â€ Kinetik: Game Technology, Information System, Computer Network, Computing, Electronics, and Control, pp. 19â€“26, 2019.
[25] A. P. Wijaya and H. A. Santoso, â€œNaive bayes classification pada klasifikasi dokumen untuk identifikasi konten e-government,â€ Journal of Applied Intelligent System, vol. 1, no. 1, pp. 48â€“55, 2016.
[26] K. Y. Raharja, â€œPERBANDINGAN KINERJA ALGORITMA GAUSSIAN NAIVE BAYES DAN K-NEAREST NEIGHBOR (KNN) UNTUK MENGKLASIFIKASI PENYAKIT HEPATITIS C VIRUS (HCV),â€ PhD Thesis, Universitas Muhammadiyah Jember, 2021.
[27] P. Nando, A. G. Putrada, and M. Abdurohman, â€œIncreasing The Precision Of Noise Source Detection System using KNN Method,â€ Kinetik: Game Technology, Information System, Computer Network, Computing, Electronics, and Control, pp. 157â€“168, 2019.
[28] M. Ramadhani and D. H. Murti, â€œKlasifikasi Ikan Menggunakan Oriented Fast and Rotated Brief (Orb) Dan K-Nearest Neighbor (Knn),â€ JUTI J. Ilm. Teknol. Inf, vol. 16, no. 2, p. 115, 2018.
[29] A. G. Putrada, N. G. Ramadhan, and M. Abdurohman, â€œContext-aware smart door lock with activity recognition using hierarchical hidden markov model,â€ Kinetik: Game Technology, Information System, Computer Network, Computing, Electronics, and Control, vol. 5, no. 1, pp. 37â€“44, 2020.
[30] N. G. Ramadhan, A. G. Putrada, and M. Abdurohman, â€œImproving Smart Lighting with Activity Recognition Using Hierarchical Hidden Markov Model,â€ Indonesia Journal on Computing (Indo-JC), vol. 4, no. 2, pp. 43â€“54, 2019.

Overcoming Data Imbalance Problems in Sexual Harassment Classification with SMOTE

Abstract

Downloads

References

Most read articles by the same author(s)