Overcoming Data Imbalance Problems in Sexual Harassment Classification with SMOTE

  • Aji Gautama Putrada Telkom University
  • Irfan Dwi Wijaya
  • Dita Oktaria
Abstract views: 272 , pdf downloads: 233
Keywords: SMOTE, QPQ, HWE, sexual harassment, data imbalance, text analysis, TF-IDF, naïve Bayes, KNN

Abstract

Delivery of justice with the help of artificial intelligence is a current research interest. Machine learning with natural language processing (NLP) can classify the types of sexual harassment experiences into quid pro quo (QPQ) and hostile work environments (HWE). However, imbalanced data are often present in classes of sexual harassment classification on specific datasets. Data imbalance can cause a decrease in the classifier's performance because it usually tends to choose the majority class. This study proposes the implementation and performance evaluation of the synthetic minority over-sampling technique (SMOTE) to improve the QPQ and HWE harassment classifications in the sexual harassment experience dataset. The term frequency-inverse document frequency (TF-IDF) method applies document weighting in the classification process. Then, we compare naïve Bayes with K-Nearest Neighbor (KNN) in classifying sexual harassment experiences. The comparison shows that the performance of the naïve Bayes classifier is superior to the KNN classifier in classifying QPQ and HWE, with AUC values of 0.95 versus 0.92, respectively. The evaluation results show that by applying the SMOTE method to the naïve Bayes classifier, the precision of the minority class can increase from 74% to 90%.

Downloads

Download data is not yet available.

References

[1] B. Fileborn, “Justice 2.0: Street harassment victims’ use of social media and online activism as sites of informal justice,” British journal of criminology, vol. 57, no. 6, pp. 1482–1501, 2017.
[2] W. Perkins and J. Warner, “Sexual Violence Response and Prevention: Studies of Campus Policies and Practices,” Journal of School Violence, vol. 16, no. 3, pp. 237–242, Jul. 2017, doi: 10.1080/15388220.2017.1318569.
[3] D. N. Simorangkir, M. S. Saraswati, E. Melissa, L. L. Perangin-Angin, and S. Schumacher, “RAISING AWARENESS ABOUT SEXUAL HARASSMENT IN THE MEDIA INDUSTRY,” Jurnal Sinergitas PKM dan CSR, vol. 4, no. 3, 2020.
[4] Y. N. Pappoe, “The shortcomings of Title VII for the Black female plaintiff,” U. Pa. JL & Soc. Change, vol. 22, p. 1, 2019.
[5] C. Girgis, “Sexual Harassment,” in Burnout in Women Physicians, Springer, 2020, pp. 105–128.
[6] G. Chandra, R. Gupta, and N. Agarwal, “Role of artificial intelligence in transforming the justice delivery system in covid-19 pandemic,” Chandra, G., Gupta, R. and Agarwal, no. 2020, pp. 344–350, 2020.
[7] E. Alawneh, M. Al-Fawa’reh, M. T. Jafar, and M. Al Fayoumi, “Sentiment analysis-based sexual harassment detection using machine learning techniques,” in 2021 international symposium on electronics and smart devices (ISESD), 2021, pp. 1–6.
[8] J. Jang, Y. Kim, K. Choi, and S. Suh, “Sequential Targeting: an incremental learning approach for data imbalance in text classification,” arXiv preprint arXiv:2011.10216, 2020.
[9] M. Khushi et al., “A comparative performance analysis of data resampling methods on imbalance medical data,” IEEE Access, vol. 9, pp. 109960–109975, 2021.
[10] A. Sarker et al., “Data and systems for medication-related text classification and concept normalization from Twitter: insights from the Social Media Mining for Health (SMM4H)-2017 shared task,” Journal of the American Medical Informatics Association, vol. 25, no. 10, pp. 1274–1283, 2018.
[11] A. Fernández, S. Garcia, F. Herrera, and N. V. Chawla, “SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary,” Journal of artificial intelligence research, vol. 61, pp. 863–905, 2018.
[12] K. Polat, “A hybrid approach to Parkinson disease classification using speech signal: The combination of SMOTE and random forests,” in 2019 Scientific Meeting on Electrical-Electronics & Biomedical Engineering and Computer Science (EBBT), 2019, pp. 1–3.
[13] A. S. Hussein, T. Li, C. W. Yohannese, and K. Bashir, “A-SMOTE: A new preprocessing approach for highly imbalanced datasets by improving SMOTE,” International Journal of Computational Intelligence Systems, vol. 12, no. 2, pp. 1412–1422, 2019.
[14] F. Last, G. Douzas, and F. Bacao, “Oversampling for imbalanced learning based on k-means and smote,” arXiv preprint arXiv:1711.00837, 2017.
[15] T. A. M. Putri, U. Enri, and B. N. Sari, “Analisis Algoritma Naive Bayes Classifier untuk Klasifikasi Tweet Pelecehan Seksual dengan #MeToo,” p. 10.
[16] S. Modrek and B. Chakalov, “The# MeToo movement in the United States: text analysis of early twitter conversations,” Journal of medical Internet research, vol. 21, no. 9, p. e13837, 2019.
[17] R. Sawhney, A. K. Gautam, and R. R. Shah, “BMGC 2020 Grand Challenge: Multi-Aspect Analysis of the MeToo Movement on Twitter,” in 2020 IEEE Sixth International Conference on Multimedia Big Data (BigMM), 2020, pp. 481–484.
[18] P. Basu, S. Tiwari, J. Mohanty, and S. Karmakar, “Multimodal Sentiment Analysis of# MeToo Tweets using Focal Loss (Grand Challenge),” in 2020 IEEE Sixth International Conference on Multimedia Big Data (BigMM), 2020, pp. 461–465.
[19] A. Reyes-Menendez, J. R. Saura, and F. Filipe, “Marketing challenges in the# MeToo era: Gaining business insights using an exploratory sentiment analysis,” Heliyon, vol. 6, no. 3, p. e03626, 2020.
[20] A. Priyanshu et al., “Stance Classification with Improved Elementary Classifiers Using Lemmatization (Grand Challenge),” in 2020 IEEE Sixth International Conference on Multimedia Big Data (BigMM), 2020, pp. 466–470.
[21] A. Karami, S. C. Swan, C. N. White, and K. Ford, “Hidden in plain sight for too long: Using text mining techniques to shine a light on workplace sexism and sexual harassment.,” Psychology of Violence, 2019.
[22] S. Qaiser and R. Ali, “Text mining: use of TF-IDF to examine the relevance of words to documents,” International Journal of Computer Applications, vol. 181, no. 1, pp. 25–29, 2018.
[23] D. Prabowo, M. Fhadli, M. Najib, H. Fauzi, and I. Cholissodin, “TF-IDF-Enhanced Genetic Algorithm Untuk Extractive Automatic Text Summarization,” Jurnal Teknologi Informasi dan Ilmu Komputer, vol. 3, p. 208, Sep. 2016, doi: 10.25126/jtiik.201633217.
[24] A. M. Putrada, M. Abdurohman, and A. G. Putrada, “Increasing smoke classifier accuracy using naive bayes method on internet of things,” Kinetik: Game Technology, Information System, Computer Network, Computing, Electronics, and Control, pp. 19–26, 2019.
[25] A. P. Wijaya and H. A. Santoso, “Naive bayes classification pada klasifikasi dokumen untuk identifikasi konten e-government,” Journal of Applied Intelligent System, vol. 1, no. 1, pp. 48–55, 2016.
[26] K. Y. Raharja, “PERBANDINGAN KINERJA ALGORITMA GAUSSIAN NAIVE BAYES DAN K-NEAREST NEIGHBOR (KNN) UNTUK MENGKLASIFIKASI PENYAKIT HEPATITIS C VIRUS (HCV),” PhD Thesis, Universitas Muhammadiyah Jember, 2021.
[27] P. Nando, A. G. Putrada, and M. Abdurohman, “Increasing The Precision Of Noise Source Detection System using KNN Method,” Kinetik: Game Technology, Information System, Computer Network, Computing, Electronics, and Control, pp. 157–168, 2019.
[28] M. Ramadhani and D. H. Murti, “Klasifikasi Ikan Menggunakan Oriented Fast and Rotated Brief (Orb) Dan K-Nearest Neighbor (Knn),” JUTI J. Ilm. Teknol. Inf, vol. 16, no. 2, p. 115, 2018.
[29] A. G. Putrada, N. G. Ramadhan, and M. Abdurohman, “Context-aware smart door lock with activity recognition using hierarchical hidden markov model,” Kinetik: Game Technology, Information System, Computer Network, Computing, Electronics, and Control, vol. 5, no. 1, pp. 37–44, 2020.
[30] N. G. Ramadhan, A. G. Putrada, and M. Abdurohman, “Improving Smart Lighting with Activity Recognition Using Hierarchical Hidden Markov Model,” Indonesia Journal on Computing (Indo-JC), vol. 4, no. 2, pp. 43–54, 2019.
Published
2022-08-02
How to Cite
Putrada, A. G., Wijaya, I. D., & Oktaria, D. (2022). Overcoming Data Imbalance Problems in Sexual Harassment Classification with SMOTE. International Journal on Information and Communication Technology (IJoICT), 8(1), 20-29. https://doi.org/10.21108/ijoict.v8i1.622
Section
Theoretical Computer Science & Statistic