Machine Learning-Based Hate Speech Detection in the Kazakh Language

Bolatbek, Milana; Mussiraliyeva, Shynar; Sagynay, Moldir

dc.contributor.author	Bolatbek, Milana
dc.contributor.author	Mussiraliyeva, Shynar
dc.contributor.author	Sagynay, Moldir
dc.date.accessioned	2026-01-02T10:28:10Z
dc.date.available	2026-01-02T10:28:10Z
dc.date.issued	2025-12-30
dc.identifier.issn	1731-7533
dc.identifier.uri	http://hdl.handle.net/11089/57150
dc.description.abstract	Modern text data processing and classification methods require extensive use of machine learning and neural networks. Categorizing text into different classes has become a crucial task in many fields. This paper presents a multi-class text classification model utilizing a Modified TF-IDF (MTF-IDF) approach in combination with Long Short-Term Memory (LSTM) neural networks, XGBoost, and MLPClassifier algorithms. Additionally, the study explores the integration of TF-IDF and CountVectorizer (MTF-IDF) methods for text vectorization, aiming to enhance classification efficiency.The research findings indicate that the LSTM model achieved the highest accuracy rate of 89%, demonstrating superior performance. The MLPClassifier model achieved 85% accuracy, while XGBoost obtained 81% accuracy. Moreover, the integration of TF-IDF and MTF-IDF methods significantly improved the detection of rare but essential words, enhancing the overall performance of the models.This study is dedicated to addressing the problem of automated detection of harmful content in the Kazakh language. Hate speech in the digital space refers to any online material that harms individuals or communities through aggression, manipulation, discrimination, or the intentional spread of socially damaging narratives. The results provide a solid foundation for future research aimed at the early identification and mitigation of hate speech in the digital space, contributing to a safer online environment.	en
dc.language.iso	en
dc.publisher	Wydawnictwo Uniwersytetu Łódzkiego	pl
dc.relation.ispartofseries	Research in Language	en
dc.rights.uri	https://creativecommons.org/licenses/by-nc-nd/4.0
dc.subject	hate speech	en
dc.subject	bullying	en
dc.subject	violent extremism	en
dc.subject	TF-IDF	en
dc.subject	MTF-IDF	en
dc.subject	LSTM	en
dc.subject	MLPClassifier	en
dc.title	Machine Learning-Based Hate Speech Detection in the Kazakh Language	en
dc.type	Article
dc.page.number	311-319
dc.contributor.authorAffiliation	Bolatbek, Milana - Al-Farabi Kazakh National University	en
dc.contributor.authorAffiliation	Mussiraliyeva, Shynar - Al-Farabi Kazakh National University	en
dc.contributor.authorAffiliation	Sagynay, Moldir - Al-Farabi Kazakh National University	en
dc.references	Gorwa R., Binns R., Katzenbach C. Algorithmic content moderation: Technical and political challenges in the automation of platform governance //Big Data & Society. – 2020. – Т. 7. – №. 1. – С. 2053951719897945.	en
dc.references	Barakhin V. B. et al. Methods for detecting destructive information. // Physics Journal: Conference Series. – IOP Publishing, 2019. – Vol. 1405. – No. 1. – P. 012004.	en
dc.references	Kumisbekov S. K., Sabitov S. M., Akimzhanova M. T. Issues of preventing cyberbullying at the present stage. // Bulletin of the Karaganda University “Law Series”. – 2022. – Vol. 105. – No. 1. – Pp. 85–95.	en
dc.references	Alqahtani A. F., Ilyas M. A Machine Learning Ensemble Model for the Detection of Cyberbullying //arXiv preprint arXiv:2402.12538. – 2024.	en
dc.references	Li J. R., Mao Y. F., Yang K. Improvement and application of TF* IDF algorithm //Information Computing and Applications: Second International Conference, ICICA 2011, Qinhuangdao, China, October 28-31, 2011. Proceedings 2. – Springer Berlin Heidelberg, 2011. – С. 121-127.	en
dc.references	Fan H., Qin Y. Research on text classification based on improved tf-idf algorithm //2018 International Conference on Network, Communication, Computer Engineering (NCCE 2018). – Atlantis Press, 2018. – С. 501-506.	en
dc.references	Shakil M. H., Alam M. G. R. Toxic Voice Classification Implementing CNN-LSTM & Employing Supervised Machine Learning Algorithms Through Explainable AI-SHAP //2022 IEEE International Conference on Artificial Intelligence in Engineering and Technology (IICAIET). – IEEE, 2022. – С. 1-6.	en
dc.references	Schnitzler K. et al. Using Twitter™ to influence research: discussing strategies, opportunities, and challenges. // International Journal of Nursing Studies. – 2016. – Vol. 59. – Pp. 15–26.	en
dc.references	Bolatbek M. et al. Kazakh Language Dataset for Hate Speech Detection on Social Media Text //2024 IEEE 9th International Conference on Computational Intelligence and Applications (ICCIA). – IEEE, 2024. – С. 94-98.	en
dc.contributor.authorEmail	Bolatbek, Milana - bolatbek.milana@gmail.com
dc.contributor.authorEmail	Mussiraliyeva, Shynar - mussiraliyevash@gmail.com
dc.contributor.authorEmail	Sagynay, Moldir - sagynaymoldir11@gmail.com
dc.identifier.doi	10.18778/1731-7533.23.19
dc.relation.volume	23

Pliki tej pozycji

Nazwa:: 311-319_Bolatbek_et_al_.pdf
Rozmiar:: 368.1KB
Format:: PDF

Oglądaj/Otwórz

Pozycja umieszczona jest w następujących kolekcjach

Research in Language (2025) vol. 23 [24]

Pokaż uproszczony rekord

Poza zaznaczonymi wyjątkami, licencja tej pozycji opisana jest jako https://creativecommons.org/licenses/by-nc-nd/4.0