Azerbaijani young talent: From computer science to AI

Comparative Analysis of Machine Learning Algorithms for Sentiment Analysis of Text in Azerbaijani and English.

 Stories Matter (photo credit: PEXELS)
Stories Matter
(photo credit: PEXELS)

Introduction

We are pleased to present Mehdi Rasul, a talented teenager excelling in his 11th-grade studies within the A-Level program at LANDAU School in Azerbaijan. Since a young age, he has been interested in Artificial Intelligence and Machine Learning, especially in Natural Language Processing and deep learning technologies. Over the years, Mehdi has become proficient in programming languages such as Python, Numpy, and Javascript. His exceptional academic performance is a testament to his unwavering quest for knowledge. Beyond the classroom, Mehdi's accomplishments extend to a diverse range of extracurricular activities, including community service, internships in various AI-related projects, active participation in MUN conferences, membership in the school newsletter, the completion of the IT Essentials program at Cisco Networking Academy, etc. In 2021 and 2022, Mr. Rasul won the competitions hosted by the "CyberMath Academy" and in 2022, he made a notable appearance in the American Mathematics Olympiad. Mehdi's research project has already been featured on different platforms and published in esteemed scientific journals in both Azerbaijan and Switzerland. His research delves into the comparison of diverse machine learning algorithms for sentiment analysis in both Azerbaijani and English texts. The study's findings underscore the pivotal role of national language corpora in achieving precise outcomes in natural language processing, particularly in sentiment analysis.

The article is provided below for your perusal.

The prediction of the sentiment of the text within different business spheres has been a challenging problem for various languages. Sentiment analysis, also known as opinion mining, is an active area of research in natural language processing (NLP) and computational linguistics that involves using text analysis and classification methods to identify and extract subjective information such as opinions, emotions, and attitudes from text data. Sentiment analysis has many applications in business, politics, and social media monitoring. Using sentimental analysis of textual data, companies can evaluate customer feedback, monitor reputation, forecast future and user behavior, etc., which can lead to better performance, efficiency, and increased profits.

CREDIT: Mehdi Rasul
CREDIT: Mehdi Rasul

Studies have shown that machine learning algorithms, particularly those using supervised learning and deep learning, produce satisfactory results in the automation of textual sentiment analysis. Recent research illustrates that sentiment analysis of given texts can be automated, relying on the combination of computer science and mathematics, while also improving accuracy.

A well-defined English language corpus for model training helps build powerful and highly accurate models for English texts. Given the extensive array of libraries offering diverse natural language processing techniques in English, an automated system for analyzing texts and extracting meaningful insights or summaries from input paragraphs is relatively easy to create. The abundance of useful libraries in computer linguistics for English, combined with numerous natural language processing tools, such as word correction, grammar checking, text generation, word tokenization, etc., enables the construction of highly accurate sentiment analysis models.

Despite its utilization with major languages such as English, research on sentiment analysis of under-resourced languages like Azerbaijani is still relatively limited. Azerbaijani is the official language of the Republic of Azerbaijan and has distinct linguistic characteristics. However, it still lacks labeled datasets and the sophisticated language corpus for training machine learning models in sentiment analysis of the texts. Therefore, building sentiment analysis for Azerbaijani appears to be uniquely challenging, due to the lack of built-in NLP techniques and language corpus designed specifically for Azerbaijani.

This study encompasses a comprehensive comparison of machine learning algorithms. These algorithms are evaluated to compare their effectiveness in achieving sentiment classification for Azerbaijani. The same algorithms were tested on the English version of the dataset to address the importance of the built-in language corpus and the advancement of currently applicable techniques for NLP tasks in Azerbaijani.

Recent studies have shown that using machine learning techniques, such as supervised and deep learning algorithms, has contributed to significant advancements in improving and automating sentiment analysis for textual data. In politics, machine learning is also used to determine the sentiment of texts. In 2016, Heredia et al. collected political tweets to predict the U.S. 2016 election. Researchers collected three million location-based tweets related to Donald Trump and Hillary Clinton and trained them on the deep convolutional neural network (CNN) to predict the election results and attained an accuracy score of 84%.

The sentiment analysis of the texts in Azerbaijani is limited due to the lack of the labeled dataset and the language corpus. In 2013, research by Neethu and Rajasree on sentiment analysis of Twitter using machine learning algorithms produced satisfactory results in terms of classification of tweets into positive and negative classes.

Generally, machine learning algorithms perform well in the classification of texts in both Azerbaijani and English. The major algorithms utilized in text classification problems are Support Vector Machines, Naïve Bayes, and Logistic Regression, which help to identify the patterns in both type and sentiments. Decision trees are supervised machine-learning algorithms used for classification and regression tasks. The goal is to create a model that predicts the value of a target variable by using decision rules created during the training process from the input features.

In this research, various machine learning algorithms were used to predict the sentiment of movie reviews in both English and Azerbaijani. The study has indicated the achievements attained in building models with various techniques. TF-IDF and BOW (or Count Vectorizer) have been implemented for feature extraction methods from the texts and tested in different models, including Logistic Regression, Naïve Bayes, SVM, Decision Tree, Random Forest, AdaBoost, and XGBoost.

For the Azerbaijani version of the dataset, using the TF-IDF feature extraction approach, Logistic Regression and SVM algorithms produced better results compared to other models. However, the decision tree had 64% accuracy. When BOW feature extraction was used, Logistic Regression produced 84% accuracy, while SVM reached an accuracy of 82%.

The same dataset in English has also been modeled to compare the results with different preprocessing techniques that were not applied in Azerbaijani, such as stemming, stopwords list, etc. Overfitting was much less of an issue for models based on English datasets. The highest score was attained with the BOW feature extraction method using the Logistic Regression method. Additionally, SVM and Naive Bayes algorithms performed well with the TF-IDF feature extraction method and achieved an 85% accuracy score. The research indicates that the results are similar for both language models.

Python has enriched libraries for building language models in English and includes a list of stopwords, word tokenization, stemming, and lemmatization techniques. The English language corpus is well developed, which helps achieve higher results, unlike Azerbaijani, which lacks the language corpus, making it harder to build generalized models. Additional resources should be dedicated to research and creating a language corpus for Azerbaijani, which would help attain greater accuracy.

This article was written in cooperation with Mehdi Rasul