For one of our projects, we were given tasks to extract user posts/messages from social media sites such as Facebook and Twitter, and analyze the sentiments for each post/message. The idea of sentiment analysis is to extract the text content from users' posts and based on the message content, we determine whether the content is either positive, negative, neutral.
Since there was no implementation of Malay sentiment analysis currently available, our team was tasked to develop the AI-based analytic tool for analyzing the sentiment from the social media posts/messages.
In achieving this, our first strategy was to make use of existing open source algorithms and/or implemented code that is available for English sentiment analysis. We found quite several open source code for sentiment analysis which is readily available to analyze English text. We tested several of these implementations and we found that the best implementations were the ones that use deep learning as the core algorithms for English sentiment analysis.
Our next step was to re-use the same implemented code and re-train the algorithms for Malay text. Initially, this seems to be an ideal approach at first, but it turns out that the tasks are non-trivial.
This motivates us to develop the Malay Language Sentiment Analysis from the basic. We had realized that the already implemented deep learning models were not suitable for purpose as we did not have millions of examples. So we started with non-neural network approaches and we only use simple text features extraction such as Bag-of-Words and TF-IDF, number of words in each message. We then extended the features using features extracted from the Malay NLTK. We further employed feature engineering techniques such as Principle Component Analysis (PCA), and Linear Discriminant Analysis (LDA) to remove or introduce additional features that could improve the accuracy. Subsequently, we also applied semi-supervised techniques with Cosine Similarity (with normalized TF and TF-IDF) and Jaccard similarities. However, it turns out that such approaches are not sufficient to give good accuracy.
Understand complex linguistic structures. At the first glance, the following statement seems negative: "Kadang2 kalau watak Rihanna tuh tak ade pun tak apa. tak memberikan sebarang kesan...". By default, our sentiment analysis program categorizes this statement as highly negative. However, when considering the full sentence like the following: "... tak memberikan sebarang kesan sbb citer nih mmg dah best.", the last 6 words imply that the statement is indeed positive. Our sentiment analytics would classify such a statement as negative since it has higher weight towards positive sentiment. Not being able to differentiate such statements has a significant impact on the accuracy as we had realized that there is a high percentage of such statements posted by users in twitters and facebook. Therefore, we cannot simply rely on a simple weighted ratio on the number of words. To resolve this issue, we had to apply different weighted ratio on certain words that are more prominent either in Positive, Negative, or Neutral categories. However, the TF-IDF approach does not take this into account. We, therefore, introduce several new features such as positive/negative/neutral ratio for each word or combination of words. For example, "suka" will be classified as positive since it has a high positive weight ratio but a combination of "tak" and "suka" which makes up the "tak suka" will be classified a negative based on a high negative weighted ratio. The algorithm must take into account the different word combinations and not just single words. Further, going back to our previous example, the algorithm must give significantly higher weight to "mmg dah best" when compared to the remaining words "Kadang2 kalau watak Rihanna tuh tak ade pun tak apa. tak memberikan sebarang kesan" despite its higher word counts.
Identify Shortforms, misspellings, slangs Short forms, misspellings, slangs are so common in social media posts. In fact, we found that over 70% of the text was written in short forms and or there were significant words that are misspelled. Consider the following statement where most words are either mispelled or written in shortforms: "filem mcm ni wajib tgk wyg...nmpk bmutu lakonan n phayatan..ptot filem2 mcm ni pecah panggung..bukan mcm ombak rindu tu...lol". Due to a small number of our Malay examples, we observe that most people like to write in short forms and/or they often misspell the most important words that have higher contribution to the sentiment analysis. Consider another statement: "nyesal ak tgk citer ni...........sanggup ak datang ke kl nk tgk citer ni. We can observe "aku" is mispell as "ak" and "tgk" is a shortform for "tengok". The same applies for "cerita" which is wrongly spelled as "citer". Therefore, our analytics must be able to understand that "aku" and "ak", or "tgk" and "tengok" have the same meanings. Unfortunately, the Malay examples from the training data were not sufficient for the machine to understand those words have the same meaning. To resolve this issue, we implemented our own version of Malay spell checker. We extracted each word from the dataset, and introduce thousands of possibilities of artificial misspellings. We then trained the machine to learn those misspelling words. For example, the word "yg" would return "yang", the word "giler" would return "gila", which prove to be useful for our machine to identify that those words have the same meaning. Applying this alone improves the accuracy tremendously.
Differentiate sentiments in both English and Malay. Another issue that we face is a mixed-language post where some content was written in English and some were in Malay. Normally, we train the model separate in English and Malay languages to avoid data noise. When both English and Malay languages are mixed together for training, the accuracy turns out to be very low on unseen text messages. We decided to separate and train the English and Malay languages. The strategy is to detect language usage even before the prediction is made. Our analytics would first detect what language the messages were written, and once the language has been identified, we would feed the messages/posts for sentiment analysis.
Capture different hashtags and keywords related to the domain. When examing a list of messages which were misclassified, we realized that our sentiment analysis was not able to identify statements which were made on the specific context. We had realized not all text is the same, some industries have lingo and jargon that might mean something else in another context. For example, in the context of Malay movie review, phrases such as "bermutu lakonan", "karyawan", "CGI" are very commonly used. On the other hand, in the context of public transport, "perkhidmatan", "pengangkutan", "KLI Express", "LRT" etc are commonly used. Therefore, to solve this issue, we added crafted features to our machine learning algorithms so that it can learn information for the different domain. To ensure that the crafted features contain relevant keywords in the specific domain, we also developed our own custom word embeddings for the Malay language, in which we extracted related words and vocabulary from online articles in Berita Harian, Utusan, and other Malaysian websites.
Finally, using our newly crafted features from the custom word embeddings as well as automated spell checkers, we explored different machine learning techniques including state-of-the-art Gradient Boosting, Random Forest, and Deep Learnings models. For deep learning, our team has explored almost all recent implemented models such as Content Attention Model, Recurrent Attention Network on Memory, Aspect Level Sentiment Classification, Interactive Attention Networks, and several others. We then combine the best models that we have to create many different variations for ensemble models to improve the accuracy further. In the end, after exploring and experimenting many different techniques and algorithms, we were able to achieve the accuracy we were aiming for.
A stable and production version of our Malay Sentiment Analysis is now available here.