In Table10, the similarities are fairly large, because we used all the similarities between a cluster and term.
In this section, terms used in producing a coefficient are restricted in the cluster as shown in (33).
ri,j = g(Ci,wj),forwj ∈Ci,
0, forwj ∈/Ci. (33)
To generate the cluster-term relation matrix, we can use the following equation.
R=((M S)◦M)◦N (34)
The result by usingRis shown in Table12.
Or we might be able to use just R=M. The result is shown in Table13.
Table 9 Words in each concept by WordNet concepts
Concept Words in concept
Animal, boat, claw, engine, fur, leg, mouse, Whole number, pet, road, seat, tail, thing, vehicle,
water, wheel
Piece Claw, leg, seat, tail
Engine Engine
Substance Fur, water
Component Fur, place, water
Form Leg, tail
Solid Leg
Property Leg, number, place
Event Leg, number, pet, place, road, thing
State Mouse, pet, place, thing
Individual Mouse, pet, tail
definite_quantity Number
Signaling number
social_group Number, people
Indication Number
Collection Number
People People
Citizenry People
Masses People
Location Place, seat
Knowledge Place, seat, thing
auditory_communication Place written_communication place
body_of_water Sea, water
indefinite_quantity Sea
Activity Sea
Portion Seat
Point Tail
Content Thing
Something Thing
Stuff Thing
Thing Thing
Substance Vehicle, water
Vehicle Vehicle
Fluid Water
Quality Wheel
Table 10 Similarities by word-clustering with WordNet (7 clusters)
‘dog’ ‘cat’ ‘horse’ ‘car’ ‘bus’
‘cat’ 0.904
‘horse’ 0.918 0.678
‘car’ 0.789 0.963 0.519
‘bus’ 0.960 0.778 0.979 0.657
‘ship’ 0.884 0.968 0.695 0.961 0.809
Table 11 Cluster members with WordNet (7 clusters)
Cluster no. Terms
1 Fur, number, place, road, sea, seat, thing, water
2 People
3 Animal, mouse, pet, tail
4 Boat, claw, engine, leg, vehicle, wheel
5 Be, carry, guard, keep, use
6 Drive, ride, travel
7 Catch, hunt, pull
Table 12 Similarities by word-clustering withR(7 clusters)
‘dog’ ‘cat’ ‘horse’ ‘car’ ‘bus’
‘cat’ 0.797
‘horse’ 0.811 0.523
‘car’ 0.344 0.402 0.403
‘bus’ 0.491 0.264 0.829 0.553
‘ship’ 0.301 0.323 0.550 0.858 0.817
Table 13 Similarities by word-clustering withM(7 clusters)
‘dog’ ‘cat’ ‘horse’ ‘car’ ‘bus’
‘cat’ 0.740
‘horse’ 0.876 0.519
‘car’ 0.322 0.451 0.363
‘bus’ 0.596 0.329 0.827 0.578
‘ship’ 0.289 0.379 0.457 0.892 0.773
These results are quite good, in our point of view, because the similarities between the animals are high and the ones between the transporters including ‘horse’ except the one between ‘horse’ and ‘car’.
7 Conclusion
Using word-clustering by semantic distance derived from WordNet, we transformed term-document matrix to cluster-document matrix. This transformation is a dimen- sion reduction of document vectors.
We demonstrated this method to calculate the similarities of the small documents.
Unfortunately, we have no quantitative evaluation, but each result of the experiments suggests that this method is advantageous for a small set of documents. To show that this method can be applicable to a large set, we need more examinations.
We can also use Word2Vec [10,11] to calculate the distance of words. Although Word2Vec needs a large amount of documents to train a model, the use of pre-trained model can be separated from training the model. For example, we can use pre-trained model of GloVe [12]. Use of Word2Vec and its verification are issues in the future.
References
1. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Informat. Sci.41(6), 391–407 (1990)
2. Moravec, P., Kolovrat, M., Snášel, V.: LSI vs. Wordnet Ontology in Dimension Reduction for Information Retrieval. Dateso, 18–26 (2004)
3. Miller, G.: WordNet: An Electronic Lexical Database. MIT Press (1998)
4. Budanitsky, A., Hirst, G.: Evaluating WordNet-based measures of lexical semantic relatedness.
Computat. Linguist.32(1), 13–47. MIT Press (2006)
5. Wu, Z., Palmer, M.: Verbs semantics and lexical selection. In: Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics, pp. 113–138 (1994)
6. https://dictionary.cambridge.org/
7. Schmid, H.: Probabilistic part-of-speech tagging using decision trees. New Methods in Language Processing, 154–164 (2013).http://www.cis.uni-muenchen.de/%7Eschmid/tools/
TreeTagger/
8. Bond, F., Baldwin, T., Fothergill, R., Uchimoto, K.: Japanese SemCor: a sense-tagged corpus of Japanese. In: Proceedings of the 6th Global WordNet Conference (GWC 2012), pp. 56–63 (2012).http://compling.hss.ntu.edu.sg/wnja/index.en.html
9. Tian, Y., Lo, D.: A comparative study on the effectiveness of part-of-speech tagging techniques on bug reports. In: 2015 IEEE 22nd International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 570–574 (2015)
10. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient Estimation of Word Representations in Vector Space.arXiv:1301.3781(2013)
11. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. Advanc. Neural Informat. Process. Syst.26, 3111–3119 (2013)
12. Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation.
Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014).https://
nlp.stanford.edu/projects/glove/
Language
Thiri Marlar Swe and Phyu Hninn Myint
Abstract In recent years, social media emerges and becomes mostly use all over the world. Social media users express their emotions by posting status with friends.
Analysis of emotions of people becomes popular to apply in many application areas.
So many researchers propose emotion detection systems by using Lexicon based approach. Researchers create emotion lexicons in their own languages to apply in emotional system. To detect Myanmar social media users’ emotions, lexicon does not available thus; a new word-emotion lexicon especially based on Myanmar language is needed to create. This paper describes the creation of Myanmar word-emotion lex- icon, M-Lexicon that contains six basic emotions: happiness, sadness, fear, anger, surprise, and disgust. Facebook status written in Myanmar words are collected and segmented. Words in M-Lexicon is finally got by applying stop-words removal pro- cess. Finally, Matrices, Term-Frequency Inversed Document Frequency (TF-IDF), and unity-based normalization is used in lexicon creation. Experiment shows that the M-lexicon creation contains over 70% of correctly associated with six basic emotions.
Keywords Word-emotion lexiconãM-LexiconãTF-IDFãUnity-based normalization
1 Introduction
Social media becomes most popular communication median among users. People spend a lot of time on social media by positing texts, images, and videos to express their emotions, to get up-to-date news and other interested information. The opinions and emotions of social media users become important in many situations such as T. M. SweãP. H. Myint (B)
University of Computer Studies, Mandalay, Myanmar e-mail:phyuhninmyint@ucsm.edu.mm
T. M. Swe
e-mail:thirimarlarswe@gmail.com
© Springer Nature Switzerland AG 2020
R. Lee (ed.),Big Data, Cloud Computing, and Data Science Engineering, Studies in Computational Intelligence 844,
https://doi.org/10.1007/978-3-030-24405-7_11
157
elections, medical treatment, etc. In order to identify user’s emotional situation from the social media, many researchers propose emotion detection system from text in so many techniques. Some researchers propose and create emotion lexicon to detect emotional status from the text in their own languages such as English, France, Arabic, Chinese, Korean, etc.
For Myanmar, there is no emotion lexicon which to be applied in detecting emotion words from Myanmar social media users’ status. Language translation tools and existing different language emotion lexicon can be used for Myanmar language, but it is not sufficient for practical system. In addition, an emotion lexicon can be manually built, but it is time-consuming. Therefore, a new word-emotion lexicon for Myanmar language, namely M-Lexicon, is programmatically created. This lexicon contains six basic emotions such as happiness (joy), sadness, fear, anger, surprise, and disgust as defined by Ekman [1,2,3].
In this paper, the creation process of M-Lexicon is presented. This lexicon is created with words from a social media, Facebook. Although there has different Facebook status such as text status, video and image status, two kinds of status:
Myanmar text only and status with feelings are applied in the lexicon creation process.
Before creating M-Lexicon, the Facebook status text is segmented in order to get words from it. In the segmentation process, there are two phases: syllable segmen- tation and syllable merging. The segmented status may contain stop words, which are unnecessary words in the creation of M-Lexicon. Thus, there is also needed to remove stop words from the segmented status. The emotion lexicon is then created by using the matrices and preprocessed words or terms. Term-Frequency Inversed Document Frequency (TF-IDF) scheme is applied in matrix computational process and finally unity-based normalization is performed on the matrix values to adjust it in range [0, 1].
The rest of this paper is organized as follows. In Sect.2, the related works are presented. Section 3describes Myanmar language processing which is needed to perform before M-Lexicon creation. This paper explains how to create an M-Lexicon in Sect.4. Section5discusses experimental results followed by conclusion in Sect.6.
2 Related Works
Emotion lexicons are constructed based on different methodologies and languages.
Some researchers proposed an automatic way while other ones are generating it in manual, translating the existing lexicon to their own languages.
Bandhakavi et al. [4] proposed a set of methods to automatically generate a word- emotion lexicon from the emotional tweets corpus. They used term frequencies for learning lexicons and more sophisticated methods that are based on conceptual mod- els on how tweets are generated. Their methods obtain significantly better scores.
However, their methods can only work on short text representation and cannot get better results on other domains such as blogs, discussion forums wherein the text size is larger than tweets.
Although many researchers used automatic methods to assist the lexicon building procedure. Their works commonly rank words by calculating similarities values with a set of seed words. The most similar words are added to the final lexicon or used as additional seed words. Xu et al. [5] adopted a graph-based algorithm to build Chinese emotion lexicons and incorporate multiple resources to improve the quality of lexicons and save human effort. They used graph-based algorithm to rank words automatically according to a few seed emotion words. Although, they built lexicon for five emotions (happiness, anger, sadness, fear, surprise), the size of an emotion lexicon is small and the quality is not good.
Do et al. [6] presented a method to build fine-grained emotion lexicons for Korean language automatically from the annotated Twitter corpus without using other exist- ing lexical resources. Their proposed method, weighted tweet frequency (weighted TwF), is aggregated same emotion label tweets in one document, and produce six documents as a result. Then they calculate the weighted TwF values for each term that appears in each six documents. The higher TwF values, the stronger the corre- sponding emotion. Their classification performance can improve by adding several fine-grained features as they suggested.
Mohammad et al. [7] utilized crowd sourcing to build and generate a large, high- quality, word-emotion and word-polarity association lexicon quickly and inexpen- sively. Their lexicon, NRC word-emotion association lexicon also known as EmoLex lexicons, assigns 14182 words into eight emotion categories [8], and two sentiments (positive and negative). Their lexicon mostly affects the annotations for English words. Each word in the lexicon is assigned to one or more emotion categories that are annotated manually by Mechanical Turk users.
3 Myanmar Language
This section describes the introduction to Myanmar language and explains the word- segmentation and stop-words removal process. These processes are needed to be performed before creating an M-Lexicon; more detail process of such lexicon creation is described in next section.