LIST OF FIGURESFigure 1: Spectrum of vowel in five different emotions 13Figure 2: A flat diagram of the two-dimensional chemotherapy Figure 4: Model Convolutional Neural Network 32 Figur
Trang 1VIETNAM NATIONAL UNIVERSITY, HANOI
Advisors: PhD Kim Dinh Thai
PhD Ha Manh Hung Team leader: Nguyen Duc Quang Anh
Ha Noi, April 15, 2024
Trang 2INFORMATION OF THE GROUP LEADER
- Program: Application Information Technology
- Address: Nam Thanh, Ninh Binh
- Phone no./Email: 0948396862 / anhnd@vnuis.edu.vn
II Academic results
III Other achievements
- Semi-finale the 25th Euréka Student Scientific Research Award
- Certificate of Merit from the Youth Union for achievements in Youth Month 2023
- Delegates exchange at the 9th Global Young Parliamentarians Conference
- Certificate of Merit for contributions in Learning and Scientific Research in 2022 - 2023
- Certificate of outstanding contributions at two competitions FUTURE BANKER 2023
& THE TECH TALENT HUNT 2023 organized by the Center for Forecasting andHuman Resource Development - Vietnam National University
Trang 3To represent Speech Recognition Group we would like to express our deep thanks toPhD Kim Đinh Thai, PhD Ha Manh Hung and Mr Chinh Thank you for guiding usthroughout the research process Thanks to the knowledge and experience that theTeachers and Mr.Chinh share, we have learned a lot during the study period
The teachers are not only professional in the project, they are also companions, alwaysready to help us overcome challenging difficulties throughout the study process And it isimpossible not to mention Mr Chinh, although we are just strangers, yet he alwaysaccompanied us throughout the study time We appreciate the dedication and love thateveryone has for us, and we wish that Teachers and Mr Chinh are healthy, happy andsucceed in life
Thank you for everything you have done for us and we look forward to working with you
on many future projects
Students,Nguyen Thi Thu Hien, Nguyen Duc Quang Anh
Nhóm Speech Recognition chúng em xin gửi lời cảm ơn sâu sắc đến thầy Kim Đình Thái, thầy Hà Mạnh Hùng và anh Chinh Cảm ơn mọi người đã hướng dẫn chúng em trong suốt quá trình nghiên cứu Nhờ những kiến thức và kinh nghiệm mà các thầy và anh chia
sẻ, chúng em đã học hỏi được rất nhiều trong suốt thời gian nghiên cứu.
Các Thầy không chỉ là những thầy giáo giỏi về chuyên môn, các thầy còn là người đồng hành, luôn sẵn sàng giúp đỡ chúng em vượt qua những khó khăn thách thức trong suốt quá trình nghiên cứu Và không thể không nhắc tới anh Chinh, mặc dù chúng em chỉ là những con người xa lạ, mặc dù vậy anh lại luôn đồng hành cùng bọn em trong suốt thời gian nghiên cứu Chúng em cảm kích sự tận tâm và tình yêu thương mà mọi người dành cho chúng em, bọn em mong rằng các Thầy và Anh mạnh khỏe, hạnh phúc và đạt nhiều thành công trong cuộc sống.
Xin chân thành cảm ơn thầy và anh vì mọi điều thầy và anh đã làm cho chúng em và chúng em cũng mong được hợp tác với các thầy và anh trong nhiều dự án tương lai.
Sinh viên, Nguyễn Thị Thu Hiền, Nguyễn Đức Quang Anh
Trang 4TABLE OF CONTENTS
Trang 5IV.3.1 Dealing with missing data 40
IV.4.3 One Dimensional Attention CNN-LSTM Model 45
Trang 6LIST OF FIGURES
Figure 1: Spectrum of vowel in five different emotions 13Figure 2: A flat diagram of the two-dimensional chemotherapy
Figure 4: Model Convolutional Neural Network 32
Figure 13: VNEMOS Emotions Data length Distribution 40
Figure 17: Web Voice Recognition Receiver and Analysis Interfacet 51
Figure 24: Examples of emotion classification results 55
Trang 7LIST OF TABLES
Table 1: Statistics on the number of people using antidepressants in
Taiwan from the Ministry of Health and Welfare Bureau of Statistics 19Table 2: Corpus characteristics: 404 agent-client dialogs of around 10 hours,
Table 3: Details of Emotional Oriya speech database 26Table 4: Characteristics of other popular Emotional Speech Databases 26
Table 6: Key functional overview of deep learning architecture 34Table 7: Two Dimensional CNN Model network parameters 43
Table 10: 2D CNN Model Confusion matrix classifying in EmoDB 47Table 11: 2D CNN Model Confusion matrix classifying in VNEMOS 47Table 12: 1D CNN-LSTM Model Confusion matrix classifying in EmoDB 47Table 13: 1D CNN-LSTM Model Confusion matrix classifying in VNEMOS 47Table 14: 1D Atenttion CNN-LSTM Model Confusion matrix
Trang 8LIST OF ABBREVIATION
ANN Artificial Neural Network
DAIC-WOZ Distress Analysis Interview Corpus Wizard-of-Oz
EECE Electronic, Electrical, and Computer Engineering
HUMAINE Human-Machine Interaction Network on Emotion
IEMOCAP Interactive Emotional Dyadic Motion CaptureLPCCs Linear Prediction Cepstral Coefficients
Trang 9LSTM Long Short-Term Memory
MFCCs Mel Frequency Cepstral Coefficients
MLP Multi-layer Perceptron
NLP Natural language processing
RNN Recurrent Neural Network
PHQ - 8 Eight-item Patient Health Questionnaire depression scaleReLU Rectified Linear Unit
SER Speech Emotion Recognition
Trang 10- English: It is complicated to determine whether a person is depressed due to the
symptoms of depression not apparent However, their voice can be one of the factorsfrom which we can acknowledge signs of depression People who are depressed expressdiscomfort, and sadness; they may speak slowly and trembling, and lose emotion in theirvoices In this research, deep learning is applied to detect emotions It analyzes the audiosignal of speech to analyze human emotions, our deep learning models can detectdifferent states of emotions The research delves into the intricate computational stepsinvolved in implementing a deep learning architecture, specifically focusing on a modelthat is structured on the Convolutional Neural Network (CNN) and introduces a speechemotions dataset in Vietnamese The data set is the result of research, testing, andfiltering 250 emotional segments from movies, series, and live shows divided equally for
5 basic emotional states of humans: “anger, happiness, sadness, neutral, and fear”,VNEMOS contains approximately 30 minutes long With experimentation, the model hasachieved impressive recognition with the highest accuracy rate of 89%
- Vietnamese: Rất phức tạp để xác định liệu một người có bị trầm cảm hay không do các triệu chứng trầm cảm không rõ ràng Tuy nhiên, giọng nói của họ có thể là một trong những yếu tố để chúng ta nhận biết dấu hiệu trầm cảm, người bị trầm cảm thể hiện sự khó chịu, buồn bã và có thể nói chậm, run rẩy, mất cảm xúc trong giọng nói Trong nghiên cứu này, deep learning được áp dụng để phát hiện cảm xúc Nó phân tích tín hiệu
âm thanh của lời nói để phân tích cảm xúc của con người, các mô hình deep learning của chúng tôi có thể phát hiện các trạng thái cảm xúc khác nhau Nghiên cứu đi sâu vào các bước tính toán phức tạp liên quan đến việc triển khai kiến trúc deep learning, đặc biệt tập trung vào mô hình được cấu trúc trên Mạng thần kinh chuyển đổi (CNN) và giới thiệu
bộ dữ liệu cảm xúc lời nói bằng tiếng Việt Bộ dữ liệu là kết quả nghiên cứu, thử nghiệm
và lọc 250 phân đoạn cảm xúc từ phim điện ảnh, phim bộ và liveshow chia đều cho 5 trạng thái cảm xúc cơ bản của con người: “tức giận, hạnh phúc, buồn bã, trung tính và
sợ hãi”, VNEMOS chứa khoảng 30 trạng thái cảm xúc phút dài Qua thử nghiệm, mô hình đã đạt được sự công nhận ấn tượng với tỷ lệ chính xác cao nhất là 89%.
Keywords
Depression, CNN, Deep Learning, Vietnamese, SER
Trang 11Nguyễn Đức Quang Anh 22070306 AIT2022A Applied Information Technology 2𝑛𝑑
Nguyễn Quỳnh Chi 23070464 ISEL2023A Industrial Systems Engineering and Logistics 1𝑠𝑡
Nguyễn Thị Thu Hiền 22070073 AIT2022B Applied Information Technology 2𝑛𝑑
Vũ Quân 22071105 AIT2022B Applied Information Technology 2𝑛𝑑
Đỗ Xuân Minh Đức 22070047 AIT2022A Applied Information Technology 2𝑛𝑑
3 Structure
The first chapter, "Introduction," serves as the foundation for the entire research In thissection, we will provide the rationale for why recognizing depression tendencies fromspeech signals is an important topic Additionally, this chapter raises research questions,clearly identifies the motivation and objectives of the study, and introduces the researchmethodology we employ to achieve our objectives
In the following chapter, we will apply the psychological basis of depression and thecycle model to voice recognition We investigate the psychological components ofdepression, such as symptoms and classification, as well as the cycle model This serves
as the theoretical foundation for our research We will discuss what speech recognition isand how we utilize deep learning to detect early indicators of depression through speech
We investigate speech recognition technologies, ranging from artificial neural networks
to recurrent neural networks, as well as other related research in the field
In chapter 3, Our study technique is explained also about an overview of the deeplearning methods we use in our research and some related reference methods
Chapter 4 In this chapter, we will explain how the architecture and model are used,culminating in the model training process and describing the phases of data gathering and
Trang 12introducing the datasets that we will use There are two main datasets: EmoDB andVNEMOS
Moving on to Chapter 5, we provide the research results and discuss noteworthy findingsbased on the previously defined research questions and assumptions We may alsoexamine how the model is applied in real time We design a web that can display thefindings and provide us with a percentage table to see how emotional we are in the fivemain categories of emotions, as well as tables and charts of the input sound that represent
us in the most plain and intuitive way
Finally, Chapter 6 focuses on the discussion of the identified factors in the results section
We propose methods for model improvement if necessary and summarize the key points
of the study, along with its limitations Furthermore, we suggest future researchdirections
4 Contributions
(1) In this research we have developed a Vietnamese Speech Emotion dataset namedVNEMOS containing approximately 30 minutes with five emotions classes “Anger,Happiness, Sadness, Neutral and Anxiety”, achieving the accuracy at 89% The dataset is
a sub dataset of a larger Vietnamese Speech Emotion dataset named PhoEmo which wewill introduce in the near future, detailed information of the dataset is in the followinglink: “bit.ly/VNEMOS”
(2) Our research has created and tested three CNN based models with Two DimensionalCNN structure, One Dimensional CNN-LSTM structure and One Dimensional AttentionCNN-LSTM structure Our Two Dimensional CNN Model achieved state-of-the-artaccuracy between available Speech Emotion Recognition utilizing CNN based structuresand in our recent research
(3) We designed an intuitive web by using the streamlit python library to deploy themodel and tested it outside the lab to identify emotions through the speaker's speechsignal, from which we can diagnose whether they have signs of pre-depression throughtheir cognitive behavior
(4) This research have been submitted in the ICDV Conference with the title Quang-AnhN.D, Quynh-Chi Nguyen, Thu Hien Nguyen Thi, Quan Vu, Minh-Duc D.X, Duc-Chinh
Nguyen, Hung-Ha Manh, Thai Kim Dinh (2024) “VNEMOS: Vietnamese Speech Emotion Inference Using Deep Neural Networks.” ICDV 2024 9th International Conference on Integration, Design and Verification.
Trang 13CHAPTER I INTRODUCTION I.1 Concerning Rationale of the Study
Emotions are mind mysteries, they're a jumble of thoughts, feelings, and reactions allmixed up They involve a complicated dance of cognitive, physiological, and behavioralelements According to LeDoux's dual-pathway model [1], emotions are processedthrough both a fast, subcortical route involving the amygdala and a slower, cortical routethrough higher order brain regions The James -Lange theory [2] posits that physiologicalresponses precede emotional experiences, suggesting a bidirectional relationship betweenbodily states and emotional states Furthermore, Damasio's somatic marker hypothesis [3]underscores the role of bodily signals in influencing decision-making and emotionalexperiences Collectively, these theories provide a comprehensive understanding of theneural mechanisms governing emotions In accordance with the research in [4], basicemotions can be described as two dimensions, known as Arousal and Valence, asillustrated in Figure 1 The level of Arousal of an individual is determined by his or hersense of calmness or excitement, whereas the level of Valence is determined by positiveand negative feelings Joy and Relaxation are two positive emotions depicted in theimage below The positions on this plane indicate that these two emotions are positive innature While both are positive emotions, Joy arouses the senses more than Relaxed.Also an informational signal called
speech consists mainly of information
regarding the message to be conveyed, the
emotional content of the message, and the
speaker's characteristics, as well as the
language information In order to produce
sound units in different emotions, the vocal
tract has its own unique shapes that
represent emotion specific characteristics
The suprasegmental level of emotion
specific knowledge is characterized by
unique patterns of duration, pitch, and energy In order to capture emotional tractinformation, spectral features such as mel frequency cepstral coefficients (MFCCs),linear prediction cepstral coefficients (LPCCs) and their derivatives are used Basicprosodic features such as pitch, duration, voice quality, and energy are derived fromframewise parameters [5] In accordance with research, basic emotions are communicatedthrough vocalization, and nonverbal cues in speech are capable of recognizing themregardless of cultural background [6] There are a number of properties associated with
Trang 14voices conveying happiness, including a higher mean pitch, a higher voice intensity, afaster speech rate, higher pitch variability, and a higher frequency energy [7] Voices with
a sad tone exhibit a lower mean pitch, lower intensity, and a lower speech rate, in addition
to a narrower pitch range and reduced high-frequency energy [8] An angry voice ischaracterized by a high level of mean pitch and intensity, a wide range of pitch variation,
a faster rate of speech, and more high frequency energy [9] Typically, fearful voices arecharacterized by high pitch and intensity with jittery, uneven vocal qualities and a largepitch variation Disgusted voices show lower intensity and downward shifts in pitch [8]
By carefully analyzing acoustic cues such as pitch, speech rate, and voice quality, one canreliably quantify the degree to which emotions can be recognized from vocal expressions[10] Using neuroimaging techniques, distinct patterns of brain activity can be identifiedfor different emotions conveyed through voice, providing objective evidence of neuralcorrelates A more dramatic activation of the temporal voice areas and orbitofrontalcortex is observed in response to an angry voice as compared to a sad voice [11] Fearfulvoices activate the amygdala, insula, and temporal voice areas more than happy voices[12] The convergence of perc-eptual studies, acoustic measurements, neuroimaging, andcross-cultural research has the potential to enhance our understanding of the vocalexpression of basic emotions Computational analysis of large datasets of vocalexpressions promises improved automated emotion recognition [13] A comprehensiveunderstanding necessitates the integration of evidence derived from various disciplines,including psychology, neuroscience, signal processing, and anthropology
The new generation of machines is gradually replacing human beings in the real workfield to perform tasks more efficiently and at a lower cost in the era of modernization 5.0.Throughout the past several decades, a great deal of research has been undertaken tostudy the recognition and analysis of human emotions in neuroscience, psychology,cognitive science and computer science Emotions play an important role in developingmodels related to human emotion recognition, or developing AI to have emotions oractions like humans Through this, it will help humanity increasingly develop towards adeveloped society with AI machines that can help support more people and reduce theamount of work a person has to endure as well as help reduce failure Throughout humancognitive development, behavior and social interactions, emotions play a crucial role indetermining decision-making processes, communication and patterns According to somestudies, the proportion of people suffering from depression in Vietnam tends to increaseuntil 2015, the proportion of people suffering from this depression was at 3% When thecovid epidemic situation has become complicated since 2019, the rate of people sufferingfrom the disease is higher [14] It is estimated that almost 900,000 people die each yearfrom depression Therefore, the prevention and treatment of depression is an extremely
Trang 15important topic in today’s society Recognizing whether a person is depressed or not can
be a big challenge, as the symptoms of depression may not be obvious However, voice isone of the factors that we can observe to detect some signs of depression When peopleare depressed, their voices can express discomfort, sadness, and feelings of fatigue Theymay even speak slowly, tremble, lose emotion in their voices, or have no energy to talk.Some people may have a smaller or unclear voice, because they feel tired and don’t want
to communicate In addition, people with depression can make complaints, self-stimulate,blame themselves, and show a loss of interest and motivation in their voices They mayalso tend to avoid conversations, restrict communication and stay away from socialactivities The study hopes to be able to analyze a person’s mood from voice informationthrough deep learning, and ultimately use the deep learning model created to analyze andinfer which potential patients may be depressed or tend to suffer from depression fortreatment and prevention as early as possible
Emotion databases play a crucial role in the development of emotion recognitionmodels It has always been important for people to communicate in a language, especiallythrough speech, since language has played and continues to play such a fundamental role
in bringing people together The distorted speech signals associated with differentemotional states have been observed in studies in which speakers utter words as a result
of different mental states Our research uses a method which employs audio parametersfrom distorted speech waves, each representing different emotions, as feature vectors.These vectors are then subjected to pattern recognition techniques with the aim ofderiving the desired solution [15, 16] The realization of the advantages of speech signalshas encouraged researchers to search for effective ways to integrate speech into computersystems Employing machines to process human emotions has the potential to enhancethe performance of interactive systems The development of verbal emotion detectors will
be the key to unlocking unparalleled interactive experiences between humans andcomputers, posing a new challenge to the research community Our endeavor is to fosterauthentic interactions between humans and computers, thereby enabling computers toperceive emotions and engage in communication Our aim is to facilitate emotionrecognition, although the development of natural databases can be costly and is oftenconstrained by limited resources Conversely, processed databases are easier to develop.This suggests the need to create an emotion dataset suitable for a specific cultural contextlike Vietnam [17] A suitable dataset will help accurately capture and analyze theemotions of each individual in the Vietnamese These datasets, moreover, representimperative resources, serving as the foundation for the development of applications,ranging from emotional computing systems to virtual agents [18] By addressing theunique emotion in the Vietnamese, these initiatives not only advance scientific
Trang 16understanding but also facilitate the development of more inclusive and effectivetechnologies Moreover, these developments cater to a diverse range of user groups.Emotion recognition in human-computer interaction is very important, however mostexisting datasets are mainly in English, creating an emotion recognition dataset inVietnamese will help researchers be able to study emotion recognition in this language[19] In this study, we build a Vietnamese speech emotion dataset through movies andlive shows The Vietnamese Emotion Dataset is developed to solve the problem of a lack
of data for research on emotion recognition in Vietnamese
I.2 Research Questions
To further develop the current speech model that recognizes depression, the study focused
on these two basic research questions:
(1) How can voice recognition be used to assess and track depression indicators, e.g., tone, speaking speed, distress, etc., of the speaker?
(2) How to build a voice recognition model capable of recognizing signs of depression in the voice and thus making an assessment of the diagnosis of the symptoms of the disease? (3) How to create a Vietnamese Speech Emotion Dataset?
Trang 17CHAPTER II OVERVIEW II.1 Psychological Foundations
The word "Melancholy" was first used by the ancient Greeks to describe feelingintensely sad and despair, to be more familiar in modern times, this word can be called
"Depress" The mental illness of depression, considered to be one of the most prevalentand serious mental illnesses, severely impairs an individual’s quality of life, markingthem with feelings of sadness, hopelessness, and loss of interest in enjoyable activities.The diagnosis of schizophrenia requires at least five or more of these symptomsoccurring for at least two weeks: depressed mood, diminished interest/pleasure,significant weight loss or gain, insomnia, hypersomnia, psychomotor agitation, fatigue,feelings of worthlessness, diminished concentration, and recurrent thoughts of suicide ordeath [20] Various factors are involved in the origins and causes of depression, includinggenetic, biological, environmental, and psychological factors [21] With a lifetimeprevalence of around 15-20%, depression contributes significantly to the global burden ofdisease and affects individuals of all ethnicities, ages, and genders [22] The WorldHealth Organization (WHO) reports that depression causes more than 300 million peoplearound the world to suffer from disabilities [23] Depression disorder is a commonpsychiatric disorder in Vietnam as well as in the world, it affects the quality of life of thesick person, in the severity of the patient can be affected to life due to the high risk ofsuicide In typical seizures, there are manifestations of inhibition of the entire mentalactivity Patients with a sad and pleasant mood, reduced interest in pleasure, feeling darkfuture, slow thinking, difficult association, self-confidence reduced, often paranoid guilt,leading to patients with thoughts of suicide and thoughts of death occurred in more thantwo-thirds of these patients, this is the leading cause of suicides, 10-15% suicide, in 2006
in Europe there were at least 59,000 successful suicide patients and this risk is presentthroughout the pathological process According to the study of the mental health program
by the American Veterans Fund in Vietnam (VVAF), the rates of depression and anxietyare the most common problems, in Da Nang there are 18.3% of adults with the diseaseand most depressive disorders are treated with chemotherapy But as we already know,antidepressants have many side effects such as headaches, dizziness, anxiety, breastfatigue, hand tremors, constipation some medications when taken in the early stagesmay increase the risk of suicide in patients In the meantime, some studies have shownthat when antidepressants and cognitive behavioral therapy are combined in the treatment
of depression, good results are more than 71%, if the treatment is only 43%, and theanti-depressant alone is more than 61% Behavioral activation therapy is a simple,time-effective, and cost-saving therapy This is a widely used therapy in the U.S., which
Trang 18is part of cognitive behavioral therapy, based on the underlying theory of behavior andthe current evidence to constitute behaviors that can trigger the mechanism of cognitionally-behavioral change in clinical depression From that we found that in Vietnam weneed to apply this behavioral activation therapy in combination with medication topatients to overcome some of the above disadvantages Da Nang City with the primarysupport of the American Veterans Fund in Vietnam, in the public health care programfrom 2009 to 2011 We have applied behavioral activation therapy in combination withantidepressants to treat depressed patients in 5 communities in the city Accompanyingthe community research, we conducted at the hospital with the application of behavioralactivation therapy for all patients with depression when examined at the clinic with thetheme: “Study of clinical characteristics and treatment of depression patients withbehavioral activation Therapy at Da Nang City Psychiatric Hospital” In Vietnam there is
no research work on this subject Depression is one of the most common mental illnesses
in the world, affecting millions of people each year Another example: according tostatistics from the Ministry of Health and Welfare [24], the number of people takingantidepressants in Taiwan is increasing year by year over time As of Taiwan’s 2017thyear, statistics have exceeded 1.3 million people And at the same time, the percentage ofyoung people under the age of 30 continues to slow down, as in Table 1 below, and it hasexceeded 10% in 2016 This percentage suggests that it means that the population withdepressive depression in Taiwan is slowing down, and is tending to grow younger.Although the number of depressed patients is increasing, on the other hand, the numbers
of people willing to receive treatment continue to increase This is a positive sign thatpeople’s awareness of their mental health status is increasing Because in the past,depression was considered a sensitive and less-mentioned issue, but now it has become atopic of concern So, to overcome the symptoms of depression we have used theknowledge of deep learning to create a system that can infer whether a person is prone todepression or depression, thus to apply to our lives The most direct and convenientassessment method to detect depression or depression in a person’s voice is to use voiceanalysis technology to assess emotions To achieve high accuracy in assessment, it isnecessary to collect characteristics of voice information through complex pre-processingprocesses or in coordination with other information such as images, facial expressioninformation and many other factors Voice analysis technology is developing rapidly tohelp detect and evaluate emotions from voice Furthermore, the use of diverse data from avariety of sources is also a way to increase the accuracy and efficiency of voice analysissystems
Trang 19Taiwan 2018 2017 2016 2015 2014Sum 1,397,197 1,330,204 1,273,561 1,121,659 1,194,395
Table 1: Statistics on the number of people using antidepressants in Taiwan from the
Ministry of Health and Welfare Bureau of Statistics [24]
Even though there are effective treatments available, less than half of those affectedworldwide receive them [23] Depression can be alleviated by improving access toevidence-based interventions Clinical studies have demonstrated the efficacy ofCognitive Behavioral Therapy (CBT) and pharmacological treatments for treating acutedepression episodes and preventing relapses, with CBT showing particular promise inmaintaining long-term benefits [25] Depressive disorders require profound compassion
on the part of those suffering from their relentless and debilitating effects A significantnumber of personal and societal costs will need to be alleviated through ongoing researchand improved services Although melancholy has perennially plagued humankind, hopepersists in the progress made and the promise of better understanding and caring for those
in its thrall According to current statistics, up to 80% of the world's population willexperience depression at some point in their life
● Cause: talking about the reason, there is still no really clear cause
● Endogenous: there are many theories due to genetics from parents or familymembers, surrounding environment, society, autoimmune factors
● Stress-induced depression: due to pressure from many sides such as family, work,children or unexpected events that have a bad consequence such as loss of lovedones, loss of money, etc
● Depression can occur after experiencing illnesses or injuries that cause seriousdirect damage to the brain
Trang 20Loss of concentration: although it is normal to forget the name of a person or a task to do.But if this is repeated often and the ability to lose concentration is reduced, reducingwork efficiency It could be due to depression Being depressed leads to more error-proneand more difficult decisions.
● Changes in sleep: Sleep disturbance can be considered as one of the mainsymptoms, some will sleep too much, others too little
● Change in the feeling of eating: some people may eat more than usual or there arepeople who have a lot of good food but do not feel good Either way, a dramaticchange in appetite and weight (more than 5% of body weight a month) could be asign of depression
● Irritability, agitation and moodiness: Increased irritability, agitation and moodiness
is also a recognizable sign Even the little things that can upset you - such as loudnoises or long waits Sometimes accompanied by anger, self-harming thought
Depression classification and analysis:
May be divided into major depressive disorder, persistent depressive disorder, andseasonal disorder And severe depression is the most serious symptom of depression, atthat time the patient will fall into a depressed emotional state for a long time (maybemore than 2 weeks), and the head will always have negative thoughts Depressedfeelings, feeling empty, useless, eating disorders, loss of concentration, slow reactions,the patient himself will always think of death or suicide as a way to free himself, finally
in the end, about 3.4% of patients actually commit suicide [26] Persistent depressivedisorder is basically similar to major depressive disorder, but with persistent depressivedisorder it is usually milder but the duration will be longer than that of major depression[27], it can take more than 2 years to be diagnosed and the disease can last for more than
10 years or even accompany with the patient's life, and that persistent depression has ahigh risk of developing major depression, studies have shown that 79% of patients candevelop major depressive complications, and This condition is known as doubledepression The last type of depression is seasonal depression, as the name suggests, thistype of depression mainly occurs in the cold season, is also more common inhigh-latitude areas, is a disease with a higher incidence when the weather is cold Theamount of time in the sun is reduced because more exposure to the sun can reduce therisk of disease [28] The World Health Organization (WHO) has listed depression as one
of the three most serious diseases affecting the world and its severity is comparable tocardiovascular diseases or AIDS Psychosis in Taiwan according to the statistics is asevere or persistent psychological disorder, and tends to be increasingly younger, thismay be due to changes in society, environment due to the young population being under
Trang 21great pressure and may also have problems with family, friends and outside relationships.Since depression is not just a psychological problem, the person can be associated withpersonality and physical condition, depression can be treated with medication andaccompanied by psychotherapy, if the proper care and treatment is obtained, 70-85% ofpatients can significantly improve the situation of the disease [29].
However, with the advancement of e (AI), it is possible to apply this technology totreat depression more effectively One of the potential applications of AI in the treatment
of depression is the automatic assessment of patients’ symptoms AI systems can analyzedata on symptoms of depression, including changes in a patient’s mood, eating habits andsleep, to help diagnose the disease and provide appropriate treatment In addition, AI canalso assist health professionals in managing depression patients AI systems can monitorpatient health indicators, give advice on treatments, and support health care decisions.This helps enhance the quality of care and minimize the risks associated with errors in thecare process Finally, AI can also help in early detection of depression AI systems cananalyze data from high-risk patients to detect early signs of depression and give warnings
to patients and health professionals In short, the use of AI in the treatment of depressioncan bring many benefits to patients and healthcare professionals Applications of AI inthe treatment of depression are being researched and developed to help improvehealthcare quality and improvement [30] By the completion of this research, we will beable to obtain useful information to diagnose depression and be able to detect depressionsigns more accurately and quicker By using deep learning techniques, this research aims
to assess the probability of individuals having a tendency to be depressed by analyzingthe frequency amplitude of the voice Our objective was to identify the speaker’s specificemotional characteristics and then determine if they are likely to experience depressionbased on those characteristics By using deep learning, we are able to assess and analyzethe emotional characteristics of a voice in a more comprehensive and accurate manner
An evaluation of specific emotional characteristics in the human voice was performed,including speed, volume, rhythm, and tone
II.2 Period Model
When traditional machine learning methods are used to recognize voice emotions, themost commonly used method is the use of classifiers. The core concept is to set up aclassification model based on a large number of known databases and assign the correctdata to the right category; it can also complete the accurate classification work when newdata appears Common types of classification include Support Vector Machine (SVM),Gaussian Mixture Models (GMM), Hidden Markov Model (HMM) and various types.According to research in [31], basic emotions can be described by two-dimensional
Trang 22Figure 2: A flat diagram of the two-dimensional
chemotherapy-stimulating emotional model
spatial zones including Stimulation and Chemotherapy, as in Figure 1 below Arousalrepresents the intensity of calm or excitement, while Valence represents positive andnegative emotional trends For example, the two positive emotions in the picture beloware Joy and Relaxed From the location of these two emotions on this plane, we can seethat both are more positive emotions, but Joy has a higher level of euphoria than Relaxed.The Period model is a deep learning model used to analyze voice signals and detectdepression This model is based on the cyclical nature of the voice to detect changes involume and frequency in the voice This model uses multi-layer analysis techniques toanalyze voice signals into different frequency components The model then usedalgorithms that extracted information from the vocal cycles to detect variations in thevolume and frequency of the voice Some applications of the Period model in thedetection of depression include: Analyze the frequency components of the voice toidentify the characteristics of voice associated with depression, detect variations in thevolume and frequencies of voice to detect changes in the mood of a depressed patient, usealgorithms that extract information from the cycle of voice in order to make predictions
of the depression of the patient
II.3 An Overview in Emotion Datasets
Depending on how emotions are expressed, emotion datasets can be categorized intothree types: [32 - 34] simulation/acted, elicitation, and natural In contrast to the threetypes above, acting is responsible for the majority of datasets collected; the actingemotion dataset is the largest, and it is usually conducted by professional actors.Typically, three emotion corpora are used in the emotion recognition problem: Berlin
Trang 23Emotion Speech Database (EmoDB), Chinese Emotions Dataset, and Danish EmotionCorpora Each corpora is collected from the most widely used expressions in emotionrecognition problems EmoDB is a database of emotional speech recordings containing
533 tracksets of ten actors (five men, five women) describing seven emotional states,including: anger, boredom, anxiety, fear, disgust, and neutral feelings [35] There are fiftylinguistically neutral sentences in the CASIA Chinese emotion corpus describing sixemotions: anger, fear, happiness, neutrality, sadness, and surprise This corpus containstwo hours of spontaneous emotional segments extracted from 219 speakers from films,television plays and talk shows Because of the large number of speakers, this database is
an outstanding addition to existing emotional databases [36] The Danish emotionalspeech corpus contains semantically neutral utterances recorded by four subjectsexperiencing four different emotions: anger, happiness, surprise, and sadness [37] Eventhough these datasets are widely utilized, there are certain limitations to them However,these emotions are recorded by actors who do not act within the appropriate context,which makes the authenticity of these recordings suspect Poor and somewhat awkwardspeaking and emotions are expressed in separate sentences that are not connected in anyway However, in comparison to the data recorded in the laboratory, emotions appearunnatural
A number of other common speech databases are also available, [38] the mostcommonly used of which are IEMOCAP and DAIC-WOZ These databases have thesame sampling frequency of 16kHz as the IEMOCAP database [39], which containsvarious types of information An audiovisual recording of the characters facialexpressions, movements, and verbatim manuscripts was created at the University ofSouthern California's Sail Laboratory There are approximately 12 hours of audio visualdata included in the document As of the current version, the database includes recordeddata from ten participants and six emotions are included, including anger, excitement,frustration, happiness, sadness and neutrality A relatively recent interview voicedatabase, DAIC-WOZ is an interview benchmark voice dataset for depression detectiondeveloped by the University of Southern California-Ellie in collaboration withinterviewees to serve as a virtual psychologist The total number of interview data iscurrently approximately 200, with recording times ranging from 7 to 33 minutes (onaverage 16 minutes), consisting of 189 clinical interviews between an interviewer and apatient 30 out of 107 interviews within the training set and 12 out of 35 interviewswithin the development set are classified as depressed [40, 41] Along with audiorecordings, accompanying data includes verbatim transcripts of responses, timestamps forthe beginning and end of sentences, facial expressions, eye contact information, etc… In
Trang 24the DAIC-WOZ database, the emotions are classified fairly simply, and depression status
is determined by the score of the PHQ - 8 scale [42]
The databases include video clips and audio clips from TV programs, radio stations orcall centers These videos and audio clips are called spontaneous speech recordings andcan be found in these databases As examples of real-life situations, cockpit recordingsunder unusual conditions, conversations between patients and doctors, emotionalconversations in public places, and other situations [43] Morrison et al 's Chineseemotion data set and the French medical emergency data warehouse [44] both obtain datathrough call centers Morrison et al used an Automated Telephone System to collect datafrom Chinese emotion consumers These systems have speech recognition units thatprocess user requests based on spoken language Voice emotion recognition systems can
be used to handle calls based on the perceived urgency of the call In the event that theautomated system detects that a caller appears confused or angry, it may route their call to
an operator for assistance The system monitors voicemail messages in the switchboardand prioritizes messages according to sentiment [45] The French medical emergencycorpus derived from real-life call center conversation records can provide an insight intoemotional recognition A set of naturally occurring dialogues was recorded in areal-world call center Approximately 20 hours of transcribed data are included in thetranscribed corpus 404 agent caller dialogs from six different agents and 404 callers areincluded in this dataset About 10% of the transcribed speech data are omitted due tooverlapping voices [46] A medical advice service is provided through this service Thesedata are used in accordance with ethical conventions and agreements, ensuring calleranonymity and privacy, as well as non propagation of the corpus An overview of thecorpus' characteristics is provided in Table 2
#clients 404 (152 Male, 266 Female)
Table 2: Corpus characteristics: 404 agent-client
dialogs of around 10 hours, 19k speaker turns [45]
We must also mention EmoTV Dataset, which is a collection of 51 video clips recorded
by French television channels as part of the HUMAINE project - a collection ofFrench-language television interviews recorded during the HUMAINE project Various
Trang 25topics (politics, religion, sports ) are covered in the interviews conducted in France Thecorpus in French includes 48 subjects, the total duration of the corpus is 12 minutes andthe vocabulary capacity is 800 words (total number of words: 2500) [47] Additionally,there are datasets such as "Baby ears" [48] These datasets were collected using two types
of experimental data In the first experiment, audio data was collected from parentsconversing with their children The second experiment involved adult listeners judgingwhether each utterance was acceptable, attention seeking, or forbidden words, in addition
to rating the message's strength Rao & Koolagaudi's dataset is quite different fromothers, it includes both natural and acted elements [49] The Hindi speech corpus wascollected from five different geographical regions (central, eastern, western, northern andsouthern) of India, representing the five dialects of Hindi In order to obtain speech data,five men and five women spoke each dialect in isolation In order to collect speech data,random questions were posed to the speaker in order to describe one's childhood, one'shometown, and one's childhood memories Each dialect lasted approximately 1-1.5 hours.The speech database consists of 10 professional artists (5 males and 5 females) collectedfrom All India Radio (AIR) Varanasi, India for the purpose of recording In creating thisdatabase, eight emotions were taken into account: anger, disgust, fear, joy, neutrality,sadness, sarcasm, and surprise As well as some of the datasets mentioned above, severalother popular datasets covering a wide variety of languages, such as Scherer et al Stressdataset Approximately 100 English speakers (25 native German speakers, 16 nativeEnglish speakers, and 59 native French speakers) were recorded using a computerinduction tool Each speaker wrote a total of 100 reading sentences, as well as somespontaneous responses [48] The INTERFACE dataset, which captures speech in multiplelanguages (English, Spanish, French, Slovenian) [50], and the "you stupid tin box"dataset, which focuses on speech produced by children in English and German TheGerman data was collected from 51 children (age 10-13, 21 males, 30 females) [50] Thechildren hail from two different institutions (‘Mont’ and ‘Ohm’) Thirty native Englishspeakers, aged between 4 and 14, participated in the E1 and E27 data collections Therecordings were undertaken in a specially designed multimedia studio at CETADL,located within the Department of Electronic, Electrical, and Computer Engineering(EECE) The total duration of the recordings is approximately 8.5 hours, whichcorresponds to just over 1.5 hours of speech once silences, pauses and ‘babble’ have beenremoved In addition to the databases mentioned above, there are also elicited databases.Different from Natural Database and Acted Database, Elicited Speech Emotion Database
is where emotions are created, i.e artificial emotional situations without any knowledge.any knowledge about the speaker [34] The Chinese elicited data set of Yuan et al.consists of a total of 288 sentences collected from nine speakers describing anger, fear,
Trang 26happiness, and sadness [52] The data collection consists of sounds of anger, fear,happiness, and sadness In each sentence, four listeners were asked to rate the emotiontype, and they chose neutral anger, fear, happiness, or sadness The Department of Comp
Emotion Classes Anger, Sadness, Astonish, Fear, Happiness, Neutral
Speakers Male 23, Female 12 (Age 22 to 58)
Sampling Rate 16000 Hz, 16 bit, 1-channel
Table 3: Details of Emotional Oriya speech database [53]
-uter Science and Applications of Utkal University of Orissa has also developed theOriya emotional speech dataset There are 35 speakers, aged 22 to 58, speaking Oriya,who are from different Oriya speaking areas of the state of Orissa [53] Based on theiremotional state, speakers can read the text in a very natural voice By monitoring therecorded speech, the speaker can read emotional sentences if the desired emotion is not
Berlin EmoDB [21 ] German Acted 533 trackset from ten actors (five man, fivewomen) describing 7 emotional states anger, boredom, anxiety,fear, disgust, and neutral
feelings
sCASIA Chinese
emotion corpus [22] Mandarin Acted
Two hours of spontaneous emotional segments extracted from 219 speakers from films, television plays and talk shows, describing 6 emotional states
anger, fear, happiness, sadness, neutrality and surprise
The Danish emotional
speech corpus [23 ] Danish Acted Four subjects experiencing four different emotions anger, happiness, surprise,and sadness
IEMOCAP Database
Audiovisual recording of the characters facial expressions, movements, and verbatim manuscripts, approximately 12 hours of audiovisual data included in the document.
Recorded data from ten participants and 6 emotions are included
anger, excitement, frustration, happiness, sadness, neutral
DAIC-WOZ Database
The total number of interview data is currently approximately 200, with recording times ranging from 7 to 33 minutes (on average 16 minutes) neutral, sadness, anxietyMorrison et al [29] Mandarin Natural 388 utterances, 11 speakers, 2 emotions Anger, neutral
Baby ears dataset [34] English Natural 509 utterances, 12 actors (6 males, 6 female) Approval, attention,prohibition
Trang 27conveyed when the recording takes place [54, 55] Oriya drama scripts are used in the
analysis of the texts, including “MO PEHENKALI BAJAIDE” (/mo pehenkAli bajaide/),
“EDUNIA CHIDIA KHANA” (/e duniA chidiA khanA/), etc In a laboratory
environment containing significant noise, data is digitized at 16000Hz A summary of the
Oriya emotion corpus is provided in Table 3
Dataset of emotion recognition through sound in Vietnamese has not been developed or
researched anywhere in the world Having recognized this gap, we decided to contribute
to this field by building the first Vietnamese emotion data set that incorporates a wide
range of emotions To advance artificial intelligence research and development, this
represents both an important step forward and an important step towards exploring the
unique and emotional characteristics of the Vietnamese language The properties of other
prominent emotional speech databases are listed above in Table 4
II.4 Related Work
In 2014 E Yuncu developed A binary decision tree consisting of SVM classifiers was
utilized to classify seven emotions using the EmoDB database [56] reached highest
accuracy of 82.9%, in following years 2016 XingChan Ma used a neural network in
French medical
emergency corpus [32] French Natural
10 hours subset comprised dialogs (6 different agents, 404 callers)
anger, fear, joy, sadness, disgust and surprise Rao & Koolagaudi
[35] Hindi Natural and acted
Contains 5 females and 5 males, sentences uttered based on their memories
Anger, disgust, fear, happy, neutral, sadness, surprise, sarcastic
EmoTV [33] French Natural
51 video clips recorded from French TV channels containing interviews on 24 different topics.
Consisting of 48 subjects, the total duration is 12mn (average length of 14s per clip) and the lexicon size is 800 words (total number of words:
2500)
anger, disgust, fear, embarrassment, sadness
Scherer et al [36] English andGerman Natural 100 native speakers stress and load level
INTERFACE [37] Slovernian,English,
Spanish, French.
Acted English 186 utterances, Slovenian 190 utterances,Spanish 184 utterances French, 175 utterances Anger, disgust, fear, joy,slow neutral, fast neutral,
surprise, sadness You stupid tin box [38] German, English Elicited 51 Children Anger, boredom, joy,surprise
Yuan et al [39] Chinese Elicited 9 native speakers anger, fear, joy, neutral andsadnessOriya emotional
corpus [40] Oriya Elicited
35 speaker (23 Male, 12 Female), taken from various Oriya drama scripts
Anger, Sadness, Astonish, Fear, Happiness, Neutral VNEMOS (Ours) Vietnamese Natural and acted 250 segments, approximately 30 minutes from 27 movies, movie series and live shows anger, sadness, happiness,neutral, anxiety
Table 4: Characteristics of other popular Emotional Speech Datasets
Trang 28conjunction with CNN and LSTM [57], which achieved a resolution rate of 68% for theDAICWOZ In 2018, Haytham used IEMOCAP Database using CNN-RNN structure[58] and achieved a recognition rate of 64.78%, in the same year, S Tripathi used athree-layer LSTM architecture to perform emotional discrimination on IEMOCAP [59]and achieved a recognition rate of 71.04%, followed by Toyoshima I et al in 2023proposed a speech emotion recognition model based on a multi-input deep neuralnetwork that simultaneously learns these two audio features by using CNN - DNNarchitecture and input is spectrogram and gemaps achieved 61.49% accuracy rate [60].The comparison of related works is outlined in Table 5.
Table 5: Comparison of related works
B: anger, happiness, sadness, neutral
C: anger, happiness, neutral, sadness, silence
D: anger, boredom, disgust, fear, happiness, sadness, neutral
Trang 29CHAPTER III BACKGROUND III.1 What is speech recognition?
Speech recognition is knownas automatic speech recognition (ASR), is the capacity of
a program to transform human speech into written text While voice recognition andspeech recognition are frequently conflated, speech recognition focuses on the translation
of speech from a verbal format to a text format, whereas voice recognition solelyaims atidentifying a particular user's voice Key characteristics of effective speech recognitionThere are numerous voice recognition software and devices available, but the mostmodern ones employ artificial intelligence and machine learning.To interpret and analyzehuman speech, they incorporate syntax, vocabulary, structure of audio and signaling fromvoices.They are supposed to learn as they go, changing replies with each engagement.The best kind of systems also allow organizations to customize and adapt the technology
to their specific requirements — everything from language and nuances of speech tobrand recognition For example:
● Language weighting: Improve precision by weighting specific words that arespoken frequently, beyond terms already in the base vocabulary
● Speaker labeling: Output a transcription that cites or tags each speaker’scontributions to a multi-participant conversation
● Acoustics training: Attend the acoustical side of the business Train the system toadapt to an acoustic environment and speaker styles (voice pitch, volume, etc)
● Profanity filtering: Use filters to identify certain words or phrases and sanitizespeech output
Figure 3: Recognition of voice
Speech recognition algorithms
The vagaries of human speech have made development challenging It’s considered to beone of the most complex areas of computer science – involving linguistics, mathematicsand statistics Speech recognizers are made up of a few components, such as the speechinput, feature extraction, feature vectors, a decoder, and a word output The decoder
Trang 30leverages acoustic models, a pronunciation dictionary, and language models to determinethe appropriate output.
Speech recognition technology is evaluated on its accuracy rate, i.e (WER), and speed Anumber of factors can impact word error rate, such as pronunciation, accent, pitch,volume, and background noise Reaching human parity – meaning an error rate on parwith that of two humans speaking – has long been the goal of speech recognition systems.Various algorithms and computation techniques are used to recognize speech into text andimprove the accuracy of transcription Below are brief explanations of some of the mostcommonly used methods:
- Natural language processing (NLP): While NLP [61] isn’t necessarily a specificalgorithm used in speech recognition, it is the area of artificial intelligence which focuses
on the interaction between humans and machines through language through speech andtext Many mobile devices incorporate speech recognition into their systems to conductvoice search—e.g Siri—or provide more accessibility around texting
- Hidden Markov models (HMM): Hidden Markov Models build on the Markov chainmodel, which stipulates that the probability of a given state hinges on the current state,not its prior states While a Markov chain model is useful for observable events, such astext inputs, hidden Markov models allow us to incorporate hidden events, such aspart-of-speech tags, into a probabilistic model They are utilized as sequence modelswithin speech recognition, assigning labels to each unit—i.e., words, syllables, sentences,etc.—in the sequence These labels create a mapping with the provided input, allowing it
to determine the most appropriate label sequence
- N-grams: This is the simplest type of language model, which assigns probabilities tosentences or phrases An N-gram is a sequence of N-words For example, “order thepizza” is a trigram or 3-gram and “please order the pizza” is a 4-gram Grammar and theprobability of certain word sequences are used to improve recognition and accuracy
- Neural networks: Primarily leveraged for deep learning [62] algorithms, neuralnetworks process training data by mimicking the interconnectivity of the human brainthrough layers of nodes Each node is made up of inputs, weights, a bias (or threshold)and an output If that output value exceeds a given threshold, it “fires” or activates thenode, passing data to the next layer in the network Neural networks [63] learn thismapping function through supervised learning, adjusting based on the loss functionthrough the process of gradient descent While neural networks tend to be more accurateand can accept more data, this comes at a performance efficiency cost as they tend to beslower to train compared to traditional language models
Trang 31- Speaker Diarization algorithms identify and segment speech by speaker identity Thishelps programs better distinguish individuals in a conversation and is frequently applied
at call centers distinguishing customers and sales agents
- Multi-layer perceptron: Perceptron can be considered a fully connected layer oftransitional artificial neural networks (ANN) And the term MLP is used in a way thatcan be considered loose as any transitional ANN, or sometimes it is only used to refer tonetworks consisting of multiple layers of perception
- The term "multi-layer perceptron" does not refer to a single perceptron that has multiplelayers [64] Instead, it contains multiple perceptrons organized into layers An alternative
is the “multi-layer perceptron network.” Furthermore, the "perceptron" MLP is not aperceptron in the strictest sense possible Perceptron is actually officially a special case ofartificial neurons using threshold activation functions such as the Heaviside step function.The MLP perceptron can use arbitrary activation functions A perceptron actuallyperforms binary classification, a MLP neuron can freely perform classification orregression, depending on its activating function
- The term "multi-layer perceptron" is then applied without taking into account the nature
of the nodes/layers, which may include arbitrarily defined artificial neurons rather thanspecific perceptrons
III.2 Convolutional Neural Network
Convolutional Neural Network (CNN) is a type of neural network architecture usedmainly in computer imaging and vision processing It is designed to automatically learnthe specific characteristics of image data through learning from training data CNNs useconvolutional layers to automatically extract characteristics from image data by applyingfilters to scan through input images These filters help identify features such as edges,angles, or other simple objects Then, pooling layers are used to reduce the size of theextract characteristics, helping to reduce computing complexity and increase theefficiency of the network Then, fully connected layers are used to connect the extractedcharacteristics to the output layers, to predict the labels of the input image This allowsCNNs to classify, identify, or partition images Figure 4 below is a popular basic CNN.CNNs have achieved success in many fields, such as object recognition in photos, facialrecognition, biomedical image processing, and applications in autonomous vehicles,artificial intelligence, and many other fields Thanks to its automatic learning capabilitiesand scalability, CNNs have become an important tool in the field of computer vision andimage processing