Báo cáo nghiên cứu khoa học: Recognizing emotions through deep neural networks in order to detect speakers with depression tendencies

LIST OF FIGURESFigure 1: Spectrum of vowel in five different emotions 13Figure 2: A flat diagram of the two-dimensional chemotherapy Figure 4: Model Convolutional Neural Network 32 Figur

Trang 1

VIETNAM NATIONAL UNIVERSITY, HANOI

Advisors: PhD Kim Dinh Thai

PhD Ha Manh Hung Team leader: Nguyen Duc Quang Anh

Ha Noi, April 15, 2024

Trang 2

INFORMATION OF THE GROUP LEADER

- Program: Application Information Technology

- Address: Nam Thanh, Ninh Binh

- Phone no./Email: 0948396862 / anhnd@vnuis.edu.vn

II Academic results

III Other achievements

- Semi-finale the 25th Euréka Student Scientific Research Award

- Certificate of Merit from the Youth Union for achievements in Youth Month 2023

- Delegates exchange at the 9th Global Young Parliamentarians Conference

- Certificate of Merit for contributions in Learning and Scientific Research in 2022 - 2023

- Certificate of outstanding contributions at two competitions FUTURE BANKER 2023

& THE TECH TALENT HUNT 2023 organized by the Center for Forecasting andHuman Resource Development - Vietnam National University

Trang 3

To represent Speech Recognition Group we would like to express our deep thanks toPhD Kim Đinh Thai, PhD Ha Manh Hung and Mr Chinh Thank you for guiding usthroughout the research process Thanks to the knowledge and experience that theTeachers and Mr.Chinh share, we have learned a lot during the study period

The teachers are not only professional in the project, they are also companions, alwaysready to help us overcome challenging difficulties throughout the study process And it isimpossible not to mention Mr Chinh, although we are just strangers, yet he alwaysaccompanied us throughout the study time We appreciate the dedication and love thateveryone has for us, and we wish that Teachers and Mr Chinh are healthy, happy andsucceed in life

Thank you for everything you have done for us and we look forward to working with you

on many future projects

Students,Nguyen Thi Thu Hien, Nguyen Duc Quang Anh

Nhóm Speech Recognition chúng em xin gửi lời cảm ơn sâu sắc đến thầy Kim Đình Thái, thầy Hà Mạnh Hùng và anh Chinh Cảm ơn mọi người đã hướng dẫn chúng em trong suốt quá trình nghiên cứu Nhờ những kiến thức và kinh nghiệm mà các thầy và anh chia

sẻ, chúng em đã học hỏi được rất nhiều trong suốt thời gian nghiên cứu.

Các Thầy không chỉ là những thầy giáo giỏi về chuyên môn, các thầy còn là người đồng hành, luôn sẵn sàng giúp đỡ chúng em vượt qua những khó khăn thách thức trong suốt quá trình nghiên cứu Và không thể không nhắc tới anh Chinh, mặc dù chúng em chỉ là những con người xa lạ, mặc dù vậy anh lại luôn đồng hành cùng bọn em trong suốt thời gian nghiên cứu Chúng em cảm kích sự tận tâm và tình yêu thương mà mọi người dành cho chúng em, bọn em mong rằng các Thầy và Anh mạnh khỏe, hạnh phúc và đạt nhiều thành công trong cuộc sống.

Xin chân thành cảm ơn thầy và anh vì mọi điều thầy và anh đã làm cho chúng em và chúng em cũng mong được hợp tác với các thầy và anh trong nhiều dự án tương lai.

Sinh viên, Nguyễn Thị Thu Hiền, Nguyễn Đức Quang Anh

Trang 4

TABLE OF CONTENTS

Trang 5

IV.3.1 Dealing with missing data 40

IV.4.3 One Dimensional Attention CNN-LSTM Model 45

Trang 6

LIST OF FIGURES

Figure 1: Spectrum of vowel in five different emotions 13Figure 2: A flat diagram of the two-dimensional chemotherapy

Figure 4: Model Convolutional Neural Network 32

Figure 13: VNEMOS Emotions Data length Distribution 40

Figure 17: Web Voice Recognition Receiver and Analysis Interfacet 51

Figure 24: Examples of emotion classification results 55

Trang 7

LIST OF TABLES

Table 1: Statistics on the number of people using antidepressants in

Taiwan from the Ministry of Health and Welfare Bureau of Statistics 19Table 2: Corpus characteristics: 404 agent-client dialogs of around 10 hours,

Table 3: Details of Emotional Oriya speech database 26Table 4: Characteristics of other popular Emotional Speech Databases 26

Table 6: Key functional overview of deep learning architecture 34Table 7: Two Dimensional CNN Model network parameters 43

Table 10: 2D CNN Model Confusion matrix classifying in EmoDB 47Table 11: 2D CNN Model Confusion matrix classifying in VNEMOS 47Table 12: 1D CNN-LSTM Model Confusion matrix classifying in EmoDB 47Table 13: 1D CNN-LSTM Model Confusion matrix classifying in VNEMOS 47Table 14: 1D Atenttion CNN-LSTM Model Confusion matrix

Trang 8

LIST OF ABBREVIATION

ANN Artificial Neural Network

DAIC-WOZ Distress Analysis Interview Corpus Wizard-of-Oz

EECE Electronic, Electrical, and Computer Engineering

HUMAINE Human-Machine Interaction Network on Emotion

IEMOCAP Interactive Emotional Dyadic Motion CaptureLPCCs Linear Prediction Cepstral Coefficients

Trang 9

LSTM Long Short-Term Memory

MFCCs Mel Frequency Cepstral Coefficients

MLP Multi-layer Perceptron

NLP Natural language processing

RNN Recurrent Neural Network

PHQ - 8 Eight-item Patient Health Questionnaire depression scaleReLU Rectified Linear Unit

SER Speech Emotion Recognition

Trang 10

- English: It is complicated to determine whether a person is depressed due to the

symptoms of depression not apparent However, their voice can be one of the factorsfrom which we can acknowledge signs of depression People who are depressed expressdiscomfort, and sadness; they may speak slowly and trembling, and lose emotion in theirvoices In this research, deep learning is applied to detect emotions It analyzes the audiosignal of speech to analyze human emotions, our deep learning models can detectdifferent states of emotions The research delves into the intricate computational stepsinvolved in implementing a deep learning architecture, specifically focusing on a modelthat is structured on the Convolutional Neural Network (CNN) and introduces a speechemotions dataset in Vietnamese The data set is the result of research, testing, andfiltering 250 emotional segments from movies, series, and live shows divided equally for

5 basic emotional states of humans: “anger, happiness, sadness, neutral, and fear”,VNEMOS contains approximately 30 minutes long With experimentation, the model hasachieved impressive recognition with the highest accuracy rate of 89%

- Vietnamese: Rất phức tạp để xác định liệu một người có bị trầm cảm hay không do các triệu chứng trầm cảm không rõ ràng Tuy nhiên, giọng nói của họ có thể là một trong những yếu tố để chúng ta nhận biết dấu hiệu trầm cảm, người bị trầm cảm thể hiện sự khó chịu, buồn bã và có thể nói chậm, run rẩy, mất cảm xúc trong giọng nói Trong nghiên cứu này, deep learning được áp dụng để phát hiện cảm xúc Nó phân tích tín hiệu

âm thanh của lời nói để phân tích cảm xúc của con người, các mô hình deep learning của chúng tôi có thể phát hiện các trạng thái cảm xúc khác nhau Nghiên cứu đi sâu vào các bước tính toán phức tạp liên quan đến việc triển khai kiến trúc deep learning, đặc biệt tập trung vào mô hình được cấu trúc trên Mạng thần kinh chuyển đổi (CNN) và giới thiệu

bộ dữ liệu cảm xúc lời nói bằng tiếng Việt Bộ dữ liệu là kết quả nghiên cứu, thử nghiệm

và lọc 250 phân đoạn cảm xúc từ phim điện ảnh, phim bộ và liveshow chia đều cho 5 trạng thái cảm xúc cơ bản của con người: “tức giận, hạnh phúc, buồn bã, trung tính và

sợ hãi”, VNEMOS chứa khoảng 30 trạng thái cảm xúc phút dài Qua thử nghiệm, mô hình đã đạt được sự công nhận ấn tượng với tỷ lệ chính xác cao nhất là 89%.

Keywords

Depression, CNN, Deep Learning, Vietnamese, SER

Trang 11

Nguyễn Đức Quang Anh 22070306 AIT2022A Applied Information Technology 2𝑛𝑑

Nguyễn Quỳnh Chi 23070464 ISEL2023A Industrial Systems Engineering and Logistics 1𝑠𝑡

Nguyễn Thị Thu Hiền 22070073 AIT2022B Applied Information Technology 2𝑛𝑑

Vũ Quân 22071105 AIT2022B Applied Information Technology 2𝑛𝑑

Đỗ Xuân Minh Đức 22070047 AIT2022A Applied Information Technology 2𝑛𝑑

3 Structure

The first chapter, "Introduction," serves as the foundation for the entire research In thissection, we will provide the rationale for why recognizing depression tendencies fromspeech signals is an important topic Additionally, this chapter raises research questions,clearly identifies the motivation and objectives of the study, and introduces the researchmethodology we employ to achieve our objectives

In the following chapter, we will apply the psychological basis of depression and thecycle model to voice recognition We investigate the psychological components ofdepression, such as symptoms and classification, as well as the cycle model This serves

as the theoretical foundation for our research We will discuss what speech recognition isand how we utilize deep learning to detect early indicators of depression through speech

We investigate speech recognition technologies, ranging from artificial neural networks

to recurrent neural networks, as well as other related research in the field

In chapter 3, Our study technique is explained also about an overview of the deeplearning methods we use in our research and some related reference methods

Chapter 4 In this chapter, we will explain how the architecture and model are used,culminating in the model training process and describing the phases of data gathering and

Trang 12

introducing the datasets that we will use There are two main datasets: EmoDB andVNEMOS

Moving on to Chapter 5, we provide the research results and discuss noteworthy findingsbased on the previously defined research questions and assumptions We may alsoexamine how the model is applied in real time We design a web that can display thefindings and provide us with a percentage table to see how emotional we are in the fivemain categories of emotions, as well as tables and charts of the input sound that represent

us in the most plain and intuitive way

Finally, Chapter 6 focuses on the discussion of the identified factors in the results section

We propose methods for model improvement if necessary and summarize the key points

of the study, along with its limitations Furthermore, we suggest future researchdirections

4 Contributions

(1) In this research we have developed a Vietnamese Speech Emotion dataset namedVNEMOS containing approximately 30 minutes with five emotions classes “Anger,Happiness, Sadness, Neutral and Anxiety”, achieving the accuracy at 89% The dataset is

a sub dataset of a larger Vietnamese Speech Emotion dataset named PhoEmo which wewill introduce in the near future, detailed information of the dataset is in the followinglink: “bit.ly/VNEMOS”

(2) Our research has created and tested three CNN based models with Two DimensionalCNN structure, One Dimensional CNN-LSTM structure and One Dimensional AttentionCNN-LSTM structure Our Two Dimensional CNN Model achieved state-of-the-artaccuracy between available Speech Emotion Recognition utilizing CNN based structuresand in our recent research

(3) We designed an intuitive web by using the streamlit python library to deploy themodel and tested it outside the lab to identify emotions through the speaker's speechsignal, from which we can diagnose whether they have signs of pre-depression throughtheir cognitive behavior

(4) This research have been submitted in the ICDV Conference with the title Quang-AnhN.D, Quynh-Chi Nguyen, Thu Hien Nguyen Thi, Quan Vu, Minh-Duc D.X, Duc-Chinh

Nguyen, Hung-Ha Manh, Thai Kim Dinh (2024) “VNEMOS: Vietnamese Speech Emotion Inference Using Deep Neural Networks.” ICDV 2024 9th International Conference on Integration, Design and Verification.

Trang 13

CHAPTER I INTRODUCTION I.1 Concerning Rationale of the Study

Emotions are mind mysteries, they're a jumble of thoughts, feelings, and reactions allmixed up They involve a complicated dance of cognitive, physiological, and behavioralelements According to LeDoux's dual-pathway model [1], emotions are processedthrough both a fast, subcortical route involving the amygdala and a slower, cortical routethrough higher order brain regions The James -Lange theory [2] posits that physiologicalresponses precede emotional experiences, suggesting a bidirectional relationship betweenbodily states and emotional states Furthermore, Damasio's somatic marker hypothesis [3]underscores the role of bodily signals in influencing decision-making and emotionalexperiences Collectively, these theories provide a comprehensive understanding of theneural mechanisms governing emotions In accordance with the research in [4], basicemotions can be described as two dimensions, known as Arousal and Valence, asillustrated in Figure 1 The level of Arousal of an individual is determined by his or hersense of calmness or excitement, whereas the level of Valence is determined by positiveand negative feelings Joy and Relaxation are two positive emotions depicted in theimage below The positions on this plane indicate that these two emotions are positive innature While both are positive emotions, Joy arouses the senses more than Relaxed.Also an informational signal called

speech consists mainly of information

regarding the message to be conveyed, the

emotional content of the message, and the

speaker's characteristics, as well as the

language information In order to produce

sound units in different emotions, the vocal

tract has its own unique shapes that

represent emotion specific characteristics

The suprasegmental level of emotion

specific knowledge is characterized by

unique patterns of duration, pitch, and energy In order to capture emotional tractinformation, spectral features such as mel frequency cepstral coefficients (MFCCs),linear prediction cepstral coefficients (LPCCs) and their derivatives are used Basicprosodic features such as pitch, duration, voice quality, and energy are derived fromframewise parameters [5] In accordance with research, basic emotions are communicatedthrough vocalization, and nonverbal cues in speech are capable of recognizing themregardless of cultural background [6] There are a number of properties associated with

Trang 14

voices conveying happiness, including a higher mean pitch, a higher voice intensity, afaster speech rate, higher pitch variability, and a higher frequency energy [7] Voices with

a sad tone exhibit a lower mean pitch, lower intensity, and a lower speech rate, in addition

to a narrower pitch range and reduced high-frequency energy [8] An angry voice ischaracterized by a high level of mean pitch and intensity, a wide range of pitch variation,

a faster rate of speech, and more high frequency energy [9] Typically, fearful voices arecharacterized by high pitch and intensity with jittery, uneven vocal qualities and a largepitch variation Disgusted voices show lower intensity and downward shifts in pitch [8]

By carefully analyzing acoustic cues such as pitch, speech rate, and voice quality, one canreliably quantify the degree to which emotions can be recognized from vocal expressions[10] Using neuroimaging techniques, distinct patterns of brain activity can be identifiedfor different emotions conveyed through voice, providing objective evidence of neuralcorrelates A more dramatic activation of the temporal voice areas and orbitofrontalcortex is observed in response to an angry voice as compared to a sad voice [11] Fearfulvoices activate the amygdala, insula, and temporal voice areas more than happy voices[12] The convergence of perc-eptual studies, acoustic measurements, neuroimaging, andcross-cultural research has the potential to enhance our understanding of the vocalexpression of basic emotions Computational analysis of large datasets of vocalexpressions promises improved automated emotion recognition [13] A comprehensiveunderstanding necessitates the integration of evidence derived from various disciplines,including psychology, neuroscience, signal processing, and anthropology

The new generation of machines is gradually replacing human beings in the real workfield to perform tasks more efficiently and at a lower cost in the era of modernization 5.0.Throughout the past several decades, a great deal of research has been undertaken tostudy the recognition and analysis of human emotions in neuroscience, psychology,cognitive science and computer science Emotions play an important role in developingmodels related to human emotion recognition, or developing AI to have emotions oractions like humans Through this, it will help humanity increasingly develop towards adeveloped society with AI machines that can help support more people and reduce theamount of work a person has to endure as well as help reduce failure Throughout humancognitive development, behavior and social interactions, emotions play a crucial role indetermining decision-making processes, communication and patterns According to somestudies, the proportion of people suffering from depression in Vietnam tends to increaseuntil 2015, the proportion of people suffering from this depression was at 3% When thecovid epidemic situation has become complicated since 2019, the rate of people sufferingfrom the disease is higher [14] It is estimated that almost 900,000 people die each yearfrom depression Therefore, the prevention and treatment of depression is an extremely

Trang 15

important topic in today’s society Recognizing whether a person is depressed or not can

be a big challenge, as the symptoms of depression may not be obvious However, voice isone of the factors that we can observe to detect some signs of depression When peopleare depressed, their voices can express discomfort, sadness, and feelings of fatigue Theymay even speak slowly, tremble, lose emotion in their voices, or have no energy to talk.Some people may have a smaller or unclear voice, because they feel tired and don’t want

to communicate In addition, people with depression can make complaints, self-stimulate,blame themselves, and show a loss of interest and motivation in their voices They mayalso tend to avoid conversations, restrict communication and stay away from socialactivities The study hopes to be able to analyze a person’s mood from voice informationthrough deep learning, and ultimately use the deep learning model created to analyze andinfer which potential patients may be depressed or tend to suffer from depression fortreatment and prevention as early as possible

Emotion databases play a crucial role in the development of emotion recognitionmodels It has always been important for people to communicate in a language, especiallythrough speech, since language has played and continues to play such a fundamental role

in bringing people together The distorted speech signals associated with differentemotional states have been observed in studies in which speakers utter words as a result

of different mental states Our research uses a method which employs audio parametersfrom distorted speech waves, each representing different emotions, as feature vectors.These vectors are then subjected to pattern recognition techniques with the aim ofderiving the desired solution [15, 16] The realization of the advantages of speech signalshas encouraged researchers to search for effective ways to integrate speech into computersystems Employing machines to process human emotions has the potential to enhancethe performance of interactive systems The development of verbal emotion detectors will

be the key to unlocking unparalleled interactive experiences between humans andcomputers, posing a new challenge to the research community Our endeavor is to fosterauthentic interactions between humans and computers, thereby enabling computers toperceive emotions and engage in communication Our aim is to facilitate emotionrecognition, although the development of natural databases can be costly and is oftenconstrained by limited resources Conversely, processed databases are easier to develop.This suggests the need to create an emotion dataset suitable for a specific cultural contextlike Vietnam [17] A suitable dataset will help accurately capture and analyze theemotions of each individual in the Vietnamese These datasets, moreover, representimperative resources, serving as the foundation for the development of applications,ranging from emotional computing systems to virtual agents [18] By addressing theunique emotion in the Vietnamese, these initiatives not only advance scientific

Trang 16

understanding but also facilitate the development of more inclusive and effectivetechnologies Moreover, these developments cater to a diverse range of user groups.Emotion recognition in human-computer interaction is very important, however mostexisting datasets are mainly in English, creating an emotion recognition dataset inVietnamese will help researchers be able to study emotion recognition in this language[19] In this study, we build a Vietnamese speech emotion dataset through movies andlive shows The Vietnamese Emotion Dataset is developed to solve the problem of a lack

of data for research on emotion recognition in Vietnamese

I.2 Research Questions

To further develop the current speech model that recognizes depression, the study focused

on these two basic research questions:

(1) How can voice recognition be used to assess and track depression indicators, e.g., tone, speaking speed, distress, etc., of the speaker?

(2) How to build a voice recognition model capable of recognizing signs of depression in the voice and thus making an assessment of the diagnosis of the symptoms of the disease? (3) How to create a Vietnamese Speech Emotion Dataset?

Trang 17

CHAPTER II OVERVIEW II.1 Psychological Foundations

The word "Melancholy" was first used by the ancient Greeks to describe feelingintensely sad and despair, to be more familiar in modern times, this word can be called

"Depress" The mental illness of depression, considered to be one of the most prevalentand serious mental illnesses, severely impairs an individual’s quality of life, markingthem with feelings of sadness, hopelessness, and loss of interest in enjoyable activities.The diagnosis of schizophrenia requires at least five or more of these symptomsoccurring for at least two weeks: depressed mood, diminished interest/pleasure,significant weight loss or gain, insomnia, hypersomnia, psychomotor agitation, fatigue,feelings of worthlessness, diminished concentration, and recurrent thoughts of suicide ordeath [20] Various factors are involved in the origins and causes of depression, includinggenetic, biological, environmental, and psychological factors [21] With a lifetimeprevalence of around 15-20%, depression contributes significantly to the global burden ofdisease and affects individuals of all ethnicities, ages, and genders [22] The WorldHealth Organization (WHO) reports that depression causes more than 300 million peoplearound the world to suffer from disabilities [23] Depression disorder is a commonpsychiatric disorder in Vietnam as well as in the world, it affects the quality of life of thesick person, in the severity of the patient can be affected to life due to the high risk ofsuicide In typical seizures, there are manifestations of inhibition of the entire mentalactivity Patients with a sad and pleasant mood, reduced interest in pleasure, feeling darkfuture, slow thinking, difficult association, self-confidence reduced, often paranoid guilt,leading to patients with thoughts of suicide and thoughts of death occurred in more thantwo-thirds of these patients, this is the leading cause of suicides, 10-15% suicide, in 2006

in Europe there were at least 59,000 successful suicide patients and this risk is presentthroughout the pathological process According to the study of the mental health program

by the American Veterans Fund in Vietnam (VVAF), the rates of depression and anxietyare the most common problems, in Da Nang there are 18.3% of adults with the diseaseand most depressive disorders are treated with chemotherapy But as we already know,antidepressants have many side effects such as headaches, dizziness, anxiety, breastfatigue, hand tremors, constipation some medications when taken in the early stagesmay increase the risk of suicide in patients In the meantime, some studies have shownthat when antidepressants and cognitive behavioral therapy are combined in the treatment

of depression, good results are more than 71%, if the treatment is only 43%, and theanti-depressant alone is more than 61% Behavioral activation therapy is a simple,time-effective, and cost-saving therapy This is a widely used therapy in the U.S., which

Trang 18

is part of cognitive behavioral therapy, based on the underlying theory of behavior andthe current evidence to constitute behaviors that can trigger the mechanism of cognitionally-behavioral change in clinical depression From that we found that in Vietnam weneed to apply this behavioral activation therapy in combination with medication topatients to overcome some of the above disadvantages Da Nang City with the primarysupport of the American Veterans Fund in Vietnam, in the public health care programfrom 2009 to 2011 We have applied behavioral activation therapy in combination withantidepressants to treat depressed patients in 5 communities in the city Accompanyingthe community research, we conducted at the hospital with the application of behavioralactivation therapy for all patients with depression when examined at the clinic with thetheme: “Study of clinical characteristics and treatment of depression patients withbehavioral activation Therapy at Da Nang City Psychiatric Hospital” In Vietnam there is

no research work on this subject Depression is one of the most common mental illnesses

in the world, affecting millions of people each year Another example: according tostatistics from the Ministry of Health and Welfare [24], the number of people takingantidepressants in Taiwan is increasing year by year over time As of Taiwan’s 2017thyear, statistics have exceeded 1.3 million people And at the same time, the percentage ofyoung people under the age of 30 continues to slow down, as in Table 1 below, and it hasexceeded 10% in 2016 This percentage suggests that it means that the population withdepressive depression in Taiwan is slowing down, and is tending to grow younger.Although the number of depressed patients is increasing, on the other hand, the numbers

of people willing to receive treatment continue to increase This is a positive sign thatpeople’s awareness of their mental health status is increasing Because in the past,depression was considered a sensitive and less-mentioned issue, but now it has become atopic of concern So, to overcome the symptoms of depression we have used theknowledge of deep learning to create a system that can infer whether a person is prone todepression or depression, thus to apply to our lives The most direct and convenientassessment method to detect depression or depression in a person’s voice is to use voiceanalysis technology to assess emotions To achieve high accuracy in assessment, it isnecessary to collect characteristics of voice information through complex pre-processingprocesses or in coordination with other information such as images, facial expressioninformation and many other factors Voice analysis technology is developing rapidly tohelp detect and evaluate emotions from voice Furthermore, the use of diverse data from avariety of sources is also a way to increase the accuracy and efficiency of voice analysissystems

Trang 19

Taiwan 2018 2017 2016 2015 2014Sum 1,397,197 1,330,204 1,273,561 1,121,659 1,194,395

Table 1: Statistics on the number of people using antidepressants in Taiwan from the

Ministry of Health and Welfare Bureau of Statistics [24]

Even though there are effective treatments available, less than half of those affectedworldwide receive them [23] Depression can be alleviated by improving access toevidence-based interventions Clinical studies have demonstrated the efficacy ofCognitive Behavioral Therapy (CBT) and pharmacological treatments for treating acutedepression episodes and preventing relapses, with CBT showing particular promise inmaintaining long-term benefits [25] Depressive disorders require profound compassion

on the part of those suffering from their relentless and debilitating effects A significantnumber of personal and societal costs will need to be alleviated through ongoing researchand improved services Although melancholy has perennially plagued humankind, hopepersists in the progress made and the promise of better understanding and caring for those

in its thrall According to current statistics, up to 80% of the world's population willexperience depression at some point in their life

● Cause: talking about the reason, there is still no really clear cause

● Endogenous: there are many theories due to genetics from parents or familymembers, surrounding environment, society, autoimmune factors

● Stress-induced depression: due to pressure from many sides such as family, work,children or unexpected events that have a bad consequence such as loss of lovedones, loss of money, etc

● Depression can occur after experiencing illnesses or injuries that cause seriousdirect damage to the brain

Trang 20

Loss of concentration: although it is normal to forget the name of a person or a task to do.But if this is repeated often and the ability to lose concentration is reduced, reducingwork efficiency It could be due to depression Being depressed leads to more error-proneand more difficult decisions.

● Changes in sleep: Sleep disturbance can be considered as one of the mainsymptoms, some will sleep too much, others too little

● Change in the feeling of eating: some people may eat more than usual or there arepeople who have a lot of good food but do not feel good Either way, a dramaticchange in appetite and weight (more than 5% of body weight a month) could be asign of depression

● Irritability, agitation and moodiness: Increased irritability, agitation and moodiness

is also a recognizable sign Even the little things that can upset you - such as loudnoises or long waits Sometimes accompanied by anger, self-harming thought

Depression classification and analysis:

May be divided into major depressive disorder, persistent depressive disorder, andseasonal disorder And severe depression is the most serious symptom of depression, atthat time the patient will fall into a depressed emotional state for a long time (maybemore than 2 weeks), and the head will always have negative thoughts Depressedfeelings, feeling empty, useless, eating disorders, loss of concentration, slow reactions,the patient himself will always think of death or suicide as a way to free himself, finally

in the end, about 3.4% of patients actually commit suicide [26] Persistent depressivedisorder is basically similar to major depressive disorder, but with persistent depressivedisorder it is usually milder but the duration will be longer than that of major depression[27], it can take more than 2 years to be diagnosed and the disease can last for more than

10 years or even accompany with the patient's life, and that persistent depression has ahigh risk of developing major depression, studies have shown that 79% of patients candevelop major depressive complications, and This condition is known as doubledepression The last type of depression is seasonal depression, as the name suggests, thistype of depression mainly occurs in the cold season, is also more common inhigh-latitude areas, is a disease with a higher incidence when the weather is cold Theamount of time in the sun is reduced because more exposure to the sun can reduce therisk of disease [28] The World Health Organization (WHO) has listed depression as one

of the three most serious diseases affecting the world and its severity is comparable tocardiovascular diseases or AIDS Psychosis in Taiwan according to the statistics is asevere or persistent psychological disorder, and tends to be increasingly younger, thismay be due to changes in society, environment due to the young population being under

Trang 21

great pressure and may also have problems with family, friends and outside relationships.Since depression is not just a psychological problem, the person can be associated withpersonality and physical condition, depression can be treated with medication andaccompanied by psychotherapy, if the proper care and treatment is obtained, 70-85% ofpatients can significantly improve the situation of the disease [29].

However, with the advancement of e (AI), it is possible to apply this technology totreat depression more effectively One of the potential applications of AI in the treatment

of depression is the automatic assessment of patients’ symptoms AI systems can analyzedata on symptoms of depression, including changes in a patient’s mood, eating habits andsleep, to help diagnose the disease and provide appropriate treatment In addition, AI canalso assist health professionals in managing depression patients AI systems can monitorpatient health indicators, give advice on treatments, and support health care decisions.This helps enhance the quality of care and minimize the risks associated with errors in thecare process Finally, AI can also help in early detection of depression AI systems cananalyze data from high-risk patients to detect early signs of depression and give warnings

to patients and health professionals In short, the use of AI in the treatment of depressioncan bring many benefits to patients and healthcare professionals Applications of AI inthe treatment of depression are being researched and developed to help improvehealthcare quality and improvement [30] By the completion of this research, we will beable to obtain useful information to diagnose depression and be able to detect depressionsigns more accurately and quicker By using deep learning techniques, this research aims

to assess the probability of individuals having a tendency to be depressed by analyzingthe frequency amplitude of the voice Our objective was to identify the speaker’s specificemotional characteristics and then determine if they are likely to experience depressionbased on those characteristics By using deep learning, we are able to assess and analyzethe emotional characteristics of a voice in a more comprehensive and accurate manner

An evaluation of specific emotional characteristics in the human voice was performed,including speed, volume, rhythm, and tone

II.2 Period Model

When traditional machine learning methods are used to recognize voice emotions, themost commonly used method is the use of classifiers. The core concept is to set up aclassification model based on a large number of known databases and assign the correctdata to the right category; it can also complete the accurate classification work when newdata appears Common types of classification include Support Vector Machine (SVM),Gaussian Mixture Models (GMM), Hidden Markov Model (HMM) and various types.According to research in [31], basic emotions can be described by two-dimensional

Trang 22

Figure 2: A flat diagram of the two-dimensional

chemotherapy-stimulating emotional model

spatial zones including Stimulation and Chemotherapy, as in Figure 1 below Arousalrepresents the intensity of calm or excitement, while Valence represents positive andnegative emotional trends For example, the two positive emotions in the picture beloware Joy and Relaxed From the location of these two emotions on this plane, we can seethat both are more positive emotions, but Joy has a higher level of euphoria than Relaxed.The Period model is a deep learning model used to analyze voice signals and detectdepression This model is based on the cyclical nature of the voice to detect changes involume and frequency in the voice This model uses multi-layer analysis techniques toanalyze voice signals into diﬀerent frequency components The model then usedalgorithms that extracted information from the vocal cycles to detect variations in thevolume and frequency of the voice Some applications of the Period model in thedetection of depression include: Analyze the frequency components of the voice toidentify the characteristics of voice associated with depression, detect variations in thevolume and frequencies of voice to detect changes in the mood of a depressed patient, usealgorithms that extract information from the cycle of voice in order to make predictions

of the depression of the patient

II.3 An Overview in Emotion Datasets

Depending on how emotions are expressed, emotion datasets can be categorized intothree types: [32 - 34] simulation/acted, elicitation, and natural In contrast to the threetypes above, acting is responsible for the majority of datasets collected; the actingemotion dataset is the largest, and it is usually conducted by professional actors.Typically, three emotion corpora are used in the emotion recognition problem: Berlin

Trang 23

Emotion Speech Database (EmoDB), Chinese Emotions Dataset, and Danish EmotionCorpora Each corpora is collected from the most widely used expressions in emotionrecognition problems EmoDB is a database of emotional speech recordings containing

533 tracksets of ten actors (five men, five women) describing seven emotional states,including: anger, boredom, anxiety, fear, disgust, and neutral feelings [35] There are fiftylinguistically neutral sentences in the CASIA Chinese emotion corpus describing sixemotions: anger, fear, happiness, neutrality, sadness, and surprise This corpus containstwo hours of spontaneous emotional segments extracted from 219 speakers from films,television plays and talk shows Because of the large number of speakers, this database is

an outstanding addition to existing emotional databases [36] The Danish emotionalspeech corpus contains semantically neutral utterances recorded by four subjectsexperiencing four different emotions: anger, happiness, surprise, and sadness [37] Eventhough these datasets are widely utilized, there are certain limitations to them However,these emotions are recorded by actors who do not act within the appropriate context,which makes the authenticity of these recordings suspect Poor and somewhat awkwardspeaking and emotions are expressed in separate sentences that are not connected in anyway However, in comparison to the data recorded in the laboratory, emotions appearunnatural

A number of other common speech databases are also available, [38] the mostcommonly used of which are IEMOCAP and DAIC-WOZ These databases have thesame sampling frequency of 16kHz as the IEMOCAP database [39], which containsvarious types of information An audiovisual recording of the characters facialexpressions, movements, and verbatim manuscripts was created at the University ofSouthern California's Sail Laboratory There are approximately 12 hours of audio visualdata included in the document As of the current version, the database includes recordeddata from ten participants and six emotions are included, including anger, excitement,frustration, happiness, sadness and neutrality A relatively recent interview voicedatabase, DAIC-WOZ is an interview benchmark voice dataset for depression detectiondeveloped by the University of Southern California-Ellie in collaboration withinterviewees to serve as a virtual psychologist The total number of interview data iscurrently approximately 200, with recording times ranging from 7 to 33 minutes (onaverage 16 minutes), consisting of 189 clinical interviews between an interviewer and apatient 30 out of 107 interviews within the training set and 12 out of 35 interviewswithin the development set are classified as depressed [40, 41] Along with audiorecordings, accompanying data includes verbatim transcripts of responses, timestamps forthe beginning and end of sentences, facial expressions, eye contact information, etc… In

Trang 24

the DAIC-WOZ database, the emotions are classified fairly simply, and depression status

is determined by the score of the PHQ - 8 scale [42]

The databases include video clips and audio clips from TV programs, radio stations orcall centers These videos and audio clips are called spontaneous speech recordings andcan be found in these databases As examples of real-life situations, cockpit recordingsunder unusual conditions, conversations between patients and doctors, emotionalconversations in public places, and other situations [43] Morrison et al 's Chineseemotion data set and the French medical emergency data warehouse [44] both obtain datathrough call centers Morrison et al used an Automated Telephone System to collect datafrom Chinese emotion consumers These systems have speech recognition units thatprocess user requests based on spoken language Voice emotion recognition systems can

be used to handle calls based on the perceived urgency of the call In the event that theautomated system detects that a caller appears confused or angry, it may route their call to

an operator for assistance The system monitors voicemail messages in the switchboardand prioritizes messages according to sentiment [45] The French medical emergencycorpus derived from real-life call center conversation records can provide an insight intoemotional recognition A set of naturally occurring dialogues was recorded in areal-world call center Approximately 20 hours of transcribed data are included in thetranscribed corpus 404 agent caller dialogs from six different agents and 404 callers areincluded in this dataset About 10% of the transcribed speech data are omitted due tooverlapping voices [46] A medical advice service is provided through this service Thesedata are used in accordance with ethical conventions and agreements, ensuring calleranonymity and privacy, as well as non propagation of the corpus An overview of thecorpus' characteristics is provided in Table 2

#clients 404 (152 Male, 266 Female)

Table 2: Corpus characteristics: 404 agent-client

dialogs of around 10 hours, 19k speaker turns [45]

We must also mention EmoTV Dataset, which is a collection of 51 video clips recorded

by French television channels as part of the HUMAINE project - a collection ofFrench-language television interviews recorded during the HUMAINE project Various

Trang 25

topics (politics, religion, sports ) are covered in the interviews conducted in France Thecorpus in French includes 48 subjects, the total duration of the corpus is 12 minutes andthe vocabulary capacity is 800 words (total number of words: 2500) [47] Additionally,there are datasets such as "Baby ears" [48] These datasets were collected using two types

of experimental data In the first experiment, audio data was collected from parentsconversing with their children The second experiment involved adult listeners judgingwhether each utterance was acceptable, attention seeking, or forbidden words, in addition

to rating the message's strength Rao & Koolagaudi's dataset is quite different fromothers, it includes both natural and acted elements [49] The Hindi speech corpus wascollected from five different geographical regions (central, eastern, western, northern andsouthern) of India, representing the five dialects of Hindi In order to obtain speech data,five men and five women spoke each dialect in isolation In order to collect speech data,random questions were posed to the speaker in order to describe one's childhood, one'shometown, and one's childhood memories Each dialect lasted approximately 1-1.5 hours.The speech database consists of 10 professional artists (5 males and 5 females) collectedfrom All India Radio (AIR) Varanasi, India for the purpose of recording In creating thisdatabase, eight emotions were taken into account: anger, disgust, fear, joy, neutrality,sadness, sarcasm, and surprise As well as some of the datasets mentioned above, severalother popular datasets covering a wide variety of languages, such as Scherer et al Stressdataset Approximately 100 English speakers (25 native German speakers, 16 nativeEnglish speakers, and 59 native French speakers) were recorded using a computerinduction tool Each speaker wrote a total of 100 reading sentences, as well as somespontaneous responses [48] The INTERFACE dataset, which captures speech in multiplelanguages (English, Spanish, French, Slovenian) [50], and the "you stupid tin box"dataset, which focuses on speech produced by children in English and German TheGerman data was collected from 51 children (age 10-13, 21 males, 30 females) [50] Thechildren hail from two different institutions (‘Mont’ and ‘Ohm’) Thirty native Englishspeakers, aged between 4 and 14, participated in the E1 and E27 data collections Therecordings were undertaken in a specially designed multimedia studio at CETADL,located within the Department of Electronic, Electrical, and Computer Engineering(EECE) The total duration of the recordings is approximately 8.5 hours, whichcorresponds to just over 1.5 hours of speech once silences, pauses and ‘babble’ have beenremoved In addition to the databases mentioned above, there are also elicited databases.Different from Natural Database and Acted Database, Elicited Speech Emotion Database

is where emotions are created, i.e artificial emotional situations without any knowledge.any knowledge about the speaker [34] The Chinese elicited data set of Yuan et al.consists of a total of 288 sentences collected from nine speakers describing anger, fear,

Trang 26

happiness, and sadness [52] The data collection consists of sounds of anger, fear,happiness, and sadness In each sentence, four listeners were asked to rate the emotiontype, and they chose neutral anger, fear, happiness, or sadness The Department of Comp

Emotion Classes Anger, Sadness, Astonish, Fear, Happiness, Neutral

Speakers Male 23, Female 12 (Age 22 to 58)

Sampling Rate 16000 Hz, 16 bit, 1-channel

Table 3: Details of Emotional Oriya speech database [53]

-uter Science and Applications of Utkal University of Orissa has also developed theOriya emotional speech dataset There are 35 speakers, aged 22 to 58, speaking Oriya,who are from different Oriya speaking areas of the state of Orissa [53] Based on theiremotional state, speakers can read the text in a very natural voice By monitoring therecorded speech, the speaker can read emotional sentences if the desired emotion is not

Berlin EmoDB [21 ] German Acted 533 trackset from ten actors (five man, fivewomen) describing 7 emotional states anger, boredom, anxiety,fear, disgust, and neutral

feelings

sCASIA Chinese

emotion corpus [22] Mandarin Acted

Two hours of spontaneous emotional segments extracted from 219 speakers from films, television plays and talk shows, describing 6 emotional states

anger, fear, happiness, sadness, neutrality and surprise

The Danish emotional

speech corpus [23 ] Danish Acted Four subjects experiencing four different emotions anger, happiness, surprise,and sadness

IEMOCAP Database

Audiovisual recording of the characters facial expressions, movements, and verbatim manuscripts, approximately 12 hours of audiovisual data included in the document.

Recorded data from ten participants and 6 emotions are included

anger, excitement, frustration, happiness, sadness, neutral

DAIC-WOZ Database

The total number of interview data is currently approximately 200, with recording times ranging from 7 to 33 minutes (on average 16 minutes) neutral, sadness, anxietyMorrison et al [29] Mandarin Natural 388 utterances, 11 speakers, 2 emotions Anger, neutral

Baby ears dataset [34] English Natural 509 utterances, 12 actors (6 males, 6 female) Approval, attention,prohibition

Trang 27

conveyed when the recording takes place [54, 55] Oriya drama scripts are used in the

analysis of the texts, including “MO PEHENKALI BAJAIDE” (/mo pehenkAli bajaide/),

“EDUNIA CHIDIA KHANA” (/e duniA chidiA khanA/), etc In a laboratory

environment containing significant noise, data is digitized at 16000Hz A summary of the

Oriya emotion corpus is provided in Table 3

Dataset of emotion recognition through sound in Vietnamese has not been developed or

researched anywhere in the world Having recognized this gap, we decided to contribute

to this field by building the first Vietnamese emotion data set that incorporates a wide

range of emotions To advance artificial intelligence research and development, this

represents both an important step forward and an important step towards exploring the

unique and emotional characteristics of the Vietnamese language The properties of other

prominent emotional speech databases are listed above in Table 4

II.4 Related Work

In 2014 E Yuncu developed A binary decision tree consisting of SVM classifiers was

utilized to classify seven emotions using the EmoDB database [56] reached highest

accuracy of 82.9%, in following years 2016 XingChan Ma used a neural network in

French medical

emergency corpus [32] French Natural

10 hours subset comprised dialogs (6 different agents, 404 callers)

anger, fear, joy, sadness, disgust and surprise Rao & Koolagaudi

[35] Hindi Natural and acted

Contains 5 females and 5 males, sentences uttered based on their memories

Anger, disgust, fear, happy, neutral, sadness, surprise, sarcastic

EmoTV [33] French Natural

51 video clips recorded from French TV channels containing interviews on 24 different topics.

Consisting of 48 subjects, the total duration is 12mn (average length of 14s per clip) and the lexicon size is 800 words (total number of words:

2500)

anger, disgust, fear, embarrassment, sadness

Scherer et al [36] English andGerman Natural 100 native speakers stress and load level

INTERFACE [37] Slovernian,English,

Spanish, French.

Acted English 186 utterances, Slovenian 190 utterances,Spanish 184 utterances French, 175 utterances Anger, disgust, fear, joy,slow neutral, fast neutral,

surprise, sadness You stupid tin box [38] German, English Elicited 51 Children Anger, boredom, joy,surprise

Yuan et al [39] Chinese Elicited 9 native speakers anger, fear, joy, neutral andsadnessOriya emotional

corpus [40] Oriya Elicited

35 speaker (23 Male, 12 Female), taken from various Oriya drama scripts

Anger, Sadness, Astonish, Fear, Happiness, Neutral VNEMOS (Ours) Vietnamese Natural and acted 250 segments, approximately 30 minutes from 27 movies, movie series and live shows anger, sadness, happiness,neutral, anxiety

Table 4: Characteristics of other popular Emotional Speech Datasets

Trang 28

conjunction with CNN and LSTM [57], which achieved a resolution rate of 68% for theDAICWOZ In 2018, Haytham used IEMOCAP Database using CNN-RNN structure[58] and achieved a recognition rate of 64.78%, in the same year, S Tripathi used athree-layer LSTM architecture to perform emotional discrimination on IEMOCAP [59]and achieved a recognition rate of 71.04%, followed by Toyoshima I et al in 2023proposed a speech emotion recognition model based on a multi-input deep neuralnetwork that simultaneously learns these two audio features by using CNN - DNNarchitecture and input is spectrogram and gemaps achieved 61.49% accuracy rate [60].The comparison of related works is outlined in Table 5.

Table 5: Comparison of related works

B: anger, happiness, sadness, neutral

C: anger, happiness, neutral, sadness, silence

D: anger, boredom, disgust, fear, happiness, sadness, neutral

Trang 29

CHAPTER III BACKGROUND III.1 What is speech recognition?

Speech recognition is knownas automatic speech recognition (ASR), is the capacity of

a program to transform human speech into written text While voice recognition andspeech recognition are frequently conflated, speech recognition focuses on the translation

of speech from a verbal format to a text format, whereas voice recognition solelyaims atidentifying a particular user's voice Key characteristics of effective speech recognitionThere are numerous voice recognition software and devices available, but the mostmodern ones employ artificial intelligence and machine learning.To interpret and analyzehuman speech, they incorporate syntax, vocabulary, structure of audio and signaling fromvoices.They are supposed to learn as they go, changing replies with each engagement.The best kind of systems also allow organizations to customize and adapt the technology

to their specific requirements — everything from language and nuances of speech tobrand recognition For example:

● Language weighting: Improve precision by weighting specific words that arespoken frequently, beyond terms already in the base vocabulary

● Speaker labeling: Output a transcription that cites or tags each speaker’scontributions to a multi-participant conversation

● Acoustics training: Attend the acoustical side of the business Train the system toadapt to an acoustic environment and speaker styles (voice pitch, volume, etc)

● Profanity filtering: Use filters to identify certain words or phrases and sanitizespeech output

Figure 3: Recognition of voice

Speech recognition algorithms

The vagaries of human speech have made development challenging It’s considered to beone of the most complex areas of computer science – involving linguistics, mathematicsand statistics Speech recognizers are made up of a few components, such as the speechinput, feature extraction, feature vectors, a decoder, and a word output The decoder

Trang 30

leverages acoustic models, a pronunciation dictionary, and language models to determinethe appropriate output.

Speech recognition technology is evaluated on its accuracy rate, i.e (WER), and speed Anumber of factors can impact word error rate, such as pronunciation, accent, pitch,volume, and background noise Reaching human parity – meaning an error rate on parwith that of two humans speaking – has long been the goal of speech recognition systems.Various algorithms and computation techniques are used to recognize speech into text andimprove the accuracy of transcription Below are brief explanations of some of the mostcommonly used methods:

- Natural language processing (NLP): While NLP [61] isn’t necessarily a specificalgorithm used in speech recognition, it is the area of artificial intelligence which focuses

on the interaction between humans and machines through language through speech andtext Many mobile devices incorporate speech recognition into their systems to conductvoice search—e.g Siri—or provide more accessibility around texting

- Hidden Markov models (HMM): Hidden Markov Models build on the Markov chainmodel, which stipulates that the probability of a given state hinges on the current state,not its prior states While a Markov chain model is useful for observable events, such astext inputs, hidden Markov models allow us to incorporate hidden events, such aspart-of-speech tags, into a probabilistic model They are utilized as sequence modelswithin speech recognition, assigning labels to each unit—i.e., words, syllables, sentences,etc.—in the sequence These labels create a mapping with the provided input, allowing it

to determine the most appropriate label sequence

- N-grams: This is the simplest type of language model, which assigns probabilities tosentences or phrases An N-gram is a sequence of N-words For example, “order thepizza” is a trigram or 3-gram and “please order the pizza” is a 4-gram Grammar and theprobability of certain word sequences are used to improve recognition and accuracy

- Neural networks: Primarily leveraged for deep learning [62] algorithms, neuralnetworks process training data by mimicking the interconnectivity of the human brainthrough layers of nodes Each node is made up of inputs, weights, a bias (or threshold)and an output If that output value exceeds a given threshold, it “fires” or activates thenode, passing data to the next layer in the network Neural networks [63] learn thismapping function through supervised learning, adjusting based on the loss functionthrough the process of gradient descent While neural networks tend to be more accurateand can accept more data, this comes at a performance efficiency cost as they tend to beslower to train compared to traditional language models

Trang 31

- Speaker Diarization algorithms identify and segment speech by speaker identity Thishelps programs better distinguish individuals in a conversation and is frequently applied

at call centers distinguishing customers and sales agents

- Multi-layer perceptron: Perceptron can be considered a fully connected layer oftransitional artificial neural networks (ANN) And the term MLP is used in a way thatcan be considered loose as any transitional ANN, or sometimes it is only used to refer tonetworks consisting of multiple layers of perception

- The term "multi-layer perceptron" does not refer to a single perceptron that has multiplelayers [64] Instead, it contains multiple perceptrons organized into layers An alternative

is the “multi-layer perceptron network.” Furthermore, the "perceptron" MLP is not aperceptron in the strictest sense possible Perceptron is actually officially a special case ofartificial neurons using threshold activation functions such as the Heaviside step function.The MLP perceptron can use arbitrary activation functions A perceptron actuallyperforms binary classification, a MLP neuron can freely perform classification orregression, depending on its activating function

- The term "multi-layer perceptron" is then applied without taking into account the nature

of the nodes/layers, which may include arbitrarily defined artificial neurons rather thanspecific perceptrons

III.2 Convolutional Neural Network

Convolutional Neural Network (CNN) is a type of neural network architecture usedmainly in computer imaging and vision processing It is designed to automatically learnthe specific characteristics of image data through learning from training data CNNs useconvolutional layers to automatically extract characteristics from image data by applyingfilters to scan through input images These filters help identify features such as edges,angles, or other simple objects Then, pooling layers are used to reduce the size of theextract characteristics, helping to reduce computing complexity and increase theefficiency of the network Then, fully connected layers are used to connect the extractedcharacteristics to the output layers, to predict the labels of the input image This allowsCNNs to classify, identify, or partition images Figure 4 below is a popular basic CNN.CNNs have achieved success in many fields, such as object recognition in photos, facialrecognition, biomedical image processing, and applications in autonomous vehicles,artificial intelligence, and many other fields Thanks to its automatic learning capabilitiesand scalability, CNNs have become an important tool in the field of computer vision andimage processing

Tiêu đề	Recognizing Emotions Through Deep Neural Networks In Order To Detect Speakers With Depression Tendencies
Tác giả	Nguyen Duc Quang Anh, Nguyen Thi Thu Hien
Người hướng dẫn	Kim Dinh Thai, PhD, Ha Manh Hung, PhD
Trường học	Vietnam National University, Hanoi
Chuyên ngành	Application Information Technology
Thể loại	Research Report
Năm xuất bản	2024
Thành phố	Ha Noi

Định dạng
Số trang	62
Dung lượng	4,94 MB