LIST OF ABBREVIATIONS AB Attitude towards a Behavior AI Artificial Intelligence AO Attitude towards an Object ASR Automatic Speech Recognition CALL Computer Assisted Language Learning CA
Trang 1VIETNAM NATIONAL UNIVERSITY – HOCHIMINH CITY UNIVERSITY OF SOCIAL SCIENCES AND HUMANITIES
FACULTY OF ENGLISH LINGUISTICS AND LITERATURE
THE APPLICAFTION OF AUTOMATIC SPEECH RECOGNITION TO STUDENTS’ NEW WORD
NGUYỄN THÚY NGA, Ph.D
HO CHI MINH CITY, OCTOBER 2021
Trang 2ACKNOWLEDGEMENTS
Firstly, I would like to express my deepest gratitude to my thesis supervisor, Dr Nguyễn Thúy Nga for her precious guidance, strong support, and helpful criticism and instructive comments on my work Without her helps and encouragements, this M.A thesis would not have been successfully accomplished
Next, I do want to say thank you to Assoc Prof Dr Tô Minh Thanh for her teaching
of pronunciation knowledge which directly related to my thesis I am also grateful to other lecturers of the master programme in TESOL at the Faculty of English Linguistics and Literature of VNUHCM - University of Social Sciences and Humanities for their devoted guidance on my academic journey
Then, I would like to thank Ms Ngô Thị Lan Hương – Director of English for New Generation Language Centre in Bien Hoa City and all of the participants who facilitated the process of accomplishing my thesis
Last but not least, I would love to send my thanks to my beloved family, especially
my wife, who absolutely believed and encouraged me in completing the programme
Trang 3STATEMENT OF ORIGINALITY
I hereby certify that this thesis entitled
“THE APPLICAFTION OF AUTOMATIC SPEECH RECOGNITION TO
Submitted in terms of Statements of Requirements for Theses in Master‟s Programmes issued by the Higher Degree Committee, is my own work
This thesis has not been submitted for the award of any degree or diploma in any other institution
Ho Chi Minh City, 2021
PHẠM HƯNG THỊNH
Trang 4RETENTION OF USE
I hereby state that I, Phạm Hưng Thịnh, being a candidate for the degree of Master of Arts in TESOL, accept the requirements of the university relating to the retention and use of Master‟s Thesis deposited in the University Library
I agree that the original of my Master‟s Thesis deposited in the University Library should be accessible for the purposes of study and research, in accordance with the normal conditions established by the library for the care, loan and reproduction for theses
Ho Chi Minh City, 2021
PHẠM HƯNG THỊNH
Trang 5TABLE OF CONTENTS
ACKNOWLEDGEMENTS i
STATEMENT OF ORIGINALITY ii
RETENTION OF USE iii
TABLE OF CONTENTS iv
LIST OF ABBREVIATIONS viii
LIST OF TABLES ix
LIST OF FIGURES x
ABSTRACT xi
CHAPTER 1: INTRODUCTION 1
1.1 Background to the study 1
1.2 Statement of the problem 1
1.3 Aims of the study 2
1.4 Research questions 3
1.5 Rationale of the study 3
1.6 Scope of the research 5
1.7 Structure of the thesis 5
1.8 Summary of the chapter 6
CHAPTER 2: LITERATURE REVIEW 7
2.1 Definition of terms 7
2.1.1 Pronunciation 7
2.1.2 Word 7
2.1.3 Phoneme (sound) 8
Trang 62.1.4 Consonant 8
2.2 The application of technology in language teaching and learning 11
2.2.1 Mobile Assisted Language Learning (MALL) 11
2.2.2 Automatic speech recognition (ASR) 12
2.2.3 ELSA Speak 13
2.3 Attitude 14
2.3.1 Affect 15
2.3.2 Behavioural intention 16
2.4 Young learners 17
2.5 Review of previous studies 18
2.6 Conceptual framework 28
2.7 Summary of the chapter 30
CHAPTER 3: METHODOLOGY 31
3.1 Research questions 31
3.2 Research design 31
3.3 Research site 36
3.4 Participants 36
3.4.1 Starters class (main class) 37
3.4.2 Movers class (point of reference class) 37
3.5 Research tools 38
3.5.1 Test 38
3.5.2 Questionnaire 45
3.5.3 Interview 49
Trang 73.7 Procedure of data analysis 51
3.8 Validity 53
3.9 Reliability 54
3.10 Summary of the chapter 55
CHAPTER 4: RESULTS AND DISCUSSION 56
4.1 Research question 1: the effects of ASR (ELSA Speak) on students‟ new word pronunciation 56
4.1.1 Perception 56
4.1.2 Production 60
4.2 Research question 2: students‟ attitudes on the application of ASR (ELSA Speak) in their pronunciation learning 63
4.2.1 Emotion 65
4.2.2 Behavioural intention 69
4.3 Discussion of results 76
4.4 Findings 78
4.5 Summary of the chapter 79
CHAPTER 5 CONCLUSION 80
5.1 Conclusions 80
5.2 Suggestions 81
5.3 Limitations of the study 82
5.4 Recommendations for further study 83
5.5 Summary of the chapter 83
REFERENCES 84
APPENDICES 90
Trang 8APPENDIX A 90
APPENDIX B 92
APPENDIX C 94
APPENDIX D 96
APPENDIX E 98
APPENDIX F 102
APPENDIX G 107
APPENDIX H 110
APPENDIX I 113
APPENDIX J 114
Trang 9LIST OF ABBREVIATIONS
A(B) Attitude towards a Behavior
AI Artificial Intelligence
A(O) Attitude towards an Object
ASR Automatic Speech Recognition
CALL Computer Assisted Language Learning
CAPT Computer Assisted Pronunciation Training
CEFR Common European Framework of Reference for Languages EFL English as a Foreign Language
ESL English as a Second Language
IT Information Technology
IVI iFlytek Voice Input
MALL Mobile Assisted Language Learning
TALL Technology Assisted Language Learning
TAM Technology Acceptance Model
TPB Theory of Planned Behavior
TRA Theory of Reasoned Action
Trang 10LIST OF TABLES
Table 2.1 Frequency of Common Problematic Final Sounds 25
Table 2.2 Summary of Previous Studies 26
Table 3.1 General View on Pretest and Posttest 40
Table 3.2 Format of a Pretest/Posttest 40
Table 3.3 Comparison between the two Questionnaires 49
Table 3.4 Interpretation Scale of Cohen‟s Kappa 53
Table 4.1 Result of Perception of Starters Class 57
Table 4.2 Result of Perception of Movers Class 59
Table 4.3 Result of Inter-Rater Reliability in Cohen‟s Kappa 61
Table 4.4 Result of Production of Starters Class (main class) 62
Table 4.5 Result of Production of Movers Class (point of reference class) 63
Table 4.6 Summary of the Results of Research Question 1 64
Table 4.7 Result of Question 2 in the Questionnaire for two Classes 66
Trang 11LIST OF FIGURES
Figure 2.1 Final Version of Technology Acceptance Model 16
Figure 2.2 Common Problematic Final Sounds 24
Figure 2.3 Conceptual Framework of the Research 29
Figure 3.1 Model of Research Design 34
Figure 3.2 Cyclical AR model based on Kemmis and McTaggart 35
Trang 12ABSTRACT
While teaching English at a foreign language centre in Bien Hoa city, the researcher realized that many students have difficulties pronouncing new words by themselves despite having been instructed face-to-face The recognition was also shared by Celce-Murcia, Brinton, and Goodwin (1996) in which students are dependent on their teacher
to learn new sounds Therefore, the researcher decided to conduct research on the issue It was an action research which employed mixed methods design on 15 participants of two classes covering two main points: How ASR (ELSA Speak) affects students‟ new word pronunciation and what attitudes students show towards the use of ASR (ELSA Speak) in their new word pronunciation The main focus of the research was on four final alveolar sounds (/t/, /d/, /s/, /z/) It was found that the technology helped participants improved their pronunciation of the four target final sounds although the improvement was generally not significant enough in terms of statistics Another finding was that participants showed positive attitudes on the use of the technology in their pronunciation learning In addition, suggestions, limitations of the research, and recommendations for further studies were also mentioned
Trang 13CHAPTER 1: INTRODUCTION
This chapter introduced the background to the study, the statement of the problem, the aims of the study, the research questions, the rationale, and the scope of the study to provide readers with general information about the research
1.1 Background to the study
The English language has been an important part of everyday life not only in the world but also in Vietnam, especially in such an integrating world like it is nowadays
It has become more and more popular at school, at work, on the internet, etc As a result, the demand to learn English has been growing up and up Therefore, more and more people are trying to learn English, at home, at school, at language centres, etc However, during the process of learning English, many Vietnamese learners find it difficult to learn the language, especially in pronunciation Avery and Ehrlich (1995) explain that the difficulties in pronunciation may come from the differences between the sound systems of the 2 languages: the number of syllables in a word, the existence
of consonants clusters, and the number of consonants They also point out some errors that Vietnamese learners usually make when pronouncing English including final sounds
1.2 Statement of the problem
The issue of students‟ pronunciation could also be seen in the researcher‟s classroom During the researcher‟s teaching of a young learner class of English in a language centre in Bien Hoa city, there has been an issue noticed It is that students depend too much on the teacher‟s pronunciation when they learn new words in the lessons The teacher pronounces new words several times for them to repeat However, after a while, their pronunciation becomes different from the modeling of the teacher,
Trang 14especially the final consonant sounds Although there are some hard-working students who do try to pronounce the words, it seems hard for them to do it well Because of the limited time of the lessons, the researcher cannot always spend time to help them with their new word pronunciation Moreover, students have little chance to get feedback from others on their pronunciation which makes them worried about the fact that whether their pronunciation is accurate or not This issue may lead to language anxiety MacIntyre (2007, p.565) describes language anxiety as “the worry and usually negative emotional reaction aroused when learning or using an L2” High language anxiety is likely to have a negative impact on learners‟ performance in learning L2 as well as lead to avoidance of interaction This issue may make it more difficult for students to pronounce new words in English
In this case, teachers need to find out ways to teach pronunciation for their students, adapt methods to fit students and their needs, and help them practice effectively to overcome any problems they might have (Celce-Murcia et al, 1996) Therefore, the researcher believed that something should be done to help them practice pronouncing new words along with the support from the researcher The target was not to train them into experts of pronunciation but to facilitate them to have their own way of practicing and getting feedback in pronouncing new words Moreover, the researcher hoped, through the findings, some contributions could be made to the theory of teaching and learning English pronunciation As a result, the researcher decided to do research on the issue
1.3 Aims of the study
The research was conducted to discover the effects of automatic speech recognition (ASR) technology on students‟ new word pronunciation, especially final alveolar sounds, not only in the classroom but also in other places and in different periods of time The application of ELSA Speak built on ASR technology was suggested as it is now very popular with more than 13 million users across the globe Moreover,
Trang 15feedback - students‟ emotion and behavioural intention to use the technology - was expected to make it better for future use Finally, from what found in the study, the researcher hoped to make a contribution to the theory and practice of teaching and learning English pronunciation
Research question 2: What are students‟ attitudes towards the use of automatic speech recognition (ELSA Speak) in their pronunciation learning?
Sub-questions:
- What is students’ emotion of automatic speech recognition (ELSA Speak) in
their pronunciation learning?
- What is students’ behavioural intention to use automatic speech
recognition (ELSA Speak) in their pronunciation learning?
1.5 Rationale of the study
There might be a lot of ways to help learners develop their new word pronunciation including technology Celce-Murcia et al (1996) suggest 10 techniques and materials for teaching pronunciation, including recording learners‟ production This technique might be one of the best ways to apply technology to learning pronunciation Nowadays students are getting more and more familiar with using technology to learn It is because of the economic growth and the development in living standard of
Trang 16the country in previous years Electronic devices can be found in the majority of families like smartphones, computers, laptops, tablets, etc Therefore, it seems to be appropriate to the trend if teachers can apply technology to help them However, which kind of technology should be used and how to use it were difficult questions to
be answered
From the development of technology, Artificial Intelligence (AI) is currently realized
to be used more and more There are many kinds of technologies of AI considered to
be applicable to English language teaching and learning like big data, image recognition, ASR, etc In terms of pronunciation, ASR is thought to be one of the most related technologies It can be applied to many aspects of pronunciation like looking for information on the Internet, practicing word pronunciation, giving feedback on ones‟ pronunciation, etc The function of giving automatic feedback appears to be useful as automatic feedback at early time may help learners prevent fossilizing wrong pronunciation habits (Eskenazi, 1999) Consequently, applications and software have been introduced on the basis of ASR like Google Search by Voice, Siri, Pronunciation Power, Rosetta Stone, ELSA Speak, etc
Also, there have been a number of studies on the use of ASR in teaching and learning languages in general and in English in particular Most of them have been conducted
in countries around the world However, as far as the researcher could reach, it was difficult to find studies on the same interest in the context of Vietnam Many of them were conducted in Western countries on languages like English, Italian, Dutch, etc which are members of the Indo-European language group (“Indo-European Language Family,” n.d.) It means that, to a certain extent, these languages have something in common Some other studies were on languages of different areas like the Middle East, East Asia, etc which may have different features from Vietnamese Nevertheless, Vietnamese belongs to the Austro-Asiatic language family (“Austro-Asiatic Language Family,” n.d.) which may lead to more differences (or difficulties) when Vietnamese students try to learn English Therefore, a study should be done to
Trang 17contribute to the use of ASR in teaching and learning English pronunciation in the context of Vietnam
The contribution of this study could be in two different aspects: theory and practice
In theory, the researcher hoped, through the finding(s) of this study in the context of Viet Nam, to provide some evidence on the effects of applying ASR to teaching and learning English pronunciation Also, students‟ feedback from the use of ASR may help other teachers or researchers have anticipated feedback from their learners
On the practical side, what found in the study were hoped to come up with a solution
to the issue of students‟ pronunciation It may suggest an effective way to teach and learn English pronunciation for other teachers and students In addition, developers of applications or software can have a look at users‟ feedback to have appropriate adjustments to their products
1.6 Scope of the research
The scope of the study was limited to the pronunciation of new words of English (isolated words only) The research only focused on one aspect of word pronunciation: segmental (consonants and vowels) In the segmental aspect, the researcher paid special attention to four final alveolar sounds (/t/, /d/, /s/, /z/) It was because of the limited time to conduct the research and the need to catch up with the time requirement of the syllabus that prevented the researcher from going further Also, the age and workload of students might limit them to go further in such a fixed schedule
1.7 Structure of the thesis
This thesis was divided into five chapters in which each one has its own role:
Trang 18Chapter one provided a general view of the issue and background which led to the conduction of the study It also suggested the research questions and the scope of the study
Chapter two mentioned the theoretical background to support the study Operational terms were also shown to make it clear what the researcher really meant in the study Moreover, previous studies related to the study were listed to find the gap and support the research Finally, a conceptual framework was given as an instruction to conduct the research
Chapter three was about the specific way to conduct the research including the method, the participants, the site, the tools, the data collection procedure, the data analysis procedure, validity, and reliability
Chapter four was on analyzing data to come up with results which were then compared and contrasted with those of previous studies in the literature review to get findings of the study
Chapter five came up with the conclusions of the study It also presented suggestions, limitations of the study, and recommendations for future ones
1.8 Summary of the chapter
The researcher mentioned the general background of the study including the research site, the statement of the problem, the aim and the rationale for doing the study, and the research questions Finally, the scope of the study was limited to four final alveolar sounds /t/, /d/, /s/, and /z/ within isolated new words In the next chapter, related literature to the study was introduced
Trang 19CHAPTER 2: LITERATURE REVIEW
In this chapter, the researcher mentioned the related literature First, definitions of some operational terms were introduced on the basis of theoretical literature Then, some related studies were listed Finally, the research gap and the conceptual framework of the study were suggested
2.1 Definition of terms
Before going to studies related to the issue, the researcher had to mention the definition of some terms as it may influence the way readers understand what the researcher really meant in the study
2.1.1 Pronunciation
The term “pronunciation” is difficult to define as there are different points of view on
it Pronunciation is defined as “the way in which a particular person pronounces the words of a language” (“pronunciation,” n.d.) Unfortunately, it is not that easy There are many levels of pronunciation: from the smallest level (sound or phoneme), syllable, word, phrase, to sentence level or above However, due to the scope of the study (isolated words only), the term of pronunciation could be understood as the pronunciation of isolated words
However, how can readers define words and how to understand their meaning in the scope of research? The answers to these questions were given in the next definition 2.1.2 Word
A word is defined as “a single unit of language that means something and can be spoken or written” (“word,” n.d.) From this definition, in terms of pronunciation and the scope of the study, the features of “meaningful” and “spoken” of this definition were employed As a result, the term “word” or “isolated word”, in this study, could
be understood as “a single unit of language which means something and can be spoken”
Trang 20In addition, according to Fromkin, Rodman, and Hyams (2011), words are combinations of one or more syllables and each syllable includes one or more phonemes Hence, when considering the pronunciation at word level, we would take phonemes into consideration
2.1.3 Phoneme (sound)
Richard and Schmidt (2010, p.432) define phoneme as “the smallest unit of sound in a language which can distinguish two words” This view is also shared by Roach (2009) With agreement, Ladefoged and Johnson (2011, p.34) add that “when two sounds can be used to differentiate words, they are said to belong to different phonemes.” Therefore, from the agreement among these authors on the characteristics
of sound and distinctness or differentiation, a “phoneme” in this study referred to the smallest unit of sound that can be used to distinguish or differentiate two words in a language
Also, it is stated that “the sounds of all languages fall into two classes: consonants and vowels” (Fromkin et al, 2011, p.195) However, due to the nature of this research (focusing on consonants only), more explanations were given on consonants
2.1.4 Consonant
There are a lot of definitions for consonants A consonant is defined as “a speech sound made by completely or partly stopping the flow of air through the mouth or nose” (“consonant,” n.d.) Also, Finegan (1994, p.34) claims that “consonants are sounds produced by partially or completely blocking air in its passage from the lung
to the vocal tract” In the same point of view, Fromkin et al (2011, p.195) claim that
“consonants are sounds produced with some restriction or closure in the vocal tract that impedes the flow of air from the lungs” Among these points of view, the common thing is the element of partly or completely blocking of the air from the lung
to the vocal tract Consequently, the term of consonant, in the scope of this study,
Trang 21could be understood as a sound that is produced by partly or completely blocking or stopping of the air on the way from the lung to the vocal tract
So far, the definition of consonants has already been stated However, there are still other aspects of consonants that should be taken into account: place of articulation, manner of articulation, and voicing
2.1.4.1 Place of articulation
Fromkin et al (2011) classify consonants into two aspects: the position in the vocal tract and the airflow restriction The aspect of position in the vocal tract can be, in other words, understood as place of articulation It refers to the movement of the tongue and lips that leads to the production of different consonants including bilabials, labiodentals, interdentals, alveolars, palatals, velars, and glottal
Among these places of articulation, the research only focused on alveolar sounds which were defined below
Alveolar sound
There have been a number of definitions for alveolar sounds An alveolar sound is understood as “a speech sound made with the tongue touching the part of the mouth behind the upper front teeth” (“alveolar,” n.d.) Also, Fromkin et al (2011, p.570) define an alveolar sound as “A sound produced by raising the tongue to the alveolar ridge, e.g., [s], [t], [n”]” In a more specific view, Roach (2009, p.9) states that:
The alveolar ridge is between the top front teeth and hard palate You can feel its shape with your tongue Its surface is really much rougher than it feels, and
is covered with little ridges You can only see these if you have a mirror small enough to go inside your mouth, such as those used by dentists Sounds made
Trang 22with tongue touching here (such as /t/, /d/, /n/) are called alveolar (Roach,
2009, p.9)
Among these views towards alveolar sounds, the definition by Roach (2009) was taken because it is the most specific in terms of the physical shape of the mouth which seems to be more imaginable for readers There are totally six alveolar sounds which are /t/, /d/, /s/, /z/, /n/, and /l/ (Roach, 2009)
2.1.4.2 Manner of articulation
Fromkin et al (2011) also classify consonants according to the way that whether the airstream from the lungs to the outermost of the mouth and the nose is partially blocked or completely blocked It is called the manners of articulation which are divided into stops, nasals, fricatives, affricates, glide, and liquid In another point of view, Richard and Schmidt (2010, p.351) define manner of articulation as “the way in which a speech sound is produced by the speech organs” Of the two definitions, the one by Fromkin et al was taken as it is more specific in terms of physical movement
of the airstream and the way it is changed (partly blocked or completely blocked) 2.1.4.3 Voicing
Apart from the classification of consonants into places and manners of articulation introduced by Fromkin et al, Roach (2009) identifies consonants by the way that whether the vocal fold in the human body vibrates or not The vibration of the vocal fold of a consonant differs from the other in three ways: intensity (high or low), frequency (rapid or less rapid), and quality (harsh, breathy, murmured, or creaky) This way of identifying consonants is called voice or voicing In addition, Richard and Schmidt (2010, p 630) classify voice into two categories: voiced and voiceless They explain that speech sounds produced with vocal cords vibrating are named as
“voiced” which can be felt by touching the neck in the region of larynx while those produced without vocal cords vibrating are called “voiceless” Among these ways of
Trang 23defining voice or voicing, the definition by Richard and Schmidt was employed as it
is easier to understand physically by identifying the vibration of the vocal cords inside the neck
In the next part, the researcher introduced the trend to use technology to teach pronunciation, from general to specific: the technique, the technology behind the technique, and a specific application of the technology- ELSA Speak
2.2 The application of technology in language teaching and learning
In the past few decades, there has been increasing interest in the application of technology to teaching and learning languages in general and English in particular It
is due to the development of technology which makes teaching and learning more convenient There has been more and more software, programmes, websites, applications, etc designed for this purpose This trend has attracted some researchers‟ interest leading to more and more studies on the use of technology in many aspects of language teaching and learning As a result, some domains of study have been born like Technology Assisted Language Learning (TALL), Computer Assisted Language Learning (CALL), or a recent one named Mobile Assisted Language Learning (MALL)
In the next part of the chapter, the domain of Mobile Assisted Language Learning (MALL) was mentioned
2.2.1 Mobile Assisted Language Learning (MALL)
Recently, a trend has emerged and become increasingly popular named as Mobile Assisted Language Learning It is because of the increasing popularity of mobile devices like smartphones According to a report on the global mobile market by Newzoo in 2020, Vietnam was ranked at the 10th position with more than 61 million smartphone users (as cited in VNA, 2021) From the statistics of the report, it was implied that nearly all families have at least a smartphone Later, from the increasing
Trang 24interest in MALL, another area of learning was born – mobile learning (m-learning) There are some distinctive features of mobile learning that can be withdrawn from those of mobile devices: portability, social interactivity, context sensitivity, and connectivity (Kloper et al, 2002) In m-learning, the element of mobility is the key point The mobility of mobile learning is not only limited to movement in space but also the time and the place to learn (Kukulska-Hulme et al, 2009) In other words, El-Hussein and Cronje (2010) describe mobility into three categories: mobility of technology, mobility of learning, and mobility of learners Also, Kim (2012) categorizes application services for MALL into four groups: mobile social networking
or mobile social software, mobile podcasting or mobile cast, course management service, and ASR Among these, ASR raised special interest to the researcher as it showed the potential to stimulate learners to practice pronunciation by themselves
In the next part, the definition of ASR and its application were mentioned
2.2.2 Automatic speech recognition (ASR)
The notion of ASR (or speech recognition) may be originated from the invention of the telephone by Alexander Graham Bell in 1881 which turned sound waves into electrical signals It was the earliest time recorded in history that people could invent
a machine to recognize human speech From that time, more and more experiments and improvements have been made to make it better to recognize human speech In
1959, Bell Laboratories produced a system that can recognize vowels sounds of English with 93% of accuracy It might be a milestone for using speech recognition to recognize speech in English Nowadays, due to the development of technology, ASR has been developed more and more and applied to a lot of software and applications like Windows Speech Recognition, Google Search by Voice, Siri, Rosetta Stone, ELSA Speak, etc Most of them employ the Hidden Markov Model to analyze and process the data they receive At the same time, the applications and definitions of ASR have been quite different Levis and Surovov (2014, p.1) define ASR as “an
Trang 25independent, machine-based process of decoding and transcribing oral speech A typical ASR system receives acoustic input from the speaker through a microphone, analyzes it using some pattern, model or algorithm, and produces an output, usually in the form of a text” As far as the researcher could reach, this is the most updated definition for ASR Hence, this definition was used as the way to understand the term
of ASR in the study
In the following part, a specific application of ASR was mentioned – ELSA Speak 2.2.3 ELSA Speak
There have been some applications or software which employ ASR for pronunciation teaching Among them, ELSA Speak was selected due to the following criteria suggested by Yoshida (2018) for selecting a tool for teaching pronunciation under the viewpoint of a teacher:
Appropriateness to learning objectives: ELSA Speak met the researcher‟s objective which was to help students pronounce four final alveolar sounds /t/, /d/, /s/, and /z/ in isolated words only
Quality and accuracy: The application provides students with a natural-speed pronunciation by native speakers for each individual sound of each isolated word Moreover, it provides students with feedback in terms of color: green for accurate, yellow for nearly accurate, and red for inaccurate pronunciation It also gives a percentage score for how accurate the pronunciation of the whole word is in comparison to the pronunciation of native speakers In addition, the application has been developed by a group of experts on pronunciation and AI What‟s more, the application can detect people‟s pronunciation mistakes with 95% of accuracy The number of users is another evidence for its quality: more than 13 million users around the world (ELSA, n.d.) Therefore, the quality of the application could be ensured
Trang 26Practicality of use: The researcher considered the application as easy to be used, just a smartphone and internet connection Furthermore, it can be used for a range of users
of different levels (elementary to advanced) and ages (young learners to aged one) Cost: The cost of the application was considered as reasonable under the viewpoint of the researcher 329,000 VND (14 USD)/three-month package (nearly five dollars per month) with seven days of free trial
Despite the good points mentioned above, there were still other applications that could satisfy the above-mentioned criteria apart from ELSA Speak The most important thing for the selection of ELSA Speak was the fact that it has been developed by a Vietnamese founder with great support for Vietnamese users
Apart from checking whether or not students make any progress when using ASR (ELSA Speak) for their new word pronunciation, the researcher also wanted to know more about their attitudes on the use of ASR (ELSA Speak) Therefore, the term of attitude was defined in the next part
2.3 Attitude
The term of attitude is still in a debate that might be endless as there are lots of points
of view on it Fishbein and Ajzen (1975, as cited in Kroenung & Bernius, 2012) developed the Theory of Reasoned Action (TRA) in which attitude was described as the formation from the strength of behavioral beliefs and the evaluation of the potential outcomes According to TRA, attitudes of a certain behavior can be positive, negative, or neutral It is also believed in this theory that there is a link between attitude and outcome which can be explained as follow: If someone has a positive attitude, that person tends to have positive behavior and vice versa Later, the Theory
of Planned Behavior (TPB) was developed from TRA by Ajzen (1985 as cited in Kroenung & Bernius, 2012) which states that attitude, subjective norm, and perceived behavioral control combine with one another to form a person‟s behavior and intention Another view towards attitude states that there are two kinds of attitude:
Trang 27Attitude towards an Object A (O) and Attitude towards a Behavior A (B) (Yang & Yoo 2004; Zhang et al, 2008; Zhang & Sun, 2009) A(B) is described as “an individual‟s positive or negative feelings (evaluative affect) about performing the target behavior”, (Fishbein & Ajzen, 1975, p 216, as cited in Kroenung & Bernius, 2012) while A (O) refers to “a psychological tendency that is expressed by evaluating
a particular entity with some degree of favor and disfavor” (Eagly and Chaiken 1993, p.1) What‟s more, Eagly and Chaiken (2007) believe that attitude includes three components: affect (emotion), cognition, and behaviour The affect component is about how a person feels, the cognition is about the information or knowledge that a person receives, and behaviour component is the way that reflects how a person acts overtly towards an object of attitude and his or her intentions to act Due to the scope
of the research, the definition including three components by Eagly and Chaiken was chosen as it was considered to be the most appropriate when referring to the emotion
of how a person has about something Among the three components of attitude, the two components named emotion and behaviour (behavioural intention) were selected
as they could reflect how students felt about using ASR (ELSA Speak) in their pronunciation learning and their behavioural intention to use ASR (ELSA Speak) in, outside their class as well as after the research Therefore, the term of attitude, in this research, could be understood as students‟ affect (emotion) and their behaviour (behavioural intention) to use ASR (ELSA Speak) for their pronunciation learning The term of affect (emotion) was defined in the following part
2.3.1 Affect
The term of emotion has been studied for a long time James (1884, as cited in Pritzker, Fenigsen, & Wilce, 2020, p.157) states that “My theory, on the contrary, is that the bodily changes follow directly the PERCEPTION of the exciting fact, and that our feeling of the same changes as they occur IS the emotion” Meanwhile, emotion is defined by Anne, Kuchibhotla, and Vankayalapati (2015, p.5) as “a physiological experience of a person‟s state when interacting with the environment
Trang 28and valence arousal space captures a wide range of significant issues in emotion” From another viewpoint, emotion is understood as “a strong feeling such as love, fear
or anger; the part of a person‟s character that consists of feelings” (“emotion,” n.d.) Among these points of view towards emotion, the one by Oxford University Press is the most updated definition Also, it is the most specific and related to the aim of the research in terms of feelings (anger, fear, etc.) Therefore, this definition was used as what the researcher meant by using the term emotion in this study
On the other hand, the researcher did not want to stop at how participants felt about ELSA Speak, but whether or not they had the behavioural intention to use the application in, outside their class as well as after the research Therefore, the term of behavioural intention was defined next
Final Version of Technology Acceptance Model
Note Adapted from A model of the antecedents of perceived ease of use:
Development and test by Venkatesh and Davis, 1996
Trang 29From the model, it is easily seen that the behavioural intention to use a kind of technology is formed from two elements: perception of usefulness and perception of ease of use To be specific, if a person perceives the usefulness and ease of use of a kind of technology, that person might have the behavioural intention to use it Of the two definitions, the one by Venkatesh and Davis (1996) was taken as it reflects a person‟s behavioural intention to use a kind of technology which was ASR in this research Therefore, the term of behavioural intention, in this research, referred to participants‟ perception of usefulness and perception of ease of use of ASR (ELSA Speak) which led to their behavioural intention whether or not to use it in, outside their class as well as after the study
What‟s more, when comparing the definition of behavior component by Eagly and Chaiken (2007) and the definition of behavioural intention by Venkatesh and Davis (1996), the researcher found that they both refer to the intention to act To be specific, the behaviour component by Eagly and Chaiken (2007) mentions a person‟s intention
to act towards an attitude object while the behavioural intention by Venkatesh and Davis (1996) refers to a person‟s intention to use a kind of technology They both, in this research, referred to the intention to use the technology of ASR (ELSA Speak) Therefore, in this study, the component of behaviour by Eagly and Chaiken (2007) or the part of behavioural intention by Venkatesh and Davis (1996) had the same meaning
2.4 Young learners
The term of young learners must be defined carefully as it may affect the way readers understand what the researcher meant in this research
There are two points of view on defining young learners People of the first one prefer
to use the exact term for each period of age Richards and Schmidt claim that young learners in language teaching are children of pre-primary and primary school age They also clarify other groups as adolescent learners and adult learners In a more
Trang 30specific view, young learners are understood as those from five to 12 years old (Rixon, 1999, 2014; McKay, 2006) Meanwhile, teenagers are defined as those who are from 13 to 19 years old (“teenager,” n.d.)
Of the latter point of view, people prefer to use the umbrella term of young learners for those who are under 18 According to the definition by United Nations Convention
on the Rights of the Child in 1990 (as cited in Ellis, 2013), a child is a person under
18 years of age However, this definition is still too general while the scope of this research refers to those who learn English Therefore, the definition for this term must
be more specific In addition, Ellis (2013) defines young learners in English language teaching as those who are under 18 She also adds that the term of young learners covers a wide range of learners who share the same needs and rights as children but differ greatly as learners such as their physical, psychological, social, emotional, conceptual, and cognitive development, as well as their development of literacy
Of the two points of view, the researcher, in this study, used the first one in which the term of young learners refers to those who are from five to 12 while teenagers are those from 13 to 19 It was because of the nature of the research in which the researcher employed two classes (main class and point of reference class) Participants of the main class were 12 or under while those of the point of reference class were 13 or older Each age group might have different features in learning English Therefore, it might be more appropriate to use the exact term for each age group rather than employing an umbrella term for both of them
The researcher mentioned the theoretical section of the chapter In the next part, studies on the use of ASR were introduced
2.5 Review of previous studies
In this section, previous studies related to the research were listed and analyzed chronologically In addition, these studies were arranged in terms of region: in other
Trang 31countries and then in Vietnam Later, the research gap was identified from these studies
There has been increasing interest in using technology to teach and learn second languages There are some researchers who criticize using this technique in second language teaching and learning It is criticized that using ASR leads to low rates of accurate recognition in the language for those who are not native speakers (Coniam, 1999; Derwing, Munro, & Carbonaro, 2000)
On the opposite side, Neri et al (2002) argue that the level of accuracy in evaluating pronunciation of ASR is getting better and better with higher level of accuracy and better adaptability to even non-native speakers Sharing the same interest, Kim (2006) conducted research on 36 EFL Korean university students (freshmen) to find out the correlation coefficient between scores of the ASR software of Fluspeak and human raters He found that the correlation coefficient at word level was not high and near zero at the intonation level It meant that the development of ASR technology at that time was far away from what it had been expected
Banafa (2008, as cited in McCrocklin 2014), has a more positive view towards using technology in pronunciation He states that working on computers was effective in pronunciation because it provides a good environment for practicing oral language Having the same positive viewpoint towards using technology in pronunciation, Neri
et al (2008) did quantitative research with computer-assisted pronunciation training (CAPT) basing on ASR technology The research was conducted on 28 students who were 11 years old at the same school at the same level The students were divided into two groups: the control group (15 students) with teacher-led instruction and the experimental group (13 students) with the help of CAPT It showed that children of the experimental group who work with ASR also had improvement in terms of new word pronunciation along with the group who was taught using the traditional way
Trang 32However, the research was conducted on isolated words without explaining what aspects of word pronunciation they covered like vowels, consonants, or stress
Furui (2010), on the other hand, points out that although ASR systems are widely used and have already been embedded in commercial applications; in real-life cases room reverberation significantly decreases the ASR performance, making the use of such systems less effective
With the same interest, Al-Qudah (2012) did research to find out whether there is any improvement in pronunciation after using ASR She also wanted to know whether there is any difference between male and female students in terms of pronunciation performance after using ASR To find out the answer for these issues, she designed it into experimental research with 149 third-year students (73 males and 76 females) at a university in Jordan They were divided into two groups: The control group (74 students) employed printed materials for improving pronunciation whereas the experimental group (75 students) employed ASR A pretest was conducted for both groups to make sure that the participants have the same level and a posttest was done after eight weeks of training for both groups The calculations of mean, standard deviation, and two-way ANOVA analysis of variance were utilized It found that the participants in the experimental group outperformed those in the control group However, there was no significant difference found between male and female participants in their pronunciation performance However, there were still some vague points in the study It is that the scope of the study was on which aspect of pronunciation (segmental or supra-segmental) and which way was used to assess the performance of the two groups These vague points may affect the reliability of the research
At the same time, Tran (2012) conducted experimental research on the use of computer software (Pronunciation Power and Praat) to help students improve their pronunciation of four English fricative sounds: /f/, /v/, / θ/, /ð/ There were 92 English
Trang 33non-majored freshmen at Hung Vuong University who joined the research They were divided into two groups: The control group (45 participants) who studied without computer software and the experimental group (47 participants) who worked with computer software The participants‟ pronunciation was checked in the pretest which showed that the control group got 0.10 points higher than the experimental one However, after the training procedure (13 weeks), the experimental group performed 1.93 points higher than the control group in the posttest In addition, students showed positive attitudes towards the use of computer software in their studying of the four English fricatives
Another research was published by McCrocklin in 2014 on the influence of ASR on students‟ autonomy in pronunciation The participants were divided into three groups: controlled group (15 participants) with traditional face-face-face instruction; experimental group (17 participants) with traditional face-to-face instruction and minimal strategy training with ASR; and second experimental group (16 participants) with half face-to-face minimal strategy training and the other half working with ASR She found that ASR was a useful tool for practicing new word pronunciation because the technology fosters autonomy and interest inside its users
Simultaneously, Elimat and AbuSeileek (2014) added that individual work is more effective than pair or group work if learners practices ASR on their pronunciation Fathi Sidig Sidgi and Jelani Shaari (2017) conducted research on ASR with the question that whether it was effective in terms of helping users overcome the difficulties of pronouncing sounds that are not in their native language – Arabic Participants of the research were 20 randomly-selected Iraqi EFL students from first-year college students at Al-Turath University College in Baghdad The participants attended a two-month pronunciation course employing the software of Eyespeak At the end of the course, they completed a questionnaire on how useful it was (if any) to improve their pronunciation It showed that participants considered Eyespeak
Trang 34software as a very useful tool to improve their pronunciation as long as let them know where their pronunciation might be inaccurate They also showed positive attitudes towards the software as an enjoyable one to learn pronunciation
In the same year, Li et al (2017) researched the application of ASR into English pronunciation correction and participants‟ attitude towards ASR in their learning of English There were 29 first-year students from the School of Foreign Studies of South China Normal University who participated in this study They used the application of iFlytek Voice Input (IVI) during a four-week course to practice their pronunciation During the course, participants read aloud selected materials and then the application recorded, analyzed, and transformed their speech into text Each time
of their practice lasted for 20 minutes including reading and self-checking with IVI and then rereading and rechecking over and over again They found that IVI was helpful in improving participants‟ pronunciation
Later, Kholis (2021) conducted action research on the application of ELSA Speak on
18 English-education students from Nahdlatul Ulama University of Yogyakarta, Indonesia They practiced their pronunciation on accuracy in grammar and vocabulary, fluency, appropriacy, and comprehensibility during three cycles He found that ASR (ELSA Speak) could improve participants‟ pronunciation skills from two to four and motivate them to learn to pronounce
Among these studies, the ones by Tran (2012), Li et al (2017), and Kholis (2021) were considered to be the most related literature to the scope of the research However, they were still different in some ways The research by Tran (2012) explains clearly her focus - four English fricatives only (/f/, /v/, / θ/, /ð/) There are still other aspects that can be used for research like: vowels, other consonants (stops, nasals, approximants, initial consonants, etc., and stress The study by Li et al, on the other side, pays attention to pronunciation in general which seems to be too general and overloaded to the scope of this research The one by Kholis (2021) seems to be
Trang 35the most familiar study with the nature of this research in terms of application employment ASR (ELSA Speak) and research design (action research with both quantitative and qualitative) However, it is different in terms of participants level (C1
in comparison with those at Pre-A1 and A1 of this research) and the subject to be studied (pronunciation in general in comparison with four final alveolar sounds /t/, /d/, /s/, /z/) These missing aspects of pronunciation could be considered as the research gaps among the previous studies
However, it would be overloaded if the researcher covered all these gaps due to the limited time and the requirements of the syllabus Therefore, it would be more appropriate if one of these gaps was selected for the scope of the research
As a result, there were many choices for conducting the research Nguyen (2012) did research on problematic consonants on 24 out of 104 participants at a college in Ho Chi Minh City which is in the same region (the South-East of Vietnam) as the research site (Bien Hoa City) She found out some problematic final consonant sounds
to students: /k/, /g/, /t/, /d/, /s/, /z/, /l/, /r/, /n/, /f/, /v/, /θ/, /ð/, / dʒ/, / tʃ/, / ʃ/, /p/, /b/ which attracted special interest from the researcher Among those, the researcher paid special attention to alveolar sounds and found them feasible to study in his research Moreover, during the process of teaching in the classroom, the researcher realized that final alveolar sounds are quite problematic for the students Therefore, final alveolar sounds were selected According to Roach (2009), the list of alveolar sounds consists
of /t/, /d/, /s/, /z, /n/, and /l/
In addition, Avery and Ehrlich (1992) state that Vietnamese people have difficulties
in pronouncing final consonant sounds including /s/, /z/, /t/, /d/, /p/, /k/, /b/, /g/, /ʃ/, /θ/, /ð/, /ʒ/, /f/, /v/
Therefore, when combining the list of problematic final consonant sounds between Nguyen (2012) and the one by Avery and Ehrlich (1992), the researcher came up with Figure 2.2 as follow:
Trang 36Figure 2.2
Common Problematic Final Sounds
From Figure 2.2, it was seen that the common problematic final sounds were plosive
or fricatives in terms of manner of articulation but ranged variously in terms of place
of articulation (from bilabials to velars) Among these sounds, the four fricatives of /f/, /v/, /θ/, and /ð/ had been studied by Tran (2012) so the researcher wanted to conduct research on the rest of them (/p/, /b/, /t/, /d/, /k/, /g/, /s/, /z/, /ʃ/) However, it was still too broad to cover all of them During the researcher‟s teaching at the class (oriented to satisfy the demands for Pre-A1 Starters certificate by Cambridge Assessment English), some final sounds were noticed to be seen more often than others which meant that their frequencies of use in teaching materials were higher than others Table 2.1 showed the list of common problematic final sounds and their frequency of being seen in the Starters Word List by Cambridge Assessment English (2018):
/k/,/g/,/t/,/d/,/s/,/z/,
/l/,/r/,/n/,/f/,/v/, /θ/, /ð/, / dʒ/, / tʃ/,
/ ʃ/,/p/,/b/
Nguyen (2012)
/s/,/z/,/t/,/d/, /p/,/b/,/k/,/g/, /ʃ/, /ʒ/,/θ/, /ð/, /f/, /v/
Avery & Ehrlich (1992)
/p/, /b/, /t/, /d/, /k/, /g/, /f/, /v/, / θ/, /ð/, /s/, /z/, /ʃ/
Trang 37Table 2.1
Frequency of Common Problematic Final Sounds
Problematic final sounds Frequency
of being
seen
/p/ /b/ /t/ /d/ /k/ /g/ /s/ /z/ / ʃ/
9 0 51 31 24 8 27 17 3
From Table 2.1, the four final sounds of /t/, /d/, /s/, and /k/ have the highest frequency
of being seen in the word list However, they belong to two different groups (plosives and fricative in terms of manner of articulation and alveolars and velar in terms of place of articulation) which might be confusing to study However, the final sound of /z/ also holds the fourth highest frequency (17) of being seen in the word list Hence,
it might be more appropriate to employ sounds of the same group rather than different ones
In addition, although Roach (2009) lists out six alveolar sounds in total (/t/, d/, /s/, /z/, /n/, /l/), there were only four common problematic final alveolar sounds (/t/, /d/, /s/, /z/) between the studies by Nguyen (2012) and Avery and Ehrlich (1992) Therefore, only four final alveolar sounds mentioned above were selected in this research
The researcher reviewed some previous studies related to the research For readers‟ convenience, Table 2.2 provided a summary of the previous studies mentioned above
Trang 38Table 2.2
Summary of Previous Studies
No Name Year Researcher(s) Findings
2002 Neri et al The accuracy of ASR in evaluating
pronunciation is getting higher and higher even for non-native speakers
4 Reliability and
Pedagogical implications
of ASR on pronunciation
teaching
2006 Kim The correlation coefficient at word level
is not high and near zero at the intonation level
5 Effects of IT on
pronunciation
2008 Banafa Working on computers was effective in
pronunciation because it provides a good environment for practicing oral language
6 Effectiveness of CAPT
for children‟s L2
learning
2008 Neri et al Participants who use ASR have the
same progress as those who employ the traditional way
7 History and development 2010 Furui Room reverberation significantly
Trang 39of speech recognition decreases the effectiveness of ASR in
2012 Al-Qudah - The group who worked with ASR
outperformed the one who didn‟t
- There was no difference in terms of gender
9 Effectiveness of
computer software on
English fricative
pronunciation
2012 Tran -The experimental group got 1.93 points
higher than the control group
- Students showed positive attitudes towards the use of ASR on four fricative sounds /f/, /v/, / θ/, /ð/
10 The potential of ASR on
fostering learners‟
autonomy in
pronunciation
2014 McCrocklin ASR is useful for practicing new word
pronunciation as it encourages users in terms of autonomy and interest
an enjoyable way to learn pronunciation
13 Improving English
Pronunciation via
automatic speech
2017 Li et al IVI was found to be helpful in
improving participants‟ pronunciation
Trang 40Table 2.2 was arranged chronologically, from the year of 1999 to 2021 The table just showed key points of the previous studies mentioned in this chapter, including the name of the study, the year of publication, the name of researcher(s), and the finding(s) The summary was to help readers follow the study better and summarize what had been written in this part of the chapter
In the next part, the conceptual framework was introduced to help readers have a general view of what was done in the study as well as the relations of operational terms stated in the previous parts of the study