The application of automatic speech recognition to students new word pronunciation m a

LIST OF ABBREVIATIONS AB Attitude towards a Behavior AI Artificial Intelligence AO Attitude towards an Object ASR Automatic Speech Recognition CALL Computer Assisted Language Learning CA

Trang 1

VIETNAM NATIONAL UNIVERSITY – HOCHIMINH CITY UNIVERSITY OF SOCIAL SCIENCES AND HUMANITIES

FACULTY OF ENGLISH LINGUISTICS AND LITERATURE

THE APPLICAFTION OF AUTOMATIC SPEECH RECOGNITION TO STUDENTS’ NEW WORD

NGUYỄN THÚY NGA, Ph.D

HO CHI MINH CITY, OCTOBER 2021

Trang 2

ACKNOWLEDGEMENTS

Firstly, I would like to express my deepest gratitude to my thesis supervisor, Dr Nguyễn Thúy Nga for her precious guidance, strong support, and helpful criticism and instructive comments on my work Without her helps and encouragements, this M.A thesis would not have been successfully accomplished

Next, I do want to say thank you to Assoc Prof Dr Tô Minh Thanh for her teaching

of pronunciation knowledge which directly related to my thesis I am also grateful to other lecturers of the master programme in TESOL at the Faculty of English Linguistics and Literature of VNUHCM - University of Social Sciences and Humanities for their devoted guidance on my academic journey

Then, I would like to thank Ms Ngô Thị Lan Hương – Director of English for New Generation Language Centre in Bien Hoa City and all of the participants who facilitated the process of accomplishing my thesis

Last but not least, I would love to send my thanks to my beloved family, especially

my wife, who absolutely believed and encouraged me in completing the programme

Trang 3

STATEMENT OF ORIGINALITY

I hereby certify that this thesis entitled

“THE APPLICAFTION OF AUTOMATIC SPEECH RECOGNITION TO

Submitted in terms of Statements of Requirements for Theses in Master‟s Programmes issued by the Higher Degree Committee, is my own work

This thesis has not been submitted for the award of any degree or diploma in any other institution

Ho Chi Minh City, 2021

PHẠM HƯNG THỊNH

Trang 4

RETENTION OF USE

I hereby state that I, Phạm Hưng Thịnh, being a candidate for the degree of Master of Arts in TESOL, accept the requirements of the university relating to the retention and use of Master‟s Thesis deposited in the University Library

I agree that the original of my Master‟s Thesis deposited in the University Library should be accessible for the purposes of study and research, in accordance with the normal conditions established by the library for the care, loan and reproduction for theses

Ho Chi Minh City, 2021

PHẠM HƯNG THỊNH

Trang 5

TABLE OF CONTENTS

ACKNOWLEDGEMENTS i

STATEMENT OF ORIGINALITY ii

RETENTION OF USE iii

TABLE OF CONTENTS iv

LIST OF ABBREVIATIONS viii

LIST OF TABLES ix

LIST OF FIGURES x

ABSTRACT xi

CHAPTER 1: INTRODUCTION 1

1.1 Background to the study 1

1.2 Statement of the problem 1

1.3 Aims of the study 2

1.4 Research questions 3

1.5 Rationale of the study 3

1.6 Scope of the research 5

1.7 Structure of the thesis 5

1.8 Summary of the chapter 6

CHAPTER 2: LITERATURE REVIEW 7

2.1 Definition of terms 7

2.1.1 Pronunciation 7

2.1.2 Word 7

2.1.3 Phoneme (sound) 8

Trang 6

2.1.4 Consonant 8

2.2 The application of technology in language teaching and learning 11

2.2.1 Mobile Assisted Language Learning (MALL) 11

2.2.2 Automatic speech recognition (ASR) 12

2.2.3 ELSA Speak 13

2.3 Attitude 14

2.3.1 Affect 15

2.3.2 Behavioural intention 16

2.4 Young learners 17

2.5 Review of previous studies 18

2.6 Conceptual framework 28

CHAPTER 3: METHODOLOGY 31

3.1 Research questions 31

3.2 Research design 31

3.3 Research site 36

3.4 Participants 36

3.4.1 Starters class (main class) 37

3.4.2 Movers class (point of reference class) 37

3.5 Research tools 38

3.5.1 Test 38

3.5.2 Questionnaire 45

3.5.3 Interview 49

Trang 7

3.7 Procedure of data analysis 51

3.8 Validity 53

3.9 Reliability 54

CHAPTER 4: RESULTS AND DISCUSSION 56

4.1 Research question 1: the effects of ASR (ELSA Speak) on students‟ new word pronunciation 56

4.1.1 Perception 56

4.1.2 Production 60

4.2 Research question 2: students‟ attitudes on the application of ASR (ELSA Speak) in their pronunciation learning 63

4.2.1 Emotion 65

4.2.2 Behavioural intention 69

4.3 Discussion of results 76

4.4 Findings 78

CHAPTER 5 CONCLUSION 80

5.1 Conclusions 80

5.2 Suggestions 81

5.3 Limitations of the study 82

5.4 Recommendations for further study 83

REFERENCES 84

APPENDICES 90

Trang 8

APPENDIX A 90

APPENDIX B 92

APPENDIX C 94

APPENDIX D 96

APPENDIX E 98

APPENDIX F 102

APPENDIX G 107

APPENDIX H 110

APPENDIX I 113

APPENDIX J 114

Trang 9

LIST OF ABBREVIATIONS

A(B) Attitude towards a Behavior

AI Artificial Intelligence

A(O) Attitude towards an Object

ASR Automatic Speech Recognition

CALL Computer Assisted Language Learning

CAPT Computer Assisted Pronunciation Training

CEFR Common European Framework of Reference for Languages EFL English as a Foreign Language

ESL English as a Second Language

IT Information Technology

IVI iFlytek Voice Input

MALL Mobile Assisted Language Learning

TALL Technology Assisted Language Learning

TAM Technology Acceptance Model

TPB Theory of Planned Behavior

TRA Theory of Reasoned Action

Trang 10

LIST OF TABLES

Table 2.1 Frequency of Common Problematic Final Sounds 25

Table 2.2 Summary of Previous Studies 26

Table 3.1 General View on Pretest and Posttest 40

Table 3.2 Format of a Pretest/Posttest 40

Table 3.3 Comparison between the two Questionnaires 49

Table 3.4 Interpretation Scale of Cohen‟s Kappa 53

Table 4.1 Result of Perception of Starters Class 57

Table 4.2 Result of Perception of Movers Class 59

Table 4.3 Result of Inter-Rater Reliability in Cohen‟s Kappa 61

Table 4.4 Result of Production of Starters Class (main class) 62

Table 4.5 Result of Production of Movers Class (point of reference class) 63

Table 4.6 Summary of the Results of Research Question 1 64

Table 4.7 Result of Question 2 in the Questionnaire for two Classes 66

Trang 11

LIST OF FIGURES

Figure 2.1 Final Version of Technology Acceptance Model 16

Figure 2.2 Common Problematic Final Sounds 24

Figure 2.3 Conceptual Framework of the Research 29

Figure 3.1 Model of Research Design 34

Figure 3.2 Cyclical AR model based on Kemmis and McTaggart 35

Trang 12

ABSTRACT

While teaching English at a foreign language centre in Bien Hoa city, the researcher realized that many students have difficulties pronouncing new words by themselves despite having been instructed face-to-face The recognition was also shared by Celce-Murcia, Brinton, and Goodwin (1996) in which students are dependent on their teacher

to learn new sounds Therefore, the researcher decided to conduct research on the issue It was an action research which employed mixed methods design on 15 participants of two classes covering two main points: How ASR (ELSA Speak) affects students‟ new word pronunciation and what attitudes students show towards the use of ASR (ELSA Speak) in their new word pronunciation The main focus of the research was on four final alveolar sounds (/t/, /d/, /s/, /z/) It was found that the technology helped participants improved their pronunciation of the four target final sounds although the improvement was generally not significant enough in terms of statistics Another finding was that participants showed positive attitudes on the use of the technology in their pronunciation learning In addition, suggestions, limitations of the research, and recommendations for further studies were also mentioned

Trang 13

CHAPTER 1: INTRODUCTION

This chapter introduced the background to the study, the statement of the problem, the aims of the study, the research questions, the rationale, and the scope of the study to provide readers with general information about the research

1.1 Background to the study

The English language has been an important part of everyday life not only in the world but also in Vietnam, especially in such an integrating world like it is nowadays

It has become more and more popular at school, at work, on the internet, etc As a result, the demand to learn English has been growing up and up Therefore, more and more people are trying to learn English, at home, at school, at language centres, etc However, during the process of learning English, many Vietnamese learners find it difficult to learn the language, especially in pronunciation Avery and Ehrlich (1995) explain that the difficulties in pronunciation may come from the differences between the sound systems of the 2 languages: the number of syllables in a word, the existence

of consonants clusters, and the number of consonants They also point out some errors that Vietnamese learners usually make when pronouncing English including final sounds

1.2 Statement of the problem

The issue of students‟ pronunciation could also be seen in the researcher‟s classroom During the researcher‟s teaching of a young learner class of English in a language centre in Bien Hoa city, there has been an issue noticed It is that students depend too much on the teacher‟s pronunciation when they learn new words in the lessons The teacher pronounces new words several times for them to repeat However, after a while, their pronunciation becomes different from the modeling of the teacher,

Trang 14

especially the final consonant sounds Although there are some hard-working students who do try to pronounce the words, it seems hard for them to do it well Because of the limited time of the lessons, the researcher cannot always spend time to help them with their new word pronunciation Moreover, students have little chance to get feedback from others on their pronunciation which makes them worried about the fact that whether their pronunciation is accurate or not This issue may lead to language anxiety MacIntyre (2007, p.565) describes language anxiety as “the worry and usually negative emotional reaction aroused when learning or using an L2” High language anxiety is likely to have a negative impact on learners‟ performance in learning L2 as well as lead to avoidance of interaction This issue may make it more difficult for students to pronounce new words in English

In this case, teachers need to find out ways to teach pronunciation for their students, adapt methods to fit students and their needs, and help them practice effectively to overcome any problems they might have (Celce-Murcia et al, 1996) Therefore, the researcher believed that something should be done to help them practice pronouncing new words along with the support from the researcher The target was not to train them into experts of pronunciation but to facilitate them to have their own way of practicing and getting feedback in pronouncing new words Moreover, the researcher hoped, through the findings, some contributions could be made to the theory of teaching and learning English pronunciation As a result, the researcher decided to do research on the issue

1.3 Aims of the study

The research was conducted to discover the effects of automatic speech recognition (ASR) technology on students‟ new word pronunciation, especially final alveolar sounds, not only in the classroom but also in other places and in different periods of time The application of ELSA Speak built on ASR technology was suggested as it is now very popular with more than 13 million users across the globe Moreover,

Trang 15

feedback - students‟ emotion and behavioural intention to use the technology - was expected to make it better for future use Finally, from what found in the study, the researcher hoped to make a contribution to the theory and practice of teaching and learning English pronunciation

Research question 2: What are students‟ attitudes towards the use of automatic speech recognition (ELSA Speak) in their pronunciation learning?

Sub-questions:

- What is students’ emotion of automatic speech recognition (ELSA Speak) in

their pronunciation learning?

- What is students’ behavioural intention to use automatic speech

recognition (ELSA Speak) in their pronunciation learning?

1.5 Rationale of the study

There might be a lot of ways to help learners develop their new word pronunciation including technology Celce-Murcia et al (1996) suggest 10 techniques and materials for teaching pronunciation, including recording learners‟ production This technique might be one of the best ways to apply technology to learning pronunciation Nowadays students are getting more and more familiar with using technology to learn It is because of the economic growth and the development in living standard of

Trang 16

the country in previous years Electronic devices can be found in the majority of families like smartphones, computers, laptops, tablets, etc Therefore, it seems to be appropriate to the trend if teachers can apply technology to help them However, which kind of technology should be used and how to use it were difficult questions to

be answered

From the development of technology, Artificial Intelligence (AI) is currently realized

to be used more and more There are many kinds of technologies of AI considered to

be applicable to English language teaching and learning like big data, image recognition, ASR, etc In terms of pronunciation, ASR is thought to be one of the most related technologies It can be applied to many aspects of pronunciation like looking for information on the Internet, practicing word pronunciation, giving feedback on ones‟ pronunciation, etc The function of giving automatic feedback appears to be useful as automatic feedback at early time may help learners prevent fossilizing wrong pronunciation habits (Eskenazi, 1999) Consequently, applications and software have been introduced on the basis of ASR like Google Search by Voice, Siri, Pronunciation Power, Rosetta Stone, ELSA Speak, etc

Also, there have been a number of studies on the use of ASR in teaching and learning languages in general and in English in particular Most of them have been conducted

in countries around the world However, as far as the researcher could reach, it was difficult to find studies on the same interest in the context of Vietnam Many of them were conducted in Western countries on languages like English, Italian, Dutch, etc which are members of the Indo-European language group (“Indo-European Language Family,” n.d.) It means that, to a certain extent, these languages have something in common Some other studies were on languages of different areas like the Middle East, East Asia, etc which may have different features from Vietnamese Nevertheless, Vietnamese belongs to the Austro-Asiatic language family (“Austro-Asiatic Language Family,” n.d.) which may lead to more differences (or difficulties) when Vietnamese students try to learn English Therefore, a study should be done to

Trang 17

contribute to the use of ASR in teaching and learning English pronunciation in the context of Vietnam

The contribution of this study could be in two different aspects: theory and practice

In theory, the researcher hoped, through the finding(s) of this study in the context of Viet Nam, to provide some evidence on the effects of applying ASR to teaching and learning English pronunciation Also, students‟ feedback from the use of ASR may help other teachers or researchers have anticipated feedback from their learners

On the practical side, what found in the study were hoped to come up with a solution

to the issue of students‟ pronunciation It may suggest an effective way to teach and learn English pronunciation for other teachers and students In addition, developers of applications or software can have a look at users‟ feedback to have appropriate adjustments to their products

1.6 Scope of the research

The scope of the study was limited to the pronunciation of new words of English (isolated words only) The research only focused on one aspect of word pronunciation: segmental (consonants and vowels) In the segmental aspect, the researcher paid special attention to four final alveolar sounds (/t/, /d/, /s/, /z/) It was because of the limited time to conduct the research and the need to catch up with the time requirement of the syllabus that prevented the researcher from going further Also, the age and workload of students might limit them to go further in such a fixed schedule

1.7 Structure of the thesis

This thesis was divided into five chapters in which each one has its own role:

Trang 18

Chapter one provided a general view of the issue and background which led to the conduction of the study It also suggested the research questions and the scope of the study

Chapter two mentioned the theoretical background to support the study Operational terms were also shown to make it clear what the researcher really meant in the study Moreover, previous studies related to the study were listed to find the gap and support the research Finally, a conceptual framework was given as an instruction to conduct the research

Chapter three was about the specific way to conduct the research including the method, the participants, the site, the tools, the data collection procedure, the data analysis procedure, validity, and reliability

Chapter four was on analyzing data to come up with results which were then compared and contrasted with those of previous studies in the literature review to get findings of the study

Chapter five came up with the conclusions of the study It also presented suggestions, limitations of the study, and recommendations for future ones

1.8 Summary of the chapter

The researcher mentioned the general background of the study including the research site, the statement of the problem, the aim and the rationale for doing the study, and the research questions Finally, the scope of the study was limited to four final alveolar sounds /t/, /d/, /s/, and /z/ within isolated new words In the next chapter, related literature to the study was introduced

Trang 19

CHAPTER 2: LITERATURE REVIEW

In this chapter, the researcher mentioned the related literature First, definitions of some operational terms were introduced on the basis of theoretical literature Then, some related studies were listed Finally, the research gap and the conceptual framework of the study were suggested

2.1 Definition of terms

Before going to studies related to the issue, the researcher had to mention the definition of some terms as it may influence the way readers understand what the researcher really meant in the study

2.1.1 Pronunciation

The term “pronunciation” is difficult to define as there are different points of view on

it Pronunciation is defined as “the way in which a particular person pronounces the words of a language” (“pronunciation,” n.d.) Unfortunately, it is not that easy There are many levels of pronunciation: from the smallest level (sound or phoneme), syllable, word, phrase, to sentence level or above However, due to the scope of the study (isolated words only), the term of pronunciation could be understood as the pronunciation of isolated words

However, how can readers define words and how to understand their meaning in the scope of research? The answers to these questions were given in the next definition 2.1.2 Word

A word is defined as “a single unit of language that means something and can be spoken or written” (“word,” n.d.) From this definition, in terms of pronunciation and the scope of the study, the features of “meaningful” and “spoken” of this definition were employed As a result, the term “word” or “isolated word”, in this study, could

be understood as “a single unit of language which means something and can be spoken”

Trang 20

In addition, according to Fromkin, Rodman, and Hyams (2011), words are combinations of one or more syllables and each syllable includes one or more phonemes Hence, when considering the pronunciation at word level, we would take phonemes into consideration

2.1.3 Phoneme (sound)

Richard and Schmidt (2010, p.432) define phoneme as “the smallest unit of sound in a language which can distinguish two words” This view is also shared by Roach (2009) With agreement, Ladefoged and Johnson (2011, p.34) add that “when two sounds can be used to differentiate words, they are said to belong to different phonemes.” Therefore, from the agreement among these authors on the characteristics

of sound and distinctness or differentiation, a “phoneme” in this study referred to the smallest unit of sound that can be used to distinguish or differentiate two words in a language

Also, it is stated that “the sounds of all languages fall into two classes: consonants and vowels” (Fromkin et al, 2011, p.195) However, due to the nature of this research (focusing on consonants only), more explanations were given on consonants

2.1.4 Consonant

There are a lot of definitions for consonants A consonant is defined as “a speech sound made by completely or partly stopping the flow of air through the mouth or nose” (“consonant,” n.d.) Also, Finegan (1994, p.34) claims that “consonants are sounds produced by partially or completely blocking air in its passage from the lung

to the vocal tract” In the same point of view, Fromkin et al (2011, p.195) claim that

“consonants are sounds produced with some restriction or closure in the vocal tract that impedes the flow of air from the lungs” Among these points of view, the common thing is the element of partly or completely blocking of the air from the lung

to the vocal tract Consequently, the term of consonant, in the scope of this study,

Trang 21

could be understood as a sound that is produced by partly or completely blocking or stopping of the air on the way from the lung to the vocal tract

So far, the definition of consonants has already been stated However, there are still other aspects of consonants that should be taken into account: place of articulation, manner of articulation, and voicing

2.1.4.1 Place of articulation

Fromkin et al (2011) classify consonants into two aspects: the position in the vocal tract and the airflow restriction The aspect of position in the vocal tract can be, in other words, understood as place of articulation It refers to the movement of the tongue and lips that leads to the production of different consonants including bilabials, labiodentals, interdentals, alveolars, palatals, velars, and glottal

Among these places of articulation, the research only focused on alveolar sounds which were defined below

Alveolar sound

There have been a number of definitions for alveolar sounds An alveolar sound is understood as “a speech sound made with the tongue touching the part of the mouth behind the upper front teeth” (“alveolar,” n.d.) Also, Fromkin et al (2011, p.570) define an alveolar sound as “A sound produced by raising the tongue to the alveolar ridge, e.g., [s], [t], [n”]” In a more specific view, Roach (2009, p.9) states that:

The alveolar ridge is between the top front teeth and hard palate You can feel its shape with your tongue Its surface is really much rougher than it feels, and

is covered with little ridges You can only see these if you have a mirror small enough to go inside your mouth, such as those used by dentists Sounds made

Trang 22

with tongue touching here (such as /t/, /d/, /n/) are called alveolar (Roach,

2009, p.9)

Among these views towards alveolar sounds, the definition by Roach (2009) was taken because it is the most specific in terms of the physical shape of the mouth which seems to be more imaginable for readers There are totally six alveolar sounds which are /t/, /d/, /s/, /z/, /n/, and /l/ (Roach, 2009)

2.1.4.2 Manner of articulation

Fromkin et al (2011) also classify consonants according to the way that whether the airstream from the lungs to the outermost of the mouth and the nose is partially blocked or completely blocked It is called the manners of articulation which are divided into stops, nasals, fricatives, affricates, glide, and liquid In another point of view, Richard and Schmidt (2010, p.351) define manner of articulation as “the way in which a speech sound is produced by the speech organs” Of the two definitions, the one by Fromkin et al was taken as it is more specific in terms of physical movement

of the airstream and the way it is changed (partly blocked or completely blocked) 2.1.4.3 Voicing

Apart from the classification of consonants into places and manners of articulation introduced by Fromkin et al, Roach (2009) identifies consonants by the way that whether the vocal fold in the human body vibrates or not The vibration of the vocal fold of a consonant differs from the other in three ways: intensity (high or low), frequency (rapid or less rapid), and quality (harsh, breathy, murmured, or creaky) This way of identifying consonants is called voice or voicing In addition, Richard and Schmidt (2010, p 630) classify voice into two categories: voiced and voiceless They explain that speech sounds produced with vocal cords vibrating are named as

“voiced” which can be felt by touching the neck in the region of larynx while those produced without vocal cords vibrating are called “voiceless” Among these ways of

Trang 23

defining voice or voicing, the definition by Richard and Schmidt was employed as it

is easier to understand physically by identifying the vibration of the vocal cords inside the neck

In the next part, the researcher introduced the trend to use technology to teach pronunciation, from general to specific: the technique, the technology behind the technique, and a specific application of the technology- ELSA Speak

2.2 The application of technology in language teaching and learning

In the past few decades, there has been increasing interest in the application of technology to teaching and learning languages in general and English in particular It

is due to the development of technology which makes teaching and learning more convenient There has been more and more software, programmes, websites, applications, etc designed for this purpose This trend has attracted some researchers‟ interest leading to more and more studies on the use of technology in many aspects of language teaching and learning As a result, some domains of study have been born like Technology Assisted Language Learning (TALL), Computer Assisted Language Learning (CALL), or a recent one named Mobile Assisted Language Learning (MALL)

In the next part of the chapter, the domain of Mobile Assisted Language Learning (MALL) was mentioned

2.2.1 Mobile Assisted Language Learning (MALL)

Recently, a trend has emerged and become increasingly popular named as Mobile Assisted Language Learning It is because of the increasing popularity of mobile devices like smartphones According to a report on the global mobile market by Newzoo in 2020, Vietnam was ranked at the 10th position with more than 61 million smartphone users (as cited in VNA, 2021) From the statistics of the report, it was implied that nearly all families have at least a smartphone Later, from the increasing

Trang 24

interest in MALL, another area of learning was born – mobile learning (m-learning) There are some distinctive features of mobile learning that can be withdrawn from those of mobile devices: portability, social interactivity, context sensitivity, and connectivity (Kloper et al, 2002) In m-learning, the element of mobility is the key point The mobility of mobile learning is not only limited to movement in space but also the time and the place to learn (Kukulska-Hulme et al, 2009) In other words, El-Hussein and Cronje (2010) describe mobility into three categories: mobility of technology, mobility of learning, and mobility of learners Also, Kim (2012) categorizes application services for MALL into four groups: mobile social networking

or mobile social software, mobile podcasting or mobile cast, course management service, and ASR Among these, ASR raised special interest to the researcher as it showed the potential to stimulate learners to practice pronunciation by themselves

In the next part, the definition of ASR and its application were mentioned

2.2.2 Automatic speech recognition (ASR)

The notion of ASR (or speech recognition) may be originated from the invention of the telephone by Alexander Graham Bell in 1881 which turned sound waves into electrical signals It was the earliest time recorded in history that people could invent

a machine to recognize human speech From that time, more and more experiments and improvements have been made to make it better to recognize human speech In

1959, Bell Laboratories produced a system that can recognize vowels sounds of English with 93% of accuracy It might be a milestone for using speech recognition to recognize speech in English Nowadays, due to the development of technology, ASR has been developed more and more and applied to a lot of software and applications like Windows Speech Recognition, Google Search by Voice, Siri, Rosetta Stone, ELSA Speak, etc Most of them employ the Hidden Markov Model to analyze and process the data they receive At the same time, the applications and definitions of ASR have been quite different Levis and Surovov (2014, p.1) define ASR as “an

Trang 25

independent, machine-based process of decoding and transcribing oral speech A typical ASR system receives acoustic input from the speaker through a microphone, analyzes it using some pattern, model or algorithm, and produces an output, usually in the form of a text” As far as the researcher could reach, this is the most updated definition for ASR Hence, this definition was used as the way to understand the term

of ASR in the study

In the following part, a specific application of ASR was mentioned – ELSA Speak 2.2.3 ELSA Speak

There have been some applications or software which employ ASR for pronunciation teaching Among them, ELSA Speak was selected due to the following criteria suggested by Yoshida (2018) for selecting a tool for teaching pronunciation under the viewpoint of a teacher:

Appropriateness to learning objectives: ELSA Speak met the researcher‟s objective which was to help students pronounce four final alveolar sounds /t/, /d/, /s/, and /z/ in isolated words only

Quality and accuracy: The application provides students with a natural-speed pronunciation by native speakers for each individual sound of each isolated word Moreover, it provides students with feedback in terms of color: green for accurate, yellow for nearly accurate, and red for inaccurate pronunciation It also gives a percentage score for how accurate the pronunciation of the whole word is in comparison to the pronunciation of native speakers In addition, the application has been developed by a group of experts on pronunciation and AI What‟s more, the application can detect people‟s pronunciation mistakes with 95% of accuracy The number of users is another evidence for its quality: more than 13 million users around the world (ELSA, n.d.) Therefore, the quality of the application could be ensured

Trang 26

Practicality of use: The researcher considered the application as easy to be used, just a smartphone and internet connection Furthermore, it can be used for a range of users

of different levels (elementary to advanced) and ages (young learners to aged one) Cost: The cost of the application was considered as reasonable under the viewpoint of the researcher 329,000 VND (14 USD)/three-month package (nearly five dollars per month) with seven days of free trial

Despite the good points mentioned above, there were still other applications that could satisfy the above-mentioned criteria apart from ELSA Speak The most important thing for the selection of ELSA Speak was the fact that it has been developed by a Vietnamese founder with great support for Vietnamese users

Apart from checking whether or not students make any progress when using ASR (ELSA Speak) for their new word pronunciation, the researcher also wanted to know more about their attitudes on the use of ASR (ELSA Speak) Therefore, the term of attitude was defined in the next part

2.3 Attitude

The term of attitude is still in a debate that might be endless as there are lots of points

of view on it Fishbein and Ajzen (1975, as cited in Kroenung & Bernius, 2012) developed the Theory of Reasoned Action (TRA) in which attitude was described as the formation from the strength of behavioral beliefs and the evaluation of the potential outcomes According to TRA, attitudes of a certain behavior can be positive, negative, or neutral It is also believed in this theory that there is a link between attitude and outcome which can be explained as follow: If someone has a positive attitude, that person tends to have positive behavior and vice versa Later, the Theory

of Planned Behavior (TPB) was developed from TRA by Ajzen (1985 as cited in Kroenung & Bernius, 2012) which states that attitude, subjective norm, and perceived behavioral control combine with one another to form a person‟s behavior and intention Another view towards attitude states that there are two kinds of attitude:

Trang 27

Attitude towards an Object A (O) and Attitude towards a Behavior A (B) (Yang & Yoo 2004; Zhang et al, 2008; Zhang & Sun, 2009) A(B) is described as “an individual‟s positive or negative feelings (evaluative affect) about performing the target behavior”, (Fishbein & Ajzen, 1975, p 216, as cited in Kroenung & Bernius, 2012) while A (O) refers to “a psychological tendency that is expressed by evaluating

a particular entity with some degree of favor and disfavor” (Eagly and Chaiken 1993, p.1) What‟s more, Eagly and Chaiken (2007) believe that attitude includes three components: affect (emotion), cognition, and behaviour The affect component is about how a person feels, the cognition is about the information or knowledge that a person receives, and behaviour component is the way that reflects how a person acts overtly towards an object of attitude and his or her intentions to act Due to the scope

of the research, the definition including three components by Eagly and Chaiken was chosen as it was considered to be the most appropriate when referring to the emotion

of how a person has about something Among the three components of attitude, the two components named emotion and behaviour (behavioural intention) were selected

as they could reflect how students felt about using ASR (ELSA Speak) in their pronunciation learning and their behavioural intention to use ASR (ELSA Speak) in, outside their class as well as after the research Therefore, the term of attitude, in this research, could be understood as students‟ affect (emotion) and their behaviour (behavioural intention) to use ASR (ELSA Speak) for their pronunciation learning The term of affect (emotion) was defined in the following part

2.3.1 Affect

The term of emotion has been studied for a long time James (1884, as cited in Pritzker, Fenigsen, & Wilce, 2020, p.157) states that “My theory, on the contrary, is that the bodily changes follow directly the PERCEPTION of the exciting fact, and that our feeling of the same changes as they occur IS the emotion” Meanwhile, emotion is defined by Anne, Kuchibhotla, and Vankayalapati (2015, p.5) as “a physiological experience of a person‟s state when interacting with the environment

Trang 28

and valence arousal space captures a wide range of significant issues in emotion” From another viewpoint, emotion is understood as “a strong feeling such as love, fear

or anger; the part of a person‟s character that consists of feelings” (“emotion,” n.d.) Among these points of view towards emotion, the one by Oxford University Press is the most updated definition Also, it is the most specific and related to the aim of the research in terms of feelings (anger, fear, etc.) Therefore, this definition was used as what the researcher meant by using the term emotion in this study

On the other hand, the researcher did not want to stop at how participants felt about ELSA Speak, but whether or not they had the behavioural intention to use the application in, outside their class as well as after the research Therefore, the term of behavioural intention was defined next

Final Version of Technology Acceptance Model

Note Adapted from A model of the antecedents of perceived ease of use:

Development and test by Venkatesh and Davis, 1996

Trang 29

From the model, it is easily seen that the behavioural intention to use a kind of technology is formed from two elements: perception of usefulness and perception of ease of use To be specific, if a person perceives the usefulness and ease of use of a kind of technology, that person might have the behavioural intention to use it Of the two definitions, the one by Venkatesh and Davis (1996) was taken as it reflects a person‟s behavioural intention to use a kind of technology which was ASR in this research Therefore, the term of behavioural intention, in this research, referred to participants‟ perception of usefulness and perception of ease of use of ASR (ELSA Speak) which led to their behavioural intention whether or not to use it in, outside their class as well as after the study

What‟s more, when comparing the definition of behavior component by Eagly and Chaiken (2007) and the definition of behavioural intention by Venkatesh and Davis (1996), the researcher found that they both refer to the intention to act To be specific, the behaviour component by Eagly and Chaiken (2007) mentions a person‟s intention

to act towards an attitude object while the behavioural intention by Venkatesh and Davis (1996) refers to a person‟s intention to use a kind of technology They both, in this research, referred to the intention to use the technology of ASR (ELSA Speak) Therefore, in this study, the component of behaviour by Eagly and Chaiken (2007) or the part of behavioural intention by Venkatesh and Davis (1996) had the same meaning

2.4 Young learners

The term of young learners must be defined carefully as it may affect the way readers understand what the researcher meant in this research

There are two points of view on defining young learners People of the first one prefer

to use the exact term for each period of age Richards and Schmidt claim that young learners in language teaching are children of pre-primary and primary school age They also clarify other groups as adolescent learners and adult learners In a more

Trang 30

specific view, young learners are understood as those from five to 12 years old (Rixon, 1999, 2014; McKay, 2006) Meanwhile, teenagers are defined as those who are from 13 to 19 years old (“teenager,” n.d.)

Of the latter point of view, people prefer to use the umbrella term of young learners for those who are under 18 According to the definition by United Nations Convention

on the Rights of the Child in 1990 (as cited in Ellis, 2013), a child is a person under

18 years of age However, this definition is still too general while the scope of this research refers to those who learn English Therefore, the definition for this term must

be more specific In addition, Ellis (2013) defines young learners in English language teaching as those who are under 18 She also adds that the term of young learners covers a wide range of learners who share the same needs and rights as children but differ greatly as learners such as their physical, psychological, social, emotional, conceptual, and cognitive development, as well as their development of literacy

Of the two points of view, the researcher, in this study, used the first one in which the term of young learners refers to those who are from five to 12 while teenagers are those from 13 to 19 It was because of the nature of the research in which the researcher employed two classes (main class and point of reference class) Participants of the main class were 12 or under while those of the point of reference class were 13 or older Each age group might have different features in learning English Therefore, it might be more appropriate to use the exact term for each age group rather than employing an umbrella term for both of them

The researcher mentioned the theoretical section of the chapter In the next part, studies on the use of ASR were introduced

2.5 Review of previous studies

In this section, previous studies related to the research were listed and analyzed chronologically In addition, these studies were arranged in terms of region: in other

Trang 31

countries and then in Vietnam Later, the research gap was identified from these studies

There has been increasing interest in using technology to teach and learn second languages There are some researchers who criticize using this technique in second language teaching and learning It is criticized that using ASR leads to low rates of accurate recognition in the language for those who are not native speakers (Coniam, 1999; Derwing, Munro, & Carbonaro, 2000)

On the opposite side, Neri et al (2002) argue that the level of accuracy in evaluating pronunciation of ASR is getting better and better with higher level of accuracy and better adaptability to even non-native speakers Sharing the same interest, Kim (2006) conducted research on 36 EFL Korean university students (freshmen) to find out the correlation coefficient between scores of the ASR software of Fluspeak and human raters He found that the correlation coefficient at word level was not high and near zero at the intonation level It meant that the development of ASR technology at that time was far away from what it had been expected

Banafa (2008, as cited in McCrocklin 2014), has a more positive view towards using technology in pronunciation He states that working on computers was effective in pronunciation because it provides a good environment for practicing oral language Having the same positive viewpoint towards using technology in pronunciation, Neri

et al (2008) did quantitative research with computer-assisted pronunciation training (CAPT) basing on ASR technology The research was conducted on 28 students who were 11 years old at the same school at the same level The students were divided into two groups: the control group (15 students) with teacher-led instruction and the experimental group (13 students) with the help of CAPT It showed that children of the experimental group who work with ASR also had improvement in terms of new word pronunciation along with the group who was taught using the traditional way

Trang 32

However, the research was conducted on isolated words without explaining what aspects of word pronunciation they covered like vowels, consonants, or stress

Furui (2010), on the other hand, points out that although ASR systems are widely used and have already been embedded in commercial applications; in real-life cases room reverberation significantly decreases the ASR performance, making the use of such systems less effective

With the same interest, Al-Qudah (2012) did research to find out whether there is any improvement in pronunciation after using ASR She also wanted to know whether there is any difference between male and female students in terms of pronunciation performance after using ASR To find out the answer for these issues, she designed it into experimental research with 149 third-year students (73 males and 76 females) at a university in Jordan They were divided into two groups: The control group (74 students) employed printed materials for improving pronunciation whereas the experimental group (75 students) employed ASR A pretest was conducted for both groups to make sure that the participants have the same level and a posttest was done after eight weeks of training for both groups The calculations of mean, standard deviation, and two-way ANOVA analysis of variance were utilized It found that the participants in the experimental group outperformed those in the control group However, there was no significant difference found between male and female participants in their pronunciation performance However, there were still some vague points in the study It is that the scope of the study was on which aspect of pronunciation (segmental or supra-segmental) and which way was used to assess the performance of the two groups These vague points may affect the reliability of the research

At the same time, Tran (2012) conducted experimental research on the use of computer software (Pronunciation Power and Praat) to help students improve their pronunciation of four English fricative sounds: /f/, /v/, / θ/, /ð/ There were 92 English

Trang 33

non-majored freshmen at Hung Vuong University who joined the research They were divided into two groups: The control group (45 participants) who studied without computer software and the experimental group (47 participants) who worked with computer software The participants‟ pronunciation was checked in the pretest which showed that the control group got 0.10 points higher than the experimental one However, after the training procedure (13 weeks), the experimental group performed 1.93 points higher than the control group in the posttest In addition, students showed positive attitudes towards the use of computer software in their studying of the four English fricatives

Another research was published by McCrocklin in 2014 on the influence of ASR on students‟ autonomy in pronunciation The participants were divided into three groups: controlled group (15 participants) with traditional face-face-face instruction; experimental group (17 participants) with traditional face-to-face instruction and minimal strategy training with ASR; and second experimental group (16 participants) with half face-to-face minimal strategy training and the other half working with ASR She found that ASR was a useful tool for practicing new word pronunciation because the technology fosters autonomy and interest inside its users

Simultaneously, Elimat and AbuSeileek (2014) added that individual work is more effective than pair or group work if learners practices ASR on their pronunciation Fathi Sidig Sidgi and Jelani Shaari (2017) conducted research on ASR with the question that whether it was effective in terms of helping users overcome the difficulties of pronouncing sounds that are not in their native language – Arabic Participants of the research were 20 randomly-selected Iraqi EFL students from first-year college students at Al-Turath University College in Baghdad The participants attended a two-month pronunciation course employing the software of Eyespeak At the end of the course, they completed a questionnaire on how useful it was (if any) to improve their pronunciation It showed that participants considered Eyespeak

Trang 34

software as a very useful tool to improve their pronunciation as long as let them know where their pronunciation might be inaccurate They also showed positive attitudes towards the software as an enjoyable one to learn pronunciation

In the same year, Li et al (2017) researched the application of ASR into English pronunciation correction and participants‟ attitude towards ASR in their learning of English There were 29 first-year students from the School of Foreign Studies of South China Normal University who participated in this study They used the application of iFlytek Voice Input (IVI) during a four-week course to practice their pronunciation During the course, participants read aloud selected materials and then the application recorded, analyzed, and transformed their speech into text Each time

of their practice lasted for 20 minutes including reading and self-checking with IVI and then rereading and rechecking over and over again They found that IVI was helpful in improving participants‟ pronunciation

Later, Kholis (2021) conducted action research on the application of ELSA Speak on

18 English-education students from Nahdlatul Ulama University of Yogyakarta, Indonesia They practiced their pronunciation on accuracy in grammar and vocabulary, fluency, appropriacy, and comprehensibility during three cycles He found that ASR (ELSA Speak) could improve participants‟ pronunciation skills from two to four and motivate them to learn to pronounce

Among these studies, the ones by Tran (2012), Li et al (2017), and Kholis (2021) were considered to be the most related literature to the scope of the research However, they were still different in some ways The research by Tran (2012) explains clearly her focus - four English fricatives only (/f/, /v/, / θ/, /ð/) There are still other aspects that can be used for research like: vowels, other consonants (stops, nasals, approximants, initial consonants, etc., and stress The study by Li et al, on the other side, pays attention to pronunciation in general which seems to be too general and overloaded to the scope of this research The one by Kholis (2021) seems to be

Trang 35

the most familiar study with the nature of this research in terms of application employment ASR (ELSA Speak) and research design (action research with both quantitative and qualitative) However, it is different in terms of participants level (C1

in comparison with those at Pre-A1 and A1 of this research) and the subject to be studied (pronunciation in general in comparison with four final alveolar sounds /t/, /d/, /s/, /z/) These missing aspects of pronunciation could be considered as the research gaps among the previous studies

However, it would be overloaded if the researcher covered all these gaps due to the limited time and the requirements of the syllabus Therefore, it would be more appropriate if one of these gaps was selected for the scope of the research

As a result, there were many choices for conducting the research Nguyen (2012) did research on problematic consonants on 24 out of 104 participants at a college in Ho Chi Minh City which is in the same region (the South-East of Vietnam) as the research site (Bien Hoa City) She found out some problematic final consonant sounds

to students: /k/, /g/, /t/, /d/, /s/, /z/, /l/, /r/, /n/, /f/, /v/, /θ/, /ð/, / dʒ/, / tʃ/, / ʃ/, /p/, /b/ which attracted special interest from the researcher Among those, the researcher paid special attention to alveolar sounds and found them feasible to study in his research Moreover, during the process of teaching in the classroom, the researcher realized that final alveolar sounds are quite problematic for the students Therefore, final alveolar sounds were selected According to Roach (2009), the list of alveolar sounds consists

of /t/, /d/, /s/, /z, /n/, and /l/

In addition, Avery and Ehrlich (1992) state that Vietnamese people have difficulties

in pronouncing final consonant sounds including /s/, /z/, /t/, /d/, /p/, /k/, /b/, /g/, /ʃ/, /θ/, /ð/, /ʒ/, /f/, /v/

Therefore, when combining the list of problematic final consonant sounds between Nguyen (2012) and the one by Avery and Ehrlich (1992), the researcher came up with Figure 2.2 as follow:

Trang 36

Figure 2.2

Common Problematic Final Sounds

From Figure 2.2, it was seen that the common problematic final sounds were plosive

or fricatives in terms of manner of articulation but ranged variously in terms of place

of articulation (from bilabials to velars) Among these sounds, the four fricatives of /f/, /v/, /θ/, and /ð/ had been studied by Tran (2012) so the researcher wanted to conduct research on the rest of them (/p/, /b/, /t/, /d/, /k/, /g/, /s/, /z/, /ʃ/) However, it was still too broad to cover all of them During the researcher‟s teaching at the class (oriented to satisfy the demands for Pre-A1 Starters certificate by Cambridge Assessment English), some final sounds were noticed to be seen more often than others which meant that their frequencies of use in teaching materials were higher than others Table 2.1 showed the list of common problematic final sounds and their frequency of being seen in the Starters Word List by Cambridge Assessment English (2018):

/k/,/g/,/t/,/d/,/s/,/z/,

/l/,/r/,/n/,/f/,/v/, /θ/, /ð/, / dʒ/, / tʃ/,

/ ʃ/,/p/,/b/

Nguyen (2012)

/s/,/z/,/t/,/d/, /p/,/b/,/k/,/g/, /ʃ/, /ʒ/,/θ/, /ð/, /f/, /v/

Avery & Ehrlich (1992)

/p/, /b/, /t/, /d/, /k/, /g/, /f/, /v/, / θ/, /ð/, /s/, /z/, /ʃ/

Trang 37

Table 2.1

Frequency of Common Problematic Final Sounds

Problematic final sounds Frequency

of being

seen

/p/ /b/ /t/ /d/ /k/ /g/ /s/ /z/ / ʃ/

9 0 51 31 24 8 27 17 3

From Table 2.1, the four final sounds of /t/, /d/, /s/, and /k/ have the highest frequency

of being seen in the word list However, they belong to two different groups (plosives and fricative in terms of manner of articulation and alveolars and velar in terms of place of articulation) which might be confusing to study However, the final sound of /z/ also holds the fourth highest frequency (17) of being seen in the word list Hence,

it might be more appropriate to employ sounds of the same group rather than different ones

In addition, although Roach (2009) lists out six alveolar sounds in total (/t/, d/, /s/, /z/, /n/, /l/), there were only four common problematic final alveolar sounds (/t/, /d/, /s/, /z/) between the studies by Nguyen (2012) and Avery and Ehrlich (1992) Therefore, only four final alveolar sounds mentioned above were selected in this research

The researcher reviewed some previous studies related to the research For readers‟ convenience, Table 2.2 provided a summary of the previous studies mentioned above

Trang 38

Table 2.2

Summary of Previous Studies

No Name Year Researcher(s) Findings

2002 Neri et al The accuracy of ASR in evaluating

pronunciation is getting higher and higher even for non-native speakers

4 Reliability and

Pedagogical implications

of ASR on pronunciation

teaching

2006 Kim The correlation coefficient at word level

is not high and near zero at the intonation level

5 Effects of IT on

pronunciation

2008 Banafa Working on computers was effective in

pronunciation because it provides a good environment for practicing oral language

6 Effectiveness of CAPT

for children‟s L2

learning

2008 Neri et al Participants who use ASR have the

same progress as those who employ the traditional way

7 History and development 2010 Furui Room reverberation significantly

Trang 39

of speech recognition decreases the effectiveness of ASR in

2012 Al-Qudah - The group who worked with ASR

outperformed the one who didn‟t

- There was no difference in terms of gender

9 Effectiveness of

computer software on

English fricative

pronunciation

2012 Tran -The experimental group got 1.93 points

higher than the control group

- Students showed positive attitudes towards the use of ASR on four fricative sounds /f/, /v/, / θ/, /ð/

10 The potential of ASR on

fostering learners‟

autonomy in

pronunciation

2014 McCrocklin ASR is useful for practicing new word

pronunciation as it encourages users in terms of autonomy and interest

an enjoyable way to learn pronunciation

13 Improving English

Pronunciation via

automatic speech

2017 Li et al IVI was found to be helpful in

improving participants‟ pronunciation

Trang 40

Table 2.2 was arranged chronologically, from the year of 1999 to 2021 The table just showed key points of the previous studies mentioned in this chapter, including the name of the study, the year of publication, the name of researcher(s), and the finding(s) The summary was to help readers follow the study better and summarize what had been written in this part of the chapter

In the next part, the conceptual framework was introduced to help readers have a general view of what was done in the study as well as the relations of operational terms stated in the previous parts of the study

Tiêu đề	The application of automatic speech recognition to students’ new word pronunciation
Tác giả	Phạm Hưng Thịnh
Người hướng dẫn	Nguyễn Thúy Nga, Ph.D.
Trường học	Vietnam National University - Ho Chi Minh City University of Social Sciences and Humanities
Chuyên ngành	English Linguistics and Literature
Thể loại	Thesis
Năm xuất bản	2021
Thành phố	Ho Chi Minh City

Định dạng
Số trang	129
Dung lượng	0,97 MB