Attentive biLSTMs for understanding students’ learning experiences

Understanding students’ learning experiences on social media is an important task in educational data mining.. For mining these datasets, there existed several work done for English usin

Trang 1

Students’ Learning Experiences

Tran Thi Oanh(B) International School, Vietnam National University, Hanoi,

144 Xuan Thuy, Cau Giay, Hanoi, Vietnam

oanhtt@isvnu.vn

Abstract Understanding students’ learning experiences on social

media is an important task in educational data mining Since it provides more complete and in-depth insights to help educational managers get necessary information in a timely fashion and make more informed deci-sions Current systems still rely on traditional machine learning methods with hand-crafted features One more challenge is that important infor-mation can appear in any position of the posts/sentences In this paper,

we propose an attentive biLSTMs method to deal with these problems This model utilizes neural attention mechanism with biLSTMs to auto-matically extract and capture the most critical semantic features in stu-dents’ posts in regard to the current learning experience We perform experiments on a Vietnamese benchmark dataset and results indicate that our model achieves state-of-the-art performance on this task We achieved 63.5% in the micro-average F1 score and 59.7% in the macro-average F1 score for this multi-label prediction

Keywords: Attention mechanism·biLSTMs·Students’ learning

experience·Social media

1 Introduction

Students’ learning experience refers to the feelings/thoughts of students in the process of getting knowledge or skills from studying in academic environments

It is considered to be one of the most relevant indicator of education quality in schools/universities [17] Getting to understand this is an eﬀective and important way to improve educational quality in schools/universities

Learning experiences can vary dramatically for students To determine stu-dents’ learning experiences, the widespread used methods is to undertake a num-ber of surveys, direct interviews or observations that provide important opportu-nities for educators to obtain student feedback and identify key areas for action Unfortunately, these traditional methods usually cost time, thus cannot be fre-quently repeated Moreover, they also raise the question of accuracy and valid-ity of data collected because they do not accurately reﬂect on what students were thinking or doing something at the time the problems/issues happened c

Springer Nature Switzerland AG 2020

H A Le Thi et al (Eds.): ICCSAMA 2019, AISC 1121, pp 267–278, 2020.

Trang 2

Another drawback is that the selection of the standards of educational practice and student behavior implied in the questions is also criticized in the surveys [5] Nowadays, social sites such as Facebook, forums, blog, etc provide great venues for students to express their opinions, concerns and emotions about the learning process When students post on these sites, they usually write about their feelings/thoughts at that moment Therefore, the textual data collected from on-line conversations may be more authentic and unﬁltered than responses

to formal research surveys These public data sets provide vast amount of insights for educators to understand students’ experiences besides the above traditional methods For mining these datasets, there existed several work done for English using traditional machine learning classifiers with hand-crafted features Some typical classifiers used in mining various problems in students’ learning process are Decision Tree [13], Naive Bayes [6], SVM [8], Memetic [2], etc In Vietnamese, not much effort has been spent to mine such data so far Tran and Nguyen [14] presented the first work towards mining social media to get insights from Viet-namese students’ posts They developed a framework using Naive Bayes and Decision Tree to automatically detect students’ issues and problems in their study at universities

Recently, deep neural network approaches provide an effective way of reduc-ing the number of hand crafted features Specifically, neural networks have been proved to improving the performance of many tasks ranging from question gen-eration [18], machine translations [7], relation classification [19], etc Hence, in this paper, we propose a novel architecture exploiting a neural network called attention-based biLSTMs for mining students’ learning experiences This model doesn’t use any features derived from knowledge resources or Natural Language Processing (NLP) systems We perform experiments on a benchmark dataset, and achieve 63.5% in the micro-average F1 score and 59.7% in the macro-average F1 score, higher than the existing methods in the literature for this critical task The rest of this paper is organized as follows: Sect.2 presents related work

In Sect.3, we show a proposed method using attention-based biLSTM to deal the task Section4shows experimental setups, evaluation metrics, experimental results and some ﬁndings of this work on a dataset benchmark for Vietnamese Finally, we summarize the paper in Sect.5 and discuss some on-going work for the future

2 Related Work

Social media has risen to be not only a communication media for personal pur-poses, but also a media to share opinions about products and services or even political issues among its users Many researches from diverse ﬁelds have devel-oped tools to formally represent, measure, model, and mine meaningful patterns (knowledge) from large scale social for the concerned domains In healthcare, many researches, e.g Sue et al [12] has shown that social media can be used

to reveal lots of health information about its users, or to provide online social support for anyone with health problems [16] In the marketing ﬁeld, researchers

Trang 3

mine the social data to recommend friends or items (e.g online courses, videos, beauty product, research papers, search keywords, social tags, and other prod-ucts in general.) on social media sites, etc

Recently, research on mining web-based conversations in informal ways on social media (e.g., Facebook, forum, etc.) has started emerging From these sites there are huge amount of textual data are generated which contain important data about students There existed many researches proposing different tech-niques to process such data to better know about students and their learning environments This information will be valuable to institutions/universities to make informed decisions related to students’ learning For example, Chen et al [3] firstly provided a framework for analyzing these kind of data using Twitters’ posts for educational goals Takle et al [13] did a detailed study to make com-parison of different classification techniques such as Iterative dichotomiser (ID3), Naive Bayes Multi-label Classifier and Memetic Classifier using common dataset

to analyze and get the information related to students in order to enhance the higher education system, etc Blessy et al [2] developed a framework to use both qualitative analysis and big data mining techniques using Naive Bayes Multi-label Classifier algorithm and Memetic classifier to categorize tweets presenting students’ problems Pande et al [8] exploited the SVM method to determine Many issues like stress, suicide, sleepy problems, and anxiety in students’ posts Patil et al [9] showed that the way students indicate their feelings via social media sites and which posts are in which category using Memetic algorithm Jessiepriscilla et al [6] built a sentiment analyzer tool for analyzing tweets which can be used to accomplish the goal of determining the student learning experi-ences using Navie Bayes multilabel classifier All of these researches were done using traditional machine learning methods

While most work has focused on English, a few attempts have been done for Vietnamese so far Specifically, Tran and Nguyen [14] presented the first work towards mining social media to get insights from engineering students’ posts They developed a framework to automatically detect students’ issues and prob-lems in their study at universities Similar to other work in English, the authors also exploited traditional machine-learning methods which are Naive Bayes and Decision Tree to build the prediction models This work also contributed the first benchmark dataset on this field in Vietnamese The experimental results were just the preliminary step and need more effort to enhance the performance

of the methods

As can be seen that, previous work mostly exploited traditional machine learning methods which require hand-crafted features Designing these features

is commonly time-consuming and requires experts’ knowledge Another chal-lenge is that in a post, some words play more important roles in deciding its main meanings Especially, when one students’ post may contain more than one meaning In recent years, deep neural network methods give us an eﬀective way to make the quantity of hand crafted features less in size It also does not use extra knowledge and NLP systems Therefore, this research proposes a novel archi-tecture exploiting attentive biLSTM for the task of mining students’ learning

Trang 4

experiences on social media Specifically, we convert the multi-label classifica-tion into binary classificaclassifica-tion problems and then exploit the attentive biLSTM

to build the corresponding models for these problems The eﬀective of the pro-posed method is veriﬁed on a Vietnamese benchmark dataset through extensive experiments

3 An Attention-Based biLSTM for Understanding

Students’ Learning Experiences

In formal statement, multi-label learning problem can be seen as the problem of looking for a method that converts inputs x to binary vectors y These binary vec-tors are not scalar outputs as in the single-label classification problem Learning from multi-label classification problem can be solved by transformation tech-niques This technique turns the problem into some single-label classification

problems This work uses the techniques called binary relevance Speciﬁcally, assume that we have p labels, this method creates p new data sets, one dataset

for each label This binary relevance method then trains single-label classifiers for each of these new data sets Each single-label classifier only classifies whether

or not the current sample belong to the current label i? The multi-label

predic-tion for a new sample is determined by combining the classiﬁcapredic-tion results from all of these independent single-label classiﬁers

Each of these classifiers will be built using attentive LSTM architecture as illustrated in Fig.1 This deep neural network is usually very effective to encode sequences of words and is very powerful to learn on data which have long range dependencies It considers each word in the posts with equal importance weight The attention mechanism proposed to allow the model to pay attention to more important part of the students’ posts Therefore, this model can automatically concentrate on the important words that have greater impact on the final clas-sification, to record the most important semantic information in each post This model does not use any extra knowledge and outputs from NLP systems The overall framework consists of four main layers as follows

Each students’ post consists of n words, s ={w1, w2, , w n }, where w i is the i th word of the post Each word in the posts will be converted into a vector x i using word embedding Word embedding is one of the most eﬀective representation

of post vocabulary nowadays It has the capability of encoding the context of

a word in a post, semantic as well as syntactic similarity, and the relation with other words, etc In this paper, we use GloVe [10] which is an unsupervised learning algorithm for capturing representations for words in the vector form

Let X = (x1, x2, , x n) be a students’ post consisting of the vector

represen-tations of n words in one post At each location t, the outputs of RNNs express

an intermediate representation based on h - a hidden state:

Trang 5

Fig 1 An attention-based biLSTM for understanding students’ learning experiences

on social media

where Wy and by denote parameter matrix and vector These are determined

in the training process, σ denote the element-wise Softmax function The hidden

state htis updated using an activation function It is a function of the previous

hidden state ht−1and the current input xtas follows:

LSTM cells exploit a few gates to update the hidden state ht These gates include

an input gate it, a forget gate ft, an output gate otand a memory cell ct The update formula is given below:

i = σ(W ixt+ Viht−1+ bi ), (3)

f = σ(W fxt+ Vfht−1+ bf ), (4)

Trang 6

ot = σ(W oxt+ Voht−1+ bo ), (5)

ct= ft c t−1+ it tanh (W cxt+ Vcht−1+ bc ), (6)

where multiplication operator functions, V is a weight matrice, and b is vectors

to be learned

To improve the model performance, two LSTMs are trained on user

utter-ances The ﬁrst on the utterance from left-to-right (l i) and the second on a

reversed copy of the utterance (r i ) The forward and backward outputs, l i and

r i , should be combined into c i by concatenation by default before being passed

on to the next layer

Let H denote a matrix including output vectors [h1 , h2, , h n] that biLSTMs

layer produced, where n is the post length You can just take the straight average

these vectors and feed that to your classiﬁer But it is also true that not all of this information will be equally important That is why we need attention to tell

us which words are less important and which words are the most important We

will train a little neural network from H to vote on how important each word

is Let r be the representation of the post r is created by a weighted sum of the

output vectors as follows:

where w is a trained parameter vector and w T is a transpose A little alpha here tells you how important the cell is then you do the weighted sum and feed that into your classiﬁer

We get the last post-pair representation which will be used to classify as follows:

This work exploits a softmax classiﬁer1 to guess the label y ∗ from a pre-deﬁned

set of classes Y for a student’s post s The model gets the hidden state h ∗ as input:

y ∗ = argmax

y (p(y |s)) (13)

1 Instead of using this softmax function, you can also use the sigmoid function as an alternative In fact, in the binary classiﬁcation both sigmoid and softmax functions are the same where as in the multi-class classiﬁcation softmax function is preferred

Trang 7

4 Experiments

This section first presents about the dataset used to conduct experiments Typ-ical evaluation metrics are also described to estimate the effectiveness of the proposed method Then, the detailed configuration to set up experiments is shown Finally, this section expresses experimental results on this dataset

Data were collected from a forum of a famous university in Vietnam The dataset contains 1834 posts relating to students’ learning experiences of an information technology university In this dataset, one post can fall into one or multiple categories There are seven categories which are also the main problems/issues that students often meet in their studying process at the university Figure2

gives a description of the number of instances per labels in our dataset

Fig 2 Number of posts in each category of the dataset analyzed.

The evaluation metrics for the multi-class classification is slightly different with metrics for single-label task In multi-label classification, a misclassification is not a hard wrong or right A predicted set of labels which includes a subset of the gold classes should be considered better than a predicted set that does not contains any gold class In this paper, we report both settings to evaluate the performance of the method

Trang 8

In this situation, researchers [4] proposed two types of metrics which are example-based measures and label-based measures

Example-Based Measures These measures are calculated based on examples

(in this case each post is considered as an example) and then averaged over all posts in the dataset Suppose that we are classifying a certain post p, the gold (true) set of labels that p falls into is G, and the predicted set of labeled by the classiﬁer is P, the example-based evaluation metrics are calculated as follows:

M

i=1

G i ∩ P i

G i ∪ P i

P rec = 1

M

i=1

G i ∩ P i

P i

M

i=1

G i ∩ P i

G i

F 1 = 1 M

M

i=1

2∗ P recision i ∗ Recall i

P recision i + Recall i

here M is the number of posts in the corpus

There are two more commonly used measures to estimate the effectiveness of multi-labeled classification which are micro-average F1 and macro-average F1 The former gives the same weight to each classification decision per post, while the latter gives the same weight to each label They are variants of F1 used in different situation

Label-Based Measures These measures are measured on each label and then

get averaged values over all labels in the dataset Speciﬁcally, metrics of recall,

precision, and F1 for each label l is calculated as follows:

F 1 = 2∗ P ∗ R

P + R

T P + F P

T P + F N

where TP is the number of posts that are correctly detected as the

currently-considered label l FP is the number of posts belonging to l but mis-identiﬁed to another label FN is the number of posts of l but not recognized by the models.

Trang 9

Table 1 Experimental results of detecting students’ learning experiences using

example-based metrics

Methods Accuracy Precision Recall F1 micro F1 macro

Decision Tree 0.565 0.548 0.571 0.583 0.558

Attentive LSTM 0.612 0.587 0.629 0.635 0.597

The model was implemented in Python programming language with several typ-ical libraries such as PyTorch, numpy, sklearn, utils, etc These libraries provide

rich tools and options to support developments in NLP and many other research ﬁelds

To create pre-trained embeddings of words, we gathered the raw data from Vietnamese newspapers (≈ 7 GB texts) to train the vector of word model using

Glove2 The quantity of word embedding dimensions was ﬁxed at 50

For each label, we created a corresponding dataset which only focuses on the currently-considered label On this dataset, we performed 5-fold cross-validation tests to evaluate the performance of the proposed attentive biLSTMs-based model on this dataset The parameters were chosen by using the development set We randomly select 10% of the training data as the development set To detect students’ learning experiences, we set the quantity of epochs equals 100, the batch size as 20, early stopping as True with 4-epoch patience, the rate of dropout at 0.5

In this paper, we compare the performance of the proposed model with the best results of previous work on this same dataset The best performance of previous work is using Decision Tree method [14] in the same binary relevance setting

In that work, Tran et al exploited C4.5 (J48) This algorithm is used to build a decision tree proposed by Ross Quinlan [11] C4.5 begins with big sets of cases of known classes These cases are represented by any mixture of properties both in nominal and numeric forms The cases are carefully examined for patterns which allow the classes to be reliably discriminated These patterns are then indicated

as models that can be later used for classifying new unseen cases The patterns emphasize on the ability of the models to be understandable as well as accurate This C4.5 was ranked top #1 in the best 10 data mining algorithms published

by Springer LNCS in 2008 [15] Using this method, the baseline model achieved 58.3% in the micro-average F1 score, and 55.8% in the macro-average F1 score Table1showed experimental results of the baseline and the proposed method using example-based metrics It should be noted that the higher the evaluation metrics, the better the performance of the models As can be seen that the

2 https://github.com/standfordnlp/GloVe.

Trang 10

attentive biLSTM model significantly boosted the performance of this task It achieved the better results by around 4% on all metrics of accuracy, recall, precision, macro-average F1 and micro-average F1 scores Specifically, the F1-micro score increased by 5.2% and the F1-macro score increased by 3.9% This result suggested that the attention mechanism has significant effects on mining students’ learning experiences in social media In reality, it is quite effective

in helping the model focus down on the words that are the most useful for classiﬁcation of students’ learning experiences

Table 2 Experimental results of the attentive-based biLSTMs for detecting students’

learning experiences using label-based metrics

Study

load

Negative emotion

Carrier targets

English barriers

Others Material

resources

Diversity issues Precision 0.832 0.900 0.928 0.948 0.788 0.905 0.919 Recall 0.775 0.923 0.933 0.949 0.792 0.892 0.922 F1 0.788 0.910 0.921 0.944 0.776 0.895 0.914

Table2showed the performance of the attention-based biLSTMs method on each label using label-based metrics We can see that the attentive biLSTM

model yielded quite high scores Most labels such as Negative Emotion, English

Barriers, Carrier Targets, and Diversity Issues got more than 90% in the F1

score Material Resources label got 89.5% in the F1 score For the remaining two labels, Heavy Study Load and Others, the proposed method achieved around 78%

in the F1 score This result is quite promising due to the ambiguity problem in predicting these labels Observing their samples in the dataset, we saw that these samples have a large overlap with the remaining labels The model, therefore, is easy to make mistakes in prediction

5 Conclusion

This paper presented a new approach to the task of determining students’ learn-ing experiences on social media The previous systems still relied on traditional methods with manually-designed features Building these features takes time and experts knowledge One more challenge is that not all words in one post have the same important weight to the ﬁnal prediction of the model There-fore, this paper proposed an attention-based biLSTMs to solve these problems This model utilizes neural attention mechanism with biLSTMs to automati-cally extract and capture the most critical semantic features in students’ posts

We perform experiments on a Vietnamese benchmark dataset and experimental results express that the model achieves SOTA performance on this task for Viet-namese The proposed method improves the performance by a large margin of 4% in terms of F1-micro score It achieved 63.5% in the micro-average F1 score, and 59.7% in the macro-average F1 score

Định dạng
Số trang	12
Dung lượng	631,08 KB