Predicting students' performance of Pre-English course by using neural network – a case study in international School – VietNam National University, HaNoi

INTRODUCTION

Background and Significance

In the education sector, accurately predicting student performance is essential for policymakers to evaluate learning outcomes, refine instructional strategies, and formulate educational policies Effective classification of student performance across various classes facilitates personalized learning, early intervention, and informed decision-making However, this process encounters challenges due to the subjective nature of performance assessment, the necessity for consistent and equitable classification systems, and the influence of class imbalance on the effectiveness of classification efforts.

Education data mining (EDM) leverages data analytics to derive valuable insights from extensive datasets, enhancing educational outcomes Recent research highlights the effectiveness of data mining techniques in predicting student performance, with studies examining clickstream interactions in Massive Open Online Courses to forecast student attrition, as illustrated by Sinha et al Moreover, automated student models have been utilized to anticipate performance, showcasing the potential of data-driven approaches in education.

The implications of these findings are crucial for educators, policymakers, and students, as early identification of at-risk students allows for targeted interventions that enhance their success rates Policymakers can utilize performance predictions to effectively allocate resources and support beneficial programs and schools Additionally, students can leverage their predicted performance as motivation to improve and make informed educational choices However, categorizing student performance into various levels poses challenges This thesis focuses on applying educational data mining (EDM) to predict student performance within the context of the pre-English (PE) course placement process at International School - Vietnam National University, examining the influencing factors involved.

Context

The Faculty of Applied Linguistics at the International School - Vietnam National University (VNUIS) aims to enhance the accuracy and efficiency of its student placement process for pre-English courses To achieve this, they are exploring the implementation of the EDM method, a data-driven approach designed to predict students' academic performance effectively.

The placement test assesses students' language abilities through writing, reading, and listening evaluations However, classifying students based on their scores has been difficult By employing EDM techniques, the department can analyze factors that affect student performance, including academic achievements, language test results, demographic data, and future English proficiency This approach aims to create a more effective and fair classification system for students.

Improving the placement process for pre-English courses is crucial for aligning students with classes that suit their language abilities, ultimately enhancing their learning experience The findings of this study will guide decisions on the necessity of the PE course and its influence on students' English proficiency Utilizing EDM techniques for predicting student performance shows potential for increasing the effectiveness of the classification process, thereby enriching the overall educational experience for students.

Previous Approaches in Student Performance Assessment

Traditional methods of assessing student performance, such as teacher evaluations, exams, and assignments, have been commonly used due to their familiarity and ease of implementation However, these approaches have significant limitations, particularly the subjectivity involved in the evaluation process Different teachers may interpret performance indicators in various ways, resulting in inconsistent assessment outcomes and potential biases.

Traditional educational methods face challenges in delivering timely feedback and personalized learning experiences, as assessments are typically conducted at fixed intervals, like semester exams This approach often fails to reflect the ongoing progress of students and places greater emphasis on outcomes rather than the learning process itself.

The increasing adoption of data-driven methods in evaluating student performance is largely due to advancements in educational data mining, machine learning, and predictive analytics By analyzing extensive educational datasets, these methods uncover valuable insights and patterns related to student demographics, learning activities, and performance records Ultimately, the aim is to deliver objective and evidence-based assessments of student achievement.

Data-driven approaches in education offer significant scalability, enabling the analysis of extensive student data to reveal hidden patterns that inform effective instructional strategies Additionally, these methods facilitate personalized learning by identifying individual strengths, weaknesses, and learning styles, allowing for tailored interventions that enhance student outcomes.

Data-driven approaches in education face significant challenges, including privacy concerns and ethical considerations related to the management of sensitive student information Additionally, accurately interpreting data insights necessitates domain expertise to prevent misinterpretations and erroneous assumptions.

1.3.3 Supervised Learning in Student Performance Assessment

Supervised learning algorithms are extensively utilized in assessing student performance by creating predictive models from labeled training data These algorithms identify patterns and relationships between input features, such as demographics and past performance, and the target variable, like final grades Commonly used supervised learning methods include decision trees, logistic regression, and support vector machines.

Supervised learning offers the significant advantage of automating assessment processes while delivering precise predictions A notable example is a Stanford University study that utilized Artificial Neural Networks (ANN) to predict student dropout rates, achieving an impressive true positive rate of 0.994 These algorithms not only identify key predictors but also offer valuable insights into the factors that affect student performance.

Supervised learning methods face several challenges, primarily their dependence on high-quality labeled training data, which can be difficult and time-consuming to acquire Furthermore, these models often struggle to accurately capture complex nonlinear relationships within educational data, which can hinder their overall predictive effectiveness.

1.3.4 Application of Deep Learning in Student Performance Assessment

Deep learning, a subset of machine learning, holds immense potential in predicting student performance by analyzing intricate educational data Leveraging models such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs), deep learning can effectively identify complex patterns and relationships within diverse data sources, making it a valuable tool for educational forecasting.

These models have proven effective in analyzing various educational data, including student demographics, academic records, and online learning engagement By automatically learning hierarchical representations from raw data, they can identify spatial, temporal, and sequential dependencies.

Researchers have created advanced deep learning models to forecast academic performance, dropout risk, and learning challenges For instance, Poudyal et al implemented a CNN-based method that surpasses traditional models in predicting student performance in online courses Meanwhile, Yanbai et al utilized an RNN architecture to predict student dropout, resulting in notable enhancements in accuracy.

Deep learning provides important insights into student outcomes, but it faces significant challenges, including the need for large amounts of labeled data and high computational costs To fully leverage deep learning for assessing student performance, it is essential to address issues related to data availability, model interpretability, and scalability.

1.3.5 Advantage of Using Neural Network Approach for Multiclass Classification in Predicting Student Performance

Feedforward neural networks (FFNN) have emerged as a prominent method for assessing student performance, showcasing their effectiveness in various prediction tasks These artificial neural networks excel in supervised learning by utilizing labeled data to forecast outcomes for new, unseen instances The primary advantage of FFNN in predicting student performance is their capability to manage complex variable relationships and accurately identify non-linear patterns in the data Their architecture facilitates the flow of information through multiple interconnected layers, enabling the detection of hidden patterns and enhancing prediction accuracy.

Feedforward neural networks (FFNN) outperform traditional statistical methods in data analysis, particularly in identifying complex patterns, managing missing data, and addressing noisy inputs This capability is crucial for accurately predicting student performance by considering factors such as demographic information, academic history, and past exam scores By utilizing FFNN, researchers and educators can enhance the precision and comprehensiveness of their predictions FFNN can be effectively trained through various algorithms, including backpropagation, which optimizes network weights to reduce discrepancies between predicted and actual outcomes.

FFNN can be applied in unsupervised learning, enabling the network to identify patterns and relationships within data without the need for labeled examples Nevertheless, supervised learning is typically favored for predicting student performance, as it allows the model to train on labeled data, effectively learning the connections between input features and output classes.

Research questions

This thesis evaluates the effectiveness of Pre-English (PE) courses for students, while also examining the accuracy and reliability of the placement test used for PE The study aims to determine whether the placement test offers valuable insights for categorizing students effectively.

A promising method for analyzing this data is through the use of a Feedforward Neural Network (FFNN) Identifying the key features that significantly influence the model's performance is crucial, whether these are numerical scores or personal demographic information.

- Finally, insights and recommendations will be provided based on the analysis to help improve the school's approach to PE courses and students’ placement test.

FUNDAMENTALS OF ARTIFICIAL NEURAL NETWORKS

Fundamentals of artificial neural networks

This section explores the feedforward architecture (FFNN), a type of artificial neural network (ANN) modeled after the human brain's structure and functionality ANN, often referred to as "networks of neurons," are advanced machine learning algorithms that empower computers to think, perform tasks, and tackle complex problems similarly to humans Comprising fully-connected neurons, known as units, ANN are structured in layers, allowing the outputs of certain neurons to serve as inputs for others.

The feedforward architecture is a fundamental type of artificial neural network

In a fully connected neural network, each neuron in one layer is linked to every neuron in the next layer, creating an interconnected structure The strength of these connections is defined by weights, which influence how one neuron's output affects another neuron's input The fundamental architecture of a Feedforward Neural Network (FFNN) comprises three key components: the input layer, hidden layers, and the output layer.

Figure 1 General Structure of Artificial Neural Network with Two Hidden Layers

The input layer is the initial layer of a feedforward neural network, responsible for receiving input data and transmitting it to the first hidden layer.

Hidden layers play a crucial role in extracting essential features from input data, with each layer containing multiple neurons that transform this data By adjusting the number of hidden layers and neurons within each layer, the model's performance can be enhanced Additionally, the activation functions utilized in these hidden layers allow the model to learn intricate relationships between input and output variables by incorporating nonlinearity.

Figure 2 Data Processing in a Neuron (Source: Medium)

● Output Layer: The output layer of an FFNN produces the final prediction or classification based on the input data and the learned attributes

Figure 3 Weight and Bias in a neuron (Source: InfoWorld)

In neural networks, each connection between units is assigned a weight that indicates its strength and significance During training, these weights are continually adjusted to enhance the model's performance, striving to align predicted values with actual labels The weights influence how much input from one neuron affects the activation of the subsequent neuron, enabling the network to learn and identify underlying patterns in the input data.

Weight initialization is crucial for neural network training, as it involves assigning initial values to the weights before the learning process starts Common techniques include random initialization, where weights are assigned within a specific range, and initialization based on the number of incoming and outgoing connections to each neuron Proper weight initialization is essential for avoiding problems like vanishing gradients and ensuring stable, efficient learning in the model.

Biases in Feedforward Neural Networks (FFNN) are crucial parameters that enhance the flexibility and control of neuron activation Each neuron, except those in the input layer, has an associated bias value that acts as a threshold, influencing when a neuron activates This mechanism enables the network to adapt to variations in input data, facilitating more accurate decision-making.

During the training phase of a neural network, the model's performance is optimized by adjusting biases and weights The network iteratively updates these biases in response to the discrepancies between predicted and actual outputs This process allows the network to better align with the training data, ultimately enhancing its ability to generalize to new, unseen data.

Biases in a neural network are represented as distinct parameters for each node, initially set to random values, often zero During training, these biases are updated through optimization algorithms like gradient descent, which seeks to reduce the discrepancy between predicted and actual labels.

Activation functions play a crucial role in neural networks by enabling the model to learn and predict through the introduction of non-linearity While linear calculations fall short in tackling complex real-world problems, non-linear methods effectively capture the intricate relationships between inputs and outputs.

An activation function plays a crucial role in a neural network by determining a neuron's output from the weighted sum of its inputs Without this function, the network would merely operate as a linear regression model, failing to recognize complex non-linear patterns within the data.

Activation functions are essential for neural networks as they allow the model to learn complex relationships and represent intricate patterns in data By facilitating information representation, defining output ranges, and enabling efficient gradient propagation, these functions enhance the network's flexibility in solving diverse problems Without them, neural networks would be restricted to linear transformations, significantly limiting their effectiveness Thus, activation functions play a crucial role in the success of neural networks.

Activation functions play a crucial role in the hidden layers of Feedforward Neural Networks (FFNN) by introducing nonlinearity, enabling the model to capture intricate relationships between input and output variables Among the most popular activation functions are Rectified Linear Unit (ReLU) and Softmax, along with others like sigmoid, tanh, and Leaky ReLU.

Figure 4 Common used activation functions (Source: AI Wiki)

The ReLU activation function is widely used in feedforward neural network (FFNN) architectures due to its efficiency in returning 0 for negative inputs and the original input value for positive inputs This characteristic helps mitigate the vanishing gradient problem associated with the sigmoid function, allowing models to continue learning effectively ReLU's speed and efficiency contribute to faster training times, enhanced performance, and improved accuracy, making it a preferred choice in machine learning and neural networks.

Figure 5 Visualization of ReLU Function

The softmax activation function is commonly employed in the output layer of neural networks for classification tasks, as it transforms output values into a probability distribution across classes, ensuring that the predicted probabilities total to 1 This function is ideal for scenarios where classes are mutually exclusive, indicating that each input instance is assigned to only one class.

Techniques and Methodology

Cross-validation is an essential technique for assessing the performance of Feedforward Neural Network (FFNN) models by dividing the dataset into training and validation sets, which enables an unbiased evaluation A popular approach is k-fold cross-validation, where the data is split into k subsets of similar size, allowing the model to be trained k times with k-1 subsets for training and one for validation This method provides a thorough evaluation of the model's effectiveness across various data segments For imbalanced datasets, employing stratified cross-validation ensures that each class is adequately represented in both training and validation sets, enhancing the reliability of performance metrics (Berrar et al., 2012).

Regularization techniques are essential in Feedforward Neural Network (FFNN) models to prevent overfitting and enhance generalization By adding a penalty term to the loss function, these techniques discourage excessive complexity in the model's weights and biases A widely used method is dropout, which randomly sets a portion of neuron activations to zero during training This stochastic regularization reduces interdependencies between neurons and prevents reliance on specific features, ultimately leading to improved generalization performance and reduced overfitting in FFNN models.

When developing Feedforward Neural Network (FFNN) models to predict student academic performance or dropout rates in universities, it is crucial to focus on model architecture and hyperparameters Key hyperparameters include cross-validation and epochs, with cross-validation being essential for assessing model performance by dividing the dataset into training and testing sets This process offers insights into the model's ability to generalize effectively to new, unseen data.

In the context of training a Feedforward Neural Network (FFNN), an epoch represents a complete pass through the entire training dataset While increasing the number of epochs can enhance the model's ability to learn intricate patterns, it simultaneously heightens the risk of overfitting.

Evaluation and Performance Metrics

Figure 11 Example of Train-Test Loss Learning Curve (Source: Machine Learning Mastery)

The learning curve demonstrates a reduction in training loss alongside a stable or declining validation loss, signifying that the model successfully learns from the data while avoiding overfitting or underfitting This equilibrium between training and validation loss highlights an effective balance between model complexity and generalization.

Evaluating the learning curve is essential for optimizing model performance, as it informs decisions on model architecture adjustments, the implementation of regularization techniques, and the necessity for additional training data This analysis helps prevent issues like overfitting and underfitting, ultimately leading to a more effective machine learning model.

Accuracy, precision, recall, and F1-score are commonly employed metrics in multiclass classification tasks

Equation 2 Equation 2 Function of employed metrics in classification

Accuracy measures the proportion of correct predictions made by the model, providing an overall assessment of its performance

Precision represents the ratio of true positives to the total predicted positives, while recall measures the ratio of true positives to the total actual positives

The F1-score is a harmonic mean of precision and recall, offering a balanced evaluation metric

Additionally, the confusion matrix provides a detailed breakdown of the model's predictions, allowing for the examination of false positives, false negatives, true positives, and true negatives

Figure 12 Visualization of MultiClass ROC, rechieve from Keras Library

ROC-AUC, or Receiver Operating Characteristic - Area Under the Curve, is a key metric for evaluating model performance, as it quantifies the balance between the true positive rate and the false positive rate.

In conclusion, employing regularization techniques, cross-validation, meticulous hyperparameter selection, and suitable evaluation metrics is essential for effectively developing Feedforward Neural Network (FFNN) models aimed at predicting student academic performance and dropout rates These methodologies enhance model performance, mitigate overfitting, and yield valuable insights into the models' predictive abilities.

APPLYING NEURAL NETWORKS FOR PREDICTING TIME TO

Dataset description

In this thesis, the research focuses on the problem within the International School

Vietnam National University Hanoi (VNUIS) mandates that students obtain a B2 English certificate as part of their curriculum requirements Those who fail to secure the certificate must participate in a pre-English course offered by the university to improve their skills in preparation for the certification exam.

This research focuses on analyzing data regarding students' English proficiency and pre-course information to assess the impact of the English course and implement necessary improvements The dataset utilized consists of 1,429 records, providing a comprehensive basis for evaluation.

The dataset consists of 44 columns that encompass a range of student demographic and academic performance indicators, including gender, age, prior academic achievements, and university entrance records This extensive collection of student information provides valuable insights into various aspects of student performance and characteristics, serving as a solid foundation for data analysis and mining tasks in the research.

Category No Data features Features description Data type

1 Date of birth (dd/mm/yyyy)

7 Certificate exam date (dd/mm/yyyy)

8 Certificate date (dd/mm/yyyy)

11 Score of enroll certificate Numerical

Category No Data features Features description Data type

Intake level, class arrangement and final score of intakes

32 Group subject code Code of subject group for university enrollment exam

33 VNU-IS order International school order in the student’s university wish list

34 Enroll score Combination of the 3 subjects score Numerical

35 Subject 1 First subject in subject group Categorical

37 Subject 2 Second subject in subject group Categorical

39 Subject 3 Third subject in subject group Categorical

Student admission method: direct entry (IELTS, special regulation ), university exam

This study focuses on a multiclass classification problem with five distinct classes, specifically addressing the arrangement of students into classes based on their academic performance The classification process relies on three marks, starting from the foundation level and progressing through levels 1, 2, and 3 The university's structured framework is divided into five study periods, referred to as "intakes," with the target variable 'Số intake' being the primary variable of interest.

Preprocessing Steps Applied to the Dataset

The dataset utilized in the experiment was meticulously preprocessed using the Pandas library, a widely recognized open-source tool for data analysis These preprocessing steps were designed to improve data quality, address missing values, and convert the dataset into an appropriate format for training the Feedforward Neural Network (FFNN) model.

In order to prepare the dataset for the subsequent classification task, several preprocessing steps were applied using the open-source Pandas library in Python The following steps were undertaken

Not all students participate in the pre-English course, leading to incomplete data that may hinder accurate predictions Therefore, we will eliminate this irrelevant data early in the process to minimize the need for extensive data cleaning later on.

- Students are no longer in the school: 73

- Students with certificate when enroll: 113

- Student without certificate until this moment: 92

The 'Số Intake' column, representing the target variable for the intake period, was labeled according to the time students obtained their English certification This labeling process enhanced the classification task by offering significant class labels for accurate predictions.

- This labeling process involves the data columns from category pre-English course information, number 17th to 31st, and category certificate information

The dataset indicates that this student participated solely in the first intake of Level 1-14 and submitted her certificate in December 2020, coinciding with the conclusion of intake 1 Consequently, she only required one intake to obtain her certificate, resulting in her label being "1." Notably, prior to October 2020, she was not required to attend any additional classes.

PE course The intake timeline used for lookup is shown in Table :

After labeling, a new column titled "Intake" is introduced, enhancing the analysis process by providing additional data for evaluation This column facilitates the formulation of a classification problem, where the goal is to predict the time required for students to obtain their certificates based on their initial demographic information, high school performance, enrollment, and placement test results This prediction is crucial as students face pressure to achieve their certificates in order to enroll in the curriculum course.

After labeling, we identified five distinct class labels Notably, the intake values for classes 2 and 3 are significantly lower than those for the other classes, indicating the need for balancing techniques to address this disparity.

Table 2 Number of intake classes after labeling

After that, row deleting process is goes on:

- Student achieved B2 certificate before November 2020: 30

- The total number after data selection: 1104

Figure 14 Visualization of missing rows in dataset

The dataset was examined for missing values in the columns Missing data can adversely affect the performance of machine learning models, and it is crucial to handle them appropriately

The primary issue stems from insufficient enrollment data, which consists of 236 rows To address this problem, there are two potential strategies: either remove the rows containing NaN values or apply mean/mode imputation to fill in the missing data.

The first approach: since it is an important data and contains category and numeric and not possible to fill those value in a meaningful way, these 236 rows will be removed

The second approach involves imputing missing values in the dataset using appropriate strategies, such as mean imputation for numerical data and mode imputation for categorical data However, this method may lead to skewed data, favoring variables that are more prevalent in the columns.

Figure 15 The frequency of 2nd subject after imputing mode imputation

To prevent bias in the dataset, we should consider min imputation; however, for model training and preprocessing, we will initially adopt the first approach and later evaluate the impact of the second approach on the model.

Figure 16 Correlation between numerical features and target outcome

The entrance test score for the PE shows a weak correlation with the intake 1 score, while the third exam score, which evaluates English proficiency, demonstrates a stronger correlation with the first exam score in Mathematics and the overall GPA in grade 12 This indicates a potential link between the TADB entrance test score and other academic performance metrics, warranting further analysis to uncover the factors influencing these correlations.

3.2.4 Perform Encoding and Data Scaling for Training Variables

Depending on the observation when checking the correlation between the features and the target variable, 11 columns will be selected for preprocessing and training model, which are:

The dataset includes eight numerical columns, which are essential for evaluating student performance These columns consist of scores from various assessments, including the "Entrance Exam Scores for Listening (30)", "Entrance Exam Scores for Reading (35)", and "Entrance Exam Scores for Writing (30)", along with the "Final Grade for Intake 1" Additionally, it encompasses individual subject scores labeled as "Subject 1 Score", "Subject 2 Score", and "Subject 3 Score", as well as the "Average Score from Grade 12" This structured data is crucial for understanding academic achievements and trends among students.

- 4 categorical columns: cat_fs = ['Giới tính', 'Level Intake 1','Level Intake 2', ‘Vùng miền’ ]

The selection chooses from the score of the PE test and some only 2 personal data: which is gender and birthplace

One-hot encoding (OHE) is a technique used when dealing with categorical variables that lack an ordinal relationship, allowing each category to be treated as an individual entity This method generates binary columns for each class, indicating the presence or absence of that class For instance, if the original categories are "cat," "dog," and "elephant," one-hot encoding will produce three distinct columns with binary values, such as [1, 0, 0] for "cat," [0, 1, 0] for "dog," and [0, 0, 1] for "elephant."

OHE is commonly used when working with models that cannot directly handle categorical variables, such as linear models, support vector machines, and neural networks

The dataset's categorical variables, including 'Gender', 'Level Intake 1', and 'Region', were transformed into numerical data through one-hot encoding This process resulted in the expansion of encoded variables into separate columns for analysis.

Giới tính bao gồm Nam và Nữ, với các cấp độ tiếp nhận được phân chia thành các nhóm như Level Intake 1_Foundation, Level Intake 1_Level 1, Level Intake 1_Level 2, Level Intake 1_Level 3, và Level Intake 2 với các phân loại như F-Redo, Không xếp, Level 1, Level 2 Về vùng miền, có các khu vực như Hà Nội, Nam Bộ, Nước ngoài, Trung Bộ, Tây Bắc Bộ, ĐB sông Hồng, và Đông Bắc.

The numbers of column has turned into 21 columns, which will be put into training

Scaling input features is essential for enhancing the convergence of optimization algorithms during model training When features vary significantly in scale, those with larger values can disproportionately influence gradient updates, potentially resulting in slower convergence or causing the model to become trapped in local minima Standard scaling addresses this issue by normalizing features to a similar range, which also helps prevent numerical instability in activation functions like sigmoid or softmax In our dataset, numerical features such as 'Điểm thi xếp lớp TADB Nghe_30', 'Điểm tổng kết Intake 1', and 'Điểm trúng tuyển' were scaled to a common range of 0 to 1 This normalization ensures that all features maintain a consistent scale, thereby eliminating any dominance based on magnitude during the training process.

Figure 17 Numerical training set after implementing StandardScaler

Experimental Setup

The project utilized popular open-source libraries, specifically Keras and TensorFlow, for developing and assessing predictive models Keras, a high-level neural network API in Python, offers an intuitive interface for constructing and training deep learning models In contrast, TensorFlow is an open-source machine learning framework that delivers a robust suite of tools for the efficient implementation and deployment of deep learning models.

The experimental design consisted of two key stages: first, executing the base model, and second, performing hyperparameter tuning to enhance the performance of the Feedforward Neural Network (FFNN) in predicting student academic outcomes.

The preprocessed dataset was split into training and testing sets using a standard train-test split method The training set facilitated the training of the FFNN model, while the testing set was employed to assess the model's performance on new, unseen data Following the split, the final sizes of the training and testing sets were established for model training.

Training set Validation set Test set Total amount of data

Table 3 Numbers of train - validation - test set after splitting

After train - test split, here is the final size of sets for training model:

Figure 18 Number of Startified data after Train - test split

After train - test split, here is the final size of sets for training model:

Class imbalance refers to a situation where the distribution of classes in the target variable is skewed, with one or more classes being significantly underrepresented

The target variable "Số Intake" consists of five classes that indicate different intake levels, and class imbalance can negatively affect model performance, particularly for minority classes To mitigate this issue, the Synthetic Minority Over-sampling Technique (SMOTE) was utilized, which creates synthetic samples for minority classes through interpolation of existing data By balancing the class distribution, SMOTE enhances the model's predictive accuracy across all classes.

Figure 19 Class Distribution before and after SMOTE

The preprocessing steps effectively addressed missing data and alleviated class imbalance through the use of SMOTE, ensuring the dataset was well-prepared for model training and evaluation This preparation facilitated more robust and accurate predictions of student academic performance.

3.3.2.3 Implementing base Model & features selection

The base model of the Feedforward Neural Network (FFNN) was built using default hyperparameters, establishing a benchmark for performance evaluation and a foundation for future enhancements The architecture of the baseline model includes a Sequential structure with two hidden layers, each containing 64 neurons and utilizing the ReLU activation function, followed by an output layer that applies the softmax activation for multi-class classification The model is compiled with categorical crossentropy as the loss function, the Adam optimizer, and accuracy as a performance metric.

This model features two hidden layers, each containing 64 neurons and utilizing the ReLU activation function The output layer employs the softmax activation function, while the Adam optimizer is used to enhance performance The results of the model's execution are detailed in the accompanying table.

11 different models with 2 types of outcome: original 5 classes and 3 classes: slow (4 and 5 intakes), medium (2 and 3 intakes), fast (1 intake) for the learning speed of the student

5 learning speed 3 learning speed Validation accurcy

Table 4 Examining different datasets into the FFNN model

Testing if the feature birthplace has any impact to the model

Figure 20 Learning curve of base model with out birthplace

The baseline model results indicate that excluding the variable 'Vung_mien' leads to a higher loss in the validation set, highlighting the significant impact of birthplace on the final outcome.

The final result obtained from the best features is as below:

Figure 21 Learning curve of the best base model

By the 10th epoch, the validation set demonstrates satisfactory accuracy; however, the loss shows a concerning increase of approximately 0.5, indicating the onset of overfitting To address this issue, the model requires retraining with a cleaner dataset, the incorporation of regularization techniques to mitigate the rising loss, and fine-tuning with various parameters.

The FFNN model was trained on labeled data from the training set, utilizing student demographic and academic performance indicators as input features, while the target variable was the "Số Intake" column indicating the student's intake level Backpropagation was employed to optimize the network by adjusting weights to reduce the discrepancy between predicted and actual intake levels Following the training phase, the model's performance was assessed on the testing set using various evaluation metrics, including accuracy, precision, recall, and F1 score, to determine its effectiveness in predicting student intake levels accurately.

To improve the performance of the Feedforward Neural Network (FFNN) model, hyperparameter tuning was performed by systematically adjusting key parameters, including the number of hidden layers, the number of neurons per layer, and the learning rate This process aimed to identify the optimal configuration that enhances the model's predictive capabilities.

The initial dataset consisted of 89 input variables, which introduced potential noise; therefore, we streamlined the data by grouping the Birthplace variable, resulting in 29 columns Despite this reduction, overfitting persisted, even when employing a smaller k value of 3 for cross-validation and limiting the training to 20 to 30 epochs.

Figure 22 Trying with different epochs

Changing the epochs seems didn’t go well, let’s try on adjusting the architecture of the model

Figure 23 Differences between RMSprop and Adam optimizer

Figure 24 Learning curve of applying RMSProp optimizer

Before implementing the architecture, we modified the optimizer to evaluate its impact on gradient descent performance Although the results were consistent, the learning curve indicated improved convergence and stability between epochs 40 and 50 Consequently, we opted to use RMSprop for this problem.

A widely accepted rule of thumb for determining the number of hidden neurons in a neural network is to select a value that lies between the number of input neurons and output neurons In this scenario, with 29 input nodes and 5 output nodes, it is advisable to choose between 10 to 20 hidden neurons for optimal performance.

We did examine 6 cases: hidden_layer_configs = [

[5], # Single hidden layer with 5 neurons (starting point)

[10], # Single hidden layer with 10 neurons

[15], # Single hidden layer with 15 neurons

[5, 5], # Two hidden layers, each with 5 neurons

[5, 10], # Two hidden layers, with 5 neurons in the first layer and 10 in the seconds

[10, 10], # Two hidden layers, each with 10 neurons

[5, 5, 5], # Three hidden layers, each with 5 neurons

But the result obtained didn’t as expected, the result goes worse:

Figure 25 Examining different hidden layers

The configuration featuring two hidden layers, each with 64 neurons, demonstrates superior accuracy and lower loss compared to models with fewer neurons This enhanced performance may be attributed to the complexity of the classification task, as the larger hidden layers enable the network to effectively capture intricate patterns within the data.

A good finding here's is the model goes well with 3 hidden layer than 2, this is will be exam mine later

Results and Analysis

The evaluation of the classification model through the confusion matrix, classification report, and ROC analysis yields important insights into its performance and effectiveness in accurately classifying target classes.

The performance of the classification model was evaluated using a confusion matrix, which summarizes the classification results for each class The confusion matrix is presented in Table 1

1 Intake 2 Intake 3 Intake 4 Intake 5 Intake

Table 5 Confusion matrix showing the classification results for each class

Each row represents the true class, while each column represents the predicted class

The confusion matrix analysis revealed that the model demonstrated differing accuracy levels across various classes Notably, Class 0 and Class 4 showed high precision and recall, reflecting the model's effectiveness in correctly identifying instances within these categories Conversely, the model struggled with Class 1, as evidenced by its lower precision, recall, and F1-score, highlighting the need for improvement in accurately classifying this particular class.

The classification report emphasizes the performance metrics of the model for each class, showcasing precision as the ratio of accurately classified instances and recall as the model's capability to detect all instances within a class.

The classification report provides detailed metrics for evaluating the model's performance The report, shown in Table 2, includes precision, recall, F1-score, and support for each class

Precision Recall F1-score Support Class 0 0.70 0.88 0.78 24 Class 1 0.25 0.17 0.20 6 Class 2 0.42 0.56 0.48 9 Class 3 0.57 0.36 0.44 22 Class 4 0.72 0.75 0.74 28

Table 6 Classification report displaying the precision, recall, F1-score, and support for each class

Precision measures the percentage of accurately classified instances within a specific class, whereas recall assesses the model's capability to detect all instances of that class The F1-score integrates both precision and recall into one comprehensive metric, and support refers to the total number of instances for each class.

It is evident that the model achieved higher precision and recall in Class 0 and Class 4, while struggling to achieve satisfactory results in Class 1

"The ROC analysis was conducted to evaluate the model's discrimination performance across different classes The area under the ROC curve (AUC) was calculated for each class, as shown below:

Figure 26 The ROC curve illustrates the trade-off between the true positive rate (sensitivity) and the false positive rate (1 - specificity) for each class

A higher AUC indicates better discrimination performance, with an AUC of 1 representing a perfect classifier

The ROC analysis provided a thorough evaluation of the model's discrimination ability among various classes, revealing that Class 0 attained the highest area under the ROC curve (AUC) at 0.87, while Class 4 followed with an AUC of 0.81 Additionally, the micro-average ROC curve area, indicative of the model's overall performance, was determined to be 0.77.

These results indicate the model's ability to distinguish between positive and negative instances across multiple classes, with higher AUC values indicating better discrimination performance

The study underscores the need for continued refinement of the model to boost its classification accuracy for Class 1 instances Insights gained from the classification report and ROC analysis will guide future enhancements, specifically targeting the challenges encountered in various classes.

The findings enhance our understanding of the model's strengths and weaknesses, offering crucial insights for informed decision-making and practical use Future research should investigate alternative modeling methods, advanced feature engineering, and data augmentation strategies to overcome the identified limitations and boost classification performance.

This study involved extensive research and analysis aimed at predicting student performance and classifying the time required for students to achieve B2 certification at the International School – Vietnam National University Hanoi Key stages of the work included examining the university's training model and dataset, selecting an appropriate machine learning model, specifically the Feedforward Neural Network (FFNN), and applying it to tackle the identified problem.

We effectively utilized the FFNN model to predict student performance, achieving notable success in multiclass classification of academic achievements Despite facing challenges such as a limited dataset and overfitting, we implemented thorough data preprocessing, focused on feature selection, and applied regularization techniques to improve the model's accuracy and overall performance.

Our research reveals key factors affecting student success and academic performance, particularly highlighting a strong correlation between placement test scores and the time taken to achieve a B2 certificate Higher initial test scores are linked to improved outcomes Additionally, students from Hanoi and Red Riversid provinces consistently demonstrate positive results, underscoring the importance of targeting support for these regions to harness their potential for academic excellence.

Based on our project's findings and the implications they carry, we propose the following recommendations for educational interventions and practices:

Educational institutions should prioritize placement test scores as a crucial metric when creating support programs By focusing on targeted initiatives for students with lower initial test scores, these institutions can effectively help enhance academic performance and foster student success.

To enhance student recruitment at the university level, it is essential to focus on students from Hanoi and the Red River provinces, given their consistently positive academic performance Implementing targeted strategies such as mentorship programs, academic scholarships, and outreach activities will help identify and support these students, ensuring equitable educational opportunities in the region.

Our study has notable limitations, particularly regarding the size of the dataset, which poses challenges related to overfitting To improve the accuracy and generalizability of predictive models, future research should focus on gathering a larger and more diverse dataset that includes a wider variety of student demographics and academic backgrounds.

In summary, our research enhances the understanding of how to predict student performance and the duration required to attain B2 certification By examining the relationship between placement test scores and student outcomes, as well as the strong performance of students from Hanoi and Red River provinces, educators and policymakers can customize their strategies to provide better support for students and promote equitable educational opportunities.

Tiêu đề	Predicting students' performance of pre-English course by using neural network – a case study in international school – Vietnam National University, Hanoi
Tác giả	Le Quynh Hoa
Người hướng dẫn	Dr. Nguyen Quang Thuan
Trường học	Vietnam National University, Hanoi International School
Chuyên ngành	Business Data Analytics
Thể loại	Graduation project
Năm xuất bản	2023
Thành phố	Hanoi

Định dạng
Số trang	52
Dung lượng	2,35 MB