The purpose of this essay is to find out how i use machine learning to categorize vietnamese news and how i do it

+ Raw input data is all the information we know about the data, for example the value of each pixel for an image, every word, every sentence for the text, for the audio file it is a sign

Trang 1

NAME : TRAN NGUYEN ANH THOAI

Trang 3

First thing, I would like to thank my loved ones, especially my parents Because they are feeder, and giving me the best things in this life, from matter to morality

Second thing, I want to thanks to the FPT University and the Greenwich University Thanks

to the school and teachers who creates classes and subjects for students Thank you for the knowledge that the teacher has communicated

Thank you to Mr Le Minh Nhat Trieu, who accompanied me during the past school year at TopUp semester He devoted his devotion, time, and energy to support me as much as possible

Finally, we would like to thank the teachers and friends who have accompanied me during the past 4 school years Thank you for your, your valuable knowledge and enthusiasm to help

Trang 4

Nowadays, the development of IT has changed our life so much Special Data mining and machine learning It has been applied in all field in our life from face, voice recognition, nature language processing, Especially natural language processing many areas in today's life use examples in some places using robots capable of communication to replace humans

in communication A typical example is the explosion of anti-epidemic robots in the covid-19age In the field of mass media, newspapers and news production are getting more and more attention from the masses Accompanying that is a great deal of work, it is a waste of time if we sit and read each title of an article for us to classify it Due to grasping the deadly weakness of the media industry This essay was written to fix the problem of sorting articles topic, but because time is limited, topics only revolve around the world, sports, life, law, health Usually, the articles will be stored in natural language, unstructured data The easiest way to classify these articles is the vector space itself However, in order to vector the information, we need to process the data first Specific tasks need to be done with cutting words, removing accents in sentences and eliminating stop word In this topic to be able to separate words I use the segmentation tool to separate words, then construct vectors based on Bow, Word2Vec methods Then use jupyter notebook to show the results obtained in news classification using ML method

Trang 7

1 Introduction

Digital life all information or knowledge that we know or do not know is all on the internet This problem also solves a lot of human problems such as document storage without paper or pen, long storage time, convenient for searching Although the face of the huge amount of information that is properly categorized, it is important to be concerned But in fact, this job needs to be done manually and takes a lot of time and effort So the automatic classification is very necessary

Seeing this need, I decided to explore the steps to conduct information classification using ML The method of classifying news with the data set is news taken from online news sites in Vietnamese From there we proceed to build and apply the classification methods This is a research project and also the subject of my graduation thesis

The purpose of this essay is to find out how I use machine learning to categorize Vietnamesenews and how I do it

1.1 Thesis layout

Trang 8

CHAPTER 2 MACHINE LEARNING AND NATURAL LANGUAGE PROCESSING IN GERNERAL 2.1 Overview background of machine learning.

In today's industrialized life The amount of data is increasing in quality and quantity.But only a small part of that huge chunk of data has value The desire to find and exploit information and value from that data block has opened a new wing for the information technology industry It is information extraction from the database (Knowledge Discovery from Data)

Steps for data mining include:

● Identify the request and the associated data space (problem understand and data understand)

● Data preparation requirements, including data cleaning, data integration, data selection and data transformation

● Data mining including identify the target of the data to be exploited and exploitation technique The result will be an images or text source

● Evaluation Based on the criteria and filter the source of the obtained data

● Deployment

The data mining process is repeated many times Data extraction is the process of extracting data from a data set This needs to use knowledge of many fields such as IT, AI, database, math

Mining methods include:

● classification: This is a technique that allows the classification of an object into one

or more certain classes

● Regression: Defines a data sample into a predictive variable has real value

● Clustering: "Cluster" means a group of data objects Similar objects are located in a cluster The result is similar objects in the same group

Trang 9

2.2 What is machine learning.

With the explosion of big data and the classical algorithms that haven't performed well on paper yet The emergence of Machine Learning is inevitable, it also leaves a new piece for the IT industry

Machine learning is a field of artificial intelligence involved in the research and construction

of techniques that allow systems to "learn" automatically from data to solve specific

problems

Machine learning is strongly related to statistics, as both fields are studying data analysis, but unlike statistics, machine learning focuses on the complexity of algorithms

in performing computation Many reasoning problems are classified as NP-difficult

problems, so part of machine learning is to study the development of approximate inferencealgorithms that can be handled

Currently, thanks to the development of hardware, there has been the production and improvement of many new algorithms such as Deep Learning,

Reinforcement learning But all must have the appearance of ML It is the core of the

advanced algorithms now

The outstanding feature of this document classification is the variety of topics Number of topics and texts is unlimited For example, take some popular topics in Vietnamese news such as law, health, life, education, economics

Machine learning is widely used today including data tracing machine, medical diagnostics, detecting fake credit cards, analyzing the stock market, classifying DNA sequences, speech and writing recognition , automatic translation, game play and robot locomotion [1]

Trang 10

2.3 General Machine Learning Structure

Trang 11

There are two problems that need to be addressed in machine learning is training phase andtesting phase.

Training phase.

+ The issues that need to be explained in the training phase are the Features extractor and Main Algorithms

+ Raw input data is all the information we know about the data, for example the value

of each pixel for an image, every word, every sentence for the text, for the audio file

it is a signal This data is called raw data and is in an unstructured form In order forthe machine to understand and learn this raw data type, we need to convert it to a 2-dimensional vector

+ Prior knowledge about data: different theories about the data type also benefit For example, the text classification problem we need is knowledge related to that problem As in this issue of this essay is Vietnamese text analysis, we need to have stop word files or vocab sets related to Vietnamese text

+ Main Algorithms: after extracting features from the data set, these extracted

features are applied to training algorithms such as Classification, Clustering,

2.4 Naturual Language Processing With Machine Learning

Textual data mining is a large amount of human knowledge used in communication or in documents used constants This makes the text data set extremely large, so the amount of human knowledge that is accumulating constantly increases over time However, with the rapid development of such a large number of documents, people could not control, evaluate and classify themselves as usual Not to mention the amount of

Trang 12

data that can’t be controlled, causing many serious consequences for security in general andhuman life in particular Because of this urgency, the use of ML in NLP helps people save energy in order to select, classify and filter clean information that needs attention The text classification problem plays a very important role in handling big data today So in this essay, we will dig in depth about the methods that ML applies to NLP to see the power of

ML in the current 4.0 era

Text data mining is the process of extracting data from articles or documents

in the form of text This is a multi-disciplinary problem including information retrieval, text analysis, information extraction, clustering, categorization… In the next part we will submit in-depth presentation of the problem of categorization of content of the topic of this thesis

In the field of Natural Language Processing is also an array of it, but for the input data type it is the text that is stored in an unstructured form For raw data input is text

Trang 13

that is scraped from articles or data files saved in an unstructured format This problem was given by me in the (overall machine learning) section when we need to have knowledge related to the problem we need to solve Some methods to be able to extract the feature of text in natural language processing are Bag of Word, TF-IDF, Word Embedding.

News classification is one form of the text classification problem Text classification is a post classic word processing math According to Yang & Xiu (1999)[2]” Literature classification AutoText is the assignment of classification labels on a new

document based on the degree of similarity of that text compared to labeled texts in the training set ”

In the world there are many research projects achieve positive results Example Support Vector Machine – Joachims 1998, k-Nearest Neighbor – Yang 1994, Linear Least Squares Fit – Yang and Chute 1994, Neural Network – Wiener et al 1995, Nạve Bayes – Baker and Mccallum 2000, Centroid- based – Shankar and Karypis 1998 These methods are based on statistical probability or word weight information in the text But for

Vietnamese A lot of times the token is interpreted as a word, although this is not entirely correct In English, for example, words are usually separated by spaces, but New York is still considered a default word even though it has a space in between Hence there is only 1 token in this case Another example is that I call the words ‘I 'and‘ am ’even though there are no spaces In this case, we have 2 tokens, there are still many limitations due to difficulty

in separating words and sentences

2.5 Machine learning And Natural Language Processing With Performance

Because of the increasing needs of life, the society is more and more developed The language processing method is increasingly being applied in life

Example : In the business sector Facebook uses NLP to keep track of trending topics and popular hashtags Mange the news with Fake News Detection, Spam detection Spam detection technologies use NLP's text sorting capabilities to scan emails and identify them asspam or phishing Create the Chat Bot for replying customer Generate the text for create

Trang 14

new documents Translate the text to another language And so many tasks in life can to be solved with Natural Language Processs and Machine Learning

CHAPTER 3 TEXT CLASSIFICAITON IN NATUTAL LANGUAGE PROCESSING AND MACHINE LEARNING 3.1 Background

Because computers and Internet are widely used, the extremely huge number of information is produced everyday Nowadays, the information already is full of our life Most of the information is stored as texts A large number of unstructured texts is posted and sorted in web pages, digital libraries and community Therefore, the automatic method

is necessary to help people manage and filter these information instead of manual work Predicting the class labels for the online texts has been required by a variety of applications For example, in spam filtering, classification methods are used to determine the junk

Trang 15

information automatically In news organization, because most news is provide on Internet and the amount is huge, it is impractical to finish this task manually.

The motivation for exploiting background knowledge in text classification is

attributed to two reasons First, more information from texts can make more reasonable classification Second, people have basic concepts and general knowledge in their mind, however, the common corpora/datasets are some kinds of special case which would lack some basic concepts and general knowledge These basic concepts and general knowledge are the background knowledge in our life

So Classificaiton is one of the most important task for filter the important texts we need to use or skip the useless text

3.2 Application

There are a lot of applications in nowadays need the classifiation task Depending on the classification task, there are different kinds of class sets The most usefull application wealways use in nowadays is base on the key word we search on Google , the sever use

classification for filtering the theme of information we need to find , and give back the resultcorrectly what we want Or Google , Yahoo, Facebook use the Classification task for

removing the spam email help people acess the dangerous link

Trang 16

3.3 Overview of Methods and Contributions

For example: Thủ tướng hôm nay vừa có một chuyến thăm ở Trung Quốc

This sentence belong to the politics theme

3.3.2 Background Knowledge

For understanding the categories of the sentences , the model need the knowledge

of the language vocabulary and knowledge structure of the language

For example , to understanding the sentence in English Language , corpus/dataset is the dictionary we use for model get all the Knowledge Base , and then retrieval the meaning

of the sentences And then base on the structure knowledge the model can understanding

the situation of sentences

3.3.3 Feature engineering

The process of converting the original raw data set into sets of attributes This makesthe original raw data set better to solve the problem more easily Helps more compatible with prediction models and improves predictive model accuracy

Trang 17

Some of the text conversion models that are widely used in NLP today Bag of Word, TF-IDF, Word2Vec

3.3.3.1 Bag of word

In 1954 Harris, Zellig, a linguist and mathematician, talked about the bag of word in linguistic context in this article on distribution structure According to Harris, Zellig[4] in Distributional Structure” …For language is not merely a bag of words but a tool with

particular properties which have been fashioned in the course of its use ” This is the first premise for data scientists to research and develop a bag of word technique to apply to datapreprocessing for NLP

The bag of word model is a popular method to convert data types from unstructuredtext data types to simple vector spatial representations It creates a dictionary that contains words that do not repeat in a text With the sentence pattern is a vector of equal length to the length of the dictionary in the sentence and each cell in the vector represents the number of occurrences of that word As its name suggests, it is a pocket of words containingwords in the text, arranged in a mess, regardless of the appearance of the sequence or grammar in the sentence

Trang 18

In practice, the Bag-of-words model is mainly used as a tool of feature generation After transforming the text into a "bag of words", we can calculate various measures to characterize the text The most common type of characteristics, or features calculated from the Bag-of-words model is term frequency, namely, the number of times a term appears in the text For the example above, we can construct the following two lists to record the term frequencies of all the distinct words (BoW1 and BoW2 ordered as in BoW3):

Suppose we have two sentences in a Vietnamese text

1 Con chó đang nằm canh nhà (The dog is at the house guard)

2 Con mèo đang nằm ngủ (The cat is sleeping)Based on the above two sentences, a list of 8 words will appear

Example, we have two sentences

● Con chó nằm cạnh con mèo

● Con mèo nằm trước nhà

Based on the above two sentences, a list of 9 words will appear

[“mèo”, “chó”, “cạnh”, “nằm”, “trước”, “nhà”, “con”,]

Based on the created dictionary we proceed as follows

● [0,1,1,1,0,0,2]

Trang 19

● [1,0,0,1,1,1,1]

3.3.3.2.TF-IDF

This method was introduced by Karen Spärck Jones in an article appeared in 1972 under the title "term specificity" Although it works as the "Heuristic method", it has been controversial on theoretical grounds for at least 30 years But this is also the premise to serve the data preprocessing of the NLP model

Like the second example in the Bag of Word section, the problem will arise when in aparagraph with too many duplicate words or words that appear too often in a paragraph Sothey will interfere with other words in the dictionary If the frequency of occurrence of words is too much in a text, then if only considering the frequency of occurrence of each word, the classification will give wrong results and lead to the rate of accuracy is not high

TF-IDF (Term frequency – Inverse Document Frequency) is a popular method of representing text as a vector TF-IDF is a method of weighting the important words in the text High values show the importance of that word in the text, but it is inverse with the number of times it appears in the text

TF – term frequency (how often the word appears in the text) Each text in a

dictionary has a different length, so some words may appear many times in a large

document So TF (term frequency) is calculated by

number of occurrences of word t in the text

The denominator is the total number of words in the document d IDF-(Inverse Document Frequency) inverses the frequency of the text, helping to better appreciate the

Trang 20

importance of a word When calculating TF, the importance of each word is the same in a text However, there are many characters whose frequency occurs but is not of high

importance, for example “như”,”thì”,”là”,”bị”… with Vietnamese In English text is

“is”,”the”,”and”,”a”,”an”… Using IDF helps us to reduce the importance of such words is

a text number of in file D

z is the number of text in set D containing the word t, with cases where the word t is not in the dictionary, value = 0 so we add 1 to avoid happening, in addition, using the logarithm does not change the idf property

3.3.3.3.N-Gram language model

Predicting is difficult—especially about the future, as the old quip goes But how about predicting something that seems much easier, like the next few words someone is going to say Predicting upcoming words, or assign probabilities to sentences is important because probabilities are essential in any task in which we have to identify words in noisy, ambiguous input, like speech recognition For a speech recognizer to realize that you said I will be back soonish and not I will be bassoon dish, it helps to know that back soonish is a much more probable sequence than bassoon dish For writing tools like spelling correction

or grammatical error correction, we need to find and correct errors in writing like Their are two midterms, in which There was mistyped as Their, or Everything has improve, in which improve should have been improved The phrase There are will be much more probable than Their are, and has improved than has improve, allowing us to help users by detecting and correcting these errors [https://web.stanford.edu/~jurafsky/slp3/3.pdf] So Ngram is the method can solve this problem with the task of computing P(w|h), the probability of a word w given some history h

Trang 21

Suppose the history h is “its water is so transparent that” and we want to know the probability that the next word is the:

P(the|its water is so transparent that)

One way to estimate this probability is from relative frequency counts: take a very large corpus, count the number of times we see its water is so transparent that, and count the number of times this is followed by the This would be answering the question “Out of the times we saw the history h, how many times was it followed by the word w”, as follows:

P(the|its water is so transparent that)

= Count(its water is so transparent that the)/ Count(its water is so transparent that)

The n-gram (which looks n−1 words into the past) Thus, the general equation for this n-gram approximation to the conditional probability of the next word in a sequence is

P(wn|w1:n−1) ≈ P(wn|wn−N+1:n−1)

In VietNamese sentences , we have the example :

"thủ_tướng đức nhận_lời tham_dự lễ kỷ_niệm"

→ thủ_tướng đức, đức nhận_lời, nhận_lời tham_dự, tham_dự lễ , lễ kỷ_niệm

Trang 22

Some example in Ngram is Bigram , Trigram , … The Ngram method is very usefull for understanding the sentences , but the cost of memory is very big , so we can’t use this method for the big sentences

3.3.3.4 Truncated Singular Value Decomposition

Matrix factorization is one of the most popular method for reducing the dataset helphardware can save the memory while calculating And Singular Value Decomposition is one

of the Matrix factorization method Suppose we have A(mxn) matrix, we can factorize the matrix like this:

Trang 23

In Truncated Singular Value Decomposition(Low-rank approximation) have the formula with A is the matrix , k is the rank , σ is the value of cross line in matrix

if using 90% information of matrix we will calculate the and choose k is minimum number

3.3.4 Machine Learning Model

3.3.4.1 Naive Bayes

Naive Bayes is the most classical method Naive Bayes classification method created

by Thomas Bayes is based on the Bayes theorem Naive Bayes is a prime example of the simplest solutions that are also the most powerful Even with the remarkable advancement

of machine learning in recent years, the naive method not only proves simple but also fast, accurate and reliable Especially in the field of natural language processing This method is used a lot in the classification problem

Trang 24

Based on the Bayes formula, we will find the probability to get the label based on the probabilities of the given words This proves that the prediction of a label for a certain type

of text depends on the frequency of the occurrence of words, sentences and the dependent probability of holding words Apply algorithm P (label | text) = (P (text | label) * P (label)) / P(text) For example consider a classification problem with c labels whose input vector x represents a word we call this the probability of falling into subclass c when we know the vector x From there, data can be classified by determining the class with the highest probability

Applying the Bayes formula to p(c | x) yields:

with

We can omit p (x) because no p (x) does not depend on c For calculation

convenience, we can assume that the components of the variable x are independent of eachother, if c is known

In the algorithm executio, the values and are taken from the Training data set

Training

+ For each element , belonging to x we calculate its probability for subclass c

Testing

We have elements I predict its label will be equal to

Advantages

Trang 25

The algorithm works simply, effectively, quickly and saves time Widely used in classification problems Can provide higher predictability than other models although less data input is required.

Disadvantages

Probability that the output will be wrong in some cases So we should not focus too much on its probability The naive Bayes algorithm is only correct in some cases, but if applied to the real world the algorithm will limit its capabilities.The independence

assumption does not work well in situations where the data are interdependent.The model parameters are independent probability values, so the interaction between them cannot be estimated

3.3.4.2.Neural Network

The first neural network appeared in 1944 by Warren McCullough and Walter Pitts, two researchers at the University of Chicago In 1952 they moved to MIT as the first

collaborative member of cognitive science

3.3.4.3.Recurrent Neural Network

Recurrent Neural Networks were studied by David Rumelhart, an American

psychologist in 1986 In 1982 the Hopfield network a special type of RNN was discovered by the American scientist John Joseph Hopfield In 1993 this was the turning point of the RNN model when it solved the problem of Deep Learning requiring more than 1000 layers

As you know, neural networks are developed and functioned like human neural networks It consists of 3 main parts: Input layer (x), hidden layer (Neural network), output layer (y) The input data and the output data output of the NN network are independent of each other, so it cannot be used to predict the output probability of the problem as

described, complete the sentence …

For example, when you are reading this sentence each word in the sentence will contribute to make up a piece of information and make up the meaning of the whole

Trang 26

sentence Based on the sentence you just read above your brain, you will store the

information and continue processing the semantics of the next sentence This is a

complicated process that the Neural Network cannot do So RNN was born to solve this problem RNN is able to recall the information that has been calculated previously to be able

to give the most accurate prediction for the current prediction steps

Định dạng
Số trang	53
Dung lượng	4,74 MB