My contributions include providing some useful metrics to extract valuable knowledge from log files, exploiting some superior techniques in the field of natural language processing to ad
Trang 1VIETNAM NATIONAL UNIVERSITY
HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY
-
NGUYEN MINH TRI
APPLYING MACHINE LEARNING TECHNIQUES IN EXTRACTING INFORMATION FROM THE LOG FILE
Majors: Computer ScienceID: 60480101
MASTER THESIS
Ho Chi Minh City, 03 July 2019
Trang 2ii
WORK IS DONE AT HO CHI MINH CITY UNIVERSITY OF
TECHNOLOGY – VNU – HCM
Scientific supervisor: Assoc Prof Dr Nam Thoai
The reviewer 1: Assoc Prof Dr Huynh Trung Hieu
The reviewer 2: Dr Le Thanh Sach
This master thesis is defended at Ho Chi Minh City University of Technology – VNU – HCM HCM City, 03 July 2019 The master thesis assessment committee includes: 1 Assoc Prof Dr Dang Tran Khanh
2 Dr Le Hong Trang
3 Assoc Prof Dr Huynh Trung Hieu
4 Dr Le Thanh Sach
5 Assoc Prof Dr Tran Cong Hung
Confirmation of the Chairman of the assessment committee and the Head of the specialized management department after the thesis has been corrected (if any) CHAIRMAN OF THE HEAD OF FACULTY ASSESSMENT COMMITTEE OF CSE
Trang 3iii
VNU – HO CHI MINH CITY
HO CHI MINH CITY UNIVERSITY
Student name: NGUYEN MINH TRI Student ID: 1770026
Date of birth: 10-12-1994 Place of birth: Lam Dong
Major: Computer Science Major ID: 60480101
I THESIS TITLE: Applying machine learning techniques in extracting
information from the log file
II TASKS AND CONTENTS: Study machine learning techniques in processing and
analyzing logfile Propose several analytical methods to extract the characteristics
of online contents as well as predict their popularity in the near future based on the
historical access recorded in the log file
(Sign and full name)
HEAD OF COMPUTER SCIENCE AND ENGINEERING DEPARTMENT
(Sign and full name)
Trang 4Finally, I must express my very profound gratitude to my parents, my brothers and my friends for providing me with unfailing support and continuous encouragement throughout my two years of study as well as the process of researching and doing this thesis
Thank you!
Nguyen Minh Tri 03/07/2019
Trang 5v
ABSTRACT
With the rapid growth of Internet technology and infrastructure, we have entered the era of data explosion Recently, a massive amount of data, including log files which contain a lot of valuable information, have been generated Since the amount of log files is growing exponentially, processing and analyzing them become real challenges Among these challenges, time series prediction can be considered as one of the fundamental problems in various domains, related to wide-ranging and high-impact applications Despite the fast-paced process in the field of time series analysis over the past decades, the quality of prediction has not yet met the increasing user demands At the beginning, some simple machine learning techniques were used
to make a prediction based on many kinds of linear relationships However, as the amount of data is increasing and the relationships among data sequences are becoming more complicated, those earlier techniques have no longer met the requirements With the development of machine learning and deep learning, many novel mechanisms and techniques have been proposed to capture the non-linear relations They have significantly improved the performance in time series prediction
This work mainly focuses on time series prediction task as well as provides some analysis of log files Toward this goal, I not only investigate some characteristics of several real datasets but also introduce Derivative-based Multivariate Linear Regression model and Attention-based Non-Recursive Neural Network model to predict the popularity of online contents within these datasets My contributions include providing some useful metrics to extract valuable knowledge from log files, exploiting some superior techniques in the field of natural language processing to address the short-term prediction in time series, and proposing two novel models to predict the popularity of online contents The experimental results show that the proposed models not only outperform some baselines on the real datasets but also significantly improve the inference time Finally, the limitations of proposed models are discussed as a part of my study, which makes room for further experiments in the future
Trang 6vi
PLEDGE OF HONOR
In this thesis, any formulation, idea, research, reasoning or analysis borrowed from a third party is accurately indicated in the way that the original source is immediately recognizable and in respect of citation techniques and the author’s rights
I affirm that, in addition to these cited references, the entire content of my thesis is my own study, not copied from any other document
I declare that I have not plagiarized, nor committed any other kind of fraud
Nguyen Minh Tri
Trang 7vii
CONTENTS
ACKNOWLEDGEMENTS iv
ABSTRACT v
PLEDGE OF HONOR vi
CONTENTS vii
LIST OF FIGURES x
LIST OF TABLES xii
LIST OF ACRONYMS xiii
CHAPTER 1: INTRODUCTION 14
1.1 Overview 14
1.2 Contribution 16
1.3 Research scope 17
1.4 Thesis Outline 17
CHAPTER 2: BACKGROUND 18
2.1 Weblog Mining 18
2.1.1 Data pre-processing 19
2.1.2 Data Mining 20
2.2 Time Series Prediction 20
2.2.1 Linear regression 20
2.2.2 Non-linear regression 21
2.3 Language modeling 22
2.3.1 Sequence-to-sequence model 22
2.3.2 Attention mechanism 23
2.3.3 The Transformer 24
Trang 8viii
CHAPTER 3: RELATED WORK 26
3.1 Weblog Mining 26
3.1.1 Web content mining 26
3.1.2 Web structure mining 27
3.1.3 Web usage mining 28
3.2 Time Series Prediction 29
3.3 Predicting content popularity 30
CHAPTER 4: LOGFILE PROCESSING & ANALYSIS 32
4.1 Weblog of HCMUT Website 32
4.1.1 Data Pre-processing 32
4.1.2 Pattern Analysis 39
4.2 MovieLens Dataset 44
4.2.1 The Popularity Distribution 45
4.2.2 The Movies Lifetime 47
4.2.3 Access Evolution Pattern 48
4.3 Youtube Dataset 50
4.3.1 Youtube Dataset 2008 50
4.3.2 Youtube Dataset 2019 53
CHAPTER 5: PREDICTING MODEL 58
5.1 Derivative-based Multivariate Linear Regression 58
5.2 Attention-based Non-Recursive Neural Network 59
CHAPTER 6: EXPERIMENT 63
6.1 Parameter Sensitivity 64
6.1.1 Derivative-based Multivariate Linear Regression 64
6.1.2 Attention-based Non-Recursive Neural Network 66
Trang 9ix
6.2 Time series prediction 68
6.3 Predicting online content popularity 70
6.3.1 Experiments in MovieLens dataset 70
6.3.2 Experiments in Youtube dataset 73
6.3.3 Inference time efficiency 76
CHAPTER 7: CONCLUSION & FUTURE WORK 78
REFERENCE 80
LIST OF PUBLISHED ARTICLES 86
AUTOBIOGRAPHY 87
Trang 10x
LIST OF FIGURES
Fig 1 URI Structure
Fig 2 Client IP Lookup Model
Fig 3 Data Pre-processing process
Fig 4 The global distribution of user access to the HCMUT website
Fig 5 The global proportion coupled with the number of access to the HCMUT website
Fig 6 The domestic distribution of user access to the HCMUT website
Fig 7 The domestic proportion coupled with the number of access to the HCMUT website
Fig 8 Distribution of user access within domestic ISPs
Fig 9 The market share of internet service in some major cities
Fig 10 The monthly trend of user's preferences in HCMUT website
Fig 11 The proportion of contents popularity in MovieLens dataset
Fig 12 Movies popularity distribution at a specific timestamp
Fig 13 The CDF of movies popularity in MovieLens dataset
Fig 14 Movies' lifetime in MovieLens dataset
Fig 15 The proportion of videos popularity in Youtube dataset
Fig 16 The CDF of views in Youtube dataset
Fig 17 The Alpha parameter of gamma distribution over the observation
Fig 18 The Beta parameter of gamma distribution over the observation
Fig 19 Hourly views count of top 50 popular videos in several countries
Fig 20 Percentage of views in a given hour of a day in Japan
Fig 21 Percentage of views in a given hour of a day in the US
Fig 22 Percentage of views in a given hour of a day in Vietnam and Singapore Fig 23 Percentage of view in a given day of a week in the US and Japan
Fig 24 The structure of the attention-based non-recursive neural network
Fig 25 RMSE and MAE in predicting with MovieLens dataset
Fig 26 RMSE and MAE in predicting with Youtube dataset
Fig 27 RMSE comparison among the different numbers of stacked layers
Trang 11xi
Fig 28 RMSE vs hidden size of attention layers
Fig 29 DA-RNN’s train and test prediction on NASDAQ 100 dataset
Fig 30 ANRNN’s train and test prediction on NASDAQ 100 dataset
Fig 31 DA-RNN's train and test prediction on MovieLens dataset
Fig 32 FC-ANN's train and test prediction on MovieLens dataset
Fig 33 DMLR's train and test prediction on MovieLens dataset
Fig 34 ANRNN's train test prediction on MovieLens dataset
Fig 35 DA-RNN's train and test prediction on Youtube dataset
Fig 36 FC-ANN's train and test prediction on Youtube dataset
Fig 37 DMLR's train and test prediction on Youtube dataset
Fig 38 ANRNN's train and test prediction on Youtube dataset
Fig 39 The comparison of inference time among DA-RNN, ANRNN, FC-ANN, and DMRL in two datasets
Trang 12xii
LIST OF TABLES
Table 4 1 Structure of HCMUT weblog (ECLF)
Table 4 2 Example of data cleaning
Table 4 3 Example of Access Identification
Table 4 4 Country Normalization
Table 4 5 City Normalization
Table 4 6 ISP Normalization
Table 4 7 The proportion of different evolution patterns
Table 4 8 The format of the Youtube dataset (2008)
Table 4 9 The format of Youtube dataset (2019)
Table 6 1 Average RMSE comparison between DA-RNN and ANRNN when
practicing with NASDAQ 100 dataset
Table 6 2 RMSE and MAE comparisons between different methods when practicing with MovieLens dataset
Table 6 3 Mean correlation among series in 2 datasets
Table 6 4 RMSE and MAE comparisons between different methods when practicing with Youtube dataset
Trang 13
xiii
LIST OF ACRONYMS
DA-RNN Dual-stage Attention-based Recurrent Neural Network FC-ANN Fully Connected Artificial Neural Network DMLR Derivative-based Multivariate Linear Regression ANRNN Attention-based Non-Recursive Neural Network
NARMAX Nonlinear Autoregressive Moving Average with Exogenous
ARIMA Autoregressive Integrated Moving Average
Trang 14Specifically, the emergence of social network has also brought an enormous and ever-growing amount of online content into our digital world In this context, video contents are determined to be a dominant cause of network congestion as it would account for more than 80% of total Internet traffic by 2020 [1] It has been revealed that user attention is distributed in a skewed fashion as a few contents receive massive views and downloads, whereas most contents get few user attention [2] By accurately predicting the popularity of online content in the future, the network operators can proactively manage the distribution as well as cache replacement policies for online contents across their infrastructures For services providers, they can exceedingly benefit from designing appropriate advertising strategies and recommendation schemes which encourage their users to reach the most relevant and popular contents Thus, predicting the popularity of online contents, especially videos,
is of great importance as it supports and drives the design and management of various services Several efforts have been made to predict the long-term popularity of online videos by analyzing their historical accesses [2], [3], [4] However, it has been hard to accurately predict the popularity of a given content in the near future or make the short-term prediction In this thesis, I mainly focus on exploiting several mechanisms and techniques in the field of time series prediction to address the problem
Trang 1515
Fundamentally, predicting in time series involves taking models to fit on historical data and applying them to determine the future value Since time series prediction is a basic problem in various domains, it has usually been applied in wide-ranging and high-impact applications such as financial market prediction [5], weather forecasting [6], predicting the next frame of a given video [7], complex dynamical system analysis [8] and so on There are many different methods for performing time series prediction such as exponential moving average [9], [10], autoregressive model [11], polynomial regression [12], autoregressive integrated moving average [13], [14] and so on Notably, Recurrent Neural Network (RNN) and its variances recently have outperformed the traditional methods and become the standard for some time series analysis problems Besides, several studies on Natural Language Processing (NLP) addressing some problems such as sentiment analysis, sequence-to-sequence translation, recently result in many superior mechanisms and techniques, that can be widely applied in other areas, especially in time series prediction
Although all of the aforementioned algorithms and techniques can achieve substantial results in time series prediction, applying them to address the problem in predicting online content popularity is a fascinating and challenging task
Hence, the purposes of this thesis are:
• Analyzing the characteristics of some real datasets including the popularity
of online contents on HCMUT’s website [15], Youtube [16], Movielens [17] and so on
• Applying mechanisms and techniques in time series prediction to address the short-term prediction problem as well as proposing appropriate models
to predict online content popularity based on the historical accesses
• Evaluating the proposed models on real datasets in comparison with several baseline methods
Trang 1616
1.2 Contribution
a Practical contribution:
§ Providing several useful metrics to discover the characteristics of
online contents After pre-processing the raw logs, selecting appropriate
metrics is of great importance to understand the characteristics of the dataset as well as extract valuable knowledge from it Thereby, we can design suitable predicting models
§ Predicting online content popularity with reasonable accuracy Knowing the popularity of online contents, the service provider can select appropriate marketing strategies, advertisers can maximize their revenues through better advertising placement [2] For network management, the network operators can proactively manage the bandwidth requirements and effectively deploy their caching systems [18]
§ Providing effective models in time series prediction As predicting in time series is a basic problem in many domains, the proposed models in this thesis can also be widely applied in many other areas such as financial market prediction [5], weather forecasting [6], predicting the next frame of
a given video [7] and complex dynamical system analysis [8]
b Scientific contribution:
§ Applying several mechanisms in the field of Natural Language
Processing (NLP) into time series prediction Basically, the attention
mechanism is a ubiquitous method used in modern deep learning models to tackle a variety of tasks such as language modeling [19], sentiment analysis [19], natural language inference [20], abstractive summarization [21], learning task-independent sentence representation [22] and so on Here, one of the proposed models is the combination of two attention mechanisms to address time series prediction
§ Proposing effective models to address the short-term prediction
problem in predicting online content popularity Although the field of
predicting online content popularity has been extensively researched in the recent decade, there are a few studies focus on accurately predicting the
Trang 1717
popularity of a given content in the near future [4], [3] In this thesis, I apply several state-of-the-art techniques in time series prediction to tackle this problem
§ Improving significantly inference performance compared to baseline
methods Without using any recurrent or convolutional neural network, the
proposed models are highly parallelizable, which dramatically reduces the training and testing time In addition, the empirical results also show that
my models can outperform the baselines when practicing on some real datasets
1.3 Research scope
§ Generally, this thesis focuses on extracting useful knowledge from log files, it provides several metrics to help us understand the characteristics of some real datasets that I used in the experiments
§ In particular, I consider some mechanisms and techniques in the field of time series prediction and natural language processing to address the short-term prediction problems, especially predicting online content popularity
1.4 Thesis Outline
My thesis includes seven chapters in total, the rest of it is organized as follows: Chapter 2 provides the background knowledge related to techniques and mechanisms being used in this thesis Chapter 3 outlines some related research that have been presented in the field of log analysis, predicting online content popularity as well as time series prediction Chapter 4 presents some real datasets, explores their characteristics and explains some analysis results That chapter also draws the challenges in time series prediction while applying on these datasets In Chapter 5, I propose two novel methods and describe their implementation Chapter 6 discusses the empirical results of my models in comparison to some baselines Finally, Chapter
7 draws the conclusions and proposes future work
Trang 1818
CHAPTER 2: BACKGROUND
Although each log file has its own characteristics, which make mining useful knowledge from them becomes a challenging task, there are some mining techniques that can be widely applied in the mining process as essential steps to extract valuable information For a better understanding, this Chapter not only provides some basic techniques on weblog mining but also explains some advanced techniques and algorithms related to the field of time series prediction
2.1 Weblog Mining
Weblog Mining can be considered as a sub-domain of data mining, a process of discovering useful patterns or knowledge from data sources such as databases, texts, web, and so on Therefore, weblog mining is also carried out in three main steps [23]:
• Pre-processing: a process that removes noises and unnecessary information from the raw data Moreover, as the raw data is not always suitable for the mining algorithms due to many reasons, it needs to be transformed before being mined In weblog mining, the process may contain several sub-processes such as data cleaning, pageview identification, user identification, sessionization, path completion, data integration, and so on
• Data Mining: the process that applies some data mining algorithms which produce patterns or knowledge
• Post-processing: In fact, the discovered patterns are not always useful This process identifies which ones can be applied in real-world applications
Fundamentally, this thesis will provide some techniques to pre-process weblogs, and some machine learning algorithms to predict the popularity of online contents The results can be exceedingly applied in optimizing the caching policies as well as recommendation systems
Trang 19• Pageview identification:
Similar to data cleaning, identification of pageview or so-called access identification is always site-specific since it heavily depends on not only the intra-page architecture but also the page contents and some of the underlying site domain knowledge Basically, a specific user event may provoke several requests to web objects or resources In other words, pageview identification is to determine the collection of web objects and resources that represents the user event while accessing contents on the website For the static website, each HTML file usually corresponds to
a pageview However, most websites are dynamic and their pageviews may be constructed by some static templates coupled with contents generated by the server applications based on a set of parameters That also explains why pageview identification requires some site domain knowledge
• User identification:
In fact, WUM applications do not always require knowledge about the user’s identity but it is necessary to distinguish among different users [23] However, not all sites require user authentication for accessing contents or querying information, that also makes it hard to identify the actual user event Although IP address, alone, may not be sufficient for mapping log entries onto the set of unique users, it is still possible
to accurately identify unique users by using the combination of IP address and some other information like user’s agents, referrers [24]; and applying some other techniques in pageview identification or sessionization
Trang 2020
• Sessionization:
Sessionization is the process of grouping the activities of each user into sessions in order to identify the user access pattern In the absence of authentication mechanisms, this process must rely on some heuristic methods to sessionize user activities Here, the goal of a sessionization heuristic is to reconstruct the actual sequence of actions performed by a given user during one visit based on the clickstream data [23]
2.1.2 Data Mining
After pre-processing log files, some basic statistics can be applied to extract valuable information For example, grouping user accesses based on different geographical regions (country or province), plotting the trend of user preferences over time, classifying and analyzing user access pattern Moreover, some analysis based on gamma distribution and cumulative distribution also provide some insights into the distribution of user’s attention on different kinds of online contents All those techniques will be discussed in the later Chapter
Since other advanced machine learning techniques come from other fields of study such as time series prediction, language modeling, the background knowledge related to them will be provided in the next sections
2.2 Time Series Prediction
2.2.1 Linear regression
Fundamentally, linear regression is one of the most basic and popular techniques applied in the field of time series prediction However, it achieved significant results in many real-world applications such as financial time series forecasting [25], [26], predicting content popularity [2], [3] These models were proposed based on the observations of the strong linear relations on some real datasets For instance, in the study of Szabo et al [2], the predicting model was described by the following formula [2]:
Trang 21s Finally, 𝜉$ is a noise term drawn from a given distribution with mean 0 that describe the randomness observed in the data Since the model seems simple, it can only relatively predict long-term popularity
The later study of Pinto et al [3] provided the enhanced model for predicting video popularity by redefining the factor 𝑁 (the popularity) as follow:
𝑁 𝑣, 𝑡0, 𝑡1 = Θ13,14 𝑋13(𝑣) (2)
where 𝑣 is a given video Θ13,14 = (𝜃), 𝜃&, … , 𝜃13) is the set of parameters that model has to learn, it depends only on 𝑡0 and 𝑡1 𝑋13 𝑣 = (𝑥1(𝑣), 𝑥2(𝑣), … , 𝑥𝑡𝑟(𝑣) )> is the feature vector, and 𝑥?(𝑣) is the number of views received by video 𝑣 on the i th day
since it was uploaded The experimental results proved that Pinto’s model brought substantial improvement but it was not robust enough to accurately predict the popularity of online contents in the short-term
2.2.2 Non-linear regression
Recently, non-linear models are widely applied for a variety of applications, one of their uses is for forecasting, especially forecasting for time series The rapid development of deep learning has resulted in many superior predicting models The idea is giving the previous values of the target series (𝑦), 𝑦&, … , 𝑦1A)) with 𝑦? ∈ ℝ,
the current and the past values of n other input series (𝑋), 𝑋&, … , 𝑋1) with 𝑋) ∈ ℝD,
the non-linear models would aim to learn the non-linear function 𝐹 that maps those values with the target current value 𝑦1, here, 𝑦̂1 represents the predicted value at time step 𝑡
𝑦̂1 = 𝐹(𝑦), 𝑦&, … , 𝑦1A), 𝑋), 𝑋&, … , 𝑋1) (3)
Trang 2222
To learn the non-linear function 𝐹, many models have been proposed, for example, Hushchyn et al [27] used a simple ANN, Brockwell et al [28] used ARMA, Gao et al [29] used NARX, and so on Although these models can give reasonable results, there are grounds for optimism since the computational capabilities increase, more complex models become amenable to handle complicated relationships within datasets
2.3 Language modeling
2.3.1 Sequence-to-sequence model
Although some other Deep Neural Networks (DNNs) are notably powerful machine learning models that achieve excellent performance on various problems in Natural Language Processing such as speech recognition [30], [31], DNNs can only be effectively applied to some problems whose inputs and output can be encoded with vectors or fixed dimensionality [32] In fact, many critical problems are expressed with the sequences whose lengths are unknown or arbitrary, for example, question answering, machine translation, and so on, which gives birth to sequence-to-sequence models The first sequence-to-sequence model was introduced in the field of neural machine translation in the study of Sutskever et al [32], which address the problem of sequence learning
Commonly, a sequence-to-sequence model contains 2 sub-modules: encoder and decoder The encoder encodes the input sequence into a context vector which will
be decoded by the decoder to generate the output In some later studies, sequence model is also widely applied in other applications such as speech recognition [33], text summarization [34], [35], [36] In addition, some variants of encoder-decoder architecture have also been proposed, which differ in the conditional input and the type of core networks, for example, Bahdanau et al [37] and Luong et al [38] using RNNs
sequence-to-Currently, long short term memory networks [39] and gated recurrent units [40] are commonly used as recurrent networks in encoder-decoder architecture since both
of them allow the model to memorize information from the previous steps and capture
Trang 2323
the long term dependencies Besides, some models also use convolutional neural networks to cope with large datasets For instance, Gehring et al [41] introduced the first fully convolutional model for sequence-to-sequence learning, which can outperform recurrent models on large benchmark datasets
2.3.2 Attention mechanism
In an encoder-decoder model using RNN, the encoder would sequentially encode the input sequence 𝑥 = (𝑥1, 𝑥2, … , 𝑥𝑇) into fixed size vectors ℎ = (ℎ1, ℎ2, … , ℎ𝑇), also known as hidden states, with ℎ1 = 𝑓(ℎ1A), 𝑥1) The last hidden state would be used as the context vector, which would be decoded in the same manner to generate the output Here, the attention mechanism allows the model to create shortcuts between the entire input sequences as well as the output and the context vector The weights of the shortcut connections are customizable for each output element As it can achieve significant results in the field of language modeling, several variants of attention mechanism have been proposed to address some specific problems These mechanisms include content-based attention [42], addictive (also called “concat”) attention [37], location attention [38], scaled dot product attention [43], and so on
Below is a brief summary of several popular attention mechanisms and corresponding alignment score functions:
Trang 2424
• Scaled dot product attention:
𝑠𝑐𝑜𝑟𝑒 𝑠1, 𝒉1 = 𝑠1>, 𝒉1
where 𝑠1, 𝒉1 are the cell state and the hidden state of the RNN model at time step 𝑡, 𝑛
is the dimension of two vectors 𝑠1, 𝒉1 Then, 𝒗T and 𝑊T are the parameters that the model has to learn
Recently, Cheng et al [19] proposed the self-attention (also known as attention), a mechanism that performs shallow reasoning with memory and attention
intra-In contrast to inter-attention, self-attention requires the model to compute the attention score of different positions within a single sequence In fact, it has been successfully applied in a variety of tasks including language modeling [19], sentiment analysis [19], natural language inference [19], [20], abstractive summarization [21], learning task-independent sentence representation [22], and so on
2.3.3 The Transformer
One of the limitations of models belonging to the RNN family is that they sequentially compute each time step, which leads to long training and inference time Much effort has been made to address this issue like replacing the RNNs with very deep CNNs to capture the long-term dependencies For instance, Gehring et al [41] investigated convolutional layer for sequence-to-sequence tasks, Zeng et al [44] exploited a convolutional deep neural network to extract lexical and sentence level features, Conneau et al [45] applied very deep convolutional nets to text processing
Unlike those approaches, in the recent study, Vaswani et al [43] proposed a novel model called Transformer As common sequence-to-sequence models, the Transformer’s architecture is built based on encoder-decoder structure However, its encoder is composed of N stacked identical layers Each layer has two sub-layers which are the self-attention layer and the fully connected feed-forward layer In the same manner, the decoder consists of M stacked identical layers, but its elemental layer has 3 sub-layers In addition to the two sub-layers in the encoder, the decoder
Trang 2525
has an encoder-decoder attention layer to perform attention on source sequence representation As entirely eliminating recurrent and convolutional connections as well as applying the self-attention mechanism to capture the long-term dependencies, the Transformer has been proved to reach new state-of-the-art in translation quality
In short, this Chapter has briefly summarized some background knowledge related to not only weblog mining but also time series prediction and several advanced techniques in the field of language modeling To have a better understanding of how they can be applied in this work, the next Chapter will provide some related works
Trang 2626
CHAPTER 3: RELATED WORK
There is a fact that the Internet plays a vital role in our daily lives With the advent of the Internet, people can easily reach information by means of surfing websites As the number of the Internet users is increasing exponentially, we are generating an enormous amount of data every day Therefore, the studies of discovering knowledge, modeling and predicting user access pattern based on analyzing log files have become particularly important [46] Since these fields of study have been extensively researched in the last decade, in this Chapter, I will provide some of the relevant studies related to weblog mining, time series prediction
as well as predicting content popularity
3.1 Weblog Mining
In recent decades, weblog mining is an endless research area which has attracted extensive community attention As mentioned in Liu’s study [23], weblog mining can be categorized into three sub-domains which are web content mining (WCM), web structure mining (WSM) and web usage mining (WUM) Since different websites have different weblog structures and most of the studies mainly focus on analyzing particular log files and propose appropriate approaches, there are a few considerable works Here, I only provide some typical research related to my thesis
3.1.1 Web content mining
In the term of web content mining, many efforts have been made to address analyzing user access pattern from weblogs For instance, in Spiliopoulou’s work [47], [48], a mining system called Web Utilization Miner was used to extract interesting navigation pattern from weblogs At this time, it was proven to satisfy the expert’s criteria by exploiting an innovative aggregated storage representation for the information in the logs of a real web server The authors also proposed their own mining language (called MINT) which supports the specification of statistical, structural and textual nature to build the system However, the web contents and weblog structures are changing gradually, those techniques are no longer suitable for
Trang 2727
analyzing weblogs nowadays That’s also the reason why the authors emphasized the importance of data preparation or so-called data pre-processing in mining weblog Recently, Alfaro et al [49] combined supervised machine learning algorithms and unsupervised learning techniques for sentiment analysis and opinion mining They proposed a multi-stage method for the automatic detection of different opinion trends based on analyzing weblogs
3.1.2 Web structure mining
In the scope of web structure mining, many algorithms and techniques have been proposed such as frequent pattern growth (FP-growth) [50], association rule mining (ARM) [51] to attract valuable information about the user from weblogs In a study of Iváncsy et al [52], some FP mining techniques were proposed to explore different types of pattern in the weblogs Giving the information on problems arising
to the users, mining frequent patterns from weblogs is of great importance to optimize the web structure of a website and improve the performance of the whole system For example, Perkowitz et al [53] investigated the problem of index page synthesis In order to create an adaptive website, the authors proposed a mining cluster to find the collections of cohesive pages and then gather them into the same group by applying their PageGather algorithm Based on that, the system was able to automatically generate pages system that facilitates the visitors’ navigation within the website
In a study of Wang et al [54], an enhanced algorithm called weighted association rules (WAR) was proposed It assigns a numerical attribute for each item and judges those weight in a particular domain The latter study [55] inspirited by the same idea of WAR addressed the problem in discovering binary relationships in transaction dataset in weighted settings, which was proven to be more scalable and efficient
However, it has been shown that weblogs, in general, are sparse, and also have arbitrary length patterns, that makes a conventional algorithm difficult to mine the user access patterns Therefore, Sun et al [56] presented an algorithm named Combined Frequent Pattern Mining (CPFM) to address the problem As the algorithm
Trang 2828
is the combination of the candidate-generation-and-test approach [57] and the growth approach [58], it can adaptively mine both long and short frequent patterns
pattern-3.1.3 Web usage mining
In the context of web usage mining, there are many studies focusing on not only users accessing but also usage pattern of web pages Basically, they can be categorized into two sub-domains which are predictive and descriptive [59], [60]
In the descriptive domain, data is classified or characterized to extract useful knowledge For example, Zhang et al [61] proposed the Self Organizing Map (SOM) model or so-called Kohonen neural network to discover user groups in real time based
on the users’ session extracted from weblogs As a result, they can effectively recommend suitable web links or products that their users may be interested in for each group In another study of Das et al [62], the user accessing pattern was extracted by using a model call Path Analysis Model Specifically, the model provides
a count of the number of times that a link appears in the log then applies some association rules to understand the user navigation Based on that, they could improve the impressiveness of the website
In the predictive domain, many efforts have been made to address the problem
in predicting user behavior Recently, Neelima et al [63] attempted to analyze the user behavior based on the amount of time that they spend on a particular page The user sessions extracted from weblogs are also used for predicting purposes In a study of Wang et al [64], an Unsupervised Clickstream Clustering model was proposed to capture dominating user behavior from clickstream data Unexpectedly, the authors found that the model was also able to predict future user dormancy In addition, some studies related to predicting the popularity of online contents will be discussed in the later section
Trang 2929
3.2 Time Series Prediction
In the last decade, time series prediction algorithms have also been extensively researched and applied to solve many critical problems across various areas For instance, financial market prediction [5], weather forecasting [6], predicting the next frame of a given video [7], complex dynamical system analysis [8], and so on
Basically, time series prediction is the use of a model to predict future values based on previously observed values In other words, the predicting involves taking models to fit on historical data, then applying them to forecast the future values of the input sequences Time series prediction can be considered as a basic problem in many domains, with wide-ranging and high-impact applications For a long time since the well-known autoregressive moving average (ARMA) model was first proposed, the model and its variants [65], [28], have been proven to be effectively applied in various real-world applications However, these models are not able to model the non-linear relationships as well as differentiate among input sequences Then, the nonlinear autoregressive exogenous (NARX) [66] approach was proposed to address such a problem Over time, to make the approach becomes more flexible, many improvements have been made, such as Gao et al [29] proposed a nonlinear autoregressive moving average with exogenous inputs (NARMAX) model to improve predictive performance using fuzzy neural network; Diaconescu et al [67] exploited NARX to make prediction on chaotic time series; and so on Here, the basic idea still
is utilizing a neural network to learn the non-linear relationships mapping the previous values of input sequences and the target sequences
Although many other efforts have been made to address the time series prediction problem, such as kernel methods [68], ensemble methods [69], Gaussian processes [70], and so on, most of these approaches employ a predefined non-linear form Thus, they may not appropriately capture the actual non-linear relationships among the input series [71] The development of deep learning has resulted in many superior neural network models, including Recurrent Neural Networks (RNNs), a type
of deep neural network that is successfully applied in sequence modeling RNNs have
Trang 30To partially overcome this drawback, the Long Short Term Memory unit, also known
as LSTM, was proposed in Hochreiter’s study [39] and the gated recurrent unit (GRU) was proposed in Cho’s study [40] They achieved substantial success in various application in the field of neural machine translation (NMT) Recently, Qin et al [71] successfully applied the advances of LSTM as well as the attention mechanism to address the time series prediction problem Although achieving a new state-of-the-art
in time series prediction, the Dual-stage Attention-based Recurrent Neural Network (DA-RNN) seems to heavily rely on the LSTM, which contains enormous recursive computations
3.3 Predicting content popularity
The field of predicting online content popularity was pioneered by the initiative
of Szabo et al [2] The study showed some proofs of a strong linear correlation between the long-term popularity and the early popularity on the logarithmic scale Based on that, the authors proposed a simple log-linear model to predict the overall popularity of a given online content via the early observation The proposed model was evaluated on various datasets, including Youtube videos [16], Digg stories [2], and so on Inspired by this ideal, Pinto et al [3] provided two enhanced models called Multivariate Linear model and MRBF model Using daily samples of content popularity measured up to a given reference date, these models are able to make predictions with reasonable accuracy on Youtube dataset [16]
In recent research, Li et al [4] introduced a novel model that is able to capture the popularity dynamics based on the early popularity evolution pattern as well as the popularity burst detection In this work, the authors consider not only some basic early popularity measurements but also the characteristics of individual video and the popularity evolution patterns as the input of their model In addition to the regression-
Trang 3131
based method, some other techniques such as reservoir computing [73], time series analysis [74] are also applied to improve the performances
Despite achieving initial results, these aforementioned studies mainly focused
on predicting the long-term popularity of the given content To address the problem in both long-term and short-term, the recent works of Hushchyn [27] and Meoni [75] proposed some simple artificial neural networks (ANNs) to predict the popularity of scientific datasets at the Large Hadron Collider at CERN Since these models are not robust enough to accurately predict the popularity of a particular item, they are mainly used for classifying purposes Therefore, accurately predicting the popularity of online contents in the near future is not a trivial task
In summary, this Chapter provides an overview of some studies that have been done in extracting knowledge from log files as well as predicting the popularity of contents As those studies provide an outline as well as insights into the research domains, they can be exceedingly applied to address the problems proposed in this thesis Based on that, the next Chapter will discuss the processing and analysis on some real datasets such as weblog of the HCMUT’s website, MovieLens, and Youtube
Trang 324.1 Weblog of HCMUT Website
Basically, weblogs are categorized into three groups which are client log file, proxy log file, and server log file The HCMUT log file is a server log recording all user’s sites accessed while interacting with the HCMUT website from November
2016 to the end of October 2017 To extract useful knowledge from this data, I apply a web usage mining process which contains data pre-processing, and some pattern analysis In this dataset, I consider the number of access as the popularity of particular content
4.1.1 Data Pre-processing
Data pre-processing is an essential task in every data mining applications as it takes responsibility for generating data in a suitable format, where statistics and mining algorithms can be applied In fact, log data always contains a lot of meaningless information and noise so data pre-processing may occupy about 80% of the mining process, according to Z Pabarskaite [77] Moreover, weblogs are collected from multiple sources within multiple channels, they usually have inconsistent formats
Trang 33by identity on the client's machine
“GET /apache pb.gif HTTP/1.0”
The request line from the client is given in double quotes The request line contains a great deal of useful information First, the method used
by the client is GET Second, the client requested the resource /apache pb.gif, and third, the client used the protocol HTTP/1.0
The entry indicates the size of the object returned
to the client, not including the response headers Referrer:
“http://www.example.com/start.html”
The “Referrer” (sic) HTTP request header This gives the site that the client reports having been referred from
User-Agent:
“Mozilla/4.08 [en] (Win98; I ;Nav)”
The User-Agent HTTP request header This is the identifying information that the client browser reports about itself
Trang 3434
Since the HCMUT’s weblogs are collected from a single server, all the records are stored following the Extended Common Log Format (ECLF), a semi-structured format The structure of each record is described in Table 4 1 [78]
§ Data cleaning:
As the first step of data pre-processing, it reduces the size of the dataset significantly Fundamentally, in this step, all the meaningless records would be removed from the data Table 4 2 [78] provides some examples of eliminations All the red records are removed from the dataset Specifically, requests referring to some files such as style files, images may not provide useful knowledge in my case study, hence they are all eliminated Moreover, the failure requests that have the response status “error” are also removed The experimental measurement has shown that the cleaning process reduced about 46% size of the data, that would significantly lower the computation cost for the next step
Table 4 2 Example of data cleaning
5 /includes/css/hcmut/images /tintuc/top
menu bg.png
www.hcmut.edu.vn/includes /css/hcmut/welcome.css
6 /includes/css/hcmut/nivo slider.css www.hcmut.edu.vn/vi/
7 /vi/newsletter/view/su-kien www.hcmut.edu.vn/vi/newsletter/
…
Trang 3535
§ Data reduction:
Similar to data cleaning, this step can substantially reduce the size of the dataset Since I mainly consider the popularity of a given content, in other words, its number of access, I only preserver several attributes which may be interesting in the
mining process, for example, IP address, Date Time and Referrer As most of the
contents on the HCMUT website are public, most of the requests referring to these
contents do not contain any authentication information Relying only on the IP address is not enough to determine the actual user access while each access may cause many lines in the weblog Hence, attributes like Date Time and Referrer are of great
importance to further investigate the accessing patterns
§ Access Identification:
Basically, each pageview on the website is always a collection of web objects and resources In fact, a pageview usually represents a specific user event, such as clicking on a link, opening an options panel, accessing content, and so on That is why each access often corresponds to several lines in log files Thus, the access identification process is applied to the weblog to determine the actual user accesses In addition, each request to an object or resource is represented in the Uniform Resource Identifier (URI) format, which is described in Fig 1 [78]
Fig 1 URI Structure
By eliminating query and fragment, I obtain a dataset shown in Table 4 3 [78]
Again, I remove all red records as they represent the listed access In particular, all
requests are grouped by IP address and sorted by Date Time Within each group, a request r is considered as a new access A i if it satisfies one of two following conditions: (1) the Referrer of r is different from the Referrer of the access A i-1 (a
previous access), (2) t 1 – t 2 > θ, where t 1 is the timestamp of r, t 2 is the timestamp of
http://www.hcmut.edu.vn/vi/newsletter/view/su-kien/?tag=department&offices=newest#top
Trang 3636
A i-1 and θ is a threshold determined by experiment Otherwise, r would be removed
from the dataset
§ Data Transformation:
Fundamentally, data transformation is the process that converts data from one structure into other structures where the mining algorithms can be applied Since the weblog of HCMUT website includes the IP address of the users, it is simple to get more information about the geographic location of users accessing the website Toward this goal, I apply the Client IP Location Lookup model [78] to produce three
more attributes, which are country, city and the Internet Service Provider (ISP) The
details of the model are described in Fig 2
Table 4 3 Example of Access Identification
Initially, all client IPs are considered as unknown IPs They will be looked up
one-by-one by means of using several public APIs such as lookup.com/json/ and http://ip-api.com/json/ The IP addresses found will be updated
https://extreme-ip-into an offline database for using later Based on that, I not only save time for looking
Trang 3737
up a massive duplicated IP addresses but also attain an enormous database about IPs and locations Besides, there are some private IPs that cannot be looked up then they will be stored in a separated database to be updated manually
Fig 2 Client IP Lookup Model
Table 4 4 Country Normalization
Client IPLocation
IP LocationLookup API Online
DatabaseOffline
Trang 3838
§ Data Normalization:
Since the data obtained from the data transformation process are derived from many different sources, they are often unreliable and have inconsistent formats To address such a problem, data normalization is commonly applied as a supplementary process in the data pre-processing task Table 4 4, Table 4 5 and Table 4 6 show several examples of normalizing data [78]
Table 4 5 City Normalization
Table 4 6 ISP Normalization
1 Vietnam Post and Telecom Corporation VNPT-VN
2 Vietnam Post and Telecom Corporation
3 Vietnam Posts and Telecommunications Group VNPT-VN
5 Vietnamobile Telecommunications Joint Stock
6 Vietnamobile Telecommunications Joint Stock
…
Trang 3939
In short, the whole data pre-processing process is summarized in Fig 3 As a result, the processed dataset contains more than 74 million records which represent about 10 million access to the HCMUT website over a year Moreover, I also attain more than 780 thousand different IP addresses and their locations stored in the offline database
Fig 3 Data Pre-processing process
4.1.2 Pattern Analysis
In order to discover user access patterns, several basic statistics are applied in the processed data For instance, Fig 4 shows the distribution of user accesses all over the world Though receiving the attention of more than a hundred countries, the number of user accesses among those countries have vast differences For further investigation, the proportions coupled with the number of accesses derived from some nations to the globe are also described in Fig 5
Data Preprocessing
Access Identification
Data Transformation Site Content & Structure
and Domain Knowledge
Trang 4040
Fig 4 The global distribution of user access to the HCMUT website
It is unsurprising that Vietnam is the most visiting country to the HCMUT website, it accounts for about 76.092% of the total number of accesses With 15.934%
of the total number of accesses, the US is the second country having the highest number of accesses to the HCMUT website, whereas Taiwan takes 3.138% of the total number of accesses and becomes the third most visited country to the website The explanation for this is that HCMUT currently has a lot of cooperation programs with these two countries Moreover, these metrics can be used to demonstrate the effectiveness in cooperation of HCMU with partners in these countries as well
Similarly, Fig 6 shows the distribution of access within all provinces in Vietnam, while Fig 7 provides the exact number of access coupled with the access proportions of Ha Noi and Ho Chi Minh city in comparison to other places It can be seen that most of the access come from Ho Chi Minh city with 73.497 % of the total number of accesses Meanwhile, Ha Noi accounts for 13.943 %, a bit higher than the total number of visits from other provinces combined, about 12.560%
Considering the ISP, the number of ISPs having users accessing to the HCMUT website is about 33 different ISPs As observed from Fig 8, the markets for