Recently, computer systems have been generating an enormous amount of data, including log files. Since log files can be considered as an informative and endless resource, mining log files has become an extensive research area. As a result, many algorithms and techniques have been extensively researched to address some specific problems in handling, processing as well as analyzing log files. Aiming to analyze weblogs, this thesis is expected to extract valuable information and knowledge from weblogs as well as be able to predict the popularity of online contents based on their historical accesses.
Commonly, weblogs have various formats, and they may be generated from many sources, that makes processing weblogs become a challenging task. Since each dataset has its own characteristics, pre-processing and analyzing those characteristics are of great importance to propose appropriate models to predict the popularity of the contents. Within the scope of this thesis, the data processing as well as some analysis have been performed on the HCMUT weblogs, the MovieLens, and the Youtube dataset. Specifically, the thesis not only describes carefully the processing process but also provides several investigations which reveal the users’ preferences over time as well as the distribution of the HCMUT website’s visitors based on their geographical locations. For the MovieLens dataset, some other statistics are also provided for a better understanding of the popularity evolution pattern of movies within the dataset.
For the Youtube datasets, as the data were collected more frequently, several measurements have shown some characteristics of the user access patterns in different countries such as the periodicity of contents’ popularity.
As a part of the thesis, two machine learning models are proposed to address the problem in predicting online content popularity. The first model called Derivative- based Multivariate Linear Regression is built based on Taylor’s expansion and linear regression while the second model, Attention-based Non-Recursive Neural Network,
79
is the combination of two state-of-the-art attention mechanisms which are widely applied in the field of natural language processing. The experimental results have shown that the two new models outperform several baseline methods in overall and the inference time is also significantly improved.
As data always contains noises, there are some cases that the proposed models cannot give reasonable performance. Moreover, since there are inadequate historical records for some newly uploaded videos which have attracted the attention of a large number of users in a short period of time since uploaded, the proposed models may not be able to make reliable predictions. However, there are grounds for optimism as the computational capabilities increase, and the rapid development of machine learning, deep learning has resulted in many superior mechanisms and techniques.
Therefore, improving some existing models to overcome those limitations is considerable future work. As each dataset has its own characteristics, it is hard to propose a general model which can effectively make predictions on all datasets. This work will pioneer in building a framework that contains some superior models to predict the popularity of online contents across different datasets based on analyzing their characteristics.
In summary, since the amount of data, especially video, that has been brought into the Internet is growing exponentially, knowing the popularity of contents in the future would be beneficial to a handful of applications. Therefore, the results of this thesis can be widely applied in solving some real-world problems such as improving contents’ distribution, cache placement policies, and so on. These applications make room for further improvement and integration of this study.
80
REFERENCE
[1] C. V. N. I. Cisco, "The zettabyte era—trends and analysis, 2015–2020. white paper," ed: July, 2016.
[2] G. Szabo and B. A. Huberman, "Predicting the popularity of online content,"
Available at SSRN 1295610, 2008.
[3] H. Pinto, J. M. Almeida, and M. A. Gonỗalves, "Using early view patterns to predict the popularity of youtube videos," in Proceedings of the sixth ACM international conference on Web search and data mining, 2013: ACM, pp.
365-374.
[4] C. Li, J. Liu, and S. Ouyang, "Characterizing and predicting the popularity of online videos," IEEE Access, vol. 4, pp. 1630-1641, 2016.
[5] Y. Wu, J. M. Hernández-Lobato, and Z. Ghahramani, "Dynamic covariance models for multivariate financial time series," arXiv preprint arXiv:1305.4268, 2013.
[6] P. Chakraborty, M. Marwah, M. Arlitt, and N. Ramakrishnan, "Fine-grained photovoltaic output prediction using a bayesian ensemble," in Twenty-Sixth AAAI Conference on Artificial Intelligence, 2012.
[7] V. Vukotić, S.-L. Pintea, C. Raymond, G. Gravier, and J. C. van Gemert,
"One-step time-dependent future video frame prediction with a convolutional encoder-decoder neural network," in International Conference on Image Analysis and Processing, 2017: Springer, pp. 140-151.
[8] Z. Liu and M. Hauskrecht, "A regularized linear dynamical system framework for multivariate time series analysis," in Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
[9] C. C. Holt, "Forecasting seasonals and trends by exponentially weighted
moving averages," International journal of forecasting, vol. 20, no. 1, pp. 5-10, 2004.
[10] S. Hansun, "A new approach of moving average method in time series analysis," in 2013 Conference on New Media Studies (CoNMedia), 2013:
IEEE, pp. 1-4.
[11] L. Harrison, W. D. Penny, and K. Friston, "Multivariate autoregressive modeling of fMRI time series," Neuroimage, vol. 19, no. 4, pp. 1477-1491, 2003.
[12] E. Masry, "Multivariate local polynomial regression for time series: uniform strong consistency and rates," Journal of Time Series Analysis, vol. 17, no. 6, pp. 571-599, 1996.
[13] Y.-S. Lee and L.-I. Tong, "Forecasting time series using a methodology based on autoregressive integrated moving average and genetic programming,"
Knowledge-Based Systems, vol. 24, no. 1, pp. 66-72, 2011.
[14] S. Ling and W. Li, "On fractionally integrated autoregressive moving-average time series models with conditional heteroscedasticity," Journal of the
American Statistical Association, vol. 92, no. 439, pp. 1184-1194, 1997.
[15] HCMUT. "Ho Chi Minh city University of Technology : https://hcmut.edu.vn/." (accessed March 18, 2019).
81 [16] Youtube. "Youtube Application Programming
Interface: https://developers.google.com/youtube/." (accessed March 18, 2019).
[17] F. M. Harper and J. A. Konstan, "The movielens datasets: History and context,"
Acm transactions on interactive intelligent systems (tiis), vol. 5, no. 4, p. 19, 2016.
[18] Y. Zhou, L. Chen, C. Yang, and D. M. Chiu, "Video popularity dynamics and its implication for replication," IEEE transactions on multimedia, vol. 17, no.
8, pp. 1273-1285, 2015.
[19] J. Cheng, L. Dong, and M. Lapata, "Long short-term memory-networks for machine reading," arXiv preprint arXiv:1601.06733, 2016.
[20] A. P. Parikh, O. Tọckstrửm, D. Das, and J. Uszkoreit, "A decomposable attention model for natural language inference," arXiv preprint
arXiv:1606.01933, 2016.
[21] R. Paulus, C. Xiong, and R. Socher, "A deep reinforced model for abstractive summarization," arXiv preprint arXiv:1705.04304, 2017.
[22] Z. Lin et al., "A structured self-attentive sentence embedding," arXiv preprint arXiv:1703.03130, 2017.
[23] B. Liu, Web data mining: exploring hyperlinks, contents, and usage data.
Springer Science & Business Media, 2007.
[24] R. Cooley, B. Mobasher, and J. Srivastava, "Data preparation for mining world wide web browsing patterns," Knowledge and information systems, vol. 1, no.
1, pp. 5-32, 1999.
[25] F. E. Tay and L. Cao, "Application of support vector machines in financial time series forecasting," omega, vol. 29, no. 4, pp. 309-317, 2001.
[26] C.-J. Lu, T.-S. Lee, and C.-C. Chiu, "Financial time series forecasting using independent component analysis and support vector regression," Decision Support Systems, vol. 47, no. 2, pp. 115-125, 2009.
[27] M. Hushchyn, P. Charpentier, and A. Ustyuzhanin, "Disk storage management for LHCb based on Data Popularity estimator," in Journal of Physics:
Conference Series, 2015, vol. 664, no. 4: IOP Publishing, p. 042026.
[28] P. J. Brockwell, R. A. Davis, and M. V. Calder, Introduction to time series and forecasting. Springer, 2002.
[29] Y. Gao and M. J. Er, "NARMAX time series model prediction: feedforward and recurrent fuzzy neural network approaches," Fuzzy sets and systems, vol.
150, no. 2, pp. 331-350, 2005.
[30] G. Hinton et al., "Deep neural networks for acoustic modeling in speech recognition," IEEE Signal processing magazine, vol. 29, 2012.
[31] G. E. Dahl, D. Yu, L. Deng, and A. Acero, "Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition," IEEE
Transactions on audio, speech, and language processing, vol. 20, no. 1, pp. 30- 42, 2011.
[32] I. Sutskever, O. Vinyals, and Q. V. Le, "Sequence to sequence learning with neural networks," in Advances in neural information processing systems, 2014, pp. 3104-3112.
82
[33] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio,
"Attention-based models for speech recognition," in Advances in neural information processing systems, 2015, pp. 577-585.
[34] A. M. Rush, S. Chopra, and J. Weston, "A neural attention model for
abstractive sentence summarization," arXiv preprint arXiv:1509.00685, 2015.
[35] R. Nallapati, B. Zhou, C. Gulcehre, and B. Xiang, "Abstractive text
summarization using sequence-to-sequence rnns and beyond," arXiv preprint arXiv:1602.06023, 2016.
[36] S. Shen, Y. Zhao, Z. Liu, and M. Sun, "Neural headline generation with sentence-wise optimization," arXiv preprint arXiv:1604.01904, 2016.
[37] D. Bahdanau, K. Cho, and Y. Bengio, "Neural machine translation by jointly learning to align and translate," arXiv preprint arXiv:1409.0473, 2014.
[38] M.-T. Luong, H. Pham, and C. D. Manning, "Effective approaches to attention- based neural machine translation," arXiv preprint arXiv:1508.04025, 2015.
[39] S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural computation, vol. 9, no. 8, pp. 1735-1780, 1997.
[40] K. Cho et al., "Learning phrase representations using RNN encoder-decoder for statistical machine translation," arXiv preprint arXiv:1406.1078, 2014.
[41] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin,
"Convolutional sequence to sequence learning," in Proceedings of the 34th International Conference on Machine Learning-Volume 70, 2017: JMLR. org, pp. 1243-1252.
[42] A. Graves, G. Wayne, and I. Danihelka, "Neural turing machines," arXiv preprint arXiv:1410.5401, 2014.
[43] A. Vaswani et al., "Attention is all you need," in Advances in Neural Information Processing Systems, 2017, pp. 5998-6008.
[44] D. Zeng, K. Liu, S. Lai, G. Zhou, and J. Zhao, "Relation classification via convolutional deep neural network," 2014.
[45] A. Conneau, H. Schwenk, L. Barrault, and Y. Lecun, "Very deep convolutional networks for text classification," arXiv preprint arXiv:1606.01781, 2016.
[46] Ş. Gỹndỹz and M. T. ệzsu, "A web page prediction model based on click- stream tree representation of user behavior," in Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, 2003: ACM, pp. 535-540.
[47] M. Spiliopoulou, L. C. Faulstich, and K. Winkler, "A data miner analyzing the navigational behaviour of web users," in Proc. of the Workshop on Machine Learning in User Modelling of the ACAI99, 1999: Greece, July.
[48] M. Spiliopoulou and L. C. Faulstich, "Wum: A web utilization miner," in International Workshop on the Web and Databases, Valencia, Spain, 1998:
Citeseer.
[49] C. Alfaro, J. Cano-Montero, J. Gómez, J. M. Moguerza, and F. Ortega, "A multi-stage method for content classification and opinion mining on weblog comments," Annals of Operations Research, vol. 236, no. 1, pp. 197-213, 2016.
[50] R. Mishra and A. Choubey, "Discovery of frequent patterns from web log data by using FP-growth algorithm for web usage mining," International Journal of
83
Advanced Research in Computer Science and Software Engineering, vol. 2, no.
9, 2012.
[51] Z. Qiankun, "Association Rule Mining: A Survey, Technical Report," CAIS, Nanyang Technological University, Singapore, 2003.
[52] R. Iváncsy and I. Vajk, "Frequent pattern mining in web log data," Acta Polytechnica Hungarica, vol. 3, no. 1, pp. 77-90, 2006.
[53] M. Perkowitz and O. Etzioni, "Adaptive web sites: Automatically synthesizing web pages," in AAAI/IAAI, 1998, pp. 727-732.
[54] W. Wang, J. Yang, and S. Y. Philip, Efficient mining of weighted association rules (WAR). IBM Thomas J. Watson Research Division, 2000.
[55] F. Tao, F. Murtagh, and M. Farid, "Weighted association rule mining using weighted support and significance framework," in Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, 2003: ACM, pp. 661-666.
[56] L. Sun and X. Zhang, "Efficient frequent pattern mining on web logs," in Asia- Pacific Web Conference, 2004: Springer, pp. 533-542.
[57] R. Agrawal and R. Srikant, "Fast algorithms for mining association rules," in Proc. 20th int. conf. very large data bases, VLDB, 1994, vol. 1215, pp. 487- 499.
[58] J. Han, J. Pei, and Y. Yin, "Mining frequent patterns without candidate generation," in ACM sigmod record, 2000, vol. 29, no. 2: ACM, pp. 1-12.
[59] A. Singh and K. K. Das, "Application of data mining techniques in bioinformatics," 2007.
[60] F. Bounch et al., "Web log data warehourseing and mining for intelligent web caching, J," Data Knowledge Eng, vol. 36, pp. 165-189, 2001.
[61] X. Zhang, J. Edwards, and J. Harding, "Personalised online sales using web usage data mining," Computers in Industry, vol. 58, no. 8-9, pp. 772-782, 2007.
[62] R. Das and I. Turkoglu, "Creating meaningful data from web logs for improving the impressiveness of a website by using path analysis method,"
Expert Systems with Applications, vol. 36, no. 3, pp. 6635-6644, 2009.
[63] G. Neelima and S. Rodda, "Predicting user behavior through sessions using the web log mining," in 2016 International Conference on Advances in Human Machine Interaction (HMI), 2016: IEEE, pp. 1-5.
[64] G. Wang, X. Zhang, S. Tang, H. Zheng, and B. Y. Zhao, "Unsupervised clickstream clustering for user behavior analysis," in Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, 2016: ACM, pp.
225-236.
[65] D. Asteriou and S. G. Hall, "ARIMA models and the Box–Jenkins methodology," Applied Econometrics, vol. 2, no. 2, pp. 265-286, 2011.
[66] T. Lin, B. G. Horne, P. Tino, and C. L. Giles, "Learning long-term
dependencies in NARX recurrent neural networks," IEEE Transactions on Neural Networks, vol. 7, no. 6, pp. 1329-1338, 1996.
[67] E. Diaconescu, "The use of NARX neural networks to predict chaotic time series," Wseas Transactions on computer research, vol. 3, no. 3, pp. 182-191, 2008.
84
[68] S. Chen, X. Wang, and C. J. Harris, "NARX-based nonlinear system
identification using orthogonal least squares basis hunting," IEEE Transactions on Control Systems Technology, vol. 16, no. 1, pp. 78-84, 2007.
[69] A. Bouchachia and S. Bouchachia, Ensemble learning for time series prediction. na, 2008.
[70] R. Frigola and C. E. Rasmussen, "Integrated pre-processing for Bayesian nonlinear system identification with Gaussian processes," in 52nd IEEE Conference on Decision and Control, 2013: IEEE, pp. 5371-5376.
[71] Y. Qin, D. Song, H. Chen, W. Cheng, G. Jiang, and G. Cottrell, "A dual-stage attention-based recurrent neural network for time series prediction," arXiv preprint arXiv:1704.02971, 2017.
[72] Y. Bengio, P. Simard, and P. Frasconi, "Learning long-term dependencies with gradient descent is difficult," IEEE transactions on neural networks, vol. 5, no.
2, pp. 157-166, 1994.
[73] T. Wu, M. Timmers, D. De Vleeschauwer, and W. Van Leekwijck, "On the use of reservoir computing in popularity prediction," in 2010 2nd International Conference on Evolving Internet, 2010: IEEE, pp. 19-24.
[74] G. Gürsun, M. Crovella, and I. Matta, "Describing and forecasting video access patterns," in 2011 Proceedings IEEE INFOCOM, 2011: IEEE, pp. 16- 20.
[75] M. Meoni, "Mining Predictive Models for Big Data Placement," U. Pisa (main), 2018.
[76] X. Cheng, C. Dale, and J. Liu, "Statistics and social network of youtube videos," in 2008 16th Interntional Workshop on Quality of Service, 2008:
IEEE, pp. 229-238.
[77] Z. Pabarskaite, "Implementing advanced cleaning and end-user interpretability technologies in web log mining," in ITI 2002. Proceedings of the 24th
International Conference on Information Technology Interfaces (IEEE Cat. No.
02EX534), 2002: IEEE, pp. 109-113.
[78] M.-T. Nguyen, T.-D. Diep, T. H. Vinh, T. Nakajima, and N. Thoai,
"Analyzing and Visualizing Web Server Access Log File," in International Conference on Future Data and Security Engineering, 2018: Springer, pp. 349- 367.
[79] X. Cheng, C. Dale, and J. Liu, "Understanding the characteristics of internet short video sharing: YouTube as a case study," arXiv preprint
arXiv:0707.3670, 2007.
[80] E. W. Stacy, "A generalization of the gamma distribution," The Annals of mathematical statistics, vol. 33, no. 3, pp. 1187-1192, 1962.
[81] Alexa-Internet. "Youtube Traffic
Statistic: https://www.alexa.com/siteinfo/youtube.com." (accessed March 18, 2019).
[82] T. Nakajima, M. Yoshimi, C. Wu, and T. Yoshinaga, "Color-based cooperative cache and its routing scheme for Telco-CDNs," IEICE TRANSACTIONS on Information and Systems, vol. 100, no. 12, pp. 2847-2856, 2017.
[83] P. Dienes, The Taylor series: an introduction to the theory of functions of a complex variable. Dover New York, 1957.
85
[84] J. Lei Ba, J. R. Kiros, and G. E. Hinton, "Layer normalization," arXiv preprint arXiv:1607.06450, 2016.
[85] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770-778.
[86] J. Benesty, J. Chen, Y. Huang, and I. Cohen, "Pearson correlation coefficient,"
in Noise reduction in speech processing: Springer, 2009, pp. 1-4.
[87] L. Myers and M. J. Sirois, "Spearman correlation coefficients, differences between," Encyclopedia of statistical sciences, vol. 12, 2004.
[88] H. Abdi, "The Kendall rank correlation coefficient," Encyclopedia of Measurement and Statistics. Sage, Thousand Oaks, CA, pp. 508-510, 2007.
[89] O. Vinyals, Ł. Kaiser, T. Koo, S. Petrov, I. Sutskever, and G. Hinton,
"Grammar as a foreign language," in Advances in neural information processing systems, 2015, pp. 2773-2781.
86
LIST OF PUBLISHED ARTICLES
This Chapter provides a list of related published papers submitted to several international conferences in recent 2 years.
[6] Minh-Tri Nguyen, Duong H.Le, Nakajima Takuma, Masato Yoshimi and Nam Thoai, “Attention-based Neural Network: A Novel Approach for Predicting the Popularity of Online Content” in The IEEE 21th International Conferences on High Performance Computing and Communications (HPCC), Zhangjiajie, China, 2019.
[5] Anh-Tu Ngoc Tran, Minh-Tri Nguyen, Thanh-Dang Diep, Takuma Nakajima, and Nam Thoai, “Optimizing Color-Based Cooperative Caching in Telco-CDNs by Using Real Datasets” in The 13th International Conference on Ubiquitous Information Management and Communication (IMCOM), Phuket, Thailand, 2019.
[4] Anh-Tu Ngoc Tran, Minh-Tri Nguyen, Thanh-Dang Diep, Takuma Nakajima, and Nam Thoai, “A Performance Study of Color-Based Caching in Telco-CDNs by Using Real Datasets” in The 9th International Symposium on Information and Communication Technology (SoICT), Da Nang, Vietnam, 2018.
[3] Minh-Tri Nguyen, Thanh-Dang Diep, Tran Hoang Vinh, Takuma Nakajima, and Nam Thoai, “Analyzing and Visualizing Web Server Access Log File” in The 5th International Conference on Future Data and Security Engineering (FDSE), Ho Chi Minh City, Vietnam, 2018.
[2] Anh-Tu Ngoc Tran, Huu-Phu Nguyen, Minh-Tri Nguyen, Thanh-Dang Diep, Nguyen Quang-Hung and Nam Thoai, “pyMIC-DL: A Library for Deep Learning Frameworks Run on the Intel Xeon Phi Coprocessor” in The IEEE 20th International Conferences on High Performance Computing and Communications (HPCC), Exeter, United Kingdom, 2018.
[1] Thanh-Dang Diep, Minh-Tri Nguyen, Nhu-Y Nguyen-Huynh, Minh Thanh Chung, Manh-Thin Nguyen, Nguyen Quang-Hung, and Nam Thoai, “Chainer-XP: A Flexible Framework for Artificial Neural Networks Run on the Intel Xeon Phi Coprocessor” in The 7th International Conference on High Performance Scientific Computing Simulation, Modeling and Optimization of Complex Processes (HPSC), Hanoi, Vietnam, 2018. (Currently under review)