The research still has some unresolved issues: Only focusing on building the models on binary data and not paying the attention to non-binary data; just focusing on the accuracy of the r
Trang 1UNIVERSITY OF SCIENCE AND TECHNOLOGY
- -
PHAN PHUONG LAN
RECOMMENDATION SYSTEMS BASED ON STATISTICAL IMPLICATIVE MEASURES
Specialization: Computer Science
Code: 9480101
DOCTORAL THESIS SUMMARY
Danang – 2019
Trang 2UNIVERSITY OF SCIENCE AND TECHNOLOGY -
UNIVERSITY OF DANANG
Academic Instructors:
1 Huynh Xuan Hiep, Assoc Prof., PhD
2 Huynh Huu Hung, PhD
Opponent 1:……… ……… Opponent 2:……… ……… ……… Opponent 3:……… …… ………
The dissertation will be defended before the Board of thesis review Meeting at: ………
At hour day month year
The dissertation is available at:
- National Library
- Information and Learning Center, University of Da Nang
Trang 3PREFACE
1 The urgency of the thesis
The recommendation system (RS) is considered as one of the effective solutions for the information explosion problem because it can automatically analyze data to predict the ratings
of a user for products, services, etc thereby recommending to that user the list of items with the highest predicted ratings The main techniques used to build a RS are: Content-based, collaborative filtering, knowledge-based, and hybrid methods In particular, collaborative filtering is the most important and commonly used technique Proposing and improving the recommendation models to adapt to the diversity of application areas, the difference of user requirements and the development
of technology are always the main research direction on RSs Applying the statistical implicative analysis method (SIA) to other research fields is being one of the most interesting topics Not much research links that method to RSs The research still has some unresolved issues: Only focusing on building the models on binary data and not paying the attention to non-binary data; just focusing on the accuracy of the recommended good items when evaluating RSs; using the association rules to make the recommendation, as a result, the recommendation time may
be long and the computer may be overloaded; and not noticing the combination among the characteristics of statistical implicative measures to improve the recommendation accuracy Therefore, the PhD thesis "Recommendation systems based
on statistical implicative measures" is conducted to contribute a small part to the research field on RSs and SIA
Trang 42 Objectives, objects and scope of research of the thesis
2.1 Research objectives
The objective of the thesis is to understand and apply the statistical implicative measures and the collaborative filtering technique to propose recommendation models as well as improve the accuracy of proposed models Thereby, the thesis contributes
to linking the SIA method to the research on RSs
2.2 Research objects
Two main objects of the study are: Statistical implicative measures; and recommendation models based on statistical implicative measures and collaborative filtering technique
2.3 Research scopes
The scope of the study is: To obtain the understanding on the statistical implicative measures, collaborative filtering technique, and the existing studies on RSs using the SIA method; and to propose new recommendation models that can be applied on both binary and non-binary data and improve the accuracy of recommendation (the list of good items and the predicted ratings)
3 Research methodology
Literature review and experiment are two main research methods to be used by this thesis
4 Contribution of the thesis
- Firstly, two new measures developed on statistical implicative measures: (1) k nearest neighbors/users based implicative rating - KnnUIR; and (2) k nearest neighbors/items based implicative rating - KnnIIR These measures are used to predict the ratings given to items by a user
Trang 5- Secondly, three new recommendation models: (1) based on the statistical implicative measures and association rules; (2) based on KnnUIR; and (3) based on KnnIIR The proposed models can be applied on both binary data and non-binary data
- Thirdly, the Interestingness software tool including the utility functions and the proposed recommendation models This tool is developed in the R language, and is used for experiment
- Fourthly, the DKHP binary dataset storing the course registration DKHP is collected and used for evaluating the accuracy of recommendation
Trang 6CHAPTER 1 AN OVERVIEW
1.1 Statistical implicative measures
1.1.1 Definition
Statistical implicative measures (SIM) are measures proposed
by the statistical implicative analysis method SIMs are used to detect trends in a binary attribute set or non-binary attribute set SIMs are asymmetric, probability based and non-linear measures
1.1.2 Statistical implicative measures for binary data
1.1.3 Statistical implicative measures for non-binary data
1.2 Statistical implicative ratings
Statistical implicative rating measures is proposed by the thesis using some existing SIMs We can consider these measures
as SIMs Statistical implicative rating measures are used to predict the rating of a user for an item; thereby contributing to solving recommendation problems
1.3 Recommendation based on statistical implicative
analysis
1.3.1 Recommendation systems and research directions 1.3.2 Collaborative filtering technique
1.3.2.1 Memory based methods
1.3.2.2 Model based methods
1.3.3 Evaluating recommendation systems
1.3.3.1 K-fold cross validation method
1.3.3.2 Classification accuracy metrics
1.3.3.3 Predictive accuracy metrics
Trang 71.3.3.4 Rank accuracy metrics
1.3.4 Statistical implicative analysis based recommendation
1.3.4.1 Existing recommendation methods
1.3.4.2 Recommendation based on statistical implicative measures
1.4 Conclusion
Chapter 1 focuses on obtaining the understanding on SIMs, RSs and the accuracy metrics used for evaluating RSs The thesis summarizes SIMs (such as implicative intensity, entropic version
of implicative intensity, cohesion, contribution) and identify which measures should be used by RSs and to improve the accuracy of recommendation result Besides, Chapter 1 also focuses on the collaborative filtering technique and the accuracy metrics to be used for building and evaluating recommendation models Moreover, Chapter 1 also presents the research directions on RSs as well as the existing research related to RSs based on statistical implicative analysis; then identify the scope
of study and sketch the proposal
Trang 8CHAPTER 2 RECOMMENDATION BASED ON STATISTICAL IMPLICATIVE MEASURES AND
ASSOCIATION RULES
Differing from the existing recommendation models based on the statistical implicative analysis (SIA) and association rules, the proposed model of this chapter: Can be applied on both binary and non-binary data; provides more SIMs (such as implicative intensity, entropic version of implicative intensity, cohesion) to make the recommendation; and enables to combine one of the above measure with the contribution measure to improve the accuracy of RSs
2.1 Statistical implicative rules based model - SIR
The statistical implicative rules based model SIR is developed
on SIMs and association rules The proposed model SIR is shown
in Figure 2.1 This model consists of:
- A finite set of users 𝑈 = {𝑢1, 𝑢2, … , 𝑢𝑛}
- A finite set of items (e.g products, movies, etc.) 𝐼 = {𝑖1,
𝑖2, … , 𝑖𝑚}
- A rating matrix 𝑅 = (𝑟𝑗𝑘)𝑛x𝑚 where 𝑗 = 1 𝑛 and 𝑘 =
1 𝑚 to be used for storing the feedback (ratings) of users on items In binary form, 𝑟𝑗𝑘= 1 if user 𝑢𝑗 likes the item 𝑖𝑘 and
𝑟𝑗𝑘= 0 (or 𝑁𝐴) if 𝑢𝑗 does not like/know 𝑖𝑘 In non-binary form,
𝑟𝑗𝑘∈ [0,1] if 𝑢𝑗 rates 𝑖𝑘 and 𝑟𝑗𝑘= 𝑁𝐴 if 𝑢𝑗 does not rate/know
𝑖𝑘
- A vector 𝑅𝑢𝑎storing the known ratings of the user 𝑢𝑎 who needs the recommendation 𝑅𝑢𝑎 = {𝑟𝑢𝑎𝑘} where 𝑘 = 1, 𝑚̅̅̅̅̅̅; in which, 𝑟𝑢𝑎𝑘= 𝑁𝐴 if 𝑢𝑎 does not rate 𝑖𝑘
Trang 9Figure 2.1: The statistical implicative rules based model
To reduce the recommendation time, the SIR model in Figure 2.1 is improved by combining the follows simultaneously (directly): Generating association rules, presenting those rules by the set of four values {𝑛, 𝑛𝑎, 𝑛𝑏, 𝑛𝑎𝑏̅}, calculating the implicative value of those rules according to a specific SIM We can solve this problem by using and modifying the rchic package
(𝑢 𝑎 , I, 𝑅 𝑢𝑎) (U, I, R)
Support threshold s
Confidence threshold c
Implicative intensity, Entropic version of implicative Cohesion measure
Maximum length of a rule l {𝑎 → 𝑏 | 𝑎 ∈ 𝐼 𝑘 , 𝑏 ∈ 𝐼, 𝑘 = 1, 𝑙 − 1̅̅̅̅̅̅̅̅̅}
The ruleset is presented by the statistical implicative analysis method
Trang 102.2 Operation of the statistical implicative rules based model
The operation of SIR model includes two stages: Building the filtered ruleset presented according to the SIA method; and performing the recommendation as shown in Figure 2.2 To reduce the recommendation time, we can pre-built the learning model (offline)
Figure 2.2: The operational diagram of the SIR model 2.3 Experiment
2.3.1 Data and tool
Three data sets used for the experiment are MSWeb, MovieLens and DKHP (course registration) In which, MSWeb
Presenting rules according to SIA
The list of top N items
u a {i 1 , i 13 ,…, i m-2 }
ing items with the highest implicative values
Recommend-Making recommendation (online)
Trang 11and DKHP are binary datasets and MovieLens is a non-binary dataset
We developed the Interestingnesslab tool to conduct the experimental scenarios Besides, some recommendation models
of the recommenderlab package are used for comparing with the SIR model These models are: The association rule based on model (AR); the item based collaborative filtering model (IBCF) using Jaccard measure; the popular model (POPULAR) The experimental scenarios are run on the computers with the following configurations: (1) Window 8 OS, 16 GB RAM, and Intel Pentium G630 2.7GHz processor; and (2) Windows 10 OS,
8 GB RAM, and Intel Core i5-6200U 2.5GHz CPU processor
2.3.2 Evaluating the SIR model on binary data
The accuracy of the SIR model is compared with that of some existing models by the 5-folds cross validation method and the classification accuracy metrics (via Precision - Recall curve, ROC curve and the F1 measure combining the precision and the recall) The experimental results show that:
- The simultaneous combination of steps at the learning stage (in the improved SIR model) reduces the recommendation time
- The accuracy of SIR model is the highest when the entropic version of implicative intensity and the contribution measure are combined together to make the recommendation
- The accuracy of the SIR model combining the entropic version of implicative intensity and the contribution measure is higher than that of the compared recommendation models (AR, POPULAR, IBCF); Especially, when the user requiring the recommendation is not a new user (i.e the number of items that were rated by that user, the number of known ratings, is not too low)
Trang 122.3.2 Evaluating the SIR model on non-binary data
- The accuracy of SIR model is the highest when (1) the entropic version of implicative intensity and the contribution measure are combined together and the user does require many recommended items In reality, the user will be confused by a lot
of items to be recommended
- The accuracy of SIR model is higher than that of POPULAR
- a recommendation model based on the most popular items
2.4 Conclusion
Chapter 2 proposes the statistical implicative rules based model SIR applied on both binary and non-binary data; and improves the proposed model to reduce the recommendation time The ruleset represented by a set of four values can be pre-built offline and used online when someone needs recommendation The SIR model provides many SIMs and can be expanded by providing other objective interestingness measures The SIR model is coded and integrated in the Interestingnesslab tool The accuracy of SIR model is evaluated: By the classification accuracy metrics such as ROC curve, Precision - Recall curve and F1 measure; on two types of data: Binary (MSWeb, DKHP) and non-binary (MovieLens); according to two groups of scenarios: Internal comparison (using the same SIR model but the different SIMs) and external comparison (the SIR model and some existing recommendation models: AR, POPULAR and IBCF) The experimental results show that the SIR model should: (1) combine the entropic version of implicative intensity with the contribution measure to make the recommendation; (2) be used
to build RSs because the accuracy of SIR model is higher than that of compared models
Trang 13CHAPTER 3 RECOMMENDATION BASED ON USERS
IMPLICATIVE RATING MEASURE
The SIR model of Chapter 2 uses the association rules and SIMs to recommend the list of good items to users When the number of rules is too large, the SIR model and the existing models - also based the SIA and the association rules - have to face some disadvantages: The recommendation time may be long
if the learning stage is performed online; and the computer may
be overloaded Therefore, the thesis takes attention to the rules with length of 2 to overcome those disadvantages Besides, the
item 𝑖 maybe similar to the ratings given to 𝑖 by the nearest users
the relationship of 𝑢𝑎 and his/her nearest user 𝑢𝑗 As a result, the thesis combines the above characteristics to improve the accuracy of recommendation
3.1 KnnUIR Definition
The k nearest neighbors (i.e users) based implicative rating measure 𝐾𝑛𝑛𝑈𝐼𝑅 is proposed to predict the rating given by a user 𝑢𝑎 for an item 𝑖 ∈ 𝐼 The purpose of this proposal is to increase the recommendation accuracy 𝐾𝑛𝑛𝑈𝐼𝑅 - defined by (3.1) - is based on: (1) the number of nearest users of 𝑢𝑎 - 𝑘𝑛𝑛 (the nearest neighbors 𝑢𝑗 are identified by the implicative intensities of 𝑢𝑎 and 𝑢𝑗); (2) the ratings of item 𝑖 that were rated
by those neighbors - 𝑟𝑢𝑗𝑖; (3) the typicality of 𝑖 contributing to the relationship of 𝑢𝑎 and 𝑢𝑗 - 𝛾(𝑖, 𝑢𝑎→ 𝑢𝑗) The value of
Trang 14𝐾𝑛𝑛𝑈𝐼𝑅(𝑢𝑎, 𝑖) has to be transformed to the range [0, 1] - the same scale as elements of rating matrix
𝑘𝑛𝑛 𝑗=1
(3.1)
3.2 Users implicative rating based model - UIR
The users implicative rating based model UIR is developed
by using the proposed KnnUIR measure and the user based collaborative filtering method The UIR model shown in Figure 3.1 has the same components as the SIR model However, this UIR model not only predicts the rating given by a user to an item but also recommends the list of top items to a user
Figure 3.1: The users implicative rating based model 3.3 Operation of the users implicative rating based model
The operational diagram of the UIR model is presented in Figure 3.2
(𝑢𝑎, I, 𝑅𝑢𝑎) (U, I, R)