DSpace at VNU: Dealing with the new user cold-start problem in recommender systems: A comparative review

It is difficult to give the prediction to a specific item for the new user cold-start problem because the basic filtering methods in RSs, such as collaborative filtering and content-base

Trang 1

3

5

7

9

11

13

15

17

19

21

23

25

27

29

31

33

35

37

39

41

43

45

47

49

51

53

55

57

59

61

63

65

67

69

71

73

75

77

79

81

83

85

87

Dealing with the new user cold-start problem

in recommender systems: A comparative review

Q1

VNU University of Science, Vietnam National University, Vietnam

a r t i c l e i n f o

Article history:

Received 15 September 2014

Received in revised form

6 October 2014

Accepted 7 October 2014

Recommended by D Shasha

Keywords:

Collaborative filtering

NHSM

New user cold start

Recommender systems

a b s t r a c t

The Recommender System (RS) is an efficient tool for decision makers that assists in the selection of appropriate items according to their preferences and interests This system has been applied to various domains to personalize applications by recommending items such as books, movies, songs, restaurants, news articles and jokes, among others An important issue for the RS that has greatly captured the attention of researchers is the new user cold-start problem, which occurs when there is a new user that has been registered to the system and no prior rating of this user is found in the rating table In this paper, we first present a classification that divides the relevant studies addressing the new user cold-start problem into three major groups and summarize their advantages and disadvantages in a tabular format Next, some typical algorithms of these groups, such as MIPFGWC-CS, NHSM, FARAMS and HU–FCF, are described Finally, these algorithms are implemented and validated on some benchmark RS datasets under various settings of the new user cold start The experimental results indicate that NHSM achieves better accuracy and computational time than the relevant methods

Contents

1 Introduction 2

2 Literature review 2

3 The analysis of existing methods 5

3.1 MIPFGWC-CS 5

3.2 NHSM 6

3.3 FARAMS 7

3.4 HU–FCF 8

4 Experiments 9

4.1 Environment setup 9

4.2 Results and discussion 9

5 Conclusions 15

Contents lists available atScienceDirect

journal homepage:www.elsevier.com/locate/infosys

Information Systems

http://dx.doi.org/10.1016/j.is.2014.10.001

n Tel.: þ84 904171284; fax: þ84 438623938.

E-mail addresses: sonlh@vnu.edu.vn , chinhson2002@gmail.com

1

Official address: 334 Nguyen Trai, Thanh Xuan, Hanoi, Vietnam.

Trang 2

3

5

7

9

11

13

15

17

19

21

23

25

27

29

31

33

35

37

39

41

43

45

47

49

51

53

55

57

59

61

63

65

67

69

71

73

75

77

79

81

83

85

87

89

91

93

95

97

99

101

103

105

107

109

111

113

115

117

119

121

123

Acknowledgments 17

Appendix A Supporting information 17

17

References 17

1 Introduction

The growing development of content-based systems

that provide a large amount of data, such as videos,

images, blogs, multimedia, and wikis, brings great

chal-lenges for analysts attempting to extract useful knowledge

and capture meaningful events from the massive data

Machine learning tools should indeed be oriented to what

users intend to do and how they want the results to be

returned in a given format An efficient tool that assists

decision makers to choose appropriate items according to

their preferences and interests and that is currently widely

used is the Recommender System (RS) Ricci et al [31]

defined the RS as a special type of information system that

(i) helps to make choices without sufficient personal

experience of the alternatives, (ii) suggests products to

customers, and (iii) provides consumers with information

to help them decide which products to purchase The RS is

based on a number of technologies, such as information

filtering, classification learning, user modeling and

adap-tive hypermedia, and it is applied to various domains to

personalize applications by recommending items such as

books, movies, songs, restaurants, news articles and jokes,

among others It has been applied to e-commerce to learn

from a customer and recommend products that he or she

will find most valuable from among the available products,

thus helping the customer find suitable products to

purchase Some e-commerce RSs are named as follows

[37,22] For example, Amazon.com is the most famous

e-commerce RS, structured with an information page for

each book while providing details of the text and purchase

information Two recommendations are found herein,

including books frequently purchased by customers who

purchased the selected book and authors whose books are

frequently purchased eBay.com is another example that

provides the Feedback Profile feature, which allows both

buyers and sellers to contribute to the feedback profiles of

other customers with whom they have done business The

feedback consists of a satisfaction rating and a specific

comment about the other customer On Moviefinder.com,

customers can locate movies with a similar“mood, theme,

genre or cast” through Match Maker or by their previously

indicated interests through WePredict We clearly

recog-nize that RSs are becoming important and with increasing

influence on various practical applications

An important issue for RSs that has greatly captured the

attention of researchers is the cold-start problem This

pro-blem has two variants: the new user cold-start propro-blem and

the new item cold-start problem The new item cold-start

problem occurs when there is a new item that has been

transferred to the system Because it is a new product, it has

no user ratings (or the number of ratings is less than a

threshold as defined in some equivalent papers) and is

therefore ranked at the bottom of the recommended items

list Moreover, this problem can be partially handled by staff

members of the system providing prior ratings to the new item Thus, the concentration of the cold-start problem is dedicated to the new user cold-start problem when no prior rating could be made due to the privacy and security of the system It is difficult to give the prediction to a specific item for the new user cold-start problem because the basic filtering methods in RSs, such as collaborative filtering and content-based filtering, require the historic rating of this user

to calculate the similarities for the determination of the neighborhood For this reason, the new user cold-start problem can negatively affect the recommender performance due to the inability of the system to produce meaningful recommendations [33] Addressing this problem has been the primary focus of various studies in recent years

The aim of this paper is to provide a comparative review

of those studies that could answer our research question

“which (group of) algorithm is the most effective among all?” For this purpose, we first provide a classification that divides the relevant studies into three groups: (i) makes use of additional data sources; (ii) selects the most prominent groups of analogous users; and (iii) enhances the prediction using hybrid methods A table that sum-marizes the advantages and disadvantages of all groups of methods is presented Second, some typical algorithms of the groups of methods, such as MIPFGWC-CS[46](the first group), NHSM[20](the second group), FARAMS[17]and

HU–FCF [42] (the third group), are described in detail Finally, these algorithms are implemented and validated

on some benchmark RS datasets, such as MovieLens[23]

and Jester[12], under various settings of the new user cold start The experimental results could reveal the answer for our research question stated above

The remainder of the paper is organized as follows In

Section 2, we present a literature review of the relevant studies according to the three aforementioned groups

Section 3elaborates on the four typical methods, namely, MIPFGWC-CS[46], NHSM[20], FARAMS[17]and HU–FCF

[42].Section 4 presents the comparative experiments of these algorithms involving benchmark RS datasets Finally,

research directions

2 Literature review

The beginning of this section starts with an example that clearly demonstrates the new user cold-start problem Example 1 We have a RS that includes three tables: the users' demographic data (Table 1), the movies' information (Table 2) and the rating (Table 3) This type of system is able

to predict the user rating of a movie, which is expressed in

Table 3 Nonetheless, the new user cold-start problem occurs with a new user, e.g., Kim (User ID: 6) inTable 1, who has no prior rating such that it is difficult to provide a prediction for the first movie, e.g., Titanic (ID: 1)

Trang 3

3

5

7

9

11

13

15

17

19

21

23

25

27

29

31

33

35

37

39

41

43

45

47

49

51

53

55

57

59

61

63

65

67

69

71

73

75

77

79

81

83

85

87

89

91

93

95

97

99

101

103

105

107

109

111

113

115

117

119

121

123

In the following, we briefly summarize the relevant

works in regards to the new user cold-start problem At the

milestone of 2014, there are various works aiming to

handle this problem Those studies could be divided into

three categories: (i) makes use of additional data sources,

(ii) chooses the most prominent groups of analogous users,

and (iii) enhances the prediction using hybrid methods

a) The principal idea of the first group is the use of some

additional sources, such as the demographic data (a.k.a

the users' profile), the users' opinions, and social tags,

for a better selection of the neighbors of the new user

Vozalis and Margaritis [49] demonstrated a modified

version of k-nearest neighborhood by adding a user

demographic vector to the user profile and embedding

it in the collaborative filtering algorithm for the

calcu-lation of similarity Poirier et al [27] proposed a

method that exploits blog textual data to reduce the

cold-start problem by labeling subjective texts

accord-ing to their expressed opinions to construct a user–

item-rating matrix and establishing recommendations

through collaborative filtering Zhang et al [53]

pre-sented a recommendation algorithm that makes use

of social tags, particularly user-tag-object tripartite

graphs, to provide more personalized

recommenda-tions when the assigned tags belong to diverse topics

Almazro et al [3] introduced a hybrid demographic-based and collaborative filtering approach on the movie domain using demographic data to enhance the recom-mendation suggestion process Their method classified the genres of movies based on demographic attributes, e.g., user age (child, teenager or adult), student (yes or no), have children (yes or no) and gender (female or male) Preisach et al [28] argued that many user profiles contain untagged resources that could pro-vide valuable information, especially for the cold-start problem, and proposed a purely graph-based semi-supervised relational approach that uses untagged posts Said et al [34,36] modified the user similarity calculation method to employ the hybridization of demographic and collaborative approaches A modifi-cation to the k-nearest neighborhood that calculates the similarity scores between the target user and other users was introduced Wang et al [50] introduced Credible and co-clustering filterBot for cold-stArt reco-mmendations (COBA), which uses the rating confidence level to reduce the dimensionality of the item–user matrix The items and users were co-clustered, and the ratings within every user cluster were smoothed to overcome data sparsity The recommendations were fused from item and user clusters to predict user preference Zhang et al.[52] proposed the Cold-start Recommendations Using Collaborative Filtering (CRUC) scheme, which involves formulation, filtering and pre-diction steps They assumed that users are tracked by sensors such that each user has their own location, which is currently regarded as the item The item–user matrix was normalized and clustered to identify users who have a significant influence on the recommenda-tion The prediction steps were performed by taking the hybrid between the item-based and user-based filtering methods Chen et al.[8]employed additional informa-tion, such as the social sub-community and an ontology decision model, to assist the recommendation in the cold-start problem The social sub-community was divided according to the exiting users' history data and the mining relationship between each other An ontology decision model was then constructed on the basis of sub-community and users' static information, which makes recommendations for the new user based

on his static ontology information Guo[11]proposed three different approaches from the perspective of preference modeling First, the ratings of trusted neigh-bors were merged to form a new rating profile for the active users based on which better recommendations can be generated Second, a novel Bayesian similarity measure was introduced by taking both the direction and length of rating vectors into account Third, a new information source called prior ratings, based on virtual product experience in virtual reality environments, was proposed to inherently resolve the concerned pro-blems Chen et al.[7]proposed a cold start recommen-dation method for the new user that integrates a user model with trust and distrust networks to identify tru-stworthy users, which are then aggregated to provide useful recommendations for new users Demographic

Table 3

Rating data.

Table 1

Users' demographic data.

Table 2

Movies' information.

Trang 4

3

5

7

9

11

13

15

17

19

21

23

25

27

29

31

33

35

37

39

41

43

45

47

49

51

53

55

57

59

61

63

65

67

69

71

73

75

77

79

81

83

85

87

89

91

93

95

97

99

101

103

105

107

109

111

113

115

117

119

121

123

data or users' profiles are the most common additional

source for solving the cold-start problem Safoury and

Salah [33] presented a framework for evaluating the

influence of demographic attributes on the user ratings

This framework was examined using a movie dataset to

evaluate the accuracy and precision of the generated

recommendations Nazir and Yadav [24]introduced a

profile-based approach that consisted of three main

phases: Fetch, Process and Truncate The Fetch phase is

concerned with obtaining the required parameters,

such as Global Student Profile or Shared Profile (Profile

Token), for the Process phase Process is the main

engine where the actual recommendations are

gener-ated based on inputs from the Fetch phase Truncate is

involved with the discarding of the profile-tokens used

during the Fetch phase Formoso et al.[9]proposed a

novel profile-expansion approach that includes three

types of techniques, namely, item-global, item-local

and user-local, based on the query expansion

techni-ques in information retrieval The experimental

evalua-tion showed that both item-global and user-local offer

outstanding improvements in precision Son et al.[46]

presented a novel filtering method based on fuzzy

geographically clustering [45,38–44], the so-called

MIPFGWC-CS, that can handle the issues of selected

demographic attributes, the similarities between items

and missing ratings that existed in relevant

demogr-aphic-based algorithms Rosli et al.[32]designed a new

measure by combining similarity values obtained from

a movie “Facebook Page” First, the users' similarities

were computed according to the rating cast on the

Movie Rating System Then, the similarity values

obta-ined from a user's genre interest in“Like” information

extracted from “Facebook Pages” were combined

Finally, all of the similarity values were integrated to

produce a new user's similarity value Lika et al [18]

proposed a model that incorporates classification

met-hods with demographic data for the identification of

other users with similar behaviors

Limitations of the first group: although the additional

data sources are necessary, we sometimes do not have

these types of data for the selection, e.g., in some

e-shopping systems when users do not record their

profiles and associated Facebook/Twitter accounts

b The idea of the second group is to improve the methods

that determine the analogous users without the aid of

additional data sources Ahn[2]addressed the limitations

of the existing methods for the new user cold-start

problem by primarily focusing on the similarity measures,

such as the Pearson coefficient and the cosine measure,

and proposed a heuristic similarity measure, i.e., the

so-called PIP (Proximity–Impact–Popularity) measure The

Proximity factor is based on the arithmetic difference

between two ratings, the Impact factor considers how

strongly an item is preferred or disliked by buyers, and

the Popularity factor provides greater value to a similarity

for ratings that are further from the average rating of a

co-rated item Lam et al.[15]discussed a hybrid model

based on the analysis of two probabilistic aspect models

using pure collaborative filtering to combine with users'

information Sun et al.[47]clustered users based on the

user–item rating matrix and then utilized the clustering results and users' demographic information to construct a decision tree to achieve the associations between the existing users and the new users The predictions for new users were made by combining the decision tree with the collaborative filtering algorithm Zhou et al [54] pre-sented functional matrix factorization (fMF), a novel cold-start recommendation method that constructs a decision tree with each node being a question fMF enables the recommender to query a user adaptively according to her prior responses and associates latent profiles for each node of the tree to gradually refine the profiles It also consists of an iterative optimization scheme that alter-nates between decision tree construction and latent profile extraction Qiu et al [29] introduced an item-oriented function and incorporated it with a hybrid algorithm between heat conduction and the probability spreading process so that the proposed algorithm does not require any additional information, such as tag Liu

et al [21] noted that the existing recommendation methods lacked a principled model for guiding how to select the most useful ratings and that ratings on the selected representatives are considerably more useful for making recommendations; thus, they proposed a princi-ple approach to identify representative users and items using representative-based matrix factorization Bobadilla

et al [5] presented a new similarity measure using optimization based on neural learning, which exceeds the best results obtained with current metrics, and described the mathematical formalization that shows how to obtain the main quality measures of a recom-mender system using leave-one-out cross validation Said

et al.[35]performed a set of tests to identify whether the weighting schemes on three common similarity mea-sures using two different movie datasets can be beneficial for the purpose of overcoming problems related to cold-start, as well as profiling users to generate more accurate profiles not based on the most popular items They claimed that the weighting schemes appear to have little effect on datasets with a wide rating scale and high concentration of ratings on popular items Moreover, the cosine measure is very insignificantly affected by any weighting measure and produces identical results regard-less of whether weighting is applied Sun et al [48]

proposed a novel algorithm that learns to conduct the interview process guided by a decision tree with multiple questions at each split The splits, represented as sparse weight vectors, are learned through an L_1-constrained optimization framework The users are directed to child nodes according to the inner product of their responses and the corresponding weight vector A linear regressor is learned within each node, using all previously obtained answers as inputs to predict item ratings Liu et al.[20]

presented a new user similarity model– NHSM – that takes into account the global preference of user behaviors

in addition to the local context information of user ratings

to improve the recommendation performance in the cold-start situation

Limitations of the second group: how to choose the optimal number of groups and the splitting criteria

is worth considering

Trang 5

3

5

7

9

11

13

15

17

19

21

23

25

27

29

31

33

35

37

39

41

43

45

47

49

51

53

55

57

59

61

63

65

67

69

71

73

75

77

79

81

83

85

87

89

91

93

95

97

99

101

103

105

107

109

111

113

115

117

119

121

123

c After determining the most analogous users to the new

one, some authors used hybrid methods for the

calcula-tion of similarity and/or the prediccalcula-tion of rating This is

the basic idea of the third group Leung et al [16,17]

introduced a collaborative filtering framework based on

Fuzzy Association Rules and Multiple-level Similarity

(FARAMS), which extends existing techniques by using

fuzzy association rule mining and takes advantage of

product similarities in taxonomies to address data

spar-seness and non-transitive associations Basiri et al [4]

proposed a hybrid recommender system using the

opti-mistic exponential type of an ordered weighted averaging

operator to fuse the output of recommender system

strat-egies for the new user cold-start problem Kim et al.[13]

presented a method for the cold-start problem that

includes the prediction of actual ratings, the identification

of prediction errors for each user, and the construction of

an error-reflected model for the prediction of new users

or items Ge and Ge[10]claimed that a lower-rank

appro-ximation could remove data noise resulting from

uns-table user behaviors and thus lead to better

recommen-dation quality; based upon this idea, they proposed

Singular Value Decomposition-based Collaborative

Filter-ing Kim et al.[14]proposed three hybrid recommenders

based on user similarity and two content-boosted

recom-menders used in conjunction with interaction-based

col-laborative filtering; they experimentally showed that the

best hybrid and content-boosted recommenders improve

on the interaction-based collaborative filtering method

Quijano-Sánchez et al.[30]extended a group

recommen-der system with a case based on previous group

recom-mendation events Carrer-Neto et al.[6]presented a

hyb-rid recommender system based on knowledge and social

networks Negre et al.[25]introduced a process for

solv-ing the cold-start problem in cases of data warehouses

composed of four steps: patternizing OLAP queries,

pre-dicting candidate operations, computing candidate

reco-mmendations and ranking these recoreco-mmendations Xie

et al [51] proposed Elver, which employs an iterative matrix completion technology and a nonnegative factor-ization procedure to work with meager content inklings

to recommend and optimize page-interest targeting on Facebook Aharon et al.[1]introduced a recommendation algorithm based on Latent Factor analysis called One-pass Factorization of Feature Sets (OFF-Set), which is able to model non-linear interactions between pairs of features and updates its model per each recommendation-reward observation in a pure online fashion Lin et al.[19] des-cribed a method that considers the nascent information culled from Twitter to provide relevant recommendations

in cold-start situations Nilashi et al.[26]proposed new recommendation methods using ANFIS and SOM clust-ering A hybrid user-based fuzzy collaborative filtering method has been proposed by Son[42]

Limitations of the third group: irrelevant users are still included in the computation of similarities

d Table 4summarizes the relevant works by groups along with their advantages and disadvantages

3 The analysis of existing methods

In this section, we provide more details about the typical algorithms of the groups of methods inTable 4to address the new user cold-start problem These algorithms are MIPFGWC-CS[46], NHSM[20], FARAMS[17]and HU– FCF[42], and these methods are presented in the subse-quent sections

3.1 MIPFGWC-CS

The basic concept of the MIPFGWC-CS algorithm[46]is

to use fuzzy geographically clustering[45,38–44], particu-larly MIPFGWC, for the determination of similar users with respect to all attributes in the demographic data Because

Table 4

Groups of methods for the new user cold-start problem.

algorithms

Makes use of

additional data

sources

Use some data (users' profile, opinions, social tags) to support the selection of the neighbors of the new user

MIPFGWC-CS [46]

Determination of analogous users is more accurate

Additional data are sometimes not available

Chooses the most

prominent groups

of analogous users

Improve the methods determining the analogous users by clustering algorithms, decision trees

NHSM [20]

The similarity degrees between users are enhanced

Additional data are not required

How to choose the optimal number

of groups and the splitting criteria

is worth considering

Enhances the

prediction using

hybrid methods

Use hybrid methods for the calculation of similarity and/or the prediction of ratings

FARAMS [17]

HU–FCF [42]

Utilizes the results

of existing methods for prediction

Controls the final results by parameters

Specification of values of parameters is hard

Irrelevant users are still included in the computation of similarities

Trang 6

3

5

7

9

11

13

15

17

19

21

23

25

27

29

31

33

35

37

39

41

43

45

47

49

51

53

55

57

59

61

63

65

67

69

71

73

75

77

79

81

83

85

87

89

91

93

95

97

99

101

103

105

107

109

111

113

115

117

119

121

123

the new user has no prior rating, the demographic data are

the only medium to calculate the similarities between

users After finding users similar to the new one,

MIPFGWC-CS checks whether they rated the considered

item or not If ratings are found, then they are considered

to be the representative ratings of users Otherwise, a

similar item to the considered one is found by the Pearson

coefficient, and the rating on the similar item is assumed

to be the representative rating Finally, the rating of the

new user to the considered item is approximated by the

weighted average operator of the representative ratings

Fig 1andTable 5illustrate the idea in detail

As described above, the MIPFGWC-CS algorithm

con-tains several disadvantages, as follows:

a) Determining the optimal number of clusters for

MIPFGWC is required before running the clustering

algorithm Although other parameters of MIPFGWC

were suggested by Son et al.[39], how to determine

the optimal number of clusters is still an on-going topic

of research The exact number of clusters would lead to

more accurate results for finding the similar users to a

new user and thus enhance the prediction accuracy

b) InFig 1, finding a similar item to the considered one by

the Pearson coefficient could somehow not achieve

good results because the Pearson metric has some

limitations where there is a poor signal-to-noise ratio

and negative spikes In other words, if the relationship

between two variables is non-linear, the Pearson

coef-ficient cannot accurately measure the correlation

c) The MIPFGWC-CS relies solely on the demographic

data (Fig 1) If this type of data is not available, the

algorithm cannot be performed

3.2 NHSM

Liu et al.[20]introduced a new similarity metric called

NHSM to replace the traditional Pearson coefficient or the

cosine similarity measure This heuristic similarity

mea-sure is composed of three factors of similarity, which are

Proximity, Significance and Singularity Proximity

consid-ers the distance between two ratings Significance shows

that the ratings are more significant if the two ratings are

more distant from the median rating Singularity

repre-sents how the two ratings are different from other ratings

Furthermore, NHSM integrates the modified Jaccard and

the user rating preference in the design The definition of

NHSM is stated below

sim uð ; vÞNHSM¼ sim u; vð ÞJPSSsim uð ; vÞURP; ð8Þ

sim uð ; vÞURP

þexp jμ uμvjjσuσvj; ð9Þ sim uð ; vÞJPSS¼ sim u; vð ÞPSSsim uð ; vÞJaccard; ð10Þ

sim uð ; vÞJaccard

¼jIu\ Ivj

sim uð ; vÞPSS¼XpA IProximity ru;p; rv;p

Signif icance ru;p; rv;p

Singularity ru;p; rv;p

; ð12Þ Proximity ru ;p; rv ;p

¼ 11þexp jr1

u ;prv ;pj

Signif icance ru ;p; rv ;p

¼1þexp jr 1

u ;prmedjjrv ;prmedj

ð14Þ

Fig 1 The MIPFGWC-CS algorithm.

Trang 7

3

5

7

9

11

13

15

17

19

21

23

25

27

29

31

33

35

37

39

41

43

45

47

49

51

53

55

57

59

61

63

65

67

69

71

73

75

77

79

81

83

85

87

89

91

93

95

97

99

101

103

105

107

109

111

113

115

117

119

121

123

Singularity ru ;p; rv ;p

1þexp ru;pþ r v;p

2 μp

where μu and σu are the mean rating and the standard

variance of user u, respectively Iu represents the set of

ratings of user u The operator means the common

ratings between two users ru;pis the rating of user u on

item p rmed is the median value in the rating scale.Fig 2

describes the filtering method using the NHSM metric The

limitations of the NHSM-based filtering algorithm as

follows:

a) The algorithm is based solely on the rating data and

makes no use of additional data, such as demographic

data; thus, it somehow leads inaccurate calculations of

the similarity

b) The algorithm must assume that the new user has rated

some prior rating in the rating data

3.3 FARAMS

Leung et al [17] integrated fuzzy sets theory into

association rule mining techniques and applied the

pro-posed work to the collaborative filtering of recommender

systems First, the rating data are converted to the

transac-tional database of association rule mining, fuzzified by

fuzzy memberships of linguistic variables and transformed into the type of transaction ID (TID)– items where each TID is in the form of {Item, linguistic variable}, and each item is a list of users with equivalent fuzzy memberships that opted for the {Item, linguistic variable} Then, an Apriori-like algorithm is used to define candidate item sets and possible rules with the support of MinSupp and MinConf thresholds The difference between this algo-rithm and the original Apriori algoalgo-rithm is the use of Fuzzy Support – FChh iA;X ; B;Y h ii and Fuzzy Confidence

FChh iA ;X ; B;Y h ii between two items A and B equipped by their memberships X and Y respectively (Eqs (16–21)) After defining the fuzzy rules, the predicting score of a recom-mendable item is calculated and used to provide the final rating of the new user Fig 3 highlights the concept in detail

FSh i ¼A;X

P

tiA TΠajA Aμðti½ajÞ

FCh iA;X ; B;Y h i¼FShFSA[ B;X [ Yi

A ;X

CORRh iA;X ; B;Y h i¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiCovh iA;X; B;Yh i

Varh iA;X nVarhB;Yi

Table 5

The pseudo-code of the MIPFGWC procedure.

Input Geo-demographic data X The number of elements (clusters) – NðCÞ The dimension of dataset r Threshold ε and other parameters m; η; τ, a i

ði ¼ 1; 3Þ, γ j ðj ¼ 1; C Þ Geographic parameters α; β; γ; a; b; c; d.

Output Final membership values u 0

k and centers Vðtþ 1Þ MIPFGWC

1: Set the number of clusters C, thresholdε40 and other parameters such as m; η; τ41, a i 40 ði ¼ 1; 3Þ, γ j ðj ¼ 1; C Þ as in [39]

2: Initialize centers of clusters Vj, j¼ 1; C at t ¼ 0

3: Set geographic parameters α; β; γ; a; b; c; d satisfying condition (1)

4: Use the formulas to calculate the membership values, the hesitation level and the typicality values, respectively

P C

i ¼ 1

‖X k V j ‖

‖X k V i ‖

! 2 m 1 ; k ¼ 1; N; j ¼ 1; C ;

ð2Þ

P C i¼ 1

‖X k V j ‖

‖X k V i ‖

τ 1 ; k ¼ 1; N; j ¼ 1; C ;

ð3Þ

1þ a 2 ‖X k V j ‖ 2

γ j

η 1 ; k ¼ 1; N; j ¼ 1; C:

ð4Þ 5: Perform geographic modifications through Eqs (5 – 6 )

u 0

k ¼ α u k þβ Xk 1

j ¼ 1

wkj u 0

j þγ 1

A XC

j ¼ k

wkj u j ;

ð5Þ

w kj ¼

pop k pop j

p c

kj IM d kj

d a

kj kaj

: 8

k

is a completely monotone increasing sequence or ukZu 0

k for most k ¼ 1; C , then conclude that there is no suitable solution for the given geographic parameters Otherwise, go to Step 7.

7: Calculate the centers of clusters at tþ1 by Eq (7)

V j ¼

P N k¼ 1 a 1 u m þa 2 tηkjþa 3 hτkj

X k

P N

k ¼ 1 a 1 u m þa 2 tηkjþa 3 hτkj ; j ¼ 1; C :

ð7Þ 8: If the differencejjV ð tþ 1 Þ V ð Þ t jjrε, then stop the algorithm Otherwise, assign V ð Þ t ¼ V ð t þ 1 Þ and return to Step 4.

Trang 8

3

5

7

9

11

13

15

17

19

21

23

25

27

29

31

33

35

37

39

41

43

45

47

49

51

53

55

57

59

61

63

65

67

69

71

73

75

77

79

81

83

85

87

89

91

93

95

97

99

101

103

105

107

109

111

113

115

117

119

121

123

Covh iA;X; B;Y h i¼ FShA [ B;X [ Yi FSh iA ;X nFSh B ;Y i; ð19Þ

Varh i ¼ FSA;X h iA;X 2FSh iA;X 2; ð20Þ

FSh iA;X2¼

P

t i A TnΠajA Aμðti½ajÞo2 T

In Eq.(16), A; Xrepresents an〈Itemset, FuzzySet〉 ti½aj is

the value of aj in the ith record of the transactional

database T μðti½ajÞ is the membership value of ti½aj

Eqs.(16–18) provide the formulas of Fuzzy Supports, Fuzzy

Confidence and Correlation, respectively The limitations of

the FARAMS algorithm are as follows:

a) The fuzzification in FARAMS could lead to inaccurate

prediction results The FARAMS algorithm was designed

for movie applications, e.g., MovieLens, Jester and

Each-Movie, where the linguistic variables are“Like”, “Dislike”

and “Neutral” with pre-defined membership functions

When applied to other applications, knowing how to set

up the membership functions is a matter of concern

Wrong membership values would result in the activities

of the algorithm In fact, not all recommender system

applications require fuzzy parameters; thus, for the sake of

stability and processing time, the fuzzification step should

be reduced

b) The limitation of rating data in the NHSM-based filter-ing algorithm is available

c) The FARAMS algorithm could be regarded as an effi-cient method for calculating the similarity betw-een items

3.4 HU–FCF The basic concept of the HU–FCF method [42] is to integrate the fuzzy similarity degrees between users based

on the demographic data, with the hard user-based degrees calculated from the rating histories integrated into the final

Fig 2 The NHSM-based filtering algorithm.

Fig 3 The FARAMS algorithm.

Trang 9

3

5

7

9

11

13

15

17

19

21

23

25

27

29

31

33

35

37

39

41

43

45

47

49

51

53

55

57

59

61

63

65

67

69

71

73

75

77

79

81

83

85

87

89

91

93

95

97

99

101

103

105

107

109

111

113

115

117

119

121

123

similarity degrees As such, those degrees would reflect more

exactly the correlation between users in terms of the internal

(attributes of users) and external information (interactions

between users) Each similarity degree (fuzzy/hard) is

accom-panied by weights automatically calculated according to the

numbers of analogous users After the final similarity degrees

are calculated, the final rating will be constructed based on

the rating values of neighbors of the considered user

Depend-ing on the domain of a specific problem, the final ratDepend-ing will

be approximated to its nearest value in that domain

accom-panied by an error threshold, which is normally less than 5%

A list of nearest values with equivalent error thresholds is also

given as the prediction ratings of a user for an item Fig 4

illustrates the concept in detail

The limitations of the HU–FCF algorithm are as follows:

a) If the demographic data are not provided in the data list,

the HU–FCF algorithm does not work because the rating

data have no prior ratings of the new user Thus, the

similarities between the new user and others cannot be

calculated, and the final rating cannot be found

b) Similar to the deficiencies of the MIPFGWC-CS

algo-rithm, the Pearson coefficient cannot accurately

mea-sure the correlation Thus, a better similarity metric

should be used instead of the Pearson coefficient

c) In the branch of demographic data, the GFD matrix is

calculated from all users in the system Indeed, irrelevant

users may be included in the computation of similarities,

thus degrading the performance of the prediction

4 Experiments

4.1 Environment setup

Experimental tools: we have implemented MIPFGWC-CS

[46], NHSM[20], FARAMS[17]and HU–FCF[42]in the C

programming language and executed them on a PC with an Intel Pentium 4 CPU 2.66 GHz, 1 GB RAM, and

80 GB HDD

Experimental datasets: we use the following benchmark

RS datasets

MovieLens 1 M[23]: contains 1,000,209 anonymous ratings of approximately 3900 movies provided by

6040 MovieLens users Ratings are discrete values from 1 to 5 Demographic data are provided in the following form: “Gender: Age: Occupation: Zip-code”

Jester[12]: contains ratings of 100 jokes from 73,421 users Ratings are real values ranging from 10 to

10 The value “99” corresponds to “null”¼“not rated” Demographic data are no longer supported for this dataset

Generating cold-start users: we adopt the Hold-out and the k-fold cross validation methods, where the users in the testing set are the start users For each cold-start user, we use those algorithms to predict the ratings for items that have been rated, except three rated items selected to be the basis for the calculation

of similarity Each trial is measured by the evaluation indices The final results are computed as the average value of those according to users and trials

Evaluation indices: we use the Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) for the validation of accuracy

MAE¼N1X

u ;i

pu;iru ;i

RMSE¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1

N X

u ;i

pu ;iru;i

s

where pu;iðru ;iÞ is the predicted (real) rating of user u for item i

Experimental objectives: we compare the accuracy and the computational time of the algorithms to determine the most effective algorithm

4.2 Results and discussion

First, we present the comparative results of the algo-rithms using the Hold-out cross validation method described

Fig 4 The HU–FCF algorithm.

Table 6 The experimental results using the Hold-out cross validation method.

MAE values

0.697

RMSE values

Computational time (s)

a

Smallest value for a given dataset.

Trang 10

3

5

7

9

11

13

15

17

19

21

23

25

27

29

31

33

35

37

39

41

43

45

47

49

51

53

55

57

59

61

63

65

67

69

71

73

75

77

79

81

83

85

87

89

91

93

95

97

99

101

103

105

107

109

111

113

115

117

119

121

123

in Table 6 The experimental results are evaluated by the

evaluation indices In this table, the results of MIPFGWC-CS

on the Jester dataset are null because this dataset does not

support the demographic data, which are essential for the

calculation in MIPFGWC-CS The experimental results have

clearly shown that the MAE, RMSE values and the

computa-tional time of NHSM are mostly smaller than those of the

other algorithms To visualize the experimental results, the

MAE and RMSE values of the algorithms on the MovieLens

and Jester datasets are presented inFig 5and6, respectively

It is clear that the MAE and RMSE values of NHSM are

the smallest among all of the algorithms For instance, the

MAE value of NHSM on the Jester dataset is 0.821, which approximates to 91.3% and 91.7% of those of FARAMS and

HU–FCF, respectively Similarly, the RMSE value of NHSM

on the Jester dataset is 1.04, which approximates to 95.3% and 94.7% of those of FARAMS and HU–FCF, respectively The only case in which the MAE value of NHSM is larger than those of other algorithms occurs on the MovieLens dataset with MAE values of NHSM, MIPFGWC-CS, FARAMS and HU–FCF being 0.641, 0.701, 0.636 and 0.697, respec-tively Despite this fact, the MAE value of NHSM is only larger than that of FARAMS and is smaller than those

of other algorithms Thus, the experimental results have

Fig 5 The MAE and RMSE values of algorithms on the MovieLens dataset.

Fig 6 The MAE and RMSE values of algorithms on the Jester dataset.

Định dạng
Số trang	18
Dung lượng	2,48 MB