Data Analysis Machine Learning and Applications Episode 3 Part 4 potx

3 The proposed approach Our approach constructs a feature profile of a user, based on both collaborative andcontent features.. Content-based Dimensionality Reduction for Recommender Syst

Trang 1

620 Panagiotis Symeonidis

In this paper, we construct a feature profile of a user to reveal the duality betweenusers and features For instance, in a movie recommender system, a user prefers amovie for various reasons, such as the actors, the director or the genre of the movie.All these features affect differently the choice of each user Then, we apply LatentSemantic Indexing Model (LSI) to reveal the dominant features of a user Finally, weprovide recommendations according to this dimensionally-reduced feature profile.Our experiments with a real-life data set show the superiority of our approach overexisting CF, CB and hybrid approaches

The rest of this paper is organized as follows: Section 2 summarizes the relatedwork The proposed approach is described in Section 3 Experimental results aregiven in Section 4 Finally, Section 5 concludes this paper

Infor-in document collections Moreover, Infor-in IR area, Furnas et al (1988) proposed LSI todetect the latent semantic relationship between terms and documents Sarwar et al.(2000) applied dimensionality reduction for the user-based CF approach

There have been several attempts to combine CB with CF The Fab System abanovic et al 1997), measures similarity between users after first computing a con-tent profile for each user This process reverses the CinemaScreen System (Salter et

(Bal-al 2006) which runs CB on the results of CF Melville et (Bal-al (2002) used a based predictor to enhance existing user data, and then to provide personalized sug-gestions though collaborative filtering Finally, Tso and Schmidt-Thieme (2005) pro-posed three attribute-aware CF methods applying CB and CF paradigms in two sep-arate processes before combining them at the point of prediction

content-All the aforementioned approaches are hybrid: they either run CF on the results

of CB or vice versa Our model, discloses the duality between user ratings and itemfeatures, to reveal the actual reasons of their rating behavior Moreover, we applyLSI on the feature profile of users to reveal the principal features Then, we use asimilarity measure which is based on features, revealing the real preferences of theuser’s rating behavior

3 The proposed approach

Our approach constructs a feature profile of a user, based on both collaborative andcontent features Then, we apply LSI to reveal the dominant features trends Finally,

we provide recommendations according to this dimensionally-reduced feature profile

of the users

Trang 2

Content-based Dimensionality Reduction for Recommender Systems 621

3.1 Defining rating, item and feature profiles

CF algorithms process the rating data of the users to provide accurate tions An example of rating data is given in Figures 1a and 1b As shown, the example

recommenda-data set (Matrix R) is divided into a training and test set, where I 1−12are items and

U 1−4are users The null cells (no rating) are presented with dash and the rating scale

is between [1-5] where 1 means strong dislike, while 5 means strong like

Definition 1 The rating profile R(U k ) of user U k is the k-th row of matrix R For instance, R(U1) is the rating profile of user U1, and consists of the rated items

I1,I2,I3,I4,I8and I10 The rating of a user u over an item i is given from the element

Definition 2 The item profile F(I k ) of item I k is the k-th row of matrix F.

For instance, F(I1) is the profile of item I1, and consists of features F1and F2.Notice that this matrix is not always boolean Thus, if we process documents, matrix

Fwould count frequencies of terms

To capture the interaction between users and their favorite features, we construct

a feature profile composed of the rating profile and the item profile

For the construction of the feature profile of a user, we use a positive rating

threshold, PW, to select items from his rating profile, whose rating is not less than thisvalue The reason is that the rating profile of a user consists of ratings that take values

Trang 3

from a scale(in our running example, 1-5 scale) It is evident that ratings should be

“positive", as the user does not favor an item that is rated with 1 in a 1-5 scale

Definition 3 The feature profile P(U k ) of user U k is the k-th row of matrix P whose elements P(u, f ) are given by Equation 1.

user U2, and consists of features f1, f2and f3 The correlation of a user Uk over

a feature f is given from the element P (U k , f ) of matrix P As shown, feature f2

describe him better, than feature f1does

Fig 2 User-Feature matrix P divided in (a) Training Set (n × m), (b) Test Set

3.2 Applying SVD on training data

Initially, we apply Singular Value Decomposition (SVD) on the training data of trix P that produces three matrices based on Equation 2, as shown in Figure 3:

Trang 4

Content-based Dimensionality Reduction for Recommender Systems 623

3.3 Preserving the principal components

It is possible to reduce the n×m matrix S to have only c largest singular values Then, the reconstructed matrix is the closest rank-c approximation of the initial matrix P as

it is shown in Equation 3 and Figure 4:

P ∗ n×m = U n×c · S c×c ·V 

V 

c×m

Fig 4 Example of: P ∗

n×m (approximation matrix of P), U n×c (left singular vectors of P ∗ ), S c×c (singular values of P ∗ ), V 

c×m (right singular vectors of P ∗)

We tune the number, c, of principal components (i.e., dimensions) with the jective to reveal the major feature trends The tuning of c is determined by the infor-

ob-mation percentage that is preserved compared to the original matrix

3.4 Inserting a test user in the c-dimensional space

Given the current feature profile of the test user u as illustrated in Figure 2b, we enter

pseudo-user vector in the c-dimensional space using Equation 4 In our example, we insert U4into the 2-dimensional space, as shown in Figure 5:

c×c (two singular values of inverse S).

In Equation 4, u newdenotes the mapped ratings of the test user u, whereas V m×c

and S −1

c×c are matrices derived from SVD This u newvector should be added in the

end of the U n×cmatrix which is shown in Figure 4

3.5 Generating the Neighborhood of users/items

In our model, we find the k nearest neighbors of pseudo user vector in the c-dimensional

space The similarities between train and test users can be based on Cosine

Similar-ity First, we compute the matrix U n×c · S c×cand then we perform vector similarity

This n × c matrix is the c-dimensional representation for the n users.

Trang 5

3.6 Generating the top-N recommendation list

The most often used technique for the generation of the top-N list, is the one that

counts the frequency of each positively rated item inside the found neighborhood,

and recommends the N most frequent ones Our approach differentiates from this technique by exploiting the item features In particular, for each feature f inside the

found neighborhood, we add its frequency Then, based on the features that an itemconsists of, we count its weight in the neighborhood Our method, takes into accountthe fact that, each user has his own reasons for rating an item

4 Performance study

In this section, we study the performance of our Feature-Weighted User Model(FRUM) against the well-known CF, CB and a hybrid algorithm For the experi-ments, the collaborative filtering algorithm is denoted as CF and the content-basedalgorithm as CB As representative of the hybrid algorithms, we used the Cine-mascreen Recommender Agent (SALTER et al 2006), denoted as CFCB Factors

that are treated as parameters, are the following: the neighborhood size (k, default value 10), the size of the recommendation list (N, default value 20) and the size of train set (default value 75%) PWthreshold is set to 3 Moreover, we consider the di-vision between training and test data Thus, for each transaction of a test user wekeep the 75% as hidden data (the data we want to predict) and use the rest 25%

as not hidden data (the data for modeling new users) The extraction of the contentfeatures has been done through the well-known internet movie database (imdb) Wedownloaded the plain imdb database (ftp.fu-berlin.de - October 2006) and selected 4different classes of features (genres, actors, directors, keywords) Then, we join theimdb and the Movielens data sets The joining process lead to 23 different genres,

9847 keywords, 1050 directors and 2640 different actors and actresses (we selectedonly the 3 best paid actors or actresses for each movie) Our evaluation metrics are

from the information retrieval field For a test user that receives a top-N dation list, let R denote the number of relevant recommended items (the items of the top-N list that are rated higher than PW by the test user) We define the following:

recommen-Precision is the ratio of R to N.Recall is the ratio of R to the total number of relevant items for the test user (all items rated higher than PWby him) In the following, wealso use F1= 2·recall·precision/(recall+precision) F1is used because it combinesboth precision and recall

4.1 Comparative results for CF, CB, CFCB and FRUM algorithms

For the CF algorithms, we compare the two main cases, denoted as user-based (UB)and item-based (IB) algorithms The former constructs a user-user similarity matrixwhile the latter, builds an item-item similarity matrix Both of them, exploit the userratings information(user-item matrix R) Figure 6a demonstrates that IB compares

favorably against UB for small values of k For large values of k, both algorithms

Trang 6

Content-based Dimensionality Reduction for Recommender Systems 625converge, but never exceed the limit of 40% in terms of precision The reason is that

as the k values increase, both algorithms tend to recommend the most popular items.

In the sequel, we will use the IB algorithm as a representative of CF algorithms

Fig 6 Precision vs k of: (a) UB and IB algorithms, (b) 4 different feature classes, (c) 3

different information percentages of our FRUM model

For the CB algorithms, we have extracted 4 different classes of features from theimdb database We test them using the pure content-based CB algorithm to revealthe most effective in terms of accuracy We create an item-item similarity matrixbased on cosine similarity applied solely on features of items (item-feature matrixF) In Figure 6b, we see results in terms of precision for the four different classes ofextracted features As it is shown, the best performance is attained for the “keyword”class of content features, which will be the default feature class in the sequel.Regarding the performance of our FRUM, we preserve, each time, a differentfraction of principal components of our model More specifically, we preserve 70%,30% and 10% of the total information of initial user-feature matrix P The results for

precision vs k are displayed in Figure 6c As shown, the best performance is attained

with 70% of the information preserved This percentage will be the default value forFRUM in the sequel

In the following, we test FRUM algorithm against CF, CB and CFCB algorithms

in terms of precision and recall based on their best options In Figure 7a, we plot aprecision versus recall curve for all four algorithms As shown, all algorithms’ pre-

cision falls as N increases In contrast, as N increases, recall for all four algorithms

increases too FRUM attains almost 70% precision and 30% recall, when we mend a top-20 list of items In contrast, CFCB attains 42% precision and 20% recall.FRUM is more robust in finding relevant items to a user The reason is two-fold:(i)the sparsity has been downsized through the features and (ii) the LSI applicationreveals the dominant feature trends

recom-Now we test the impact of the size of the training set The results for the F1ric are given in Figure 7b As expected, when the training set is small, performancedowngrades for all algorithms FRUM algorithm is better than the CF, CB and CFCB

met-in all cases Moreover, low tramet-inmet-ing set sizes do not have a negative impact on

mea-sure F1of the FRUM algorithm

Trang 7

15 30 45 60 75 training set size (perc.)

F 1

CF CB CFCB FRUM

(b)

Fig 7 Comparison of CF, CB, CFCB with FRUM in terms of (a) precision vs recall (b)

training set size

5 Conclusions

We propose a feature-reduced user model for recommender systems Our approachbuilds a feature profile for the users, that reveals the real reasons of their rating be-havior Based on LSI, we include the pseudo-feature user concept in order to revealhis real preferences Our approach outperforms significantly existing CF, CB and hy-brid algorithms In our future work, we will consider the incremental update of ourmodel

References

BALABANOVIC, M and SHOHAM, Y (1997): Fab: Content-based, collaborative

recom-mendation, ACM Communications,volume 40,number 3,66-72

FURNAS, G and DEERWESTER, et al (1988): Information retrieval using a singular value

decomposition model of latent semantic structure, SIGIR , 465-480

MELVILLE, P and MOONEY R J and NAGARAJAN R (2002): Content-Boosted

Collab-orative Filtering for Improved Recommendations, AAAI, 187-192

SALTER, J and ANTONOPOULOS, N (2006): CinemaScreen Recommender Agent:

Com-bining Collaborative and Content-Based Filtering Intelligent Systems Magazine, volume

21, number 1, 35-41

SARWAR, B and KARYPIS, G and KONSTAN, J and RIEDL, J (2000) Application of

di-mensionality reduction in recommender system-A case study", ACM WebKDD Workshop

SCHULT, R and SPILIOPOULOU, M (2006) : Discovering Emerging Topics in Unlabelled

Text Collections ADBIS 2006, 353-366

TSO, K and SCHMIDT-THIEME, L (2005) : Attribute-aware Collaborative Filtering, man Classification Society GfKl 2005

Trang 8

Ger-New Issues in Near-duplicate Detection

Martin Potthast and Benno SteinBauhaus University Weimar

99421 Weimar, Germany

{martin.potthast, benno.stein}@medien.uni-weimar.de

Abstract Near-duplicate detection is the task of identifying documents with almost identical

content The respective algorithms are based on fingerprinting; they have attracted able attention due to their practical significance for Web retrieval systems, plagiarism analysis,corporate storage maintenance, or social collaboration and interaction in the World Wide Web.Our paper presents both an integrative view as well as new aspects from the field of near-

consider-duplicate detection: (i)Principles and Taxonomy Identification and discussion of the

princi-ples behind the known algorithms for near-duplicate detection (ii)Corpus Linguistics sentation of a corpus that is specifically suited for the analysis and evaluation of near-duplicatedetection algorithms The corpus is public and may serve as a starting point for a standard-

Pre-ized collection in this field (iii)Analysis and Evaluation Comparison of state-of-the-art gorithms for near-duplicate detection with respect to their retrieval properties This analysisgoes beyond existing surveys and includes recent developments from the field of hash-basedsearch

al-1 Introduction

In this paper two documents are considered as near-duplicates if they share a verylarge part of their vocabulary Near-Duplicates occur in many document collections,from which the most prominent one is the World Wide Web Recent studies of Fet-

terly et al (2003) and Broder et al (2006) show that about 30% of all Web

doc-uments are duplicates of others Zobel and Bernstein (2006) give examples whichinclude mirror sites, revisions and versioned documents, or standard text buildingblocks such as disclaimers The negative impact of near-duplicates on Web searchengines is threefold: indexes waste storage space, search result listings can be clut-tered with almost identical entries, and crawlers have a high probability of exploringpages whose content is already acquired

Content duplication also happens through text plagiarism, which is the attempt

to present other people’s text as own work Note that in the plagiarism situationdocument content is duplicated at the level of short passages; plagiarized passagescan also be modified to a smaller or larger extent in order to obscure the offense

Trang 9

602 Potthast, Stein

Aside from deliberate content duplication, copying happens also accidentally:

in companies, universities, or public administrations documents are stored multipletimes, simply because employees are not aware of already existing previous work

(Forman et al (2005)) A similar situation is given for social software such as

cus-tomer review boards or comment boards, where many users publish their opinionabout some topic of interest: users with the same opinion write essentially the same

in diverse ways since they read not all existing contributions

A solution to the outlined problems requires a reliable recognition ofnear-duplicates – preferably at a high runtime performance These objectives com-pete with each other, a compromise in recognition quality entails deficiencies withrespect to retrieval precision and retrieval recall A reliable approach to identify two

documents d and d qas near-duplicates is to represent them under the vector space

model, referred to as d and dq , and to measure their similarity under the l2-norm

or the enclosed angle d and d q are considered as near-duplicates if the followingcondition holds:

M(d, dq

where M denotes a similarity function that maps onto the interval [0,1] To achieve

a recall of 1 with this approach, each pair of documents must be analyzed Likewise,

given d q and a document collection D, the computation of the set D q , D q ⊂ D, with all near-duplicates of d q in D, requires O (|D|), say, linear time in the collection size The

reason lies in the high dimensionality of the document representation d, where “high”

means “more than 10”: objects represented as high-dimensional vectors cannot besearched efficiently by means of space partitioning methods such as kd-trees, quad-

trees, or R-trees but are outperformed by a sequential scan (Weber et al (1998)).

By relaxing the retrieval requirements in terms of precision and recall the runtimeperformance can be significantly improved Basic idea is to estimate the similarity

between d and d q by means of fingerprinting A fingerprint, F d , is a set of k numbers computed from d If two fingerprints, F d and F d q , share at least N numbers, N ≤ k, it

is assumed that d and d qare near-duplicates I e., their similarity is estimated usingthe Jaccard coefficient:

Let F D= d∈D F d denote the union of the fingerprints of all documents in D, let

Dbe the power set of D, and let z : FD →D, x (→ z(x), be an inverted file index that maps a number x ∈ F D on the set of documents whose fingerprints contain x; z(x) is also called the postlist of x For document dq with fingerprint F d qconsider now the setˆ

D q ⊂ D of documents that occur in at least N of the postlists z(x), x ∈ Fd q Put anotherway, ˆD q consists of documents whose fingerprints share a least N numbers with Fd q

We use ˆD q as a heuristic approximation of D q, whereas the retrieval performance,which depends on the finesse of the fingerprint construction, computes as follows:

prec=Dˆq ∩ Dqˆ

D q

Trang 10

New Issues in Near-duplicate Detection 603

Knowledge-based Randomized

fuzzy-fingerprinting locality-sensitive hashing

Collection-specific

(Pseudo-) Random

Synchronized Local

random, sliding window

shingling, prefix anchors, hashed breakpoints, winnowing

rare chunks SPEX, I-Match

Fingerprint

construction

based

Projecting-

Embedding-based

Fig 1 Taxonomy of fingerprint construction methods (left) and algorithms (right).

The remainder of the paper is organized as follows Section 2 gives an overview

of fingerprint construction methods and classifies them in a taxonomy, including sofar unconsidered hashing technologies In particular, different aspects of fingerprintconstruction are contrasted and a comprehensive view on their retrieval properties

is presented Section 3 deals with evaluation methodologies for near-duplicate tection and proposes a new benchmark corpus of realistic size The state-of-the-artfingerprint construction methods are subject to an experimental analysis using thiscorpus, providing new insights into precision and recall performance

de-2 Fingerprint construction

A chunk or an n-gram of a document d is a sequence of n consecutive words found

in d.1Let C d be the set of all different chunks of d Note that C dis at most of size

|d| − n and can be assessed with O(|d|) Let d be a vector space representation of d

where each c ∈ Cdis used as descriptor of a dimension with a non-zero weight

According to Stein (2007) the construction of a fingerprint from d can be

under-stood as a three-step-procedure, consisting of dimensionality reduction, quantization,and encoding:

1 Dimensionality reduction is realized by projecting or by embedding Algorithms

of the former type select dimensions in d whose values occur unmodified in the reduced vector d Algorithms of the latter type reformulate d as a whole,

maintaining as much information as possible

2 Quantization is the mapping of the elements in donto small integer numbers,

1If the hashed breakpoint chunking strategy of Brin et al (1995) is applied, n can be

under-stood as expected value of the chunk length

Trang 11

604 Potthast, Stein

Table 1 Summary of chunk selection heuristics The rows contain the name of the

construc-tion algorithm along with typical constraints that must be fulfilled by the selecconstruc-tion heuristic V

Algorithm (Author) Selection heuristic V(c)

rare chunks (Heintze (1996)) c occurs once in D

SPEX (Bernstein and Zobel (2004)) c occurs at least twice in D

I-Match c = d; excluding non-discriminant terms of d (Chowdhury et al (2002), Conrad et al (2003), Kođcz et al (2004))

shingling (Broder (2000)) c ∈ {c1, , c k }, {c1, , c k } ⊂ rand C d

prefix anchor (Manber (1994)) c starts with a particular prefix, or

(Heintze (1996)) c starts with a prefix which is infrequent in d

hashed breakpoints (Manber (1994)) h(c)’s last byte is 0, or

(Brin et al (1995)) c’s last word’s hash value is 0

winnowing (Schleimer et al (2003)) c minimizes h(c) in a window sliding over d

random (misc.) c is part of a local random choice from C d

one of a sliding window (misc.) c starts at word i mod m in d; 1 ≤ m ≤ |d|

super- / megashingling c is a combination of hashed chunks

(Broder (2000) / Fetterly et al (2003)) which have been selected with shingling

2.1 Dimensionality reduction by projecting

If dimensionality reduction is done by projecting, a fingerprint F d for document d

can be formally defined as follows:

F d = {h(c) | c ∈ Cd and V(c) = true},

where V denotes a selection heuristic for dimensionality reduction that becomes true

if a chunk fulfills a certain property h denotes a hash function, such as MD5 or

Ra-bin’s hash function, which maps chunks to natural numbers and serves as a means forquantization Usually the identity mapping is applied as encoding rule Broder (2000)describes a more intricated encoding rule called supershingling

The objective of V is to select chunks to be part of a fingerprint which are suited for a reliable near-duplicate identification Table 1 presents in a consistent wayalgorithms and the implemented selection heuristics found in the literature, whereas

best-a heuristic is of one of the types denoted in Figure 1

2.2 Dimensionality reduction by embedding

An embedding-based fingerprint F d for a document d is typically constructed with a

technique called “similarity hashing” (Indyk and Motwani (1998)) Unlike standardhash functions, which aim to a minimization of the number of hash collisions, a

similarity hash function hM: D → U, U ⊂ N, shall produce a collision with a high probability for two objects d,dq ∈ D, iff M(d,dq) ≥ 1−H In this way hMdowngrades

a fine-grained similarity relation quantified within M to the concept “similar or not

similar”, reflected by the fact whether or not the hashcodes hM(d) and hM(dq) are

Trang 12

New Issues in Near-duplicate Detection 605

Table 2 Summary of complexities for the construction of a fingerprint, the retrieval, and the

size of a tailored chunk index

Construction Retrieval length print size index size

rare chunks O (|d|) O(|d|) n O (|d|) O (|d| · |D|)

SPEX O (|d|) O(r · |d|) n O (r · |d|) O(r · |d| · |D|)

prefix anchor O (|d|) O(|d|) n O (|d|) O (|d| · |D|)

hashed breakpoints O (|d|) O(|d|) E (|c|) = n O(|d|) O (|d| · |D|)

winnowing O (|d|) O(|d|) n O (|d|) O (|d| · |D|)

one of sliding window O (|d|) O(|d|) n O (|d|) O (|d| · |D|)

super- / megashingling O (|d|) O(k) n O(k) O (k · |D|)

fuzzy-fingerprinting O (|d|) O(k) |d| O(k) O (k · |D|)

locality-sensitive hashing O (|d|) O(k) |d| O(k) O (k · |D|)

identical To construct a fingerprint F d for document d a small number of k variants

of hMare used:

F d = {h (i)M(d) | i ∈ {1, ,k}}

Two kinds of similarity hash functions have been proposed, which either pute hashcodes based on knowledge about the domain or which ground on domain-independent randomization techniques (see again Figure 1) Both similarity hashfunctions compute hashcodes along the three steps outlined above: An example forthe former is fuzzy-fingerprinting developed by Stein (2005), where the embeddingstep relies on a tailored, low-dimensional document model and where fuzzification

com-is applied as a means for quantization An example for the latter com-is locality-sensitive

hashing and the variants thereof by Charikar (2002) and Datar et al (2004) Here the

embedding relies on the computation of scalar products of d with random vectors,

and the scalar products are mapped on predefined intervals on the real number line

as a means for quantization In both approaches the encoding happens according to

a summation rule

2.3 Discussion

We have analyzed the aforementioned fingerprint construction methods with respect

to construction time, retrieval time, and the resulting size of a complete chunk index.Table 2 compiles the results

The construction of a fingerprint for a document d depends on its length since

d has to be parsed at least once, which explains that all methods have the same

complexity in this respect The retrieval of near-duplicates requires a chunk index

z as described at the outset: z is queried with each number of a query document’s

Định dạng
Số trang	25
Dung lượng	582,66 KB