3 The proposed approach Our approach constructs a feature profile of a user, based on both collaborative andcontent features.. Content-based Dimensionality Reduction for Recommender Syst
Trang 1620 Panagiotis Symeonidis
In this paper, we construct a feature profile of a user to reveal the duality betweenusers and features For instance, in a movie recommender system, a user prefers amovie for various reasons, such as the actors, the director or the genre of the movie.All these features affect differently the choice of each user Then, we apply LatentSemantic Indexing Model (LSI) to reveal the dominant features of a user Finally, weprovide recommendations according to this dimensionally-reduced feature profile.Our experiments with a real-life data set show the superiority of our approach overexisting CF, CB and hybrid approaches
The rest of this paper is organized as follows: Section 2 summarizes the relatedwork The proposed approach is described in Section 3 Experimental results aregiven in Section 4 Finally, Section 5 concludes this paper
Infor-in document collections Moreover, Infor-in IR area, Furnas et al (1988) proposed LSI todetect the latent semantic relationship between terms and documents Sarwar et al.(2000) applied dimensionality reduction for the user-based CF approach
There have been several attempts to combine CB with CF The Fab System abanovic et al 1997), measures similarity between users after first computing a con-tent profile for each user This process reverses the CinemaScreen System (Salter et
(Bal-al 2006) which runs CB on the results of CF Melville et (Bal-al (2002) used a based predictor to enhance existing user data, and then to provide personalized sug-gestions though collaborative filtering Finally, Tso and Schmidt-Thieme (2005) pro-posed three attribute-aware CF methods applying CB and CF paradigms in two sep-arate processes before combining them at the point of prediction
content-All the aforementioned approaches are hybrid: they either run CF on the results
of CB or vice versa Our model, discloses the duality between user ratings and itemfeatures, to reveal the actual reasons of their rating behavior Moreover, we applyLSI on the feature profile of users to reveal the principal features Then, we use asimilarity measure which is based on features, revealing the real preferences of theuser’s rating behavior
3 The proposed approach
Our approach constructs a feature profile of a user, based on both collaborative andcontent features Then, we apply LSI to reveal the dominant features trends Finally,
we provide recommendations according to this dimensionally-reduced feature profile
of the users
Trang 2Content-based Dimensionality Reduction for Recommender Systems 621
3.1 Defining rating, item and feature profiles
CF algorithms process the rating data of the users to provide accurate tions An example of rating data is given in Figures 1a and 1b As shown, the example
recommenda-data set (Matrix R) is divided into a training and test set, where I 1−12are items and
U 1−4are users The null cells (no rating) are presented with dash and the rating scale
is between [1-5] where 1 means strong dislike, while 5 means strong like
Definition 1 The rating profile R(U k ) of user U k is the k-th row of matrix R For instance, R(U1) is the rating profile of user U1, and consists of the rated items
I1,I2,I3,I4,I8and I10 The rating of a user u over an item i is given from the element
Definition 2 The item profile F(I k ) of item I k is the k-th row of matrix F.
For instance, F(I1) is the profile of item I1, and consists of features F1and F2.Notice that this matrix is not always boolean Thus, if we process documents, matrix
Fwould count frequencies of terms
To capture the interaction between users and their favorite features, we construct
a feature profile composed of the rating profile and the item profile
For the construction of the feature profile of a user, we use a positive rating
threshold, PW, to select items from his rating profile, whose rating is not less than thisvalue The reason is that the rating profile of a user consists of ratings that take values
Trang 3622 Panagiotis Symeonidis
from a scale(in our running example, 1-5 scale) It is evident that ratings should be
“positive", as the user does not favor an item that is rated with 1 in a 1-5 scale
Definition 3 The feature profile P(U k ) of user U k is the k-th row of matrix P whose elements P(u, f ) are given by Equation 1.
user U2, and consists of features f1, f2and f3 The correlation of a user Uk over
a feature f is given from the element P (U k , f ) of matrix P As shown, feature f2
describe him better, than feature f1does
Fig 2 User-Feature matrix P divided in (a) Training Set (n × m), (b) Test Set
3.2 Applying SVD on training data
Initially, we apply Singular Value Decomposition (SVD) on the training data of trix P that produces three matrices based on Equation 2, as shown in Figure 3:
Trang 4Content-based Dimensionality Reduction for Recommender Systems 623
3.3 Preserving the principal components
It is possible to reduce the n×m matrix S to have only c largest singular values Then, the reconstructed matrix is the closest rank-c approximation of the initial matrix P as
it is shown in Equation 3 and Figure 4:
P ∗ n×m = U n×c · S c×c ·V
V
c×m
Fig 4 Example of: P ∗
n×m (approximation matrix of P), U n×c (left singular vectors of P ∗ ), S c×c (singular values of P ∗ ), V
c×m (right singular vectors of P ∗)
We tune the number, c, of principal components (i.e., dimensions) with the jective to reveal the major feature trends The tuning of c is determined by the infor-
ob-mation percentage that is preserved compared to the original matrix
3.4 Inserting a test user in the c-dimensional space
Given the current feature profile of the test user u as illustrated in Figure 2b, we enter
pseudo-user vector in the c-dimensional space using Equation 4 In our example, we insert U4into the 2-dimensional space, as shown in Figure 5:
c×c (two singular values of inverse S).
In Equation 4, u newdenotes the mapped ratings of the test user u, whereas V m×c
and S −1
c×c are matrices derived from SVD This u newvector should be added in the
end of the U n×cmatrix which is shown in Figure 4
3.5 Generating the Neighborhood of users/items
In our model, we find the k nearest neighbors of pseudo user vector in the c-dimensional
space The similarities between train and test users can be based on Cosine
Similar-ity First, we compute the matrix U n×c · S c×cand then we perform vector similarity
This n × c matrix is the c-dimensional representation for the n users.
Trang 5624 Panagiotis Symeonidis
3.6 Generating the top-N recommendation list
The most often used technique for the generation of the top-N list, is the one that
counts the frequency of each positively rated item inside the found neighborhood,
and recommends the N most frequent ones Our approach differentiates from this technique by exploiting the item features In particular, for each feature f inside the
found neighborhood, we add its frequency Then, based on the features that an itemconsists of, we count its weight in the neighborhood Our method, takes into accountthe fact that, each user has his own reasons for rating an item
4 Performance study
In this section, we study the performance of our Feature-Weighted User Model(FRUM) against the well-known CF, CB and a hybrid algorithm For the experi-ments, the collaborative filtering algorithm is denoted as CF and the content-basedalgorithm as CB As representative of the hybrid algorithms, we used the Cine-mascreen Recommender Agent (SALTER et al 2006), denoted as CFCB Factors
that are treated as parameters, are the following: the neighborhood size (k, default value 10), the size of the recommendation list (N, default value 20) and the size of train set (default value 75%) PWthreshold is set to 3 Moreover, we consider the di-vision between training and test data Thus, for each transaction of a test user wekeep the 75% as hidden data (the data we want to predict) and use the rest 25%
as not hidden data (the data for modeling new users) The extraction of the contentfeatures has been done through the well-known internet movie database (imdb) Wedownloaded the plain imdb database (ftp.fu-berlin.de - October 2006) and selected 4different classes of features (genres, actors, directors, keywords) Then, we join theimdb and the Movielens data sets The joining process lead to 23 different genres,
9847 keywords, 1050 directors and 2640 different actors and actresses (we selectedonly the 3 best paid actors or actresses for each movie) Our evaluation metrics are
from the information retrieval field For a test user that receives a top-N dation list, let R denote the number of relevant recommended items (the items of the top-N list that are rated higher than PW by the test user) We define the following:
recommen-Precision is the ratio of R to N.Recall is the ratio of R to the total number of relevant items for the test user (all items rated higher than PWby him) In the following, wealso use F1= 2·recall·precision/(recall+precision) F1is used because it combinesboth precision and recall
4.1 Comparative results for CF, CB, CFCB and FRUM algorithms
For the CF algorithms, we compare the two main cases, denoted as user-based (UB)and item-based (IB) algorithms The former constructs a user-user similarity matrixwhile the latter, builds an item-item similarity matrix Both of them, exploit the userratings information(user-item matrix R) Figure 6a demonstrates that IB compares
favorably against UB for small values of k For large values of k, both algorithms
Trang 6Content-based Dimensionality Reduction for Recommender Systems 625converge, but never exceed the limit of 40% in terms of precision The reason is that
as the k values increase, both algorithms tend to recommend the most popular items.
In the sequel, we will use the IB algorithm as a representative of CF algorithms
Fig 6 Precision vs k of: (a) UB and IB algorithms, (b) 4 different feature classes, (c) 3
different information percentages of our FRUM model
For the CB algorithms, we have extracted 4 different classes of features from theimdb database We test them using the pure content-based CB algorithm to revealthe most effective in terms of accuracy We create an item-item similarity matrixbased on cosine similarity applied solely on features of items (item-feature matrixF) In Figure 6b, we see results in terms of precision for the four different classes ofextracted features As it is shown, the best performance is attained for the “keyword”class of content features, which will be the default feature class in the sequel.Regarding the performance of our FRUM, we preserve, each time, a differentfraction of principal components of our model More specifically, we preserve 70%,30% and 10% of the total information of initial user-feature matrix P The results for
precision vs k are displayed in Figure 6c As shown, the best performance is attained
with 70% of the information preserved This percentage will be the default value forFRUM in the sequel
In the following, we test FRUM algorithm against CF, CB and CFCB algorithms
in terms of precision and recall based on their best options In Figure 7a, we plot aprecision versus recall curve for all four algorithms As shown, all algorithms’ pre-
cision falls as N increases In contrast, as N increases, recall for all four algorithms
increases too FRUM attains almost 70% precision and 30% recall, when we mend a top-20 list of items In contrast, CFCB attains 42% precision and 20% recall.FRUM is more robust in finding relevant items to a user The reason is two-fold:(i)the sparsity has been downsized through the features and (ii) the LSI applicationreveals the dominant feature trends
recom-Now we test the impact of the size of the training set The results for the F1ric are given in Figure 7b As expected, when the training set is small, performancedowngrades for all algorithms FRUM algorithm is better than the CF, CB and CFCB
met-in all cases Moreover, low tramet-inmet-ing set sizes do not have a negative impact on
mea-sure F1of the FRUM algorithm
Trang 715 30 45 60 75 training set size (perc.)
F 1
CF CB CFCB FRUM
(b)
Fig 7 Comparison of CF, CB, CFCB with FRUM in terms of (a) precision vs recall (b)
training set size
5 Conclusions
We propose a feature-reduced user model for recommender systems Our approachbuilds a feature profile for the users, that reveals the real reasons of their rating be-havior Based on LSI, we include the pseudo-feature user concept in order to revealhis real preferences Our approach outperforms significantly existing CF, CB and hy-brid algorithms In our future work, we will consider the incremental update of ourmodel
References
BALABANOVIC, M and SHOHAM, Y (1997): Fab: Content-based, collaborative
recom-mendation, ACM Communications,volume 40,number 3,66-72
FURNAS, G and DEERWESTER, et al (1988): Information retrieval using a singular value
decomposition model of latent semantic structure, SIGIR , 465-480
MELVILLE, P and MOONEY R J and NAGARAJAN R (2002): Content-Boosted
Collab-orative Filtering for Improved Recommendations, AAAI, 187-192
SALTER, J and ANTONOPOULOS, N (2006): CinemaScreen Recommender Agent:
Com-bining Collaborative and Content-Based Filtering Intelligent Systems Magazine, volume
21, number 1, 35-41
SARWAR, B and KARYPIS, G and KONSTAN, J and RIEDL, J (2000) Application of
di-mensionality reduction in recommender system-A case study", ACM WebKDD Workshop
SCHULT, R and SPILIOPOULOU, M (2006) : Discovering Emerging Topics in Unlabelled
Text Collections ADBIS 2006, 353-366
TSO, K and SCHMIDT-THIEME, L (2005) : Attribute-aware Collaborative Filtering, man Classification Society GfKl 2005
Trang 8Ger-New Issues in Near-duplicate Detection
Martin Potthast and Benno SteinBauhaus University Weimar
99421 Weimar, Germany
{martin.potthast, benno.stein}@medien.uni-weimar.de
Abstract Near-duplicate detection is the task of identifying documents with almost identical
content The respective algorithms are based on fingerprinting; they have attracted able attention due to their practical significance for Web retrieval systems, plagiarism analysis,corporate storage maintenance, or social collaboration and interaction in the World Wide Web.Our paper presents both an integrative view as well as new aspects from the field of near-
consider-duplicate detection: (i)Principles and Taxonomy Identification and discussion of the
princi-ples behind the known algorithms for near-duplicate detection (ii)Corpus Linguistics sentation of a corpus that is specifically suited for the analysis and evaluation of near-duplicatedetection algorithms The corpus is public and may serve as a starting point for a standard-
Pre-ized collection in this field (iii)Analysis and Evaluation Comparison of state-of-the-art gorithms for near-duplicate detection with respect to their retrieval properties This analysisgoes beyond existing surveys and includes recent developments from the field of hash-basedsearch
al-1 Introduction
In this paper two documents are considered as near-duplicates if they share a verylarge part of their vocabulary Near-Duplicates occur in many document collections,from which the most prominent one is the World Wide Web Recent studies of Fet-
terly et al (2003) and Broder et al (2006) show that about 30% of all Web
doc-uments are duplicates of others Zobel and Bernstein (2006) give examples whichinclude mirror sites, revisions and versioned documents, or standard text buildingblocks such as disclaimers The negative impact of near-duplicates on Web searchengines is threefold: indexes waste storage space, search result listings can be clut-tered with almost identical entries, and crawlers have a high probability of exploringpages whose content is already acquired
Content duplication also happens through text plagiarism, which is the attempt
to present other people’s text as own work Note that in the plagiarism situationdocument content is duplicated at the level of short passages; plagiarized passagescan also be modified to a smaller or larger extent in order to obscure the offense
Trang 9602 Potthast, Stein
Aside from deliberate content duplication, copying happens also accidentally:
in companies, universities, or public administrations documents are stored multipletimes, simply because employees are not aware of already existing previous work
(Forman et al (2005)) A similar situation is given for social software such as
cus-tomer review boards or comment boards, where many users publish their opinionabout some topic of interest: users with the same opinion write essentially the same
in diverse ways since they read not all existing contributions
A solution to the outlined problems requires a reliable recognition ofnear-duplicates – preferably at a high runtime performance These objectives com-pete with each other, a compromise in recognition quality entails deficiencies withrespect to retrieval precision and retrieval recall A reliable approach to identify two
documents d and d qas near-duplicates is to represent them under the vector space
model, referred to as d and dq , and to measure their similarity under the l2-norm
or the enclosed angle d and d q are considered as near-duplicates if the followingcondition holds:
M(d, dq
where M denotes a similarity function that maps onto the interval [0,1] To achieve
a recall of 1 with this approach, each pair of documents must be analyzed Likewise,
given d q and a document collection D, the computation of the set D q , D q ⊂ D, with all near-duplicates of d q in D, requires O (|D|), say, linear time in the collection size The
reason lies in the high dimensionality of the document representation d, where “high”
means “more than 10”: objects represented as high-dimensional vectors cannot besearched efficiently by means of space partitioning methods such as kd-trees, quad-
trees, or R-trees but are outperformed by a sequential scan (Weber et al (1998)).
By relaxing the retrieval requirements in terms of precision and recall the runtimeperformance can be significantly improved Basic idea is to estimate the similarity
between d and d q by means of fingerprinting A fingerprint, F d , is a set of k numbers computed from d If two fingerprints, F d and F d q , share at least N numbers, N ≤ k, it
is assumed that d and d qare near-duplicates I e., their similarity is estimated usingthe Jaccard coefficient:
Let F D= d∈D F d denote the union of the fingerprints of all documents in D, let
Dbe the power set of D, and let z : FD →D, x (→ z(x), be an inverted file index that maps a number x ∈ F D on the set of documents whose fingerprints contain x; z(x) is also called the postlist of x For document dq with fingerprint F d qconsider now the setˆ
D q ⊂ D of documents that occur in at least N of the postlists z(x), x ∈ Fd q Put anotherway, ˆD q consists of documents whose fingerprints share a least N numbers with Fd q
We use ˆD q as a heuristic approximation of D q, whereas the retrieval performance,which depends on the finesse of the fingerprint construction, computes as follows:
prec=Dˆq ∩ Dqˆ
D q
Trang 10New Issues in Near-duplicate Detection 603
Knowledge-based Randomized
fuzzy-fingerprinting locality-sensitive hashing
Collection-specific
(Pseudo-) Random
Synchronized Local
random, sliding window
shingling, prefix anchors, hashed breakpoints, winnowing
rare chunks SPEX, I-Match
Fingerprint
construction
based
Projecting-
Embedding-based
Fig 1 Taxonomy of fingerprint construction methods (left) and algorithms (right).
The remainder of the paper is organized as follows Section 2 gives an overview
of fingerprint construction methods and classifies them in a taxonomy, including sofar unconsidered hashing technologies In particular, different aspects of fingerprintconstruction are contrasted and a comprehensive view on their retrieval properties
is presented Section 3 deals with evaluation methodologies for near-duplicate tection and proposes a new benchmark corpus of realistic size The state-of-the-artfingerprint construction methods are subject to an experimental analysis using thiscorpus, providing new insights into precision and recall performance
de-2 Fingerprint construction
A chunk or an n-gram of a document d is a sequence of n consecutive words found
in d.1Let C d be the set of all different chunks of d Note that C dis at most of size
|d| − n and can be assessed with O(|d|) Let d be a vector space representation of d
where each c ∈ Cdis used as descriptor of a dimension with a non-zero weight
According to Stein (2007) the construction of a fingerprint from d can be
under-stood as a three-step-procedure, consisting of dimensionality reduction, quantization,and encoding:
1 Dimensionality reduction is realized by projecting or by embedding Algorithms
of the former type select dimensions in d whose values occur unmodified in the reduced vector d Algorithms of the latter type reformulate d as a whole,
maintaining as much information as possible
2 Quantization is the mapping of the elements in donto small integer numbers,
1If the hashed breakpoint chunking strategy of Brin et al (1995) is applied, n can be
under-stood as expected value of the chunk length
Trang 11604 Potthast, Stein
Table 1 Summary of chunk selection heuristics The rows contain the name of the
construc-tion algorithm along with typical constraints that must be fulfilled by the selecconstruc-tion heuristic V
Algorithm (Author) Selection heuristic V(c)
rare chunks (Heintze (1996)) c occurs once in D
SPEX (Bernstein and Zobel (2004)) c occurs at least twice in D
I-Match c = d; excluding non-discriminant terms of d (Chowdhury et al (2002), Conrad et al (2003), Kođcz et al (2004))
shingling (Broder (2000)) c ∈ {c1, , c k }, {c1, , c k } ⊂ rand C d
prefix anchor (Manber (1994)) c starts with a particular prefix, or
(Heintze (1996)) c starts with a prefix which is infrequent in d
hashed breakpoints (Manber (1994)) h(c)’s last byte is 0, or
(Brin et al (1995)) c’s last word’s hash value is 0
winnowing (Schleimer et al (2003)) c minimizes h(c) in a window sliding over d
random (misc.) c is part of a local random choice from C d
one of a sliding window (misc.) c starts at word i mod m in d; 1 ≤ m ≤ |d|
super- / megashingling c is a combination of hashed chunks
(Broder (2000) / Fetterly et al (2003)) which have been selected with shingling
2.1 Dimensionality reduction by projecting
If dimensionality reduction is done by projecting, a fingerprint F d for document d
can be formally defined as follows:
F d = {h(c) | c ∈ Cd and V(c) = true},
where V denotes a selection heuristic for dimensionality reduction that becomes true
if a chunk fulfills a certain property h denotes a hash function, such as MD5 or
Ra-bin’s hash function, which maps chunks to natural numbers and serves as a means forquantization Usually the identity mapping is applied as encoding rule Broder (2000)describes a more intricated encoding rule called supershingling
The objective of V is to select chunks to be part of a fingerprint which are suited for a reliable near-duplicate identification Table 1 presents in a consistent wayalgorithms and the implemented selection heuristics found in the literature, whereas
best-a heuristic is of one of the types denoted in Figure 1
2.2 Dimensionality reduction by embedding
An embedding-based fingerprint F d for a document d is typically constructed with a
technique called “similarity hashing” (Indyk and Motwani (1998)) Unlike standardhash functions, which aim to a minimization of the number of hash collisions, a
similarity hash function hM: D → U, U ⊂ N, shall produce a collision with a high probability for two objects d,dq ∈ D, iff M(d,dq) ≥ 1−H In this way hMdowngrades
a fine-grained similarity relation quantified within M to the concept “similar or not
similar”, reflected by the fact whether or not the hashcodes hM(d) and hM(dq) are
Trang 12New Issues in Near-duplicate Detection 605
Table 2 Summary of complexities for the construction of a fingerprint, the retrieval, and the
size of a tailored chunk index
Construction Retrieval length print size index size
rare chunks O (|d|) O(|d|) n O (|d|) O (|d| · |D|)
SPEX O (|d|) O(r · |d|) n O (r · |d|) O(r · |d| · |D|)
prefix anchor O (|d|) O(|d|) n O (|d|) O (|d| · |D|)
hashed breakpoints O (|d|) O(|d|) E (|c|) = n O(|d|) O (|d| · |D|)
winnowing O (|d|) O(|d|) n O (|d|) O (|d| · |D|)
one of sliding window O (|d|) O(|d|) n O (|d|) O (|d| · |D|)
super- / megashingling O (|d|) O(k) n O(k) O (k · |D|)
fuzzy-fingerprinting O (|d|) O(k) |d| O(k) O (k · |D|)
locality-sensitive hashing O (|d|) O(k) |d| O(k) O (k · |D|)
identical To construct a fingerprint F d for document d a small number of k variants
of hMare used:
F d = {h (i)M(d) | i ∈ {1, ,k}}
Two kinds of similarity hash functions have been proposed, which either pute hashcodes based on knowledge about the domain or which ground on domain-independent randomization techniques (see again Figure 1) Both similarity hashfunctions compute hashcodes along the three steps outlined above: An example forthe former is fuzzy-fingerprinting developed by Stein (2005), where the embeddingstep relies on a tailored, low-dimensional document model and where fuzzification
com-is applied as a means for quantization An example for the latter com-is locality-sensitive
hashing and the variants thereof by Charikar (2002) and Datar et al (2004) Here the
embedding relies on the computation of scalar products of d with random vectors,
and the scalar products are mapped on predefined intervals on the real number line
as a means for quantization In both approaches the encoding happens according to
a summation rule
2.3 Discussion
We have analyzed the aforementioned fingerprint construction methods with respect
to construction time, retrieval time, and the resulting size of a complete chunk index.Table 2 compiles the results
The construction of a fingerprint for a document d depends on its length since
d has to be parsed at least once, which explains that all methods have the same
complexity in this respect The retrieval of near-duplicates requires a chunk index
z as described at the outset: z is queried with each number of a query document’s