203. Predicting the Popularity of Social Curation tài liệu, giáo án, bài giảng , luận văn, luận án, đồ án, bài tập lớn v...
Trang 1Binh Thanh Kieu, Ryutaro Ichise, and Son Bao Pham
Abstract. The amount and variety of social media content such as status, images, movies, and music are increasing rapidly Accordingly, the social curation service
is emerging as a new way to connect, select, and organize information on a massive scale One noticeable feature of social curation services is that they are loosely su-pervised: the content that users create in the service is manually collected, selected, and maintained A large proportion of these contents are arbitrarily created by inex-perienced users In this paper, we look into social curation, particularly, the Storify website1 This is the most popular social curation for creating stories included in var-ious domains such as Twitter, Flicker, and YouTube We propose a machine learning method with feature extraction to filter these contents and to predict the popularity
of social curation data
Keywords: curation, social curation, social network service, prediction, popularity
1 Introduction
Along with the rapid growth of the Internet, social networks are increasingly at-tracting users, young people in particular Therefore, the study of social networks is getting more and more attention Social network services such as Facebook, Mys-pace, and Twitter have become viable sources of information for many online users These websites are increasingly used for communicating breaking news, sharing eyewitness accounts, and organizing groups of people At the most basic level,
Binh Thanh Kieu· Son Bao Pham
Faculty of Information Technology, University of Engineering and Technology,
Vietnam National University, Hanoi, Vietnam
e-mail: binhkt.vnu@gmail.com, sonpb@vnu.edu.vn
Ryutaro Ichise
National Institute of Informatics, Tokyo, Japan
e-mail: ichise@nii.ac.jp
1http://www.storify.com/
V.-H Nguyen et al (eds.), Knowledge and Systems Engineering,
Advances in Intelligent Systems and Computing 326, DOI: 10.1007/978-3-319-11680-8_33
Trang 2a curation service offers the ability to manually collect, select, and maintain this so-cial media information This is very different from other soso-cial information sources, and we can utilize this characteristic for efficient content mining
The emergence of Web 2.0 and online social networking services, such as Digg, YouTube, Facebook, and Twitter, has changed how users generate and consume on-line contents For example, YouTube, well-known for its fast-growing user-generated contents, reports 100 hours worth of video upload every minute [22] Online so-cial networking services, augmented with multimedia content support, sharing, and commenting on other users’ contents, constitute a significant part of the web expe-rience by Internet users The question is how do users find interesting contents? Or, how do certain contents rise in popularity? If we can answer these questions, we can predict the most likely contents to become popular and filter out others Moreover, when we can filter out unpopular contents that get little attention, good contents can
be used to build an automatic system for curating social content
However, predicting the popularity of content is a difficult task for many rea-sons Among these, the effects of external phenomena (e.g., media, natural, and geo-political) are difficult to incorporate into models [16], and cascades of infor-mation are difficult to forecast [3] Finally, the underlying contexts, such as locality, relevance to users, resonance, and impact, are not easy to decipher [2]
The rest of the paper is organized as follows In the second section, we explain the social curation service, our target data source, and details of the dataset specifi-cations In the third section, we review related work The fourth section is devoted
to the formulation of predicting view counts of a curation list The fifth section de-scribes experiments and the evaluation of our results The last section concludes this paper with a discussion about future work
2 Social Curation
The word “curate" is defined as selecting, organizing, and looking after the items in
a collection or exhibition The word is derived from the Latin root “curare" or “to cure", which means “to care" Curation involves assembling, managing and present-ing some types of collections For example, curators of art galleries and museums research, select, and acquire pieces for their institutions’ collections and oversee in-terpretation, displays, and exhibits Social curation is the collaborative oversight of collections of web content organized around types of content such as Pinterest (a site for sharing and organizing images) and Storify (a site for collecting and publishing stories)
2.1 Social Curation Service
Social networks are spaces for dialog and conversation that have grown into ubiqui-tous information exchanges Youth today refer to social networks, aggregators, and mobile apps for most of their information instead of singling out specific media for news, politics, personal communication, and leisure In turn, social networks have
Trang 3provided new functions that help users curate information in meaningful and produc-tive ways Social curation involves aggregating, organizing, and sharing the content created by others to add context, narrative, and meaning Artists, changemakers, and organizations use social curation to showcase the full range of conversations around
a topic, add more nuance to their own original content, and crowdsource content from their community members The rise of social curation can be attributed to three broad trends
Firstly, people are creating a constant stream of social media content, including updates, location check-ins, blog posts, photos, and videos
Secondly, people are using their social networks to filter relevant content by following others who share similar interests
Thirdly, social media platforms are also curating content by giving curation tools to users (YouTube playlists, Flickr galleries, Amazon lists, Foodspotting guides), using editors and volunteers (YouTube Politics, Tumblr Tags), or us-ing algorithms (YouTube Trends, Autogenerated YouTube channels, LinkedIn Today) As a result, a number of niche social curation platforms have emerged
to enable people to curate different types of content, including links, photos, sounds, and videos
We should emphasize that each curation list is a kind of loosely supervised but organized social dataset This means that social media items in the same curation list are expected to share the same context to a certain degree: a curation list is manually generated to fully convey one idea to the consumer This is a very distinct characteristic compared to other social media that are unorganized in many cases
2.2 Storify
The website Storify is the most well-known site for people telling stories by cu-rating social media Storify was launched in September 2010 and accounts were invitation-only until April 2011 The site is now open to everyone and users only need a Twitter account Storify provides a function to filter out poor content and un-reliable sources If social media changes or misinterprets context, Storify can help curators put it back together again [13] Storify allows curators to embed dynamic images, text, tweets, and even Facebook status updates, and then knit these all to-gether with background and context provided by the storyteller It is an engaging way for us to learn how to work out what is true and what is speculation We have also found that using Twitter has taught us how to look for sources and news and Storify has helped teach us how to think and write context and narrative Each story
is a curation list which shares some characteristics: manually collected (bundling a collection of content from diverse sources), manually selected (re-organizing them
to give one’s own perspective), manually maintained (publishing the resulting story for consumers)
The Storify data is in the form of lists of Twitter messages An example of a list is shown in Figure 2 A list of tweets corresponds to what we called a story, which rep-resents a manually filtered and organized bundle Lists in Storify draw on Twitter as
Trang 4Fig 1 Example of a Storify list
Table 1 Statistics of curated domains
Domain Number of Elements Proportion
its source The lists may be created individually in private or collaboratively in pub-lic as determined by the initial curator In the Storify curation interface, the curator begins the list curation process by looking through his Twitter timeline (tweets from users that he or she follows), or directly searching tweets via relevant words/hash-tags The curator can drag-and-drop these tweets into a list, reorder them freely, and also add annotations such as a list header and in-place comments We first provide some data statistics to get a feel for the curation data We collected all the data from
Trang 5Table 2 Element types
Types Number Proportion
Table 3 Storify action statistics
Element comments 21,133 0.002 per element
2010 to April 2013, which amounted to 63,419 users and 352,540 stories This cor-responds to a total of 11,283,815 elements from various domains Table 1 describes the various domains used in the stories Twitter is the largest domain source with more than 75% elements, and Flickr is the smallest specific source with 1.2% el-ements The statistics of the element types is shown in Table 2 The five types of elements in stories are quote, text, image, video, and link Because Storify users use
a huge number of tweets, the number of quote contents accounts for a large percent-age of nearly 70% Media content as impercent-ages and movies make up approximately 15% Text contents are written by the curator to add more information, explain, or link elements The Storify API provides the four main actions shown in Table 3 The Storify website allows users to comment on each element or on all parts of a story However, the average numbers of comments, element comments and similar actions are quite small Therefore, approaches utilizing user comments and actions are not suitable for this dataset
3 Related Work
Several studies have investigated social curation as a new source of data mining Pin-terest2is the most popular website for sharing images and video, and the third most popular social network in the US behind Facebook and Twitter The website is built around the activity of collecting digital images and videos and pinning them to a pin board Each pin is essentially a visual bookmark and the pin boards are thematic col-lections of the bookmarks, where context is added to the collected information Hall
2http://www.pinterest.com/
Trang 6and Zarro described some of the user actions on Pinterest and created a dataset to find the pin content of Pinterest users across a wide variety of subject areas [8] [23] Besides only curating images or video, other sites curate status, comments, news sources to write blogs, stories Storyful3is a social media news agency established
in 2010 with the aim of filtering news, or newsworthy content, from the vast quanti-ties of noisy data on social networks such as Twitter and YouTube Storyful invests considerable time into the manual curation of content on these networks It sounds more or less like the same goal as Storify’s but there is one important difference Storyful aims to deliver content for news organizations, whereas Storify is more of
a tool for journalists It allows journalists to use its template to write stories that include relevant tweets and Facebook posts without losing the original formatting or links Journalists can create interactive stories with clear links to original pictures
or tweets Greene et al proposed a variety of criteria for generating user list recom-mendations based on content analysis, network analysis, and the “crowdsourcing"
of existing user lists [7] In addition, the Togetter website4is a rapidly growing so-cial curation website in Japan Togetter averaged more than 4 million user-views per month in 2011 The Togetter curation data mainly exist in the form of lists of Twit-ter messages Ishiguro et al used TogetTwit-ter data for the automatic understanding and mining of images [11] and created a system [6] that suggests new tweets to increase the curator’s productivity and breadth of perspective Our research discovered an-other social curation website, Storify The structure of a Storify list is quite similar
to that of a Togetter list The only difference is the language: the common language
of Togetter is Japanese and Storify is English However, we interested in another aspect which show the quality of curation list made by users
The problem of predicting online content highlights how much attention it will ultimately receive Research shows that user attention is allocated in a rather asym-metric way, with most content getting only some views and downloads, whereas a few receive a significant amount of user attention; thus, filtering these contents will help to save much time for viewers There are different ways to formulate how much attention of contents Many researchers interested in the number of views as the pop-ularity of online content such as YouTube (the number of views [20]), Vimeo (the number of views [1]), Flickr (the number of views [24]) Otherwise, the popularity
is presented by users’ actions like Dig (the number of user votes [12]), Twitter (the number of retweets [10]) Moreover, others formulate the problem to a change of the number of views that contents receive over time
Predicting the popularity of news articles is a complex and difficult task and different prediction methods and strategies have been proposed in several recent studies [20] [21] The common point of all these methods is that they focus on predicting the exact attention that an article will generate in the near future First, some researchers have studied features that describe the underlying social network
of the users and contents that can be leveraged to predict popularity [9] [12] [18] [21] The authors in [14] [16] [17] [21] studied features that take into account the
3http://www.storyful.com/
4http://www.togetter.com/
Trang 7comments found in blogs to predict popularity However, few other works forecast a value for the actual popularity of individual content Lee et al used survival analysis
to evaluate the probability that a given content receives more than some x number of
hits [16] [17] Hong et al developed a coarse multi-class classifier-based approach
to determine whether given Twitter hashtags are retweeted x ≤ (0; 100; 10000; ∞)
times [10] Similarly, Lakkaraju and Ajmera used support vector machines (SVMs)
to predict whether a given content falls into a group that attracts x ≤ (10%; 25%;
50%; 75%; 100%) of the attention in a system [15], while Jamali and Rangwala predicted the popularity of content by using an entropy measure [12] Finally, Szabo and Huberman presented a linear regression model based on the number of views [20]; this method was applied to build predictive popularity by applying regression
to different feature spaces [2] [9] [18] [21]
In this work, the popularity of Social Curation is shown by the number of views that the content will receive in the near future We propose three groups for cate-gorizing the popularity level of Social Curation We build a predictor based on a machine learning method, SVM, with feature selection to classify into these groups
4 Predicting the Popularity of Social Curation
4.1 Problem Formulation
Similar to normal content, the popularity of social curation is defined by the number
of users’ view We predict how much view which stories will receive in the near future However, it is difficult to predict exact amount of attention and people are almost interested in the popularity of content; thus, instead of predicting exactly the number, we cast the task as a multi-class classification problem that predicts the popularity that a curation list will receive after three months based on the number
of views Although our system cannot predict exactly the number of attention, but this system partly helps users to be able to identify popular contents and not popular contents We divide the number of views into three different classes: class 1 – not popular, with the number of views less than 10, class 2 – less popular, with the number of views between 10 and 1000, class 3 – very popular, with the number of views more than 1000
We used an SVM to classify these classes LibSVM [4] with a radial basic func-tion (RBF) kernel and default parameters, and the feature selecfunc-tion tool [5] were used to optimize the result We extracted two types of features, namely curation features and curator features Curator features are features of users who collect and organize elements from some domains and create curation lists Curation features are features related to the content of the curation lists
4.2 Features
Social curation lists contain many kinds of information that are useful for classi-fying For example, if the curation list includes many Twitter contents, the view
Trang 8count of the contents is expected to increase; or, if elements match the context of the curation list, the content will attract much more attention
In this study, as the social curation list included a large number of Twitter mes-sages, we used applicable features for predicting the number of retweets and micro-blogging popularity We divided the features into the two distinct sets mentioned above: curator features (which are related to the author of the story) and curation features (which encompass various statistics of the content in the story)
4.2.1 Curator Features
The following are the five curator features:
The number of users who follow the curator of the content
The number of users who the curator of the content follows
The number of stories written by the curator
The user’s language (English or not)
When the curator of the content started using Storify
These features were selected from the content creator features proposed by Ishiguro
et al [11] We implemented these features as our baseline system The number of followers and friends has been consistently shown to be a good indicator of retweet-ability, whereas the number of stories has not been found to have a significant im-pact [19] Our prior analysis also showed that stories written in English are more likely to be viewed, so we used a binary feature indicating if the user’s language is English The date when a curator started using Storify shows their experience Nor-mally, longtime users have more experience producing more popular curator stories than do new users We are not aware of any prior work that analyzes the effect of language or date on content popularity
4.2.2 Curation Features
The following are the seven curation features:
The number of hashtags
The number of versions
The number of embeds
The story’s language (English or not)
The number of popular tweet elements/total elements (the number of retweets greater than 100)
The number of popular image and video elements/total elements (the number
of image views and video views greater than 1000)
The total number of elements
As a large proportion of elements in the curation list is from the Twitter domain, hashtags therefore play an important feature for predicting the popularity One paper showed that hashtags, URLs and mentions have a high correlation with predicting popular Twitter messages [19] Although the Storify API provides hashtags, URLs and mentions of each story, URLs and mentions have an insignificant impact on the
Trang 9Table 4 Prediction accuracies for two feature types
Type of feature No of features Classification (10-fold)
result The version feature shows that users who modified their story can improve the story’s quality and get more attention The embed feature shows that more shar-ing is more popular The English language is the most well-known language in the world, so stories written in English are read by more people than those in any other language Although the feature is quite similar to the feature of the language of the curator, not all curators use their main language to write stories According to our experiments, tests using this feature achieved higher results Finally, the higher pro-portion of Twitter elements and media elements also increase the accuracy More-over, using many elements in a story draws more attention than stories with fewer elements
To the best of our knowledge, no prior work analyzed the effect of these features
on content popularity Therefore, the features we proposed are based on the experi-ments and feature selection tool to acquire the highest result The feature selection tool, combined with libSVM, uses the F-score for selecting features [5] The F-score
is a simple technique that measures the discrimination of two sets of real num-bers The larger the F-score is, the more likely this feature is more discriminative Therefore, this score is used as a feature selection criterion Moreover, libSVM also provides a feature scaling function in order to absorb the scale differences among feature values, then we re-scaled them between [0,1] Finally, twelve of above fea-tures had the highest result for predicting the popularity of Storify data
5 Experimental Results
5.1 The Experimental Dataset
We used Storify’s streaming API to collect a random sample of public stories created from March 1, 2013 to March 31, 2013 with 34,810 curation lists We suppose that these stories have the same published time We crawled them in June 2013 so we predict how much attention of these contents in the three months later Finally, we divided this dataset into 10 groups and ran 10 cross validations
5.2 Results
The different popular levels are displayed followed by the three classes, as mentioned
in Section 4.1 Statistically, nearly half of the stories are class 1, nearly 20% are class 2, and the remaining are class 3 The prediction accuracy for the two types of
Trang 10Table 5 Accuracy of 10 tests
Test Curator features Combined features
features are shown in Table 4 The result of curation features (7 features) is the worst
at 75.08%, curator features (5 features) at 80.02%, and the best result is combined features (combined between curator and curation features for a total of 12 features)
at 82.62% Therefore, both types of features are necessary for prediction with high accuracy
Table 5 shows more detailed results for the 10 tests The curator features (as baseline features) and combined features show different results Most tests using combined features are more accurate than tests using only curator features except for test 6 We analyzed test 6 and realized that the percentage of class 2 is approximately 40%, which is double the normal percentage It is shown that combined features cannot perform well for class 2 In addition, most tests using combined features attain roughly 83% accuracy except some tests such as tests 6, 9 (lower accuracy barely over 70%) and tests 8, 10 (high accuracy over 90%) Although the distribution ratio of classes in these tests is quite different from the others, the difference is irregular and not significant This is an open problem in our research; finding the answer for this question would improve the result
5.3 T-Test Evaluation
The (student’s) t-test is a statistical examination of two population means In simple terms, the t-test assesses whether the means of two groups are statistically different from each other It is commonly used when the variances of two normal distributions are unknown and when an experiment uses a small sample size In our case, we used the t-test to evaluate two group results of the above 10 tests (small sample size) The decision rule is a 95% confidence interval of the difference from−3.8929 to
−1.1271 and we calculate our value t = −4.1059 Therefore, we conclude that this
difference is considered to be very significant It indicates that our proposal to use both features is effective to predict the popularity of social curation data