See Chapter 10 for an in-depth case study on a more comprehensive project to not onlykeep bad content on Answers subdued, but actually clean it up and remove it altogether, with much gre
Trang 1See Chapter 10 for an in-depth case study on a more comprehensive project to not only
keep bad content on Answers subdued, but actually clean it up and remove it altogether,
with much greater accuracy and speed
Tuning for Behavior
There are many useful sources for reputation input, but source stands out among all others: the user The vast majority of content on the Web is user-generated, and user feedback generates the reputation that powers the Web Even every search engine is built on evaluations in the form of links provided not by algorithms, but by people
In an effort to optimize all of this people-powered value, reputation systems have come
to play a large part in creating incentives for user behavior: participation points, top contributor awards, etc Users then respond to these incentives, changing their behav-ior, which then requires the reputation systems to be tuned to optimize newer and more sophisticated behavior (including adjustments for undesirable side effects: aka abuse) The cycle then repeats, if you’re lucky
Emergent effects and emergent defects
It’s quite possible that—even during the beta period of your deployment—you’re no-ticing some strange effects starting to take hold Perhaps content items are rising in the ranks that don’t entirely seem…deserving somehow Or maybe you’re noticing a pre-dominance of a certain kind of content at the expense of other types What you’re seeing
is the character of your community shaking itself out, finding its edges, and defining itself Tread carefully before deciding how (and if) to intervene
Check out Delicious’s Popular Bookmarks ranking for any given week; we bet you’ll
see a whole lot of “Top N” blog articles (see Figure 9-2) Why might this be? Technology essayist Paul Graham posits that it may be the users of the service, and their motiva-tional mindset, that explain it: “Delicious users are collectors, and a list of N things seems particularly collectible because it’s a collection itself.” (Graham explores the
“List of N Things” phenomenon to some depth at http://www.paulgraham.com/nthings
.html.) The preponderance of lists on Delicious is a natural offshoot of its context of
Figure 9-1 By giving users a simple, private “Watchlist,” the Answers designers responded to the needs of Abuse Reporters who wanted to check back in on bad content.
236 | Chapter 9: Application Integration, Testing, and Tuning
Trang 2use—an emergent effect—and is probably not one that you would worry about, nor
try to control in any way
But you may also be seeing the effects of some design decisions that you’ve made, and you may want to tweak those designs now before wider deployment Blogger and social media maven Muhammad Saleem noticed one such problem with voting on socially driven news sites such as Digg:
We are beginning to see a trend where people make assumptions about the contents of
an article based on the meta-data associated with the submission rather than reading the article itself Based on these (oft-flawed) assumptions, people then vote for or against the stories, and even comment on the stories without having read the stories themselves.
—http://web.archive.org/web/20061127130645/http://themulife.com/?p=256
We’ve noticed a similar tendency on some community-voting sites we’ve worked on
at Yahoo! and have come to consider behavior like this to be a type of emergent de-fect: behavior that is homegrown within the community and may even become a de
facto standard for interacting, but is not necessarily valued In fact, it’s basically a
bug and a failing of your system or—more likely—user interface design.
In instances like these, you should consider tweaking your design, to encourage the proper and appropriate use of the controls you’re providing In some ways, it’s not
Figure 9-2 What are people saving on Delicious? Lists, lists and more lists…(and there’s nothing wrong with that).
Tuning Your System | 237
Trang 3surprising that Digg users are voting on articles based on only surface appraisals; the application’s very design in fact encourages this (see Figure 9-3)
Figure 9-3 The design of Digg enables (one might argue, encourages) voting for articles at a high level
of the site This excerpted screen is the front page of Digg—users can vote for (Digg) an article, or against (bury) it, with no need to read further.
Of course, one should not presuppose that the Digg folks think of this behavior (if it’s even as widespread as Saleem indicates) as a defect Again, it’s a careful balance between the actual observed behavior of users and your own predetermined goals and aspira-tions for the application
It’s quite possible that Digg feels that high voting levels—even if some percentage of those votes are from uninformed users—are important enough to promote voting at higher and higher levels of the site From a brand perspective alone, it certainly would
be odd to visit Digg.com, and not see a single place to Digg something up, right?
It’s hard to anticipate all emergent defects until they… well…emerge But there are certainly some good principles of design that you can follow that may defend your system against some of the most common ones:
Encourage consumption
If your system’s reputations are intended to capture the quality of a piece of con-tent, you should make a good-faith attempt to ensure that users are qualified to make that assessment Some examples:
• Early on in its lifetime, Apple’s iPhone App Store allowed any visitor to rate an
application, whether they’d purchased it or not! You can probably see the po-tential for bad data to arise from this situation A subsequent release addressed this problem, ensuring that only users who’d installed the program would have
Defending against emergent defects.
238 | Chapter 9: Application Integration, Testing, and Tuning
Trang 4a voice It doesn’t guarantee perfection, but a gating mechanism for rating does help dampen noise
• Digg and other social voting sites provide a toolbar that follows logged-in users out to external sites, encouraging them to actually read linked articles before clicking the toolbar-provided voting mechanism Your application could even
require an interaction like this for a vote to be counted (More likely, you’ll
simply want to weight votes more heavily when they’re cast in a guaranteed-better fashion like this.)
• Think of ways to check for consumption in a media-specific way With videos, for example, perhaps you should give more weight to opinions cast about a video only once the user has passed a certain time-threshold of viewing (or, perhaps, disable voting mechanisms altogether until that time)
Avoid ambiguous controls
Try not to lard too much input overhead onto reputable entities, and try to keep the purpose and primary value of each clear, concise, and nonconflicting If your design already calls for a Bookmarking or Favorites features, carefully consider whether you also need a Thumbs Up or “I Like It.”
In any event, provide some cues to users about the utility of those controls Are they strictly for expressing an opinion? Sharing with a friend? Saving for later? The
downstream effects may, in fact, be that one control does all three of these things,
but sometimes it’s better to suggest clear and consistent uses for controls than let the community muddle along, inventing its own utilities and rationales for things
If a secondary or tertiary use for a control emerges, consider formalizing that func-tion as a new feature
Keep great reputations scarce
Many of the benefits that we’ve discussed for tracking reputation (the ability to high-light good contributions and contributors, the ability to “tag” user profiles with awards
or recognition, even the simple ability to motivate contributors to excel) can be
un-dermined if you make one simple mistake with your reputation system: being too gen-erous with positive reputations Particularly, if you hand out reputations at the higher
end of the spectrum too widely, they will no longer be seen as valuable and rare ach-ievements You’ll also lose the ability to call out great content in long listings; if every-thing is marked as special, noevery-thing will stand out
It’s probably OK to wait until the tuning phase to address the question of distribution thresholds You’ll need to make some calculations—based on available data for current use of the application—to determine how heavily or lightly to weight certain inputs into the system A good example is the Gold/Silver/Bronze medal system that we de-veloped at Yahoo! to reward active, quality contributors to UK Sports Message Boards
We knew that we wanted certain inputs to factor into users’ badge-holder reputations: the number of posts posted, how well the community received the posts (i.e., how
Tuning Your System | 239
Trang 5highly the posts were rated, and so on But, at first, our guesses at the appropriate thresholds for these activities were just that—guesses
Take, for instance, one input that was included to indicate dedication to the commun-ity: the number of posts that a user had rated (In general, we caution against simple activity-level indicators for karma, but remember—this is but one input into the model—weighted appropriately against other quality-indicators like community re-sponse to your own postings.) We arbitrarily settled on the following minimum thresh-olds for badge-earners:
• Bronze Badge—5 posts rated
• Silver Badge—20 posts rated
• Gold Badge—100 posts rated
These were simply stabs in the dark—placeholders, really—that we fully expected to tune as we got closer to deployment
And, in fact, once we’d done an in-depth calculation of project badge numbers in the
community (based on Message Board activity levels that were already evident before
the addition of badges), we realized that these estimates were way too low We would
be giving out millions of Bronze badges, and, heck, still thousands of Golds This felt
way too liberal, given the goals of the project: to identify and reward only the most
active and valued contributors to boards
By the time the feature went into production, these minimum thresholds for rating
others postings were made much higher (orders of magnitude higher) and, in fact, it
was several months before the first message board Gold badge actually surfaced in the wild! We considered that a good thing, and perfectly in-line with the business and community metrics we’d laid out at the project’s outset
So…How Much Is Enough?
When you’re trying to plan out these distribution thresholds for reputations, your cal-culations will (of course!) vary with the context of use
Is this karma (people reputation) or content reputation?
Be more mindful of the distribution of karma It’s probably OK to have an over-abundance of “Trophy-winning videos” floating around your site, but too many top-flight experts risks devaluing the reward altogether
Honor the presentation pattern
Some distribution thresholds will be super easy to calibrate; if you’re honoring the Top 100 Reviewers on your site, for example, the number of users awarded
should be fairly self-evident It’s only with more ambiguous patterns that
thresh-olds will need to be actively tuned and massaged to get the desired distributions
Power-law is your friend
When in doubt, try to award reputations along a power-law distribution (Go to
http://en.wikipedia.org/wiki/Power_law.) Great reputations should be rare, good
240 | Chapter 9: Application Integration, Testing, and Tuning
Trang 6ones scarce, and mediocre ones should be the norm This will naturally mimic the natural properties of most networks, so—really—your reputations should reflect those values also
Tuning for the Future
There are sometimes pleasant surprises when implementing reputation systems for the first time When users begin to interact with reputation-powered applications, the very nature of the application can change significantly; it often becomes communal— control of the reputable entities shifts from the company to the people
This shift from a content-centric to a community-centric application often leads to inspirational application designs to be built on the lessons drawn from the existing reputation system Simply put, if reputation works well for one application, all of the other related applications will want to integrate it, yesterday!
Though new reputation models can be added only as fast as they can be developed,
tested, integrated, and deployed, the application team can release new uses for exist-ing reputations without coordination and almost instantaneously—it already has
access to the reputation API calls This suggests that the reputation team should con-tinuously optimize for performance against its internal metrics Expect significant growth, especially in the number of reputation queries Even if the primary application,
as originally implemented, doesn’t grow daily users by an unexpected rate, expect the application team to add new types of uses, such as more reputation-weighted searches,
or to add more pages that display a reputation score
Tuning reputation systems for ROI, behavior, and future improvements is a
never-ending process If you stop this required maintenance, the entire system will lose value
as it becomes abused, slow, noncompetitive, broken, and eventually irrelevant
Learning by Example
It’s one thing to describe and critique currently deployed reputation systems—after they’ve already been deployed It’s another to prescribe a detailed set of steps that are recommended for new practitioners, as we have done in this book
Talk is easy; action is difficult But, action is easy; true understanding is difficult!
—Warrior Proverb The lessons we presented here are the direct result of many attempts—some succeeded, some failed—at reputation system development and deployment The book is the result
of successive refinement of those lessons, especially as we refined it at Yahoo! Chap-ter 10 is our proof-in-the-pudding that this methodology works in practice; it covers each step as we applied them during the development of a community moderation reputation model for Yahoo! Answers
Learning by Example | 241
Trang 8CHAPTER 10 Case Study: Yahoo! Answers Community Content Moderation
This chapter is a real-life case study applying many of the theories and practical advice presented in this book The lessons learned on this project had a significant impact on our thinking about reputation systems, the power of social media moderation, and the need to publish these results in order to share our findings with the greater web appli-cation development community
In the summer of 2007, Yahoo! tried to address some moderation challenges with one
of its flagship community products: Yahoo! Answers The service had fallen victim to its own success and drawn the attention of trolls and spammers in a big way The Yahoo! Answers team was struggling to keep up with harmful, abusive content that flooded the service, most of which originated with a small number of bad actors on the site Ultimately, a clever (but simple) system that was rich in reputation provided the answer
to these woes: it was designed to identify bad actors, indemnify honest contributors, and take the overwhelming load off of the customer care team Here’s how that system came about
What Is Yahoo! Answers?
Yahoo! Answers debuted in December of 2005 and almost immediately enjoyed mas-sive popularity as a community driven website and a source of shared knowledge Yahoo! Answers provides a very simple interface to do, chiefly, two things: pose ques-tions to a large community (potentially, any active, registered Yahoo! user—that’s roughly a half-billion people worldwide); or answer questions that others have asked Yahoo! Answers was modeled, in part, from similar question-and-answer sites like Ko-rea’s Naver.com Knowledge Search
The appeal of this format was undeniable By June of 2006, according to Business 2.0,
Yahoo! Answers had already become “the second most popular Internet reference site
243
Trang 9after Wikipedia and had more than 90% of the domestic question-and-answer market share, as measured by comScore.” Its popularity continues and, owing partly to excel-lent search engine optimization (SEO), Yahoo! Answers pages frequently appear very near the top of search results pages on Google and Yahoo! for a wide variety of topics Yahoo! Answers is by far the most active community site on the Yahoo! network It logs more than 1.2 million user contributions (questions and answers combined) each day
A Marketplace for Questions and Yahoo! Answers
Yahoo! Answers is a unique kind of marketplace—one not based on the transfer of goods for monetary reward No, Yahoo! Answers is a knowledge marketplace, where the currency of exchange is ideas Furthermore, Yahoo! Answers focuses on a specific kind of knowledge
Micah Alpern was the user experience lead for early releases of Yahoo! Answers He refers to the unique focus of Yahoo! Answers as “experiential knowledge”—the exchange of opinions and sharing of common experiences and advice (see Fig-ure 10-1) While verifiable, factual information is indeed exchanged on Yahoo! An-swers, a lot of the conversations that take place there are intended to be social in nature
Micah has published a detailed presentation that covers this project in
some depth You can find it at http://www.slideshare.net/malpern/wiki
mania-2009-yahoo-answers-community-moderation.
Yahoo! Answers is not a reference site in the sense that Wikipedia is; it is not based on the ambition to provide objective, verifiable information Rather, its goal is to encour-age participation from a wide variety of contributors That goal is important to keep in mind as we delve further into the problems that Yahoo! Answers was undergoing and the steps needed to solve them Specifically, keep the following in mind:
• The answers on Yahoo! Answers are subjective It is the community that determines
what responses are ultimately “right.” It should not be a goal of any
metamoder-ation system to distinguish right answers from wrong or otherwise place any im-portance on the objective truth of answers
• In a marketplace for opinions such as Yahoo! Answers, it’s in the best interest of
everyone (askers, answerers, and the site operator) to encourage more opinions,
not fewer So the designer of a moderation system intended to weed out abusive content should make every attempt to avoid punishing legitimate questions and answers False positives can’t be tolerated, and the system must include an appeals process
244 | Chapter 10: Case Study: Yahoo! Answers Community Content Moderation
Trang 10Attack of the Trolls
So, exactly what problems was Yahoo! Answers suffering from? Two factors—the time lines with which Yahoo! Answers displayed new content and the overwhelming number
of contributions it received—had combined to create an unfortunate environment that was almost irresistible to trolls Dealing with offensive and antagonistic user content had become the number one feature request from the Yahoo! Answers community The Yahoo! Answers team first attempted a machine-learning approach, developing a black-box abuse classifier (lovingly named the “Junk Detector”) to prefilter abuse re-ports coming in It was intended to classify the worst of the worst content and put it into a prioritized queue for the attention of customer care agents
The Junk Detector was mostly a bust It was moderately successful at detecting obvious spam, but it failed altogether to identify the subtler, more insidious contributions of trolls
Do Trolls Eat Spam?
What’s the difference between trolling behavior and plain old spam? The distinction is subtle, but understanding it is critical when you’re combating either one We classify
Figure 10-1 The questions asked and answers shared on Yahoo! Answers are often based on experiential knowledge rather than authoritative, fact-based information.
What Is Yahoo! Answers? | 245