Building Web Reputation Systems- P19 ppsx

See Chapter 10 for an in-depth case study on a more comprehensive project to not onlykeep bad content on Answers subdued, but actually clean it up and remove it altogether, with much gre

Trang 1

See Chapter 10 for an in-depth case study on a more comprehensive project to not only

keep bad content on Answers subdued, but actually clean it up and remove it altogether,

with much greater accuracy and speed

Tuning for Behavior

There are many useful sources for reputation input, but source stands out among all others: the user The vast majority of content on the Web is user-generated, and user feedback generates the reputation that powers the Web Even every search engine is built on evaluations in the form of links provided not by algorithms, but by people

In an effort to optimize all of this people-powered value, reputation systems have come

to play a large part in creating incentives for user behavior: participation points, top contributor awards, etc Users then respond to these incentives, changing their behav-ior, which then requires the reputation systems to be tuned to optimize newer and more sophisticated behavior (including adjustments for undesirable side effects: aka abuse) The cycle then repeats, if you’re lucky

Emergent effects and emergent defects

It’s quite possible that—even during the beta period of your deployment—you’re no-ticing some strange effects starting to take hold Perhaps content items are rising in the ranks that don’t entirely seem…deserving somehow Or maybe you’re noticing a pre-dominance of a certain kind of content at the expense of other types What you’re seeing

is the character of your community shaking itself out, finding its edges, and defining itself Tread carefully before deciding how (and if) to intervene

Check out Delicious’s Popular Bookmarks ranking for any given week; we bet you’ll

see a whole lot of “Top N” blog articles (see Figure 9-2) Why might this be? Technology essayist Paul Graham posits that it may be the users of the service, and their motiva-tional mindset, that explain it: “Delicious users are collectors, and a list of N things seems particularly collectible because it’s a collection itself.” (Graham explores the

“List of N Things” phenomenon to some depth at http://www.paulgraham.com/nthings

.html.) The preponderance of lists on Delicious is a natural offshoot of its context of

Figure 9-1 By giving users a simple, private “Watchlist,” the Answers designers responded to the needs of Abuse Reporters who wanted to check back in on bad content.

236 | Chapter 9: Application Integration, Testing, and Tuning

Trang 2

use—an emergent effect—and is probably not one that you would worry about, nor

try to control in any way

But you may also be seeing the effects of some design decisions that you’ve made, and you may want to tweak those designs now before wider deployment Blogger and social media maven Muhammad Saleem noticed one such problem with voting on socially driven news sites such as Digg:

We are beginning to see a trend where people make assumptions about the contents of

an article based on the meta-data associated with the submission rather than reading the article itself Based on these (oft-flawed) assumptions, people then vote for or against the stories, and even comment on the stories without having read the stories themselves.

—http://web.archive.org/web/20061127130645/http://themulife.com/?p=256

We’ve noticed a similar tendency on some community-voting sites we’ve worked on

at Yahoo! and have come to consider behavior like this to be a type of emergent de-fect: behavior that is homegrown within the community and may even become a de

facto standard for interacting, but is not necessarily valued In fact, it’s basically a

bug and a failing of your system or—more likely—user interface design.

In instances like these, you should consider tweaking your design, to encourage the proper and appropriate use of the controls you’re providing In some ways, it’s not

Figure 9-2 What are people saving on Delicious? Lists, lists and more lists…(and there’s nothing wrong with that).

Tuning Your System | 237

Trang 3

surprising that Digg users are voting on articles based on only surface appraisals; the application’s very design in fact encourages this (see Figure 9-3)

Figure 9-3 The design of Digg enables (one might argue, encourages) voting for articles at a high level

of the site This excerpted screen is the front page of Digg—users can vote for (Digg) an article, or against (bury) it, with no need to read further.

Of course, one should not presuppose that the Digg folks think of this behavior (if it’s even as widespread as Saleem indicates) as a defect Again, it’s a careful balance between the actual observed behavior of users and your own predetermined goals and aspira-tions for the application

It’s quite possible that Digg feels that high voting levels—even if some percentage of those votes are from uninformed users—are important enough to promote voting at higher and higher levels of the site From a brand perspective alone, it certainly would

be odd to visit Digg.com, and not see a single place to Digg something up, right?

It’s hard to anticipate all emergent defects until they… well…emerge But there are certainly some good principles of design that you can follow that may defend your system against some of the most common ones:

Encourage consumption

If your system’s reputations are intended to capture the quality of a piece of con-tent, you should make a good-faith attempt to ensure that users are qualified to make that assessment Some examples:

• Early on in its lifetime, Apple’s iPhone App Store allowed any visitor to rate an

application, whether they’d purchased it or not! You can probably see the po-tential for bad data to arise from this situation A subsequent release addressed this problem, ensuring that only users who’d installed the program would have

Defending against emergent defects.

Trang 4

a voice It doesn’t guarantee perfection, but a gating mechanism for rating does help dampen noise

• Digg and other social voting sites provide a toolbar that follows logged-in users out to external sites, encouraging them to actually read linked articles before clicking the toolbar-provided voting mechanism Your application could even

require an interaction like this for a vote to be counted (More likely, you’ll

simply want to weight votes more heavily when they’re cast in a guaranteed-better fashion like this.)

• Think of ways to check for consumption in a media-specific way With videos, for example, perhaps you should give more weight to opinions cast about a video only once the user has passed a certain time-threshold of viewing (or, perhaps, disable voting mechanisms altogether until that time)

Avoid ambiguous controls

Try not to lard too much input overhead onto reputable entities, and try to keep the purpose and primary value of each clear, concise, and nonconflicting If your design already calls for a Bookmarking or Favorites features, carefully consider whether you also need a Thumbs Up or “I Like It.”

In any event, provide some cues to users about the utility of those controls Are they strictly for expressing an opinion? Sharing with a friend? Saving for later? The

downstream effects may, in fact, be that one control does all three of these things,

but sometimes it’s better to suggest clear and consistent uses for controls than let the community muddle along, inventing its own utilities and rationales for things

If a secondary or tertiary use for a control emerges, consider formalizing that func-tion as a new feature

Keep great reputations scarce

Many of the benefits that we’ve discussed for tracking reputation (the ability to high-light good contributions and contributors, the ability to “tag” user profiles with awards

or recognition, even the simple ability to motivate contributors to excel) can be

un-dermined if you make one simple mistake with your reputation system: being too gen-erous with positive reputations Particularly, if you hand out reputations at the higher

end of the spectrum too widely, they will no longer be seen as valuable and rare ach-ievements You’ll also lose the ability to call out great content in long listings; if every-thing is marked as special, noevery-thing will stand out

It’s probably OK to wait until the tuning phase to address the question of distribution thresholds You’ll need to make some calculations—based on available data for current use of the application—to determine how heavily or lightly to weight certain inputs into the system A good example is the Gold/Silver/Bronze medal system that we de-veloped at Yahoo! to reward active, quality contributors to UK Sports Message Boards

We knew that we wanted certain inputs to factor into users’ badge-holder reputations: the number of posts posted, how well the community received the posts (i.e., how

Tuning Your System | 239

Trang 5

highly the posts were rated, and so on But, at first, our guesses at the appropriate thresholds for these activities were just that—guesses

Take, for instance, one input that was included to indicate dedication to the commun-ity: the number of posts that a user had rated (In general, we caution against simple activity-level indicators for karma, but remember—this is but one input into the model—weighted appropriately against other quality-indicators like community re-sponse to your own postings.) We arbitrarily settled on the following minimum thresh-olds for badge-earners:

• Bronze Badge—5 posts rated

• Silver Badge—20 posts rated

• Gold Badge—100 posts rated

These were simply stabs in the dark—placeholders, really—that we fully expected to tune as we got closer to deployment

And, in fact, once we’d done an in-depth calculation of project badge numbers in the

community (based on Message Board activity levels that were already evident before

the addition of badges), we realized that these estimates were way too low We would

be giving out millions of Bronze badges, and, heck, still thousands of Golds This felt

way too liberal, given the goals of the project: to identify and reward only the most

active and valued contributors to boards

By the time the feature went into production, these minimum thresholds for rating

others postings were made much higher (orders of magnitude higher) and, in fact, it

was several months before the first message board Gold badge actually surfaced in the wild! We considered that a good thing, and perfectly in-line with the business and community metrics we’d laid out at the project’s outset

So…How Much Is Enough?

When you’re trying to plan out these distribution thresholds for reputations, your cal-culations will (of course!) vary with the context of use

Is this karma (people reputation) or content reputation?

Be more mindful of the distribution of karma It’s probably OK to have an over-abundance of “Trophy-winning videos” floating around your site, but too many top-flight experts risks devaluing the reward altogether

Honor the presentation pattern

Some distribution thresholds will be super easy to calibrate; if you’re honoring the Top 100 Reviewers on your site, for example, the number of users awarded

should be fairly self-evident It’s only with more ambiguous patterns that

thresh-olds will need to be actively tuned and massaged to get the desired distributions

Power-law is your friend

When in doubt, try to award reputations along a power-law distribution (Go to

http://en.wikipedia.org/wiki/Power_law.) Great reputations should be rare, good

Trang 6

ones scarce, and mediocre ones should be the norm This will naturally mimic the natural properties of most networks, so—really—your reputations should reflect those values also

Tuning for the Future

There are sometimes pleasant surprises when implementing reputation systems for the first time When users begin to interact with reputation-powered applications, the very nature of the application can change significantly; it often becomes communal— control of the reputable entities shifts from the company to the people

This shift from a content-centric to a community-centric application often leads to inspirational application designs to be built on the lessons drawn from the existing reputation system Simply put, if reputation works well for one application, all of the other related applications will want to integrate it, yesterday!

Though new reputation models can be added only as fast as they can be developed,

tested, integrated, and deployed, the application team can release new uses for exist-ing reputations without coordination and almost instantaneously—it already has

access to the reputation API calls This suggests that the reputation team should con-tinuously optimize for performance against its internal metrics Expect significant growth, especially in the number of reputation queries Even if the primary application,

as originally implemented, doesn’t grow daily users by an unexpected rate, expect the application team to add new types of uses, such as more reputation-weighted searches,

or to add more pages that display a reputation score

Tuning reputation systems for ROI, behavior, and future improvements is a

never-ending process If you stop this required maintenance, the entire system will lose value

as it becomes abused, slow, noncompetitive, broken, and eventually irrelevant

Learning by Example

It’s one thing to describe and critique currently deployed reputation systems—after they’ve already been deployed It’s another to prescribe a detailed set of steps that are recommended for new practitioners, as we have done in this book

Talk is easy; action is difficult But, action is easy; true understanding is difficult!

—Warrior Proverb The lessons we presented here are the direct result of many attempts—some succeeded, some failed—at reputation system development and deployment The book is the result

of successive refinement of those lessons, especially as we refined it at Yahoo! Chap-ter 10 is our proof-in-the-pudding that this methodology works in practice; it covers each step as we applied them during the development of a community moderation reputation model for Yahoo! Answers

Learning by Example | 241

Trang 8

CHAPTER 10 Case Study: Yahoo! Answers Community Content Moderation

This chapter is a real-life case study applying many of the theories and practical advice presented in this book The lessons learned on this project had a significant impact on our thinking about reputation systems, the power of social media moderation, and the need to publish these results in order to share our findings with the greater web appli-cation development community

In the summer of 2007, Yahoo! tried to address some moderation challenges with one

of its flagship community products: Yahoo! Answers The service had fallen victim to its own success and drawn the attention of trolls and spammers in a big way The Yahoo! Answers team was struggling to keep up with harmful, abusive content that flooded the service, most of which originated with a small number of bad actors on the site Ultimately, a clever (but simple) system that was rich in reputation provided the answer

to these woes: it was designed to identify bad actors, indemnify honest contributors, and take the overwhelming load off of the customer care team Here’s how that system came about

What Is Yahoo! Answers?

Yahoo! Answers debuted in December of 2005 and almost immediately enjoyed mas-sive popularity as a community driven website and a source of shared knowledge Yahoo! Answers provides a very simple interface to do, chiefly, two things: pose ques-tions to a large community (potentially, any active, registered Yahoo! user—that’s roughly a half-billion people worldwide); or answer questions that others have asked Yahoo! Answers was modeled, in part, from similar question-and-answer sites like Ko-rea’s Naver.com Knowledge Search

The appeal of this format was undeniable By June of 2006, according to Business 2.0,

Yahoo! Answers had already become “the second most popular Internet reference site

243

Trang 9

after Wikipedia and had more than 90% of the domestic question-and-answer market share, as measured by comScore.” Its popularity continues and, owing partly to excel-lent search engine optimization (SEO), Yahoo! Answers pages frequently appear very near the top of search results pages on Google and Yahoo! for a wide variety of topics Yahoo! Answers is by far the most active community site on the Yahoo! network It logs more than 1.2 million user contributions (questions and answers combined) each day

A Marketplace for Questions and Yahoo! Answers

Yahoo! Answers is a unique kind of marketplace—one not based on the transfer of goods for monetary reward No, Yahoo! Answers is a knowledge marketplace, where the currency of exchange is ideas Furthermore, Yahoo! Answers focuses on a specific kind of knowledge

Micah Alpern was the user experience lead for early releases of Yahoo! Answers He refers to the unique focus of Yahoo! Answers as “experiential knowledge”—the exchange of opinions and sharing of common experiences and advice (see Fig-ure 10-1) While verifiable, factual information is indeed exchanged on Yahoo! An-swers, a lot of the conversations that take place there are intended to be social in nature

Micah has published a detailed presentation that covers this project in

some depth You can find it at http://www.slideshare.net/malpern/wiki

mania-2009-yahoo-answers-community-moderation.

Yahoo! Answers is not a reference site in the sense that Wikipedia is; it is not based on the ambition to provide objective, verifiable information Rather, its goal is to encour-age participation from a wide variety of contributors That goal is important to keep in mind as we delve further into the problems that Yahoo! Answers was undergoing and the steps needed to solve them Specifically, keep the following in mind:

• The answers on Yahoo! Answers are subjective It is the community that determines

what responses are ultimately “right.” It should not be a goal of any

metamoder-ation system to distinguish right answers from wrong or otherwise place any im-portance on the objective truth of answers

• In a marketplace for opinions such as Yahoo! Answers, it’s in the best interest of

everyone (askers, answerers, and the site operator) to encourage more opinions,

not fewer So the designer of a moderation system intended to weed out abusive content should make every attempt to avoid punishing legitimate questions and answers False positives can’t be tolerated, and the system must include an appeals process

244 | Chapter 10: Case Study: Yahoo! Answers Community Content Moderation

Trang 10

Attack of the Trolls

So, exactly what problems was Yahoo! Answers suffering from? Two factors—the time lines with which Yahoo! Answers displayed new content and the overwhelming number

of contributions it received—had combined to create an unfortunate environment that was almost irresistible to trolls Dealing with offensive and antagonistic user content had become the number one feature request from the Yahoo! Answers community The Yahoo! Answers team first attempted a machine-learning approach, developing a black-box abuse classifier (lovingly named the “Junk Detector”) to prefilter abuse re-ports coming in It was intended to classify the worst of the worst content and put it into a prioritized queue for the attention of customer care agents

The Junk Detector was mostly a bust It was moderately successful at detecting obvious spam, but it failed altogether to identify the subtler, more insidious contributions of trolls

Do Trolls Eat Spam?

What’s the difference between trolling behavior and plain old spam? The distinction is subtle, but understanding it is critical when you’re combating either one We classify

Figure 10-1 The questions asked and answers shared on Yahoo! Answers are often based on experiential knowledge rather than authoritative, fact-based information.

What Is Yahoo! Answers? | 245

Định dạng
Số trang	15
Dung lượng	480,56 KB