đánh giá chuyên môn người dùng trong các hệ thống hỏi đáp thiết kế và cài đặt

Mục đích nội dung của ĐATN Xây dựng và cài đặt thuật toán đánh giá chất lượng câu trả lời và chuyên môn người dùng trong các hệ thống hỏi đáp, ứng dụng cụ thể vào mạng cộng đồng chia sẻ

Trang 1

HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY

SCHOOL OF INFORMATION AND COMMUNICATION TECHNOLOGY

──────── * ───────

UNDERGRADUATE DISSERTATION

MAJORED IN INFORMATION TECHNOLOGY

ASSESSMENT OF USER EXPERTISE IN QUESTION-AND-ANSWER SYSTEMS: DESIGN AND IMPLEMENTATION

Author: Phạm Tuấn Long

Software Engineering C51 SoICT, HUST

Mentors: Prof Dr Huỳnh Quyết Thắng

MS Lê Quốc

HANOI, 5-2011

Trang 2

TRƯỜNG ĐẠI HỌC BÁCH KHOA HÀ NỘI

VIỆN CÔNG NGHỆ THÔNG TIN VÀ TRUYỀN THÔNG

──────── * ───────

ĐỒ ÁN

TỐT NGHIỆP ĐẠI HỌC

NGÀNH CÔNG NGHỆ THÔNG TIN

ĐÁNH GIÁ CHUYÊN MÔN NGƯỜI DÙNG TRONG CÁC HỆ THỐNG HỎI ĐÁP:

Trang 3

DISSERTATION TASK SHEET

1 Student information

Full-name: Phạm Tuấn Long

Phone number: (+84) 972-889-760Email: longpt214@gmail.com Email:

2 Objectives

Design and implement an algorithm to assess expertise of users in question-and-answer

systems (Q&A systems), applied in BkProfile – a knowledge sharing community

I, Pham Tuan Long, hereby commit that this dissertation is my work under the instruction of

Prof Dr Huỳnh Quyết Thắng and Ms Lê Quốc, incorporating no plagiarized passage.

All results and findings of this dissertation are honest

Trang 4

PHIẾU GIAO NHIỆM VỤ ĐỒ ÁN TỐT NGHIỆP

1 Thông tin về sinh viên

Họ và tên sinh viên: Phạm Tuấn Long

Điện thoại liên lạc: 0972889760Email: longpt214@gmail.com Email:

longpt214@gmail.com

Lớp: Công nghệ phần mềm K51 Hệ đào tạo: Chính quy Hệ đào tạo: Chính quy

Đồ án tốt nghiệp được thực hiện tại: Nhóm BkProfile, đặt tại công ty Cazoodle@Việt NamThời gian làm ĐATN: Từ ngày 15/02/2011 đến 25/05/2011

2 Mục đích nội dung của ĐATN

Xây dựng và cài đặt thuật toán đánh giá chất lượng câu trả lời và chuyên môn người dùng trong các hệ thống hỏi đáp, ứng dụng cụ thể vào mạng cộng đồng chia sẻ tri thức BkProfile

3 Các nhiệm vụ cụ thể của ĐATN

− Nghiên cứu các hệ thống hỏi đáp trên thế giới và cỏch tớnh điểm chuyên môn người dùng và chất lượng câu trả lời của chúng

− Thiết kế thuật toán đánh giá phù hợp với điều kiện của hệ thống BkProfile, và có thể

mở rộng khi hệ thống lớn lên

− Cài đặt thuật toán và chạy thử nghiệm trên địa chỉ bkprofile.com

4 Lời cam đoan của sinh viên:

Tôi - Phạm Tuấn Long - cam kết ĐATN là công trình nghiên cứu của bản thân tôi dưới sự

hướng dẫn của PGS TS Huỳnh Quyết Thắng và ThS Lê Quốc

Các kết quả nêu trong ĐATN là trung thực, không phải là sao chép toàn văn của bất kỳ công trình nào khác

Hà Nội, ngày 15 tháng 02 năm 2011

Tác giả ĐATN

Phạm Tuấn Long

5 Xác nhận của giáo viên hướng dẫn về mức độ hoàn thành của ĐATN và cho phép bảo vệ:

Hà Nội, ngày 17 tháng 02 năm 2011

Giáo viên hướng dẫn

PGS TS Huỳnh Quyết Thắng

Trang 5

PREFACE

To me, writing this dissertation is a prestigious chance to review my academic knowledge, present my biggest product at school, and test my ability to deal with real-life problems which I will have to solve in the future!

This dissertation is an important module of BkProfile, a knowledge sharing web applications, developed by BkProfile team, a student team of School of Information and Communication Technology, bases in Vietnam branch of Cazoodle Inc., under the instruction of Prof Dr Thang Quyet Huynh and Ms Quoc Le The first version of the module was integrated to the beta version of the web application, addressing at http://www.bkprofile.com It connects with other important modules of the system, i.e SOLR search engine and JOO-framework-based presentation

The objectives of this dissertation are as follows:

 Research Q&A systems in the world and their methods to evaluate the expertise

As a result, the dissertation will be organized as follows:

 Chapter 1 describes my motivation to do the research, the statement of the problem and my approach to tackle the problem

 Chapter 2 includes a quick review of Google AardVark and Google Confucius,

as the most related research; Markov chain & PageRank as the foundation of my research; and Hadoop MapReduce as the platform I use to implement the algorithm

 Chapter 3, the most important chapter of this dissertation, elaborates my method

to rank users in a Q&A system

 Chapter 4 discusses the details of my implementation of the algorithm in Hadoop MapReduce It includes a technique to automate MapReduce processes

 Chapter 5 summarizes the experimental results when I test ExpertRank in a real system, named BkProfile I also include the population & sampling I uses to experiment the algorithm

 Chapter 6 discusses the current application and current problems of ExpertRank, and recommend a few ways to make it better

 Chapter 7 is the conclusion

Trang 6

Mục tiêu của đồ án này là như sau:

 Nghiên cứu các hệ thống Hỏi & Đỏp trờn thế giới và phương pháp của chúng trong việc đánh giá chuyên môn của người dùng

 Thiết kế một thuật toán đánh giá có khả năng mở rộng cho các hệ thống cỡ lớn

và phải phù hợp với môi trường cụ thể của BkProfile

 Cài đặt và triển khai chương trình tại hệ thống thực BkProfile, tại địa chỉ: http://www.bkprofile.com

Theo đó, đồ án sẽ được tổ chức như sau:

 Chương 1 mô tả động lực thúc đẩy việc nghiên cứu, nêu vấn đề sẽ nghiên cứu

và hướng tiếp cận để giải quyết vấn đề

 Chương 2 bao gồm một bản phân tích ngắn hướng tiếp cận của các hệ thống hỏi đáp khác là Google AardVark và Google Confucius; bản tìm hiểu về chuỗi Markov và thuật toán PageRank như là những nền tảng chính của nghiên cứu này; và cuối cùng là Hadoop MapReduce là nền tảng phân tán mà em đó dựng

để cài đặt thuật toán Trong đó, em cũng chỉ rõ đóng góp chính của nghiên cứu này

 Chương 3, cũng là chương quan trọng nhất của đồ án, trình bày chi tiết phương pháp của em để xếp hạng người dùng trong các hệ thống Hỏi & Đáp

 Chương 4 thảo luận các chi tiết về cách em cài đặt thuật toán đề xuất trên nền tảng Hadoop MapReduce, trong đó có chứa một kỹ thuật để tự động hóa quá trình chạy Hadoop MapReduce

 Chương 5 tóm tắt kết quả thử nghiệm khi em kiểm thử ExpertRank trong một hệ thống thực có tên là BkProfile Em cũng mô tả dữ liệu mà em đã sử dụng để kiểm thử

Trang 7

 Chương 6 thảo luận những ứng dụng hiện tại, những vấn đề còn tồn tại của ExpertRank và trình bày một vài hướng có thể cải thiện chất lượng kết quả của thuật toán.

 Chương 7 là phần tổng kết những gì em đã làm được, đối chiếu với những gì đã đặt ra từ trước

Như vậy, chương 1 – 2 tương đương với phần đặt vấn đề và định hướng giải pháp, chương 3-6 tương đương với phần các kết quả đạt được và chương 7 tương đương với phần kết luận trong hướng dẫn viết đồ án của Viện

Trang 8

This dissertation presents ExpertRank, an iterative algorithm to evaluate the expertise of users about a specific domain of knowledge in question & answer systems, i.e BkProfile – a knowledge sharing community The evaluation help users prove their ability in their professional profiles in the system which is a critical motivation for Q&A system to run smoothly I base my method on Markov chain model, refer to Google's PageRank algorithm,

to convert the above problem into a probabilistic model and then use an iterative method to solve the problem My algorithm is designed on MapReduce programming model so that it can be applied for large scale systems We have experimented it in a open-source platform named Hadoop mapReduce and deploy it in BkProfile web application, http://www.bkprofile.com Results of the algorithm can be encapsulated as a reliable parameter for evaluation system of other large-scale sharing web applications

Keywords Iterative method, Markov chain, MapReduce, PageRank

Trang 9

TÓM TẮT NỘI DUNG ĐỒ ÁN TỐT NGHIỆP

Nghiên cứu này giới thiệu ExpertRank, là một thuật toán lặp đánh giá chất lượng câu trả lời và trình độ chuyên môn của người dùng về một lĩnh vực nào đó trong các hệ thống hỏi và trả lời

cỡ lớn, mà cụ thể là mạng cộng đồng chia sẻ tri thức BkProfile Việc đánh giá chuyên môn người dùng sẽ giúp người dùng có thể chứng minh được kiến thức chuyên môn của mình trong

hồ sơ nghề nghiệp của họ trong hệ thống, là một động lực quan trọng thúc đẩy hoạt động của các hệ thống Hỏi & đáp Em đã dựa trên thuật toán phân loại trang web của máy tìm kiếm Google có tên là PageRank, và mô hình chuỗi Markov để chuyển bài toán trên thành các mô hình xác suất, từ đó xây dựng thuật toán lặp đánh giá cùng một lúc hai đại lượng trên Thuật toán của chúng tôi được thiết kế trên mô hình Map-Reduce nên có thể được áp dụng cho các

hệ thống phân tán cỡ lớn Em đã thử nghiệm nó trờn hệ thống mã nguồn mở có tên là Hadoop Map Reduce và triển khai nó chạy ổn định trên ứng dụng web BkProfile tại địa chỉ http://www.bkprofile.com Các kết quả của thuật toán cũng có thể được đóng gói như là 1 tham số tin cậy sử dụng cho các hệ thống đánh giá trong các ứng dụng chia sẻ tri thức cỡ lớn khác

Từ khóa Phương pháp lặp, chuỗi Markov, MapReduce, PageRank

Trang 10

Finally, I want to send many heartfelt thanks to School of Information & Communication Technology, under Hanoi University of Science & Technology, for the knowledge, skills and spirit I have been learning there.

Pham Tuan Long

Trang 11

Table of Contents

INTRODUCTION 17

1.1 Motivations 17

1.1.1 The explosion of Q&A systems 17

1.1.2 The need of a good expertise evaluation formula 18

1.2 Statement of the problem 19

1.3 My approach 19

1.4 Dissertation structure 20

1.5 Summary 21

LITERATURE REVIEW 22

1.6 Analysis of Google Confucius's approach 22

1.7 Analysis of Google AardVark's approach 22

1.8 Online knowledge market 22

1.9 Recommendation-based evaluation system 23

1.10 Markov chain and its convergence condition 23

1.11 Google PageRank algorithm 24

1.12 Cloud computing 25

1.13 MapReduce programming structure 26

1.14 Hadoop MapReduce 27

1.15 Summary 27

METHODOLOGY 29

1.16 Recommendation channel aggregation 29

1.17 Random expert seeker model 29

1.18 ExpertRank basic formula 30

1.19 Examples of running ExpertRank basic formula 31

1.20 ExpertRank's convergence 34

1.20.1 Mapping from Random expert seeker model to Markov chain 34

1.20.2 Dangling links 36

1.20.3 ExpertRank traps 37

1.20.4 Solution 37

1.21 ExpertRank full formula 38

1.22 Limitation of the algorithm 38

1.23 Summary 38

IMPLEMENTATION 39

1.24 Flow of MapReduce processes 39

1.25 MapReduce Implementation on Apache Hadoop 39

1.26 Data normalization 40

1.27 ExpertRank transition calculation 40

1.28 Random visiting probability distribution 42

1.29 Halt condition 42

1.30 Final result extraction 42

1.31 Summary 42

RESULTS & FINDINGS 43

Trang 12

1.32 Population and Sampling 43

1.32.1 Overview of BkProfile 43

1.32.2 Sampling methods 43

1.32.3 Expected Results 43

1.33 Performance & convergence 44

1.34 Examples 44

1.35 Summary 45

DISCUSSION 46

1.36 Applications 46

1.36.1 Search engine ranking function 46

1.36.2 Answer quality evaluation 46

1.36.3 Potential answerer suggestion 47

1.36.4 User reliability evaluation 48

1.37 Limitation of current ExpertRank formula 48

1.38 ExpertRank's incentives 49

1.39 Prevention of spammers 50

1.40 ExpertRank's personalization 50

1.41 Summary 51

CONCLUSION 52

Trang 13

INDEX OF TABLES

Table 1: Advantages of Q&A systems in comparison with other knowledge sharing & searching tools 18Table 2: Incentives of current famous Q&A sites 18

Trang 14

INDEX OF FIRGURES

Figure 1: An example of application of Markov chain 24

Figure 2: PageRank transfer among web pages 25

Figure 3: Applications of cloud computing 26

Figure 4: Basic model of MapReduce 26

Figure 5: Random expert seeker model 30

Figure 6: ExpertRank transfer among expert network 31

Figure 7: Step 1 of the example of using ExpertRank basic formula 32

Figure 8: Step 2 of the example of using ExpertRank basic formula 33

Figure 9: Mapping from Random expert seeker model to ExpertRank 36

Figure 10: Dangling links and outer node 36

Figure 11: ExpertRank trap 37

Figure 12: A solution for the convergence of ExpertRank 37

Figure 13: Flow of MapReduce processes 39

Figure 14: Rough data 40

Figure 15: Aggregation 40

Figure 16: Transition table 42

Figure 17: The convergence of ExpertRank .44

Figure 18: The search engine of BkProfile 46

Figure 19: Presentation of answers for a question in BkProfile 47

Figure 20: Stats in BkProfile 50

Figure 21: A profile in BkProfile 55

Figure 22: ExpertRank are listing in each topic's page 55

Trang 15

No Abbreviation &

1 Q&A Question & Answer

2 NLP Natural Language Processing

3 SoICT School of Information and Communication Technology

4 HUST Hanoi University of Science and Technology

Trang 16

1 Outlink A link appearing in a website to direct users to another

website

2 Outer node A node that do not have outlink

3 Dangling link A link that directs to an outer node

4 Search engine A computer program which finds information on the

Internet by looking for words which you have typed in

5 Apache Lucene A high-performance, full-featured text search engine library

6 SOLR A popular open-source text search engine based on Lucene

technology and written in Java

7 Hadoop A project develops open-source software for reliable,

scalable, distributed computing

8 MapReduce A programming model for writing applications that rapidly

process vast amounts of data in parallel on large clusters of compute nodes

9 Hadoop MapReduce An open source volunteer project under the Apache

Software Foundation

10 Cloud computing A style of computing where massively scalable IT-related

capabilities are provided as a service using Internet technologies to multiple external customers

11 Markov chain a mathematical system that undergoes transitions from one

state to another as a chain, endowed with the Markov property: the next state depends only on the current state and not on the past

12 PageRank Google's algorithm to rank the importance of web pages in

the Internet

13 ExpertRank Name of my proposed algorithm to rank users' expertise

Trang 17

In this section, I will first present the explosion of Q&A systems and the reasons why a Q&A system need an algorithm to evaluate both expertise of users and quality of answers to prove the need of the research Then, I will clearly state the problems I would like to solve and discuss my approach to solve it At the end of this chapter, I will describe other chapters of my dissertation and how they connect one another.

1.1 Motivations

1.1.1 The explosion of Q&A systems

In recent years, Question and Answer systems or Q&A systems in short, have developed in a rocket speed In 2010, Google published a research stating that 25% of Google's top search results include at least one link to some Q&A site [1] It is easy to understand when most big names in IT has participated in the so-called online knowledge market, namely Yahoo with Yahoo Answers; Google with Google Answers and Google Confucius; and Facebook with Facebook Q&A There are also very good Q&A systems from smaller companies such as Quora, StackOverFlow, OSDir, and so

on People even built a platform, named Question2Asnwer, to quickly create new Q&A sites So far, it has helped build up to 1665 sites in 32 languages [2] Most recently, in the year of 2010, AardVark, a 2-year-old Q&A system, developed by 20 engineers, has been acquired by Google with the price up to $50,000,000

Now, Q&A sites are still burgeoning thanks to their advantage in comparison with other online knowledge sharing & searching tools such as search engine, forum and blog:

No Advantage of Q&A Search Engine Forum Blog

Too many junk and not-to-the-point answers which make

up of very long threads Answers and discussions are mixed

Users need to synthesize articles to get the final answer

Answers are often indirect since articles are to solve authors' problems rather than the searchers' ones

2 providing answers for

completely new

questions

can't can can't

3 providing answers for

heavily local

questions

Usually can't can can

4 providing search

engine with well

N/A Forum is one of

hardest kinds of sites

Blog is also hard to index due to its

Trang 18

No Advantage of Q&A Search Engine Forum Blog

Table 1: Advantages of Q&A systems in comparison with other knowledge sharing & searching

tools

1.1.2 The need of a good expertise evaluation formula

Through using various Q&A systems, i.e OSDir, Google Confucius, Quora, Google AardVark, and so on, as well as through reading analytical papers on successful Q&A systems, I realize that there are three main requirements to bring an Q&A site to success:

 New questions quickly get answers

 Answers are provided with high quality

 Questions & answers are organized so that they are easier to be found

Each website has its own way to reach the three above criteria Some choose automatic methods such as automatic answering, automatic classifying, automatic organizing and so on but most of them use motivations to encourage their users to self-solve the three requirements According to [1], there are two main incentives for people

to contribute to a Q&A site: finance & virtual values

No Site Question routing Incentives Establishment Year

1 Internet Oracle to experts virtual 1989

2 Ask.com N/A N/A 1996

3 WikiAnswers to public virtual 2002

4 Yahoo! Answers to public virtual 2005

5 Baidu Zhidao to public & experts virtual & $ 2005

6 Google Confucius to public & experts virtual & $ 2007

7 Aardvark to friends virtual 2009

8 Powerset N/A N/A 2005

9 Quora to friends virtual 2009

Table 2: Incentives of current famous Q&A sites

Question routing is to bring the question from a questioner to the people who have most incentives Question routing is critical to ensure that questions are quickly answered and answers are provided with high quality However, it depends on the incentives that the system provides Taking its turn, incentives depend on the evaluation system of the Q&A site If incentive is financial, the site need to have a fair evaluation formula to determine how much it should pay for an answer of a specific expert If incentive is virtual values, a fair evaluation system is also critical to make virtual values more reliable and more beneficial It can be said that evaluation, i.e

Trang 19

evaluation of answers and evaluation of experts, are the main driven force to keep a Q&A site running smoothly It, together with the method to organize questions & answers, decides the success of a Q&A site.

However, working out a good evaluation is not an easy task due to the complicated relationship between elements in a Q&A site such as answers, questions, answer providers and so on

I, together with my team, are creating a Vietnamese Q&A site, named BkProfile [3] The 1000-user system is currently running quite smoothly without a proper evaluation formula However, we all understand that a good and fair evaluation formula is absolutely important to ensure the stable development of the site

1.2 Statement of the problem

In this dissertation, I will present a method to evaluate both expertise of users and quality of answers The evaluation should be endowed with the following characteristics:

(1) Fair & objective Since evaluation is usually subjective, there is no way to create

a perfect evaluation formula that works in all cases However, a fair & objective formula is a good basis for people's self-judgment

(2) Incentive providing Not all evaluation formula, even those which are rational in some cases, provide incentive for the development of Q&A site However, providing incentive is critical here since it is the reason for the research of the evaluation formula

(3) Scalable We plan to push our system to the scale of whole Vietnamese knowledge market Thus, a scalable algorithm is compulsory

1.3 My approach

There are two natural ways to evaluate an answer or a user:

(1) Evaluate the content of answers and the contribution of users themselves, such as: how the answer is relevant to its question, what level of the writing style, how different the answer is in comparison with previous answers and so on; and how much time the user has contributed to the system, how many answers he has provided, how good his answers were, and so on

(2) Evaluate the relationship among answers, users, and answers & users, such as how many votes a user delivered to another, how many answers a user provided another, how good an answer was compared to other ones, and so on

Between the 2 approaches, I chose the latter because current methods to evaluate content of a text is not good enough and particularly quite subjective which violate the first issue of the statement of the problem Moreover, to simplify the algorithm, I only evaluate the expertise of users and then use it to evaluate the quality of answers

Trang 20

through the interaction between user and answer, i.e voter/voted and provider/answer.

answer-However, the problem is still complicated with a number of channels connecting users such as:

sub- Aggregate all channels into one single channel, which I call recommendation

 Analyze the links among all users through only the aggregated channel The analysis technique I use is similar to Google's PageRank algorithm which uses Markov chain model, the basis of my method

The method is quite fair and objective since it considers the whole network rather than only a piece of information like a specific answer It provides incentives for the system since highly ranked users in recommendation-based systems have many privileges such as prestige, weight of influence and so forte Finally, it can be scalable

as the algorithm can be implemented in a large-scale platform, named Hadoop MapReduce I implemented a simple version of ExpertRank on this platform which gave me some encouraging result

1.4 Dissertation structure

The dissertation is constructed as follow:

Chapter 1 describes my motivation to do the research, the statement of the problem and my approach to tackle the problem

Chapter 2 includes a quick review of Google AardVark and Google Confucius, as the most related research; Markov chain & PageRank as the foundation of my research; and Hadoop MapReduce as the platform I use to implement the algorithm

Chapter 3, the most important chapter of this dissertation, elaborates my method to rank users in a Q&A system

Chapter 4 discusses the details of my implementation of the algorithm in Hadoop MapReduce It includes a few techniques to automate MapReduce processes

Trang 21

Chapter 5 summarizes the experimental results when I test ExpertRank in a real system, named BkProfile I also include the population & sampling I uses to experiment the algorithm.

Chapter 6 discusses the current application and current problems of ExpertRank, and recommend a few ways to make it better

Chapter 7 is the conclusion

1.5 Summary

In this chapter, I have described my motivation to do the research, i.e the explosion

of Q&A systems and their need of an algorithm to evaluate quality of answers and expertise of users I also clearly stated the problems and gave out my approaches to tackle the problems, i.e aggregate relationship channels into only one channel called recommendation and analyze the network of recommendation to rank experts Next chapter is a review of related work which express the reasons why I chose the approach

Trang 22

LITERATURE REVIEW

Despite the rapid development of Q&A systems, careful and direct research on

it is quite limited Among all, I chose Google Confucius's and Google AardVark's research papers to analyze because of their high quality and their similarity to the system I and my team are building Then I generalize Q&A system to an online knowledge market tool to take advantage of the research on other online markets, particularly those which are proved to work well in Vietnam An analysis of recommendation based evaluation system is discussed since it is the main idea of

my approach Then, I discuss Markov chain which acts as the basis of my research

as well as Google's PageRank as an famous example of using Markov chain My method derives a lot of ideas from Google's PageRank Finally, I write a brief overview of cloud computing, MapReduce programming model and Hadoop MapReduce because Hadoop MapReduce is the platform I use to implement my algorithm.

1.6 Analysis of Google Confucius's approach

Google Confucius [1] is a very sophisticated system with 6 main components: search integration, question labeling, question recommendation, answer quality assessment, user-ranking and NLP-based answer generation With the 6 components, they provide a smooth flow in which users' work is minimized They also provide financial incentive to motivate experts work much in the system The user rank system

is quite important to help evaluate the money they have to pay for each user They used HITS algorithm [4], using questioner/answerer relationship, to rank users Their rationale is that users in Q&A systems are not active enough to provide adequate social iteration like votes, improvement suggestions, and so on However, we still believe that

if constructing a Q&A site as an knowledge sharing community, people will provide more social interaction and that they are prestigious information to mine

1.7 Analysis of Google AardVark's approach

Google AardVark names itself a social search engine And in fact, their way of implementing a Q&A site quite different from all others They do not keep the answers

as a knowledge store, but consider all queries as new questions and try to route them to the right people The right people are those who have good activity records in AardVark and those who are somehow connected to the questioner They extract the latter information through intensively analyzing social networks, such as Facebook, Twitter, Yahoo 360, of users They also have a module to evaluate quality of answers and expertise of users through probabilistic models and that is what I learned when building formula for ExpertRank

1.8 Online knowledge market

Online knowledge market is the common terminology of all system aiming at routing knowledge from “providers” to “seekers” Considering Q&A system as an online knowledge market can help me look at the essential characteristics of its: an

Trang 23

online market Researching on online markets, i.e Amazon and eBay, I realize that there are 3 things which are very important to drive a market to success:

 The availability of goods, i.e knowledge In Q&A systems, the availability is created through the size and activeness of community and supporting system like information browser, request router and search engine

 The evaluation of goods through review and information adequacy In Q&A systems, they are the comments, votes for answers, and the structure of an answer

 The evaluation of service providers In Q&A systems, of course, it is mainly the expertise of users The expertise here is defined as the ability to provide good answers It is nothing related to the certificates or positions people may have in real life

Finally, besides the 3 above criteria, for Vietnamese, through researching on a few online markets like vatgia.com and chodientu.vn, I see that popularity of usage of goods is a very important criterion for a Vietnamese to buy a product It is the foundation for me to believe that an evaluation formula based on reviews of community will succeed in Vietnam online knowledge markets

1.9 Recommendation-based evaluation system

Recommendation-based evaluation is very common in Western culture People base their assessment of a person on trustable recommendations For instance, admissions officers bases their admissions decision on the letters of recommendations from their colleagues; researchers read through the name of all articles referring to a specific article to roughly evaluate its quality In a world of knowledgeable people who care much about their prestige, recommendation system work smoothly since people will be very conscious of recommendation to promote/protect their image The Q&A site which we are building focuses on knowledgeable community Therefore, using recommendation system to evaluate people is reasonable

About the used technique, link analysis [5] , or in another word, collaborative filtering [6], a technique to predict characteristics of objects, is not a new one It is discussed intensively in an active branch of computer science, named Knowledge Discovery in Database [7] My contribution is to apply and develop the techniques into Q&A systems, a new context with unique characteristics

1.10 Markov chain and its convergence condition

A Markov chain is a mathematical system that undergoes transitions from one state

to another as a chain [8] It is a random process endowed with the Markov property: the next state depends only on the current state and not on the past

Trang 24

In practice, Markov chain has a lot of applications In the above figure, people modeled the transition of conditions of markets as a Markov chain Based on the information, people calculated the probabilistic distribution of the conditions in a specific period of time by iteratively calculating the new distribution after a transition until the process is convergent The problem is that whether a Markov chain is convergent and how many convergent states it has And that is one of the main problems of Markov chain on which people research.

According to the theory of Markov chain [8], an irreducible Markov chain will converge in one state, or in another word, will have one stationary distribution, if and only if all of its states are positively recurrent Of these:

 A Markov chain is irreducible if it is possible to get to any state from any state

 A Markov chain is recurrent, or in another word persistent, if it is not transient, i.e from a specific state i, there is a non-zero probability that we will never return to i A recurrent Markov chain is positive if the process is finite

In this dissertation, I map my research problem with a Markov chain to design a formula in which after a finite number of iterations, it will lead to one-and-only-one convergent state, i.e a formula satisfies both the two above conditions

1.11 Google PageRank algorithm

Figure 1: An example of application of Markov chain

Trang 25

PageRank [9] is a very famous algorithm of Google to determine the importance of

a web page through times when it is the outlinks of other pages: the more and with the higher quality a web page is linked from others, the more important it will be For instance, because tamhonvietnam.net is linked by soict.hut.edu.vn, a trustable website, the importance score of tamhonvietnam.net is added with some value, and hence the website will be easier to be found in the search engine

If considering incorporation of an outlink to a web page is a recommendation to the page, PageRank evaluation is exactly a recommendation based one Moreover, PageRank targets on a large-scale problem when attempting to assess the whole world wide web Finally, PageRank algorithm can be considered a Markov chain with random web surfer model Therefore, I base my research on PageRank to ensure the acceptability of some of my hypothesis

1.12 Cloud computing

Cloud computing is a style of computing where massively scalable IT-related capabilities are provided as a service using Internet technologies to multiple external customers With the appearance of cloud computing, there are more chances for small companies to approach large-scale batch processing which is more and more popular with the development of data mining techniques and the appearance of huge databases like those of Facebook, Amazon, Zing Me and so on Currently, there are many good cloud computing service providers, i.e Amazon EC2, NephoScale, Google, Microsoft, RightScale, etc Using their services, people do not need to buy expensive computers to scale their system up but rent services with their desired configurations and use the providers' APIs to access the services Since BkProfile is a small team, using cloud computing services is probably a wise approach to process large-scale computation

Figure 2: PageRank transfer among web pages

Trang 26

1.13 MapReduce programming structure

MapReduce [10] is a programming model for expressing distributed computations

on massive amounts of data and an execution framework for large-scale data processing on clusters of commodity servers The core idea of the model is the map and reduce functions commonly used in functional programming By understanding the programming model and owning a framework, e.g Apache Hadoop MapReduce, Google MapReduce, etc to run the MapReduce tasks, engineers can feel quite convenient to program a distributed computation because all the hardest tasks like job distribution, storage distribution and so on are hidden in the frameworks

The figure above present the basic model of a MapReduce task:

Figure 3: Applications of cloud computing

Figure 4: Basic model of MapReduce

Trang 27

 Input of a MapReduce task is (key,value) pairs which are distributed to mappers Mappers are objects which can be configured to stay at the same machine or at different machines.

 Output of a map task is also (key,value) pairs but they usually have different format from the input ones

 The pairs are delivered to shuffle-and-sort modules which will sort and distribute the (key,value) pairs to desired reducers (key,value) pairs of the same key will be delivered to the same reducers

 In each reducers, (key,value) pairs of the same key are reduced into

aggregated data which are also in the format of (key,value) They are the output of the MapReduce task

Note that because both input and output of a MapReduce task is in the format of (key,value), engineers can design complicated algorithms which involves several MapReduce tasks In the implementation of my proposed algorithm, I used three different types of MapReduce tasks and run them iteratively until a predefined condition is satisfied

Finally, the MapReduce programming model can be more complicated but it is not the focus of this dissertation

1.14 Hadoop MapReduce

Hadoop MapReduce [11] is a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes It is an open source volunteer project under the Apache Software Foundation Hadoop MapReduce goes along with HDFS, Hbase, etc providing a entire solution for distributed computation A few famous clients of its is Facebook, Amazon, Cloudera, etc Moreover, Hadoop MapReduce is well documented and there are many cloud computing services supporting it That is the reason why I chose Hadoop MapReduce to implement my MapReduce based algorithm

1.15 Summary

In the chapter, I have described all the related work of my dissertation including Q&A sites and online knowledge market from which I get ideas; recommendation-based assessment which is my approach; Markov chains and PageRank which are the foundation of my research; and Cloud computing, MapReduce, Apache Hadoop are the framework of my implementation

About my contribution, I got two ideas of my research from PageRank algorithm, i.e detection of problems of ExpertRank basic formula and their solutions, because PageRank is also based on Markov chain and it has a lot of similarity with ExpertRank Moreover, I use a few results of PageRank, particularly the convergence speed, to make my hypothesis more solid However, I simplified the problem by aggregating complicated relationship channels into a single one, and

Định dạng
Số trang	55
Dung lượng	1,37 MB