Mục đích nội dung của ĐATN Xây dựng và cài đặt thuật toán đánh giá chất lượng câu trả lời và chuyên môn người dùng trong các hệ thống hỏi đáp, ứng dụng cụ thể vào mạng cộng đồng chia sẻ
Trang 1HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
SCHOOL OF INFORMATION AND COMMUNICATION TECHNOLOGY
──────── * ───────
UNDERGRADUATE DISSERTATION
MAJORED IN INFORMATION TECHNOLOGY
ASSESSMENT OF USER EXPERTISE IN QUESTION-AND-ANSWER SYSTEMS: DESIGN AND IMPLEMENTATION
Author: Phạm Tuấn Long
Software Engineering C51 SoICT, HUST
Mentors: Prof Dr Huỳnh Quyết Thắng
MS Lê Quốc
HANOI, 5-2011
Trang 2TRƯỜNG ĐẠI HỌC BÁCH KHOA HÀ NỘI
VIỆN CÔNG NGHỆ THÔNG TIN VÀ TRUYỀN THÔNG
──────── * ───────
ĐỒ ÁN
TỐT NGHIỆP ĐẠI HỌC
NGÀNH CÔNG NGHỆ THÔNG TIN
ĐÁNH GIÁ CHUYÊN MÔN NGƯỜI DÙNG TRONG CÁC HỆ THỐNG HỎI ĐÁP:
Trang 3DISSERTATION TASK SHEET
1 Student information
Full-name: Phạm Tuấn Long
Phone number: (+84) 972-889-760Email: longpt214@gmail.com Email:
2 Objectives
Design and implement an algorithm to assess expertise of users in question-and-answer
systems (Q&A systems), applied in BkProfile – a knowledge sharing community
I, Pham Tuan Long, hereby commit that this dissertation is my work under the instruction of
Prof Dr Huỳnh Quyết Thắng and Ms Lê Quốc, incorporating no plagiarized passage.
All results and findings of this dissertation are honest
Trang 4PHIẾU GIAO NHIỆM VỤ ĐỒ ÁN TỐT NGHIỆP
1 Thông tin về sinh viên
Họ và tên sinh viên: Phạm Tuấn Long
Điện thoại liên lạc: 0972889760Email: longpt214@gmail.com Email:
longpt214@gmail.com
Lớp: Công nghệ phần mềm K51 Hệ đào tạo: Chính quy Hệ đào tạo: Chính quy
Đồ án tốt nghiệp được thực hiện tại: Nhóm BkProfile, đặt tại công ty Cazoodle@Việt NamThời gian làm ĐATN: Từ ngày 15/02/2011 đến 25/05/2011
2 Mục đích nội dung của ĐATN
Xây dựng và cài đặt thuật toán đánh giá chất lượng câu trả lời và chuyên môn người dùng trong các hệ thống hỏi đáp, ứng dụng cụ thể vào mạng cộng đồng chia sẻ tri thức BkProfile
3 Các nhiệm vụ cụ thể của ĐATN
− Nghiên cứu các hệ thống hỏi đáp trên thế giới và cỏch tớnh điểm chuyên môn người dùng và chất lượng câu trả lời của chúng
− Thiết kế thuật toán đánh giá phù hợp với điều kiện của hệ thống BkProfile, và có thể
mở rộng khi hệ thống lớn lên
− Cài đặt thuật toán và chạy thử nghiệm trên địa chỉ bkprofile.com
4 Lời cam đoan của sinh viên:
Tôi - Phạm Tuấn Long - cam kết ĐATN là công trình nghiên cứu của bản thân tôi dưới sự
hướng dẫn của PGS TS Huỳnh Quyết Thắng và ThS Lê Quốc
Các kết quả nêu trong ĐATN là trung thực, không phải là sao chép toàn văn của bất kỳ công trình nào khác
Hà Nội, ngày 15 tháng 02 năm 2011
Tác giả ĐATN
Phạm Tuấn Long
5 Xác nhận của giáo viên hướng dẫn về mức độ hoàn thành của ĐATN và cho phép bảo vệ:
Hà Nội, ngày 17 tháng 02 năm 2011
Giáo viên hướng dẫn
PGS TS Huỳnh Quyết Thắng
Trang 5PREFACE
To me, writing this dissertation is a prestigious chance to review my academic knowledge, present my biggest product at school, and test my ability to deal with real-life problems which I will have to solve in the future!
This dissertation is an important module of BkProfile, a knowledge sharing web applications, developed by BkProfile team, a student team of School of Information and Communication Technology, bases in Vietnam branch of Cazoodle Inc., under the instruction of Prof Dr Thang Quyet Huynh and Ms Quoc Le The first version of the module was integrated to the beta version of the web application, addressing at http://www.bkprofile.com It connects with other important modules of the system, i.e SOLR search engine and JOO-framework-based presentation
The objectives of this dissertation are as follows:
Research Q&A systems in the world and their methods to evaluate the expertise
As a result, the dissertation will be organized as follows:
Chapter 1 describes my motivation to do the research, the statement of the problem and my approach to tackle the problem
Chapter 2 includes a quick review of Google AardVark and Google Confucius,
as the most related research; Markov chain & PageRank as the foundation of my research; and Hadoop MapReduce as the platform I use to implement the algorithm
Chapter 3, the most important chapter of this dissertation, elaborates my method
to rank users in a Q&A system
Chapter 4 discusses the details of my implementation of the algorithm in Hadoop MapReduce It includes a technique to automate MapReduce processes
Chapter 5 summarizes the experimental results when I test ExpertRank in a real system, named BkProfile I also include the population & sampling I uses to experiment the algorithm
Chapter 6 discusses the current application and current problems of ExpertRank, and recommend a few ways to make it better
Chapter 7 is the conclusion
Trang 6Mục tiêu của đồ án này là như sau:
Nghiên cứu các hệ thống Hỏi & Đỏp trờn thế giới và phương pháp của chúng trong việc đánh giá chuyên môn của người dùng
Thiết kế một thuật toán đánh giá có khả năng mở rộng cho các hệ thống cỡ lớn
và phải phù hợp với môi trường cụ thể của BkProfile
Cài đặt và triển khai chương trình tại hệ thống thực BkProfile, tại địa chỉ: http://www.bkprofile.com
Theo đó, đồ án sẽ được tổ chức như sau:
Chương 1 mô tả động lực thúc đẩy việc nghiên cứu, nêu vấn đề sẽ nghiên cứu
và hướng tiếp cận để giải quyết vấn đề
Chương 2 bao gồm một bản phân tích ngắn hướng tiếp cận của các hệ thống hỏi đáp khác là Google AardVark và Google Confucius; bản tìm hiểu về chuỗi Markov và thuật toán PageRank như là những nền tảng chính của nghiên cứu này; và cuối cùng là Hadoop MapReduce là nền tảng phân tán mà em đó dựng
để cài đặt thuật toán Trong đó, em cũng chỉ rõ đóng góp chính của nghiên cứu này
Chương 3, cũng là chương quan trọng nhất của đồ án, trình bày chi tiết phương pháp của em để xếp hạng người dùng trong các hệ thống Hỏi & Đáp
Chương 4 thảo luận các chi tiết về cách em cài đặt thuật toán đề xuất trên nền tảng Hadoop MapReduce, trong đó có chứa một kỹ thuật để tự động hóa quá trình chạy Hadoop MapReduce
Chương 5 tóm tắt kết quả thử nghiệm khi em kiểm thử ExpertRank trong một hệ thống thực có tên là BkProfile Em cũng mô tả dữ liệu mà em đã sử dụng để kiểm thử
Trang 7 Chương 6 thảo luận những ứng dụng hiện tại, những vấn đề còn tồn tại của ExpertRank và trình bày một vài hướng có thể cải thiện chất lượng kết quả của thuật toán.
Chương 7 là phần tổng kết những gì em đã làm được, đối chiếu với những gì đã đặt ra từ trước
Như vậy, chương 1 – 2 tương đương với phần đặt vấn đề và định hướng giải pháp, chương 3-6 tương đương với phần các kết quả đạt được và chương 7 tương đương với phần kết luận trong hướng dẫn viết đồ án của Viện
Trang 8This dissertation presents ExpertRank, an iterative algorithm to evaluate the expertise of users about a specific domain of knowledge in question & answer systems, i.e BkProfile – a knowledge sharing community The evaluation help users prove their ability in their professional profiles in the system which is a critical motivation for Q&A system to run smoothly I base my method on Markov chain model, refer to Google's PageRank algorithm,
to convert the above problem into a probabilistic model and then use an iterative method to solve the problem My algorithm is designed on MapReduce programming model so that it can be applied for large scale systems We have experimented it in a open-source platform named Hadoop mapReduce and deploy it in BkProfile web application, http://www.bkprofile.com Results of the algorithm can be encapsulated as a reliable parameter for evaluation system of other large-scale sharing web applications
Keywords Iterative method, Markov chain, MapReduce, PageRank
Trang 9TÓM TẮT NỘI DUNG ĐỒ ÁN TỐT NGHIỆP
Nghiên cứu này giới thiệu ExpertRank, là một thuật toán lặp đánh giá chất lượng câu trả lời và trình độ chuyên môn của người dùng về một lĩnh vực nào đó trong các hệ thống hỏi và trả lời
cỡ lớn, mà cụ thể là mạng cộng đồng chia sẻ tri thức BkProfile Việc đánh giá chuyên môn người dùng sẽ giúp người dùng có thể chứng minh được kiến thức chuyên môn của mình trong
hồ sơ nghề nghiệp của họ trong hệ thống, là một động lực quan trọng thúc đẩy hoạt động của các hệ thống Hỏi & đáp Em đã dựa trên thuật toán phân loại trang web của máy tìm kiếm Google có tên là PageRank, và mô hình chuỗi Markov để chuyển bài toán trên thành các mô hình xác suất, từ đó xây dựng thuật toán lặp đánh giá cùng một lúc hai đại lượng trên Thuật toán của chúng tôi được thiết kế trên mô hình Map-Reduce nên có thể được áp dụng cho các
hệ thống phân tán cỡ lớn Em đã thử nghiệm nó trờn hệ thống mã nguồn mở có tên là Hadoop Map Reduce và triển khai nó chạy ổn định trên ứng dụng web BkProfile tại địa chỉ http://www.bkprofile.com Các kết quả của thuật toán cũng có thể được đóng gói như là 1 tham số tin cậy sử dụng cho các hệ thống đánh giá trong các ứng dụng chia sẻ tri thức cỡ lớn khác
Từ khóa Phương pháp lặp, chuỗi Markov, MapReduce, PageRank
Trang 10Finally, I want to send many heartfelt thanks to School of Information & Communication Technology, under Hanoi University of Science & Technology, for the knowledge, skills and spirit I have been learning there.
Pham Tuan Long
Trang 11Table of Contents
INTRODUCTION 17
1.1 Motivations 17
1.1.1 The explosion of Q&A systems 17
1.1.2 The need of a good expertise evaluation formula 18
1.2 Statement of the problem 19
1.3 My approach 19
1.4 Dissertation structure 20
1.5 Summary 21
LITERATURE REVIEW 22
1.6 Analysis of Google Confucius's approach 22
1.7 Analysis of Google AardVark's approach 22
1.8 Online knowledge market 22
1.9 Recommendation-based evaluation system 23
1.10 Markov chain and its convergence condition 23
1.11 Google PageRank algorithm 24
1.12 Cloud computing 25
1.13 MapReduce programming structure 26
1.14 Hadoop MapReduce 27
1.15 Summary 27
METHODOLOGY 29
1.16 Recommendation channel aggregation 29
1.17 Random expert seeker model 29
1.18 ExpertRank basic formula 30
1.19 Examples of running ExpertRank basic formula 31
1.20 ExpertRank's convergence 34
1.20.1 Mapping from Random expert seeker model to Markov chain 34
1.20.2 Dangling links 36
1.20.3 ExpertRank traps 37
1.20.4 Solution 37
1.21 ExpertRank full formula 38
1.22 Limitation of the algorithm 38
1.23 Summary 38
IMPLEMENTATION 39
1.24 Flow of MapReduce processes 39
1.25 MapReduce Implementation on Apache Hadoop 39
1.26 Data normalization 40
1.27 ExpertRank transition calculation 40
1.28 Random visiting probability distribution 42
1.29 Halt condition 42
1.30 Final result extraction 42
1.31 Summary 42
RESULTS & FINDINGS 43
Trang 121.32 Population and Sampling 43
1.32.1 Overview of BkProfile 43
1.32.2 Sampling methods 43
1.32.3 Expected Results 43
1.33 Performance & convergence 44
1.34 Examples 44
1.35 Summary 45
DISCUSSION 46
1.36 Applications 46
1.36.1 Search engine ranking function 46
1.36.2 Answer quality evaluation 46
1.36.3 Potential answerer suggestion 47
1.36.4 User reliability evaluation 48
1.37 Limitation of current ExpertRank formula 48
1.38 ExpertRank's incentives 49
1.39 Prevention of spammers 50
1.40 ExpertRank's personalization 50
1.41 Summary 51
CONCLUSION 52
Trang 13INDEX OF TABLES
Table 1: Advantages of Q&A systems in comparison with other knowledge sharing & searching tools 18Table 2: Incentives of current famous Q&A sites 18
Trang 14INDEX OF FIRGURES
Figure 1: An example of application of Markov chain 24
Figure 2: PageRank transfer among web pages 25
Figure 3: Applications of cloud computing 26
Figure 4: Basic model of MapReduce 26
Figure 5: Random expert seeker model 30
Figure 6: ExpertRank transfer among expert network 31
Figure 7: Step 1 of the example of using ExpertRank basic formula 32
Figure 8: Step 2 of the example of using ExpertRank basic formula 33
Figure 9: Mapping from Random expert seeker model to ExpertRank 36
Figure 10: Dangling links and outer node 36
Figure 11: ExpertRank trap 37
Figure 12: A solution for the convergence of ExpertRank 37
Figure 13: Flow of MapReduce processes 39
Figure 14: Rough data 40
Figure 15: Aggregation 40
Figure 16: Transition table 42
Figure 17: The convergence of ExpertRank .44
Figure 18: The search engine of BkProfile 46
Figure 19: Presentation of answers for a question in BkProfile 47
Figure 20: Stats in BkProfile 50
Figure 21: A profile in BkProfile 55
Figure 22: ExpertRank are listing in each topic's page 55
Trang 15No Abbreviation &
1 Q&A Question & Answer
2 NLP Natural Language Processing
3 SoICT School of Information and Communication Technology
4 HUST Hanoi University of Science and Technology
Trang 161 Outlink A link appearing in a website to direct users to another
website
2 Outer node A node that do not have outlink
3 Dangling link A link that directs to an outer node
4 Search engine A computer program which finds information on the
Internet by looking for words which you have typed in
5 Apache Lucene A high-performance, full-featured text search engine library
6 SOLR A popular open-source text search engine based on Lucene
technology and written in Java
7 Hadoop A project develops open-source software for reliable,
scalable, distributed computing
8 MapReduce A programming model for writing applications that rapidly
process vast amounts of data in parallel on large clusters of compute nodes
9 Hadoop MapReduce An open source volunteer project under the Apache
Software Foundation
10 Cloud computing A style of computing where massively scalable IT-related
capabilities are provided as a service using Internet technologies to multiple external customers
11 Markov chain a mathematical system that undergoes transitions from one
state to another as a chain, endowed with the Markov property: the next state depends only on the current state and not on the past
12 PageRank Google's algorithm to rank the importance of web pages in
the Internet
13 ExpertRank Name of my proposed algorithm to rank users' expertise
Trang 17In this section, I will first present the explosion of Q&A systems and the reasons why a Q&A system need an algorithm to evaluate both expertise of users and quality of answers to prove the need of the research Then, I will clearly state the problems I would like to solve and discuss my approach to solve it At the end of this chapter, I will describe other chapters of my dissertation and how they connect one another.
1.1 Motivations
1.1.1 The explosion of Q&A systems
In recent years, Question and Answer systems or Q&A systems in short, have developed in a rocket speed In 2010, Google published a research stating that 25% of Google's top search results include at least one link to some Q&A site [1] It is easy to understand when most big names in IT has participated in the so-called online knowledge market, namely Yahoo with Yahoo Answers; Google with Google Answers and Google Confucius; and Facebook with Facebook Q&A There are also very good Q&A systems from smaller companies such as Quora, StackOverFlow, OSDir, and so
on People even built a platform, named Question2Asnwer, to quickly create new Q&A sites So far, it has helped build up to 1665 sites in 32 languages [2] Most recently, in the year of 2010, AardVark, a 2-year-old Q&A system, developed by 20 engineers, has been acquired by Google with the price up to $50,000,000
Now, Q&A sites are still burgeoning thanks to their advantage in comparison with other online knowledge sharing & searching tools such as search engine, forum and blog:
No Advantage of Q&A Search Engine Forum Blog
Too many junk and not-to-the-point answers which make
up of very long threads Answers and discussions are mixed
Users need to synthesize articles to get the final answer
Answers are often indirect since articles are to solve authors' problems rather than the searchers' ones
2 providing answers for
completely new
questions
can't can can't
3 providing answers for
heavily local
questions
Usually can't can can
4 providing search
engine with well
N/A Forum is one of
hardest kinds of sites
Blog is also hard to index due to its
Trang 18No Advantage of Q&A Search Engine Forum Blog
Table 1: Advantages of Q&A systems in comparison with other knowledge sharing & searching
tools
1.1.2 The need of a good expertise evaluation formula
Through using various Q&A systems, i.e OSDir, Google Confucius, Quora, Google AardVark, and so on, as well as through reading analytical papers on successful Q&A systems, I realize that there are three main requirements to bring an Q&A site to success:
New questions quickly get answers
Answers are provided with high quality
Questions & answers are organized so that they are easier to be found
Each website has its own way to reach the three above criteria Some choose automatic methods such as automatic answering, automatic classifying, automatic organizing and so on but most of them use motivations to encourage their users to self-solve the three requirements According to [1], there are two main incentives for people
to contribute to a Q&A site: finance & virtual values
No Site Question routing Incentives Establishment Year
1 Internet Oracle to experts virtual 1989
2 Ask.com N/A N/A 1996
3 WikiAnswers to public virtual 2002
4 Yahoo! Answers to public virtual 2005
5 Baidu Zhidao to public & experts virtual & $ 2005
6 Google Confucius to public & experts virtual & $ 2007
7 Aardvark to friends virtual 2009
8 Powerset N/A N/A 2005
9 Quora to friends virtual 2009
Table 2: Incentives of current famous Q&A sites
Question routing is to bring the question from a questioner to the people who have most incentives Question routing is critical to ensure that questions are quickly answered and answers are provided with high quality However, it depends on the incentives that the system provides Taking its turn, incentives depend on the evaluation system of the Q&A site If incentive is financial, the site need to have a fair evaluation formula to determine how much it should pay for an answer of a specific expert If incentive is virtual values, a fair evaluation system is also critical to make virtual values more reliable and more beneficial It can be said that evaluation, i.e
Trang 19evaluation of answers and evaluation of experts, are the main driven force to keep a Q&A site running smoothly It, together with the method to organize questions & answers, decides the success of a Q&A site.
However, working out a good evaluation is not an easy task due to the complicated relationship between elements in a Q&A site such as answers, questions, answer providers and so on
I, together with my team, are creating a Vietnamese Q&A site, named BkProfile [3] The 1000-user system is currently running quite smoothly without a proper evaluation formula However, we all understand that a good and fair evaluation formula is absolutely important to ensure the stable development of the site
1.2 Statement of the problem
In this dissertation, I will present a method to evaluate both expertise of users and quality of answers The evaluation should be endowed with the following characteristics:
(1) Fair & objective Since evaluation is usually subjective, there is no way to create
a perfect evaluation formula that works in all cases However, a fair & objective formula is a good basis for people's self-judgment
(2) Incentive providing Not all evaluation formula, even those which are rational in some cases, provide incentive for the development of Q&A site However, providing incentive is critical here since it is the reason for the research of the evaluation formula
(3) Scalable We plan to push our system to the scale of whole Vietnamese knowledge market Thus, a scalable algorithm is compulsory
1.3 My approach
There are two natural ways to evaluate an answer or a user:
(1) Evaluate the content of answers and the contribution of users themselves, such as: how the answer is relevant to its question, what level of the writing style, how different the answer is in comparison with previous answers and so on; and how much time the user has contributed to the system, how many answers he has provided, how good his answers were, and so on
(2) Evaluate the relationship among answers, users, and answers & users, such as how many votes a user delivered to another, how many answers a user provided another, how good an answer was compared to other ones, and so on
Between the 2 approaches, I chose the latter because current methods to evaluate content of a text is not good enough and particularly quite subjective which violate the first issue of the statement of the problem Moreover, to simplify the algorithm, I only evaluate the expertise of users and then use it to evaluate the quality of answers
Trang 20through the interaction between user and answer, i.e voter/voted and provider/answer.
answer-However, the problem is still complicated with a number of channels connecting users such as:
sub- Aggregate all channels into one single channel, which I call recommendation
Analyze the links among all users through only the aggregated channel The analysis technique I use is similar to Google's PageRank algorithm which uses Markov chain model, the basis of my method
The method is quite fair and objective since it considers the whole network rather than only a piece of information like a specific answer It provides incentives for the system since highly ranked users in recommendation-based systems have many privileges such as prestige, weight of influence and so forte Finally, it can be scalable
as the algorithm can be implemented in a large-scale platform, named Hadoop MapReduce I implemented a simple version of ExpertRank on this platform which gave me some encouraging result
1.4 Dissertation structure
The dissertation is constructed as follow:
Chapter 1 describes my motivation to do the research, the statement of the problem and my approach to tackle the problem
Chapter 2 includes a quick review of Google AardVark and Google Confucius, as the most related research; Markov chain & PageRank as the foundation of my research; and Hadoop MapReduce as the platform I use to implement the algorithm
Chapter 3, the most important chapter of this dissertation, elaborates my method to rank users in a Q&A system
Chapter 4 discusses the details of my implementation of the algorithm in Hadoop MapReduce It includes a few techniques to automate MapReduce processes
Trang 21Chapter 5 summarizes the experimental results when I test ExpertRank in a real system, named BkProfile I also include the population & sampling I uses to experiment the algorithm.
Chapter 6 discusses the current application and current problems of ExpertRank, and recommend a few ways to make it better
Chapter 7 is the conclusion
1.5 Summary
In this chapter, I have described my motivation to do the research, i.e the explosion
of Q&A systems and their need of an algorithm to evaluate quality of answers and expertise of users I also clearly stated the problems and gave out my approaches to tackle the problems, i.e aggregate relationship channels into only one channel called recommendation and analyze the network of recommendation to rank experts Next chapter is a review of related work which express the reasons why I chose the approach
Trang 22LITERATURE REVIEW
Despite the rapid development of Q&A systems, careful and direct research on
it is quite limited Among all, I chose Google Confucius's and Google AardVark's research papers to analyze because of their high quality and their similarity to the system I and my team are building Then I generalize Q&A system to an online knowledge market tool to take advantage of the research on other online markets, particularly those which are proved to work well in Vietnam An analysis of recommendation based evaluation system is discussed since it is the main idea of
my approach Then, I discuss Markov chain which acts as the basis of my research
as well as Google's PageRank as an famous example of using Markov chain My method derives a lot of ideas from Google's PageRank Finally, I write a brief overview of cloud computing, MapReduce programming model and Hadoop MapReduce because Hadoop MapReduce is the platform I use to implement my algorithm.
1.6 Analysis of Google Confucius's approach
Google Confucius [1] is a very sophisticated system with 6 main components: search integration, question labeling, question recommendation, answer quality assessment, user-ranking and NLP-based answer generation With the 6 components, they provide a smooth flow in which users' work is minimized They also provide financial incentive to motivate experts work much in the system The user rank system
is quite important to help evaluate the money they have to pay for each user They used HITS algorithm [4], using questioner/answerer relationship, to rank users Their rationale is that users in Q&A systems are not active enough to provide adequate social iteration like votes, improvement suggestions, and so on However, we still believe that
if constructing a Q&A site as an knowledge sharing community, people will provide more social interaction and that they are prestigious information to mine
1.7 Analysis of Google AardVark's approach
Google AardVark names itself a social search engine And in fact, their way of implementing a Q&A site quite different from all others They do not keep the answers
as a knowledge store, but consider all queries as new questions and try to route them to the right people The right people are those who have good activity records in AardVark and those who are somehow connected to the questioner They extract the latter information through intensively analyzing social networks, such as Facebook, Twitter, Yahoo 360, of users They also have a module to evaluate quality of answers and expertise of users through probabilistic models and that is what I learned when building formula for ExpertRank
1.8 Online knowledge market
Online knowledge market is the common terminology of all system aiming at routing knowledge from “providers” to “seekers” Considering Q&A system as an online knowledge market can help me look at the essential characteristics of its: an
Trang 23online market Researching on online markets, i.e Amazon and eBay, I realize that there are 3 things which are very important to drive a market to success:
The availability of goods, i.e knowledge In Q&A systems, the availability is created through the size and activeness of community and supporting system like information browser, request router and search engine
The evaluation of goods through review and information adequacy In Q&A systems, they are the comments, votes for answers, and the structure of an answer
The evaluation of service providers In Q&A systems, of course, it is mainly the expertise of users The expertise here is defined as the ability to provide good answers It is nothing related to the certificates or positions people may have in real life
Finally, besides the 3 above criteria, for Vietnamese, through researching on a few online markets like vatgia.com and chodientu.vn, I see that popularity of usage of goods is a very important criterion for a Vietnamese to buy a product It is the foundation for me to believe that an evaluation formula based on reviews of community will succeed in Vietnam online knowledge markets
1.9 Recommendation-based evaluation system
Recommendation-based evaluation is very common in Western culture People base their assessment of a person on trustable recommendations For instance, admissions officers bases their admissions decision on the letters of recommendations from their colleagues; researchers read through the name of all articles referring to a specific article to roughly evaluate its quality In a world of knowledgeable people who care much about their prestige, recommendation system work smoothly since people will be very conscious of recommendation to promote/protect their image The Q&A site which we are building focuses on knowledgeable community Therefore, using recommendation system to evaluate people is reasonable
About the used technique, link analysis [5] , or in another word, collaborative filtering [6], a technique to predict characteristics of objects, is not a new one It is discussed intensively in an active branch of computer science, named Knowledge Discovery in Database [7] My contribution is to apply and develop the techniques into Q&A systems, a new context with unique characteristics
1.10 Markov chain and its convergence condition
A Markov chain is a mathematical system that undergoes transitions from one state
to another as a chain [8] It is a random process endowed with the Markov property: the next state depends only on the current state and not on the past
Trang 24In practice, Markov chain has a lot of applications In the above figure, people modeled the transition of conditions of markets as a Markov chain Based on the information, people calculated the probabilistic distribution of the conditions in a specific period of time by iteratively calculating the new distribution after a transition until the process is convergent The problem is that whether a Markov chain is convergent and how many convergent states it has And that is one of the main problems of Markov chain on which people research.
According to the theory of Markov chain [8], an irreducible Markov chain will converge in one state, or in another word, will have one stationary distribution, if and only if all of its states are positively recurrent Of these:
A Markov chain is irreducible if it is possible to get to any state from any state
A Markov chain is recurrent, or in another word persistent, if it is not transient, i.e from a specific state i, there is a non-zero probability that we will never return to i A recurrent Markov chain is positive if the process is finite
In this dissertation, I map my research problem with a Markov chain to design a formula in which after a finite number of iterations, it will lead to one-and-only-one convergent state, i.e a formula satisfies both the two above conditions
1.11 Google PageRank algorithm
Figure 1: An example of application of Markov chain
Trang 25PageRank [9] is a very famous algorithm of Google to determine the importance of
a web page through times when it is the outlinks of other pages: the more and with the higher quality a web page is linked from others, the more important it will be For instance, because tamhonvietnam.net is linked by soict.hut.edu.vn, a trustable website, the importance score of tamhonvietnam.net is added with some value, and hence the website will be easier to be found in the search engine
If considering incorporation of an outlink to a web page is a recommendation to the page, PageRank evaluation is exactly a recommendation based one Moreover, PageRank targets on a large-scale problem when attempting to assess the whole world wide web Finally, PageRank algorithm can be considered a Markov chain with random web surfer model Therefore, I base my research on PageRank to ensure the acceptability of some of my hypothesis
1.12 Cloud computing
Cloud computing is a style of computing where massively scalable IT-related capabilities are provided as a service using Internet technologies to multiple external customers With the appearance of cloud computing, there are more chances for small companies to approach large-scale batch processing which is more and more popular with the development of data mining techniques and the appearance of huge databases like those of Facebook, Amazon, Zing Me and so on Currently, there are many good cloud computing service providers, i.e Amazon EC2, NephoScale, Google, Microsoft, RightScale, etc Using their services, people do not need to buy expensive computers to scale their system up but rent services with their desired configurations and use the providers' APIs to access the services Since BkProfile is a small team, using cloud computing services is probably a wise approach to process large-scale computation
Figure 2: PageRank transfer among web pages
Trang 261.13 MapReduce programming structure
MapReduce [10] is a programming model for expressing distributed computations
on massive amounts of data and an execution framework for large-scale data processing on clusters of commodity servers The core idea of the model is the map and reduce functions commonly used in functional programming By understanding the programming model and owning a framework, e.g Apache Hadoop MapReduce, Google MapReduce, etc to run the MapReduce tasks, engineers can feel quite convenient to program a distributed computation because all the hardest tasks like job distribution, storage distribution and so on are hidden in the frameworks
The figure above present the basic model of a MapReduce task:
Figure 3: Applications of cloud computing
Figure 4: Basic model of MapReduce
Trang 27 Input of a MapReduce task is (key,value) pairs which are distributed to mappers Mappers are objects which can be configured to stay at the same machine or at different machines.
Output of a map task is also (key,value) pairs but they usually have different format from the input ones
The pairs are delivered to shuffle-and-sort modules which will sort and distribute the (key,value) pairs to desired reducers (key,value) pairs of the same key will be delivered to the same reducers
In each reducers, (key,value) pairs of the same key are reduced into
aggregated data which are also in the format of (key,value) They are the output of the MapReduce task
Note that because both input and output of a MapReduce task is in the format of (key,value), engineers can design complicated algorithms which involves several MapReduce tasks In the implementation of my proposed algorithm, I used three different types of MapReduce tasks and run them iteratively until a predefined condition is satisfied
Finally, the MapReduce programming model can be more complicated but it is not the focus of this dissertation
1.14 Hadoop MapReduce
Hadoop MapReduce [11] is a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes It is an open source volunteer project under the Apache Software Foundation Hadoop MapReduce goes along with HDFS, Hbase, etc providing a entire solution for distributed computation A few famous clients of its is Facebook, Amazon, Cloudera, etc Moreover, Hadoop MapReduce is well documented and there are many cloud computing services supporting it That is the reason why I chose Hadoop MapReduce to implement my MapReduce based algorithm
1.15 Summary
In the chapter, I have described all the related work of my dissertation including Q&A sites and online knowledge market from which I get ideas; recommendation-based assessment which is my approach; Markov chains and PageRank which are the foundation of my research; and Cloud computing, MapReduce, Apache Hadoop are the framework of my implementation
About my contribution, I got two ideas of my research from PageRank algorithm, i.e detection of problems of ExpertRank basic formula and their solutions, because PageRank is also based on Markov chain and it has a lot of similarity with ExpertRank Moreover, I use a few results of PageRank, particularly the convergence speed, to make my hypothesis more solid However, I simplified the problem by aggregating complicated relationship channels into a single one, and