In this way, multimedia question answering is naturallyconverted into the problem of enriching textual QA with media information, whichis more feasible and able to deal with complex ques
Trang 1Multimedia Question Answering
Liqiang Nie
A THESIS SUBMITTEDFOR THE DEGREE OF DOCTOR OF PHILOSOPHY
SCHOOL OF COMPUTINGNATIONAL UNIVERSITY OF SINGAPORE
2013
Trang 2⃝2013
Liqiang NieAll Rights Reserved
Trang 3Publications
1 Liqiang Nie, Yi-Liang Zhao, Xiangyu Wang, Jialie Shen, Tat-Seng Chua.
Learning to Recommend Descriptive Tags for Questions in Social Forums.ACM Transactions on Information Systems, 2013 Full jounral paper
2 Yi-Liang Zhao, Liqiang Nie, Xiangyu Wang, Tat-Seng Chua Personalized
Recommendations of Locally Interesting Venues to Tourists via Cross gion Community Matching ACM Transactions on Intelligent Systems andTechnology, 2013 Full jounral paper
Re-3 Liqiang Nie, Meng Wang, Gao Yue, Zheng-Jun Zha, Tat-Seng Chua
Be-yond Text QA: Multimedia Answer Generation by Harvesting Web tion IEEE Transactions on Multimedia, 2013 Full jounral paper
Informa-4 Liqiang Nie, Shuicheng Yan, Meng Wang, Richang Hong, Tat-Seng Chua.
Harvesting Visual Concepts for Image Search with Complex Queries In ceedings of the ACM International Conference on Multimedia, 2012 Fullconference paper, oral
Pro-5 Liqiang Nie, Meng Wang, Zheng-Jun Zha, Tat-Seng Chua Oracle In
Im-age Search: A Content-Based Approach To Performance Prediction ACMTransactions on Information Systems, 2012 Full jounral paper
6 Richang Hong, Meng Wang, Guangda Li, Liqiang Nie, Tat-Seng Chua.
Trang 4Multimedia Question Answering IEEE Multimedia, 2012 Full magazinepaper.
7 Weinan Zhang, Zhaoyan Ming, Yu Zhang, Liqiang Nie, Ting Liu and
Tat-Seng Chua The Use of Dependency Relation Graph to Enhance the TermWeighting in Question Retrieval In Proceedings of the International Confer-ence on Computational Linguistics, 2012 Full conference paper, oral
8 Yan Chen, Zhoujun Li, Liqiang Nie, Xia Hu, Xiangyu Wang and Tat-Seng
Chua A Semi-Supervised Bayesian Network Model for Microblog Topic sification In Proceedings of the International Conference on ComputationalLinguistics, 2012 Full conference paper, oral
Clas-9 Liqiang Nie, Meng Wang, Zheng-Jun Zha, Guangda Li, Tat-Seng Chua.
Multimedia Answering: Enriching Text QA with Media Information In ceedings of the International ACM SIGIR Conference, 2011 Full conferencepaper, oral
Pro-10 Xiangyu Chen, Jin Yuan, Liqiang Nie, Zhengjun Zha, Tat-Seng Chua.
NUS-LMS Known-item Search In TRECVID 2010 Full research paper
11 Richang Hong, Guangda Li, Liqiang Nie, Jinhui Tang, Tat-Seng Chua.
Exploring Large-Scale Data for Multimedia QA: An Initial Study In ceedings of the International ACM Conference on Image and Video Retrieval,
Pro-2010 Full conference paper, oral
Trang 5Acknowledgements
This dissertation would not have been completed, or at least not what it looks likenow, without the support, direction and help of many people I am honored to takethis opportunity to thank them
My first and foremost thank undoubtedly goes to my supervisor Prof Seng Chua, a respectable, responsible and resourceful professor, who took me into
Tat-his research group in the mid of 2009 From then, whenever I have questions,
his door is always open for discussions His creative ideas and unique angle ofresearch observations have consistently inspired me to devote my efforts in the area
of media search Prof Chua always sets high standard for our research, insists
on impact work targeting at premier forums, and advocates the value of buildingcommercializable systems Besides research, he leads us on long-distance joggingalmost once per week, which greatly strengthens our bodies
Second, I would like to express my heartfelt gratitude to Prof Meng Wangand Richang Hong, who have influenced me in many ways and deserves my specialappreciations During the first two years of my Ph.D pursuit, they have alwaysbeen providing insightful suggestion and discerning comments to my research workand paper draft Their heuristic guidance in our discussion makes me think andwork very independently
I sincerely extend my thank to my doctoral committee (Prof Mohan SKankanhalli, Chew Lim Tan and Anthony K H Tung) Their constructive feedback
Trang 6and comments at various states have been significantly helpful in shaping the thesis
up to completion I would also like to thank the external examiner, Prof Winston
H Hsu, for his critical readings and constructive criticisms so as to make the thesis
Trang 71.1 Background 1
1.2 Motivation 3
1.3 Challenges 5
1.4 Strategies 8
1.4.1 Question Analysis 10
1.4.2 Answer Medium Determination 12
1.4.3 Web Media Answer Selection 13
1.5 Contributions 14
1.6 Outline of the Thesis 15
Chapter 2 Literature Review 17 2.1 Automatic Textual Question Answering 17
2.2 Community-based Question Answering 22
2.3 Multimedia Question Answering 28
2.4 Summary 34
Trang 8Chapter 3 Question Understanding 37
3.1 Introduction 37
3.2 Related Work 41
3.2.1 Annotation of Media Entities 41
3.2.2 Annotation of Textual Entities 42
3.3 Question Annotation Scheme 43
3.4 Question Space Inference 45
3.4.1 Probabilistic Hypergraph Construction 45
3.4.2 Adaptive Probabilistic Hypergraph Learning 47
3.4.3 Discussions 50
3.5 Relevant Tag Selection 52
3.5.1 Tag Relevance Estimation 52
3.5.2 Complexity Analysis 54
3.6 Query Generation for Multimedia Search 54
3.7 Experiments 56
3.7.1 Experimental Settings 56
3.7.2 First-order Analytics on Our Dataset 57
3.7.3 On Learning Performance Comparison 60
3.7.4 On Relevant Tag Selection 63
3.7.5 On the Sensitivity of Parameters 64
3.7.6 Ontology Generation 65
3.7.6.1 Application Scenario 65
3.7.6.2 Experiments 67
3.7.7 Evaluation of Query Generation 69
3.8 Summary 70
Chapter 4 Answer Medium Determination 71 4.1 Introduction 71
Trang 94.2 Related Work 74
4.3 Answer Medium Selection 76
4.3.1 Question-Based Classification 77
4.3.2 Answer-Based Classification 78
4.4 Answer Availability Prediction 80
4.4.1 Probabilistic Analysis of AP and NDCG 80
4.4.2 Query-Adaptive Graph-Based Learning 82
4.4.2.1 Ranking-Based Relevance Analysis 82
4.4.2.2 Query-Adaptive Graph-Based Learning 83
4.4.2.3 Discussion 87
4.5 Experiments 88
4.5.1 Experimental Settings 88
4.5.2 On Answer Medium Selection 90
4.5.3 On Query Classification 92
4.5.4 On Media Search Performance Prediction 94
4.5.5 Discussion 97
4.6 Applications 98
4.6.1 Image Metasearch 98
4.6.1.1 Application Scenario 98
4.6.1.2 Experiments 99
4.6.2 Multilingual Image Search 103
4.6.2.1 Application Scenario 103
4.6.2.2 Experiments 104
4.6.3 Boolean Image Search 105
4.6.3.1 Application Scenario 105
4.6.3.2 Experiments 106
4.7 Summary 110
Trang 10Chapter 5 Multimedia Answer Selection 113
5.1 Introduction 113
5.2 Related Work 117
5.2.1 Complex Queries in Text Search 117
5.2.2 Complex Queries in Media Search 118
5.3 Relevant Media Answer Selection Scheme 119
5.4 Visual Concept Detection 120
5.5 Heterogeneous network 121
5.5.1 Semantic Relatedness Estimation 122
5.5.2 Visual Relatedness Estimation 124
5.5.3 Cross-Modality Relatedness Estimation 125
5.5.3.1 KDE Approach 126
5.5.3.2 NRCC Approach 126
5.5.4 Discussions 127
5.6 Experiments 128
5.6.1 Experimental Settings 128
5.6.2 On Visual Concept Detection 130
5.6.3 On Query Performance Comparison 131
5.6.4 On Media Answer Selection 132
5.6.5 On the Sensitivity of Parameters 134
5.7 Applications 136
5.7.1 Photo-based Question Answering 136
5.7.2 Textual News Visualization 138
5.8 System Evaluation 140
5.8.1 Data Presentation 140
5.8.2 On Informativeness of Enriched Media Data 140
5.8.3 Subjective Test of Multimedia Answering 143
Trang 115.8.4 On the Absence of Textual Answer 1445.9 Summary 147
Chapter 6 Conclusions and Future Research 149
6.1 Conclusions 1496.2 Future Directions 151
Trang 13Abstract
Along with the proliferation and improvement of underlying search and cation technologies, we have seen a flourishing of automated and community-basedquestion answering services, which have emerged as an effective paradigm for o-ceanic information seeking, diverse knowledge disseminating, and outstanding ex-pert routing However, existing QA services provide only textual answers which arenot sufficiently intuitive or informative for many questions On the other hand, theavailable multimedia data on the Internet has increased exponentially and is likely
communi-to increase continuously Naturally, it is time communi-to shift from traditional textual QA
to multimedia QA It is worth mentioning that multimedia QA has been ing increasing attentions Most of the current literatures, however, mainly focus
attract-on specific domains and utilize questiattract-on independent mattract-onolithic media type, such
as pure video or image, to answer all questions Therefore, the main aim of thisthesis is to design and develop a systematic question-aware multimedia questionanswering scheme that is able to answer general questions in broad domains withappropriate medium types Specifically, it covers the studies on question analytics,answer medium determination and multimedia answer selection
This thesis first performs higher-order question analysis to enhance questionunderstanding by exploring the intelligence from grassroot Internet users The pro-posed adaptive hypergraph learning approach recommends social-contributed tagsfrom semantically similar neighbours to the given question These tags well sum-
Trang 14marize the given question from multifaceted aspects, which are utilized to facilitatequery expansion and knowledge ontology generation Based on the generated on-tology, the specific domain of the question can be determined Hence the textualanswer for the given question can be vertically and precisely located by incorpo-rating the domain cues In this way, multimedia question answering is naturallyconverted into the problem of enriching textual QA with media information, which
is more feasible and able to deal with complex questions
The thesis then learns a four-category classifier to preliminarily determinethe deserved the answer medium types The selected medium candidates are able tovividly convey the answer content which greatly empowers the users’ multimediaexperience However, there exists a gap between predicted answer medium typeand the availability of actual answers due to the limited web resources To bridgethis gap, a query-dependent graph-based model predicting the query performance isproposed to measure the web answer availability This guarantees that the relevantanswer data is very rich in the final selected answer medium type
Finally, this thesis develops a heterogeneous probabilistic network to selectrelevant multimedia answer based on the complex queries generated from questions.The network seamlessly integrates three layers of relationships, i.e., the semantic-level between concept and complex query, the cross-modality level between visualcontent and concept, as well as visual-level between visual pairs The three layersmutually reinforce each other to facilitate the estimation of relevance scores for newreanking list generation It is found that this approach is more robust to the un-reliable initial ranking list and able to characterize the complex queries well Afterduplicate removal, media entities in top ranked positions are selected as answers toenrich the original textual QA
Through extensive experiments conducted on the large-scale real-world
dataset-s, the experimental results have demonstrated that our study could yield significant
Trang 15gains in empowering users with concise and intuitive multimedia experiences Asbyproducts of this research, the proposed approaches have addressed the questionannotation, media search performance prediction as well as web-scale media datareranking with complex queries.
Trang 17List of Figures
1.1 Examples of textual QA pairs from several dominant cQA forums 31.2 Architecture of our proposed question-aware MMQA system 91.3 Comparison illustration between the conventional MMQA and ourproposed scheme 112.1 Pipeline of traditional automatic textual QA 212.2 A conceptual framework for answer generation in traditional cQA 232.3 Differences among information retrieval, textual QA, and MMQA 282.4 System architecture of VideoQA 292.5 The framework for QA over community-contributed Web videos 302.6 Three-layer system architecture for photo-based QA 322.7 The framework of answering multimodal question by naming visualinstance 332.8 The retrieval pipelines comparison between the aQA and MMQA 343.1 Illustration of question tagging selected from Stack Overflow 383.2 The proposed automatic question annotation scheme for social QAservices 443.3 Process of hyperedges construction 473.4 The tag frequency distribution with respect to the number of distincttags over our large-scale dataset 50
Trang 183.5 The illustrative instance of semantically similar questions sharing
same tags 51
3.6 The distribution of the number of users with respect to the number of social connections 59
3.7 The distribution of the number of users with respect to the number of posts 60
3.8 Performance comparison of different reranking based question space inference approaches 62
3.9 The performance of question space inference with various regulariza-tion parameter 64
3.10 The performance of question space inference with various weighting parameter 66
3.11 The question distribution over our selected topic taxonomy 67
4.1 The schematic illustration of the proposed image search performance prediction approach 74
4.2 Initial relevance probability estimation 84
4.3 The query-adaptive graph-based learning illustration 86
4.4 Distribution of the predicted results and the real values 95
4.5 Image metasearch performance comparison of different methods 100
4.6 The comparison of image metasearch with varied metric for image search performance prediction 101
4.7 Comparison of the top search results obtained by different metasearch methods 102
4.8 Multilingual image search performance comparison of different methods104 4.9 The comparison of multilingual image search with varied metric for image search performance prediction 106
Trang 194.10 Comparison of the top results obtained by different multilingual
im-age search approaches 107
4.11 Boolean image search performance comparison of different methods 108 4.12 The comparison of Boolean image search with varied metric for image search performance prediction 109
4.13 Comparison of the top results obtained by different Boolean search methods 110
5.1 Image retrieval results comparison between simple and complex queries115 5.2 Illustration of the proposed web image reranking scheme for complex queries 116
5.3 An illustration of visual concepts detection from a given complex query122 5.4 Retrieval performance comparison between complex queries and their belonging primitive visual concepts 131
5.5 Performance comparison of different reranking approaches in terms of NDCGs 132
5.6 Illustrative results for complex image query search based on different approaches 134
5.7 Parameter learning for t 135
5.8 Parameter learning for K 136
5.9 Results of multimedia answering for 3 example queries 141
5.10 The distribution of informativeness 142
5.11 Comparison of overall average informativeness scores between with and without textual answers 145
5.12 The average performance comparison of Photo-based QA System 146 5.13 Performance comparison among different methods for textual news visualization 147
Trang 21List of Tables
1.1 The distribution of the expected answer medium types 42.1 A list of the representative research work in automatic textual QA 202.2 A list of the representative work in community-based QA 272.3 The summarization of the previous MMQA work 353.1 The distribution of the number of tags annotated for questions 393.2 Meta information of our data collection 543.3 The evaluation results of relevant tag selection 643.4 The Ground truth distribution for Query selection 683.5 The classification accuracies for query selection with different features 694.1 Illustration of the representative interrogative words 784.2 Illustration of the representative class-specific related words 794.3 The inter-rater reliability analysis for answer medium selection based
on the whole testing dataset 894.4 The distribution of the expected answer medium types 914.5 The accuracy comparison of question-based classification with dif-ferent features 914.6 The accuracy comparison of answer-based classification with differ-ent features 91
Trang 224.7 The classification accuracy via linear fusion 924.8 The representative questions for each answer medium class 934.9 The distribution of the person-related and non-person-related classes 944.10 The confusion matrix of classification results 944.11 The linear correlation comparison among different methods with d-ifferent performance measures 974.12 The better-worse prediction accuracy comparison among three dif-ferent methods with different performance measures 975.1 Representative complex queries illustration 1295.2 The distribution of visual words over five predefined high-level cate-gories 1305.3 The confusion matrix of visual concept detection results 1305.4 Visual concepts distribution in question generated queries 1375.5 Distribution of the number of pictures involved in news 1385.6 Statistic analytics for textual QA enrichment with media information 1425.7 Statistics of the comparison of our multimedia answer and the orig-inal textual answer 1425.8 The classification accuracy of answer medium selection comparisonbetween with and without textual answers 1445.9 Statistics of the comparison of our generated multimedia answer-
s without the assistance of textual answers and with the originaltextual answers 145
Trang 23Chapter 1 Introduction
With the rapid growth of the Internet, information searching has become an pensable activity in people’s daily life Document retrieval is currently the mostwidespread form of web searching Users type in queries in the form of unstructuredsets of keywords, and the search engines retrieve ordered lists of pointers to webpages based on the estimated relevance However, users are often bewildered by thevast quantity of information returned by the search engines, and they often have
indis-to painstakingly browse through large ranked lists of results indis-to locate the correctanswers Hence another retrieval paradigm, the so-called question-answering (QA),has evolved in an attempt to tackle this information-overload problem
QA is a smooth shift away from classical document search towards tion retrieval It aims to find a concise and accurate answer to a natural languagequestion instead of returning a ranked list of documents, utilizing advanced lin-guistic analysis, domain knowledge, data mining and natural language processingtechniques Compared to keyword-based search systems, it greatly facilitates thecommunication between humans and computers by naturally stating users’ inten-
Trang 24tion in plain sentences It also avoids the painstaking browsing of a vast quantity ofinformation contents returned by search engines for the correct answers Based onthe answer characteristics, QA can roughly be split into three key topics: automatictextual QA, community QA and multimedia QA
Automatic textual QA (aQA) research has been carried out for the past 20years with good success especially for answering factoid questions Specifically,researchers usually put their attentions on Open-Domain questions [112, 34, 132]and Restricted-Domain questions [91] Despite its great progress and encouragingresults reported, traditional aQA still has challenges that are not easy to overcome,such as the deep understanding of complex questions and the need to performsophisticated syntactic, semantic and contextual processing to generate meaningfulanswers It is found that, in most cases, automatic approach cannot obtain resultsthat are as good as those generated by manual processing [2, 160]
Along with the proliferation and improvement of the underlying cation technologies, community question answering (cQA) services are becomingimmensely popular in knowledge disseminating and information locating, owning
communi-to the following reasons First, information seekers are able communi-to post their
specif-ic questions on any topspecif-ic to pspecif-ick up answers provided by other partspecif-icipants Byleveraging community efforts, they are able to get better information than simplyusing search engines to find answers Second, in comparison to aQA systems, cQAusually receives answers with better quality as they are generated based on humanintelligence Third, over time, a tremendous number of QA pairs have been ac-cumulated in cQA repositories, and they facilitate the preservation and retrieval
of answered questions For example, WikiAnswer1, one of the most well-known
cQA systems, hosts more than 13 million answered questions distributed in 7, 000 categories (as of August 2011 ).
1 http://wiki.answers.com/
Trang 25Although a considerable number of questions have been well answered in text
vi-a levervi-aging the intelligence of grvi-assroot Internet users [160, 91, 112, 34, 132, 2],existing cQA forums only provide users with textual answers, as shown in Fig-ure 1.1 Unfortunately, textual answers may not provide sufficiently natural andeasy-to-grasp information Figure 1.1 (a) and (b) illustrate two examples For the
questions “What are the steps to make a weather vane” and “What does $1 Trillion
Look Like” , the answers are described by long sentences Clearly, it will be much
better if there are some accompanying videos and images that visually demonstratethe processes or the objects In fact, users usually post URLs that link to supple-mentary images or videos in their textual answers For example, for the questions
in Figure 1.1 (c) and (d), the best answers on Yahoo! Answer(Y!A)2 both containvideo URLs It further confirms that multimedia contents are useful in answering
2 http://answers.yahoo.com/
Trang 26To deeply investigate this problem, we randomly selected 5, 000 questions
from Y!A and invited five volunteers to label their preferred answer medium To
be specific, for each question, we asked each labeler to categorize it into one ofthe following four classes (text, text+image, text+video, text+image+video), andvoting was performed to obtain the final ground truth3 For the cases that thereare two classes having the same number of votes, a discussion is carried out amongthe labelers to decide the final ground truths Table 1.1 illustrates the distribution
of the four classes We can see that, more than 50% of the questions can best beanswered by adding multimedia contents instead of just using pure text This alsodemonstrates that our multimedia answering approach is highly desired
On the other hand, with the blooming of Web 2.0, an increasing growth of
multimedia contents on the Web has been witnessed in recent years For example,YouTube4 serves 100 million distinct videos and 65, 000 uploads daily, and the
traffic of this site accounts for more than 20% of all Web traffic and 10% of thewhole Internet, comprising 60% of the video watched online The photo-sharingsite Flickr5 contained more than 4 billion images as of October 2009, while over
3Here we only considered the informational questions and overlooked 1, 667 conversational questions among the 5, 000 questions [69, 44]
4
http://www.youtube.com
5 http://www.flickr.com/
Trang 273 billion photos are being uploaded every month onto FaceBook6 Meanwhile,the corresponding rapid development of media search is observed prosperously Infact, they provide rich media sources to transform textual QA to multimedia QA(MMQA), where MMQA is defined as providing answers with intuitive and accuratemedia contents, rather than pure texts
It is worth mentioning that there exist several efforts dedicated to research on matically answering questions with multimedia data, i.e., the so-called MMQA For
auto-example, Yang et al [146] proposed to extend text-based QA technology to
sup-port factoid QA in news video A photo-based QA system for finding information
about physical objects was presented in [148] Li et al [70] explored how to
lever-age YouTube video collections as a source to automatically find videos to describecooking techniques But these approaches usually work on certain narrow domainsand can hardly be generalized to handle general questions in broad domains This
is due to the following reasons:
• Question understanding gap between question and query
Conven-tional systems aim to seek multimedia answer from online corpuses directly,and the query used to search is formulated simply depending on the givenquestion However, the intentions expressed behind the original question andgenerated query may not be identical because of the deep question compre-hension gap This gap is hard to bridge owing to the following facts: First,short questions convey limited information [28] Second, it is very hard togenerate key phrases that are able to well capture the question topics whilenot subjective, ambiguous or generic Third, some key phrases are exact-
6 http://www.facebook.com
Trang 28ly what the questions asked, which do not explicitly appear in the question
bodies One typical example is “Who is the current CEO of Facebook ”.
• Prediction gap between selected answer medium and availability
of actual answers Almost all the conventional MMQA systems utilize
question independent monolithic media type, such as pure video [70] or pureimage [148] to answer all questions, and overlook the intrinsic characteristics
of question contents In fact, QA-aware answer medium type has an
immedi-ate effect on users’ multimedia experience For some questions, such as “what
day is President Obama’s birthday”, using pure textual answer is sufficient.
But we need to add image or video information for some other questions
For example, for the question, “who is Obama”, it is better to add images
to complement the textual answer, whereas we should add videos to answer
the question “how to cook beef ” Even after selecting an appropriate answer
medium candidate, the relevant resources in the selected medium type maystill be limited on the web or hard to collect Under these circumstances, wemay need to turn to other medium types However, few studies have beenconducted to measure the media answer availability This thesis views the me-dia answer availability problem as media search performance prediction task.Actually, many query performance prediction methods have been develope-
d for text-domain search, which are typically split into two categories: one
is by directly analyzing the queries and the data collection; while the other
is by analyzing search results They are usually called pre-search and search approaches, respectively Several pre-search methods in text-domaincan be applied to the multimedia search problem, however in comparison withpost-search approach, pre-search methods usually cannot achieve satisfacto-
post-ry performance as there is no information about the search results Clearly,the post-search methods cannot be directly applied to multimedia search as
Trang 297the text query and media search results belong to different media modalities.One way is to replace each multimedia entity, such as image or video, withits textual description that may contain title, attribute text and surroundingtext information This is actually a commonly-adopted approach for index-ing multimedia entities in commercial search engines However, there is alarge gap between the text description and the content of multimedia enti-
ty, and thus the conventional post-search methods also do not work well onmultimedia search Consequently, there is a lack of research on performanceprediction for multimedia search based on textual queries
• Widened semantic gap between complex queries and multimedia
data The queries generated from the textual QA pairs are usually very
verbose and complex They are not well supported by the current cial media search engines, which is due to the following reasons First, ascompared to simple queries, long ones frequently consist of more concepts,which further widen the semantic gap between the textual queries and thevisual contents Second, a complex query usually depicts the intrinsic seman-tic relationships among its constituent visual concepts7 This kind of mediaentity has loose coupled relationships with the surrounding textual descrip-tions, resulting in poor text-based search performance Third, while thereare abundant positive samples and query logs for simple queries, the positivesamples are rare for complex queries This makes learning-based model lesseffective Typically, media entity reranking techniques are the natural ways
commer-to improve the search results and select the most relevant information needs.The existing approaches generally fall into two categories One is the pseudorelevance feedback (PRF) based [99, 143, 86] They treat a significant frac-tion of the top media entities as pseudo-positive examples and collect some
7 Visual concept is defined as a noun phrase depicting a concrete entity with a visual form.
Trang 30bottom entities as pseudo-negative examples They then either learn a sifier or cluster the entities to perform reranking But, for complex queries,relevant samples are usually rare or not ranked at the top of the result list.This severely limits the ability to select pseudo positive and negative trainingsamples The other category is graph-based [131, 53, 128] that propagatesthe initial ranking information over the whole graph until convergence How-ever, for complex query, many irrelevant entities are frequently distributed inhigh ranked positions initially These irrelevant samples can hardly be pusheddown by the graph-based methods, since they often have low similarities withother irrelevant entities in the lower ranked positions [95] Consequently newapproaches towards media entities reranking for complex queries are highlydesired
To bridge the aforementioned research gaps, the main aim of this study is to sign and develop a systematical question-aware MMQA system As illustrated inFigure 1.2, this system comprises three components spanning from question ana-lytics, answer medium determination to media answer selection Specifically, given
de-a question in nde-aturde-al lde-angude-age, our scheme first verticde-ally seeks its textude-al de-answersfrom large QA repositories and generates the informative keyword-based query vialeveraging the question annotation It then performs a QA classification to prelimi-narily select an answer medium candidate and next re-considers the answer mediumwith media answer availability measurement Finally, with generated queries anddetermined answer medium, it identifies relevant media answers with the help ofvisual concepts embedded in query body
Our proposed scheme thoroughly breaks the traditional MMQA dilemmasthat can only reply certain narrow domain questions, since it does not aim to di-
Trang 31Question
Question annotation
Question with tags
Tag-based taxonomy
Question category
Vertical search in cQA
QA pair
Informative Query
Query Expansion
Query formulation
QA classification
Answer medium candidate
Media answer availability measurement
Answer medium determination
Media data collection
Media data reranking
Relevant answer
Visual concept
Trang 32we can focus on solving the second gap Therefore, our scheme can also be viewed as
an approach that accomplishes the MMQA task by jointly exploring the knowledge
of human and computer Besides, our approach facilitates question understanding
by leveraging social labeled tags and enhances answer selection by employing
visu-al concepts These make our scheme able to devisu-al with more genervisu-al questions andachieve better performance
1.4.1 Question Analysis
Effective query formulation is the prerequisite for multimedia answer generation,since most of the current dominant media search engines do not support query innatural language well However, formulating informative queries from questions isnon-trivial due to the question understanding problem inherent in some questioncharacteristics, such as short question bodies which convey less information andreplacing the key words with some interrogative words In addition, the extracted
or generated key phrases should depict at least the key facet or subtopic of the givennatural language question, and should not be over-general or over-specific On theother hand, question tag, the social-contributed subtopic of each question, can beleveraged to better formulate queries via query expansion Figure 3.1 illustrates therepresentative question with tags Also, the concept hierarchy generated based onthese tags can facilitate the key concept/domain determination that the question islikely in, which greatly improve the performance of vertical textual answer retrieval
Trang 33Quesion
(a) Conventional MMQA Approach
Find Answer
Quesion
Question Search for Answer
Enriching Text Answers with Media Information
CQA Corpuses Enriched with Media Information
(b) Accomplish MMQA by Enriching Large CQA corpuses with Media Information
Figure 1.3: The differences between the conventional MMQA approaches and ourscheme: (a) conventional MMQA aims to seek multimedia answers directly fromonline corpuses; and (b) our proposed scheme first retrieves the textual answersfrom large cQA corpuses for the given question, and then enriching textual answerswith image and video information
Trang 34within the QA knowledge ontology In this way, we seamlessly integrate the textualanswers to compensate for the limited question information and hence MMQA isconverted into enriching textual answer with media information
Our research presents a novel method to automatically annotate questiontags, with two components Given a question, our method first roughly identifies acollection of probable relevant tags via finding similar question space This subtask
is accomplished based on our proposed adaptive probabilistic hypergraph learning,where comprehensive information cues from users, questions and tags are seamlesslyintegrated together Next, it performs a heuristic approach to further filter thetag candidates by simultaneously taking informativeness, stability and questioncloseness into consideration This effectively keeps off subjective, ambiguous andgeneric tags Based on the proposed mehod, we introduce one potential applicationscenario: knowledge organizer The new generated knowledge hierarchy is user-navigable and reconfigurable
This part first learns a four-category classifier to preliminarily select the answermedium candidate 8, which is able to intuitively express the desired answer How-ever, in QA domain, precision is usually much more essential than visual intuition,since information seekers intend to locate accurate information while visualizationjust provides auxiliary function to assist users’ access For example, for the question
“How do I export Internet Explorer browser history”, it is intuitive that it should be
answered using video content, but in fact video resources related to this topic on theweb are hard to search from the current dominant video search engines However,the textual content related to this topic is very rich Under such situation, usersmay prefer the accurate textual answers rather than the visually rich but irrelevant
8 The four categories include Text, Text+Image, Text+Video and Text+Image+Video.