Phân tích câu hỏi dựa trên hình ảnh bằng BERT kết hợp với cơ sở dữ liệu kiến thức để trả lời câu hỏi chi tiết

MỤC LỤC

INTRODUCTION

R ESEARCH P ROBLEM

However, most existing VQA models only focus on answering questions about objects or actions in images, one of the main challenges is the lack of knowledge representation in VQA systems, which limits their ability to handle complex questions that require a deeper understanding of the image content. There was an idea about using the visual question answering system to answer more extensive questions about the information of the entities in the image, and to do that, the idea of retrieving the information contained in the knowledge base that has been applied to VQA.

O VERVIEW OF VQA AND KBVQA PROBLEM

This master thesis aims to investigate the challenges and opportunities of KBVQA, explore the existing approaches and techniques in the field, and propose solutions to enhance the accuracy and efficiency of KBVQA systems. In this master thesis, I will investigate the challenges and opportunities of KBVQA, explore the existing approaches and techniques in the field, and propose solutions to enhance the accuracy and efficiency of KBVQA systems that using multi-modal fusion method, consist of BERT (Bidirectional Encoder Representations from Transformers) as a language model to work with CNNs (Convolutional Neural Networks) or CLIP(Contrastive Language-Image Pre-Training) as image processing models and Wikipedia Knowledge Base.

T ARGET AND S COPE OF THE T HESIS

L IMITS OF THE T HESIS

C ONTRIBUTION OF THE T HESIS

R ESEARCH SUBJECTS

 Chapter 4 The Proposed Models: Chapter 4 specifically talks about the students' proposed models, for the Knowledge-based Visual Question Answering problem and the practical results.  Chapter 5 Conclusion: Summarize the contributions of the thesis, the outstanding issues of the problem of KBVQA and talk about future research.

BACKGROUND KNOWLEDGE

C ONVOLUTIONAL N EURAL N ETWORKS (CNN S )
L ONG S HORT -T ERM M EMORY (LSTM)

Following the first CNN-based architecture (AlexNet), which won the ImageNet 2012 competition, each subsequent winning architecture employs more layers in a deep neural network to reduce error rates. A recurrent neural network (RNN) is a type of artificial neural network in which connections between nodes can form a cycle, allowing output from one node to affect subsequent input to the same node. An RNN with LSTM units can be trained in a supervised manner on a set of training sequences, using an optimization algorithm such as gradient descent combined with backpropagation through time to compute the gradients required during the optimization process, in order to change each weight of the LSTM network in proportion to the derivative of the error (at the LSTM network's output layer) with respect to the corresponding weight.

In the BERT training process, the model receives pairs of sentences as input and learns to predict if the second sentence in the pair is the subsequent sentence in the original document. Finally, the start and end scores, along with the extracted answer span, are concatenated with the image features from CLIP and the resulting tensor is used to output the final answer to the question. Most inspirational for CLIP is the work of Ang Li and his co-authors at FAIR who in 2016 demonstrated using natural language supervision to enable zero-shot transfer to several existing computer vision classification datasets, such as the canonical ImageNet dataset.

This data is used to create the following proxy training task for CLIP: given an image, predict which out of a set of 32,768 randomly sampled text snippets, was actually paired with it in the dataset. For instance, if the task of a dataset is classifying photos of dogs vs cats we check for each image whether a CLIP model predicts the text description “a photo of a dog” or “a photo of a cat” is more likely to be paired with it. With CLIP, they tested whether task agnostic pre-training on internet scale natural language, which has powered a recent breakthrough in NLP, can also be leveraged to improve the performance of deep learning for other fields.

The IT dataset is considered to be one of the largest and most diverse datasets for training vision and language models, and has been used to pretrain models such as UniVL and CLIP. The idea behind using Wikipedia as the knowledge base is that it contains a large amount of information on a wide range of topics, making it a useful resource for answering factual questions.

RELATED WORKS

It has largely been approached as a two-stage problem, with an Information Retrieval (IR) stage followed by a Reading Comprehension (RC) stage, and a global emphasis on factoid questions (e.g. This interest is not limited to VQA applications, but is also applied to traditional Computer Vision tasks such as image classification (Marino, Salakhutdinov, and Gupta 2017), object detection (Fang et al. Most of these works, however, continue to focus on bridging images and commonsense Knowledge Graphs, such as WordNet (Miller 1995) and Conceptnet (Liu and Singh 2004), as opposed to ours, which requires bridging images and world knowledge.

Face identification and context: The visual entity linking task, i.e., linking named entities appearing in images to Knowledge Graph, necessitates face recognition at the web scale. These approaches leverage graph embedding models, graph neural networks, and reasoning algorithms to capture the semantic relationships and hierarchies within the knowledge graph. On the one hand, late fusion is simpler because both Natural Language Processing and Computer Vision techniques can be used independently, but it ignores intermodal interaction.

These methods employ information retrieval techniques, such as entity linking, entity disambiguation, and semantic matching, to extract useful knowledge and provide accurate answers. Generally this thesis focus on a Knowledge-based Visual Question Answering (KBVQA) system by combining techniques from natural language processing, computer vision and knowledge base information retreival which use the multimodal fusion technique, especially late fusion technique. While it demonstrates the effectiveness of the combined techniques, there is still room for improvement in terms of knowledge representation, model adaptation, and scalability.

PROPOSED SOLUTION

E VALUATION M ETHOD
M ETHOD
S ETTING - UP EXPERIMENT – T RAINING BERT
F IRST BASELINE : BERT BASE WITH CNN
S ECOND B ASELINE : BERT L ARGE AND CLIP
E XPERIMENT WITH GPT-2 FOR 2 ND BASELINE

Another reference is the model in the paper “KVQA: Knowledge-Aware Visual Question Answering”, in this research, author gives an input image, corresponding Wikipedia caption (which is optional for us) and a question, their first goal is to identify entities present in the image and question. Through researching datasets to evaluate the KBVQA task, I have found a dataset suitable for my needs to answer a question about famous people based on an input photo loaded into the system through query knowledge base. Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

In the thesis proposal, I presented a traditional VQA systems typically rely on models such as LSTM and CNN to generate answers to questions about images which base on the model in paper "Show, Ask, Attend, and Answer: A Strong Baseline For Visual Question Answering”. Additionally, exploring advanced image processing methods, such as utilizing specialized pre-trained CNN models for visual question answering tasks, could potentially improve the system's performance in handling a wider range of images. Moreover, the BERT large model i use has been trained on a more extensive corpus, which is The Stanford Question Answering Dataset (SQuAD), this is a benchmark dataset for evaluating machine reading comprehension models.

This enhanced representation capability allows BERT large to capture more subtle contextual cues and dependencies, which can be crucial for accurately answering complex questions in a knowledge-based visual question. While BERT base performs well on many natural language processing tasks, BERT large's larger model size, extensive training data, and enhanced representation capacity make it a superior choice for KBVQA tasks. With its ability to understand images in the context of language, CLIP offers a modern and powerful alternative to traditional models like CNN (Convolutional Neural Network) in the domain of Visual Question Answering (VQA).

We aimed to enhance the question answering capabilities by incorporating the GPT-2 module, which helps expand the content of the answers and provide more relevant information about the entity in question. The continuous development of better language models can enhance the overall performance and accuracy of the KBVQA system, ensuring that the generated answers are consistently relevant, informative, and aligned with the question and input context.

CONCLUSION

Besides, the dependence on query information on Wikipedia also needs to be improved, most of the content of the answer still depends on Wikipedia, so in the future the system will consider developing more KB with many different sources, increasing the accuracy in answering the system's questions. To improve the system, future work should focus on training models with larger datasets, exploring advanced language models, and completing additional modules like object detection and face recognition. With further refinement and development, KBVQA systems hold great potential for real-world applications in various domains that require accurate and comprehensive question answering based on textual and visual inputs.

Cui et al., “ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration,” in Proceedings of the 29th ACM International Conference on Multimedia, Chengdu, Oct. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, Minnesot: Association for Computational Linguistics, 2019, pp. Lee, “ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks,” in Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, 2019.

Soricut, “Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, 2018, pp. Bansal, “LXMERT: Learning Cross-Modality Encoder Representations from Transformers,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, 2019, pp. Tice, “Building a question answering test collection,” in Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’00), Athens, Greece: ACM Press, 2000, pp.