Application of visual question answering using bert integrated with knowledge base to answer extensive question

INTRODUCTION

R ESEARCH P ROBLEM

Fusing multiple modalities, such as image and text, to retrieve relevant information is a long-standing problem in the field of artificial intelligence In recent years, significant progress has been made in many subdomains of machine learning Neural networks are now capable of solving Computer Vision and Natural Language Processing tasks in a much different way, with greater speed and accuracy One of the most challenging and promising tasks in this area is Visual Question Answering (VQA), which aims to automatically generate an accurate answer to a natural language question about an image It was once thought that developing a computer vision system capable of answering arbitrary natural language questions about images was an ambitious but intractable goal However, since

2016, there has been tremendous progress in developing systems with these capabilities VQA systems aim to correctly answer natural language questions about an image input and comprehend the contents of an image in the same way that humans do, while also communicating effectively about that image in natural language

With the increasing availability of large-scale annotated datasets, deep learning-based approaches have recently achieved remarkable progress in VQA However, most existing VQA models only focus on answering questions about objects or actions in images, one of the main challenges is the lack of knowledge representation in VQA systems, which limits their ability to handle complex questions that require a deeper understanding of the image content There was an idea about using the visual question answering system to answer more extensive questions about the information of the entities in the image, and to do that, the idea of retrieving the information contained in the knowledge base that has been applied to VQA Combining all of this, a new task called Knowledge-based Visual Question Answering (KBVQA) has been proposed, which requires a model to retrieve relevant information about a entity in the image from a knowledge base and use it to answer questions

To address the limitation of traditional Visual Question Answering that we mention before, Knowledge-based Visual Question Answering (KBVQA) has emerged as a another more specialized direction for VQA field KBVQA requires external knowledge (knowledge bases - KBs) beyond the image to answer the question By utilizing structured knowledge sources, KBVQA systems can reason about the relationships between visual concepts and provide more accurate and detailed answers to complex questions

The integration of knowledge representation techniques in KBVQA has the potential to significantly improve the performance of VQA systems and enable them to tackle more complex and sophisticated questions This master thesis aims to investigate the challenges and opportunities of KBVQA, explore the existing approaches and techniques in the field, and propose solutions to enhance the accuracy and efficiency of KBVQA systems.

O VERVIEW OF VQA AND KBVQA PROBLEM

Visual question answering systems attempt to correctly answer natural language questions about an image input The overarching goal of this problem is to create systems that can comprehend the contents of an image in the same way that humans do and communicate effectively about that image in natural language This is a difficult task because image-based models and natural language models must interact and complement each other

Figure 1.1: Illustration of VQA's tasks

The problem is widely regarded as AI-complete One that addresses the problem of Artificial General Intelligence, namely making computers as intelligent as humans

An idea of the subproblems involved in the task of visual question answering:

Table 1.1: Computer vision sub-tasks required to be solved by VQA

Computer Vision Task Representative VQA Question

Object recogniction What is in the image?

Object detection Are there any dogs in the picture?

Attribute classification What corlor is the umbrella?

Scence Classification Is it raining?

Counting How many people are there in the image?

Activity Recognition Is the child crying?

Sptial relationships among What is betwwen cat and sofa?

Commonsense reasoning Does this have 20/20 vision?

Knowledge-base reasoning Is this a vegetarian pizza?

People can now create models that can recognize objects in images with high accuracy However, we are still a long way from human-level image comprehension When we look at images, we see objects but also understand how they interact and can tell their state and properties Visual question answering (VQA) is particularly intriguing because it allows us to learn about what our models actually see We present the model with an image and a question in the form of natural language and the model generates an answer again in the form of natural language

1.2.2 Knowledge-based Visual Question Answering (KBVQA):

Knowledge-Based Visual Question Answering (KBVQA) is a new task that combines the capabilities of computer vision, natural language processing, and knowledge representation KBVQA systems aim to answer questions about entities in images by utilizing external knowledge sources, such as Wikipedia knowledge base Therefore, the interaction between the modalities is paramount to retrieve information and must be captured with complex fusion models To address the task, one must thus retrieve relevant information from a KB (Knowledge Base) The main challenge in KBVQA is to enable machines to reason about the relationships between visual concepts and knowledge sources to provide accurate and detailed answers to complex questions This contrasts with standard Visual Question Answering, where questions target the content of the image (e.g the color of an object or the number of objects)

KBVQA is an extension of the Visual Question Answering (VQA) task, which aims to generate accurate answers to natural language questions about an image However, the limitation of traditional VQA models is their inability to handle complex questions that require a deeper understanding of the image content By utilizing structured knowledge sources, KBVQA systems can address this limitation and enhance the performance of VQA systems In KBVQA, both text and image modalities bring useful information that must be combined together with querying knowledge base techiniques Therefore, the task is more broadly related to Multimodal Information Retrieval (IR) and Multimodal Fusion

Figure 1.2: The question and relevant item in the Knowledge Base

Recent advances in deep learning, natural language processing, and knowledge representation have led to significant progress in KBVQA Several approaches have been proposed to address the challenges of KBVQA, such as knowledge graph-based methods, multi-modal fusion methods, and attention-based methods These approaches aim to improve the accuracy and efficiency of KBVQA systems and enable them to tackle more complex and sophisticated questions

In this master thesis, I will investigate the challenges and opportunities of KBVQA, explore the existing approaches and techniques in the field, and propose solutions to enhance the accuracy and efficiency of KBVQA systems that using multi-modal fusion method, consist of BERT (Bidirectional Encoder Representations from Transformers) as a language model to work with CNNs (Convolutional Neural Networks) or CLIP(Contrastive Language-Image Pre-Training) as image processing models and Wikipedia Knowledge Base I will also evaluate the performance of different KBVQA models on different baseline and datasets and analyze the results to gain insights into the strengths and weaknesses of each approach.

T ARGET AND S COPE OF THE T HESIS

The target of the thesis is to research and build a Knowledge-based Visual Question Answering system using deep learning methods and natural language processing techniques Specifically:

 Understand and use deep learning models, processing techniques in natural language and vision domain

 Understand and use the combination of models with different functions, which called mullti-modal fusion, especially the late fusion technique

 Understand and apply forms of information retrieval, and apply information retrieval to the knowledge base

 Understand problem solving methods, especially recent ones using modern deep learning models like BERT, CLIP From there, the advantages and disadvantages of each method are shown

 Make suggestions that can improve the performance for the system

 After the thesis, students have a more accurate view of natural language processing in particular and deep learning, machine learning in general Better understand the problems, challenges and feasibility when applying deep learning, machine learning to solve a real-world problem

From the above objectives, students set out the tasks to be performed in the process of making the thesis:

 Learn the problem of visual question answering combine with knowledge base, related works, problem solving methods, advantages and disadvantages of the methods

 Propose models to improve the accuracy of knowledge-based visual question answering task

 Experiment and evaluate the results of the proposed systems

 In conclusion, the outstanding issues are raised and future research is proposed.

L IMITS OF THE T HESIS

Knowledge-based Visual Question Answering is a complex problem and has many tasks and methods, so the content of the thesis will be limited as follows:

 Focus on Knowledge-based visual question answering in the direction of multimodal fusion, especially late fusion technique

 The language of the datasets is English

 Deep learning models: CNN, LSTM, CLIP, BERT, GPT-2

 The model is evaluated based on the precison, recall, f1-score for the Knowledge- base query and answer the question.

C ONTRIBUTION OF THE T HESIS

In this thesis, student propose a method to improve the answers from VQA system through retrieving information with knowledge base retreival method, combining BERT and CLIP models to the system Integrating GPT-2 to answering module to expand on information related to the entity which is in question to provide an appropriate answer about the entity.

R ESEARCH SUBJECTS

This thesis consist of 5 chapters:

 Chapter 1 Introduction: Introduce the development of the current multimodal fusion trend, describe the knoeledge-based visual question answering problem, commonly used datasets as well as evaluation methods

 Chapter 2 Background: Discuss the basic knowledge background in deep learning, from Convolutional Neural Networks (CNNs), Long-short-term memory (LSTM) to Bidirectional Encoder Representations from Transformers (BERTs), Contrastive Language-Image Pre-Training (CLIP) and Generative Pre-trained

 Chapter 3 Related works: Discussing related studies, deep learning models and multimodal fusion methods, thereby giving direction for the topic

 Chapter 4 The Proposed Models: Chapter 4 specifically talks about the students' proposed models, for the Knowledge-based Visual Question Answering problem and the practical results

 Chapter 5 Conclusion: Summarize the contributions of the thesis, the outstanding issues of the problem of KBVQA and talk about future research.

BACKGROUND KNOWLEDGE

Knowledge-based Visual Question Answering (KBVQA) is a recent and challenging research area in computer vision and natural language processing that requires the integration of knowledge representation techniques to enable machines to answer complex and sophisticated questions about visual content In this thesis, we focus on the main tasks involved in KBVQA, including natural language question processing by Bidirectional Encoder Representations from Transformers (BERT) or Long Short-Term Memory (LSTM), image understanding using Convolutional Neural Networks (CNNs), and the use of advanced language models such as Contrastive Language-Image Pre-Training (CLIP) models with OpenAI's GPT-2, to generate relevant and accurate answers to questions

Convolutional neural networks are a specialized type of artificial neural networks that use a mathematical operation called convolution in place of general matrix multiplication in at least one of their layers They are specifically designed to process pixel data and are used in image recognition and processing

A convolutional neural network (CNN, or ConvNet) is a type of artificial neural network (ANN) that is most commonly used in deep learning to analyze visual imagery CNNs are also referred to as Shift Invariant or Space Invariant Artificial Neural Networks (SIANN), due to the shared-weight architecture of the convolution kernels or filters that slide along input features and provide translation-equivariant responses known as feature maps

Due to the downsampling operation used on the input, most convolutional neural networks are not translationally invariant Image and video recognition, recommender systems, image classification, image segmentation, medical image analysis, natural language processing, brain-computer interfaces, and financial time series are just a few of the applications

CNNs are regularized versions of multilayer perceptrons Multilayer perceptrons are typically fully connected networks, in which each neuron in one layer is connected to all neurons in the next layer These networks' "full connectivity" makes them prone to data overfitting Typical methods of regularization, or preventing overfitting, include penalizing parameters during training (such as weight decay) or trimming connectivity CNNs take a different approach to regularization: they use the hierarchical pattern in data to assemble patterns of increasing complexity using smaller and simpler patterns embossed in their filters As a result, CNNs are at the lower end of the connectivity and complexity scale

Convolutional neural networks were inspired by biological processes in that the pattern of connectivity between neurons resembles the organization of the animal visual cortex Individual cortical neurons only respond to stimuli in a small region of the visual field known as the receptive field Different neurons' receptive fields partially overlap to cover the entire visual field

In comparison to other image classification algorithms, CNNs require little pre- processing This means that, unlike traditional algorithms, the network learns to optimize the filters (or kernels) through automated learning This lack of reliance on prior knowledge and human intervention in feature extraction is a significant benefit

A convolutional neural network consists of an input layer, hidden layers and an output layer In any feed-forward neural network, any middle layers are called hidden because their inputs and outputs are masked by the activation function and final convolution In a convolutional neural network, the hidden layers include layers that perform convolutions Typically this includes a layer that performs a dot product of the convolution kernel with the layer's input matrix This product is usually the Frobenius inner product, and its activation function is commonly ReLU As the convolution kernel slides along the input matrix for the layer, the convolution operation generates a feature map, which in turn contributes to the input of the next layer This is followed by other layers such as pooling layers, fully connected layers, and normalization layers

Following the first CNN-based architecture (AlexNet), which won the ImageNet 2012 competition, each subsequent winning architecture employs more layers in a deep neural network to reduce error rates This works for fewer layers, but as the number of layers increases, we encounter a common problem in deep learning known as the Vanishing/Exploding gradient As a result, the gradient becomes 0 or too large As the number of layers increases, so does the training and testing error rate

Figure 2.2 Comparison of 20-layer vs 56-layer architecture

The above plot shows that a 56-layer CNN architecture produces more error on both the training and testing datasets than a 20-layer CNN architecture After further investigation into the error rate, the authors concluded that it is caused by a vanishing/exploding gradient

ResNet, which was proposed in 2015 by Microsoft Research researchers, introduced a new architecture called Residual Network

Residual Network: This architecture introduced the concept of Residual Blocks to solve the problem of the vanishing/exploding gradient In this network, we employ a technique known as skip connections The skip connection connects layer activations to subsequent layers by skipping some layers in between This results in the formation of a residual block These residual blocks are stacked together to form renets

Instead of layers learning the underlying mapping, this network allows the network to fit the residual mapping So, rather than using, say, H(x), as the initial mapping, let the network fit

Figure 2.3 Skip (Shortcut) connection of ResNet

The benefit of including this type of skip connection is that if any layer degrades architecture performance, it will be skipped by regularization As a result, training a very deep neural network is possible without the issues caused by vanishing/exploding gradients The researchers conducted experiments on 100-1000 layers of the CIFAR-10 dataset

A similar approach is known as "highway networks," and these networks, too, use skip connections These skip connections, like LSTM, make use of parametric gates These gates control how much data passes through the skip connection However, this architecture has not provided greater accuracy than the ResNet architecture

Network Architecture: This network employs a 34-layer plain network architecture inspired by VGG-19, to which the shortcut connection is added These shortcut connections then transform the architecture into a residual network

Figure 2.4 ResNet -34 architecture Image Embeddings by ResNet:

To compute a high level representation  of the input image I, I use a pretrained convolutional neural network (CNN) model based on residual network (ResNet) architecture

 is a 14 x 14 x 2048-dimensional three-dimensional tensor from the last layer of the residual network (ResNet) before the final pooling layer I also apply l2 normalization to the depth (last) dimension of image features, which improves learning dynamics

2.2 Long Short-Term Memory (LSTM):

2.2.1 Overview about Recurrent Neural Networks (RNNs):

RELATED WORKS

In this section, we review the existing works and research contributions in the field of Knowledge-based Visual Question Answering (KBVQA) The literature on KBVQA encompasses a wide range of approaches, architectures, and evaluation methodologies We categorize the related works into several key areas to provide a comprehensive understanding of the advancements in the field

Knowledge-based Visual Question Answering (KBVQA) is a challenging task It consists in answering questions about entities grounded in a visual context using a Knowledge Base (KB) To address KBVQA task, one must thus retrieve relevant information from a KB This contrasts with standard Visual Question Answering, where questions target the content of the image (e.g the color of an object or the number of objects In KBVQA, both text and image modalities bring useful information that must be combined Therefore, the task is more broadly related to Multimodal Information Retrieval (IR) and Multimodal Fusion

Several studies have focused on integrating external knowledge bases with VQA systems to enhance their reasoning capabilities Approaches have been proposed to retrieve relevant information from knowledge bases and utilize it to answer questions accurately These works leverage techniques such as knowledge graph embeddings, graph neural networks, and ontology-based reasoning to establish connections between visual concepts and knowledge representations In addition, different architectural choices have been explored in KBVQA, ranging from traditional neural network architectures to more advanced models Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Transformer-based models such as BERT and GPT-2 have been widely used to process visual and textual information Additionally, hybrid architectures that combine multiple modalities, such as vision-language models (e.g., ViLBERT, LXMERT), have shown promising results in capturing the interactions between images and questions

Figure 3.1: Knowledge-based Visual Question Answering’s Taxonomy

Visual Question Answering (VQA): VQA has become increasingly popular in recent years Recent VQA research studies can be broadly classified as follows: improved visual features [1, 14, 30, 42], more powerful model architectures [12, 15, 18, 38, 40], and more effective learning paradigms [7, 8, 9, 22, 25, 33, 39] The Transformer architecture is used by the majority of today's cutting-edge VQA methods [34] They have approached or even surpassed human-level performance on several representative benchmarks by incorporating vision-language pretraining on large-scale datasets [3, 12, 24, 37] Aside from general-purpose VQA studies, there is a growing trend toward investigating more granular VQA tasks with specific reasoning skills, such as neural-symbolic reasoning [13,

Question Answering (QA): Because the approach to KBVQA is based on text-based

KB, it is closely related to text Question Answering (QA) TREC QA evaluations boosted the popularity of text QA [35] It has largely been approached as a two-stage problem, with an Information Retrieval (IR) stage followed by a Reading Comprehension (RC) stage, and a global emphasis on factoid questions (e.g [5])

Computer Vision and Knowledge Graph: Combining Computer Vision and background knowledge, typically Knowledge Graph, is gaining popularity This interest is not limited to VQA applications, but is also applied to traditional Computer Vision tasks such as image classification (Marino, Salakhutdinov, and Gupta 2017), object detection (Fang et al 2017), and zero-shot image tagging (Lee et al 2018) Most of these works, however, continue to focus on bridging images and commonsense Knowledge Graphs, such as WordNet (Miller 1995) and Conceptnet (Liu and Singh 2004), as opposed to ours, which requires bridging images and world knowledge

Face identification and context: The visual entity linking task, i.e., linking named entities appearing in images to Knowledge Graph, necessitates face recognition at the web scale Face recognition has been a well-studied and highly successful problem in Computer Vision, with large-scale train sets introduced (Guo et al 2016; Cao et al 2018) However, in the literature, testing is still done on a much smaller scale than what KVQA requires There have also been studies that show the value of context in improving face identification, particularly in limited settings (Bharadwaj, Vatsa, and Singh 2014; Lin et al 2010; O'Hare and Smeaton 2009)

3.1 Approaches of Knowledge-based Visual Question Answering

Approaches in Knowledge-based Visual Question Answering (KBVQA) aim to leverage external knowledge sources to enhance the reasoning capabilities of VQA systems These approaches utilize various techniques and methodologies to retrieve relevant information from knowledge bases and utilize it to generate accurate and detailed answers Here are some common approaches employed in KBVQA:

1 Knowledge Graph-based Approaches: Knowledge graph-based approaches represent entities and their relationships in a structured knowledge graph They utilize graph-based reasoning techniques to traverse the knowledge graph and infer answers to visual questions These approaches leverage graph embedding models, graph neural networks, and reasoning algorithms to capture the semantic relationships and hierarchies within the knowledge graph

2 Ontology-based Approaches: Ontology-based approaches utilize ontologies, which define concepts, properties, and relationships, to represent knowledge They exploit ontological reasoning techniques to infer answers based on the logical connections between entities and attributes These approaches leverage ontology alignment, ontology reasoning, and semantic matching to bridge the gap between visual content and knowledge representation

3 Information Retrieval-based Approaches: Information retrieval-based approaches focus on retrieving relevant information from knowledge bases to answer questions They employ techniques such as entity linking, entity disambiguation, and semantic matching to extract knowledge that is most relevant to the visual question These approaches leverage text-based indexing and retrieval algorithms to retrieve relevant textual information from the knowledge base

4 Generative Reasoning Approaches: Generative reasoning approaches aim to reason over the available knowledge base to generate answers They employ logical reasoning, rule-based inference, and probabilistic graphical models to derive answers based on the existing knowledge These approaches capture complex relationships and dependencies within the knowledge base to generate coherent and contextually appropriate answers

5 Hybrid Models: Hybrid models combine multiple modalities, such as visual features and textual information, to improve the understanding and reasoning capabilities of KBVQA systems They leverage deep learning architectures, such as vision-language models, to jointly process visual and textual inputs These models capture the interactions between images and questions, enabling better integration of visual content and external knowledge

These approaches in KBVQA highlight the diverse strategies employed to incorporate external knowledge into the VQA process Each approach has its own strengths and limitations, and the choice of approach depends on the specific requirements and characteristics of the KBVQA task at hand By exploring and analyzing these approaches, researchers can advance the field of KBVQA and develop more effective and intelligent question answering systems

The success of BERT in NLP [10], which relies on the easily-parallelizable Transformer architecture [34], an unsupervised training objective, and a task-agnostic architecture, has concurrently inspired many works in the VQA and cross-modal retrieval fields [33,28,23,32,21,6] These models are unied under a single framework in [4] and partly reviewed in [18] All of these models rely on the Transformer architecture, often initialized with a pre-trained BERT, in order to fuse image and text The training is weakly supervised, based upon image caption datasets such as COCO [41] or Conceptual Captions [29], and pre-trained object detectors like Faster R-CNN [26] [11] show that these models learn nontrivial interactions between the modalities for VQA Multimodal BERTs can be broadly categorized into single-stream and multi-stream Single-stream models feed both text tokens' embeddings and image regions' embeddings to the same Transformer model, relying on the selfattention mechanism to fuse them Instead, in the multi-stream architecture, text and image are first processed by two independent Transformers before using cross-attention to fuse the modalities Both architectures have been shown to perform equally well in [10] In this work, I use a single-stream model to take advantage of pretraining on text-only (on QA datasets)

PROPOSED SOLUTION

The main goal of this thesis is to improve the question answering feature for the Visual Question Answering problem to answer extensive question about entity in the picture In addition, the traditional VQA problem only asks the basic properties of the entity present in the image, for example, if we feed the input image into the system which is a portrait of Albert Einstein, the system can only answer questions like "Is this a person?", how many people are there in this photo?", "What color is this person's shirt?", In short, the nature of traditional VQA system is that we can only answer questions with the properties contained in the content of the image, but what if we ask deeper questions about the entity in the image, for example " What is this person's most famous research achievement?", the traditional VQA system can not answer the question above Because of that, integrating a knowledge base about the entity so that the system can query in that KB and answer deeper and more extensive questions about the information of the entity in the image is the goal of this thesis

Figure 4.1: Types of question that you can ask between traditional VQA and Knowledge- based VQA

To do that, I recommend multimodal fusion method, more precisely late fusion technique to merge deep learning models with KB to perform the above task Since asking questions about all the different types of entities requires a dataset large enough to cover all domains, it's extremely difficult, given the size of a thesis, and limited in the meantime, here the main domain of KBVQA system for this thesis will answer questions about great people, famous people in the world like Barrack Obama, Einstein, Newton, Multimodal entity representation is a critical problem that will enable more natural human-machine interactions For example, when viewing a movie, one can think, "Where did I already see this actress?" or "Did she ever win an Oscar?"

Basically, to describe how the KBVQA system work in the architecture aspect, we can show it like below:

The architecture of the KBVQA system can be described as a combination of several components by late fusion technique, including language models, information retrieval, and visual feature extraction Here is a breakdown of the architecture structure:

 Question: The system takes user input for a natural language question

 Image: The user provides the path to an image file In fact, the essence of giving the name of the entity to input into the system will be done by 2 modules, namely

"Object detection" and "Face recognition", these two modules will detect and recognize the face of the entity, from which the output is a "named entity" for input into the system for querying in the KB However, because the volume of this thesis is quite large for the coordination of activities between modules, "object detection" and "face recognition" " will be process and implemented in the future work section

 Wikipedia knowledge base: The system retrieves the Wikipedia page for the named entity to gather relevant information Because of the time limit, this thesis only uses text-based KB, the implementation of image retreival for KB including images will be done in the future work section

 LSTM Model: The LSTM model serves as the language modeling component in the system It takes the input question and the summary text from Wikipedia as its input The LSTM processes the sequential information in the text and learns to capture the semantic meaning and contextual understanding of words and phrases It generates an encoded representation of the input text that incorporates the question and the summary information

 BERT for Question Answering: The system utilizes a BERT model (BertForQuestionAnswering) for answering the user's question It employs the transformers library to load the model and tokenizer The question and the extracted summary from the Wikipedia page are passed to the BERT model, which generates start and end scores representing the answer span in the text

 CNN Model: The CNN model acts as the visual feature extraction component It takes the input image and processes it to extract visual features The CNN model is responsible for analyzing the visual content of the image and encoding it into a compact feature representation The extracted visual features capture high-level visual information relevant to answering the question

 CLIP Model: To incorporate visual information, the system employs the CLIP model (model_clip) for image feature extraction It uses the clip library to load the model The user-provided image is preprocessed using transforms from the torchvision library and passed through the CLIP model to obtain image features

 Answer Generation: The concatenated representation is used to generate the final answer The system selects the answer span by identifying the indices with the highest scores The selected span is then converted back into a string representation using the BERT tokenizer

 Answer Presentation: The system outputs the generated answer to the user, which represents the response to their question about the named entity

I recommend 2 baselines to build KBVQA system:

 Upgrade the traditional VQA model (including BERT base and CNN) by combining with Knowledge Base, where LSTM can be replaced by BERT base

 Build KBVQA system with dual encoders BERT large and CLIP combined with Knowledge base Another approach for 2 nd baseline is building KBVQA system with BERT, CLIP and generating additional context by with GPT-2

Here I refer to VQA systems from many sources, but mainly from two main papers which are "Show, Ask, Attend, and Answer: A Strong Baseline For Visual Question Answering” of the group authors Vahid Kazemi, Ali Elqursh and “KVQA: Knowledge- Aware Visual Question Answering” of the group of authors Sanket Shah, Anand Mishra,

Naganand Yadati, Partha Pratim Talukdar

Figure 4.2 is a high-level overview of the model in the paper "Show, Ask, Attend, and

Answer: A Strong Baseline For Visual Question Answering” To embed the image, the authors use a convolutional neural network based on ResNet Tokenized and embedded input questions are fed into a multi-layer LSTM The LSTMs' final state and the concatenated image features are used to compute multiple attention distributions over image features The concatenated image feature glimpses and LSTM state are fed into two fully connected layers, which generate probabilities over answer classes

Figure 4.2: The overview of "Show, Ask, Attend, and Answer: A Strong Baseline For

Visual Question Answering" VQA system

Figure 4.2 depicts a high-level overview of the model To summarize, the proposed model encodes the question using long short-term memory units (LSTM) and computes image features using a deep residual network A soft attention mechanism similar to is used to compute multiple glimpses of image features based on the state of the LSTM A classifier then uses the image feature glimpses and the final state of the LSTM as input to generate probabilities over a fixed set of most frequent answers On the VQA 1.0 open ended challenge, the model achieves 64.6% accuracy on the test-standard set without using additional data, an improvement of 0.4% over the state of the art, and on the newly released

VQA 2.0, the model scores 59.7% on the validation set, outperforming best reported results by 0.5% from the paper

Another reference is the model in the paper “KVQA: Knowledge-Aware Visual Question

CONCLUSION

In this master thesis on Knowledge-based Visual Question Answering (KBVQA), we have explored and compared two different baselines: KBVQA with BERT base and CNN, KBVQA with BERT large and CLIP The goal was to develop an effective system that can accurately answer questions based on both textual and visual inputs

The baseline version of KBVQA using BERT base and CNN has shown limitations in terms of answer quality and image processing errors The answers generated by this version were often suboptimal, and the image processing module faced challenges in correctly analyzing and extracting features from the input images On the other hand, the KBVQA system utilizing BERT large and CLIP has demonstrated remarkable results The integration of BERT large, a more powerful language model, and CLIP, a superior image processing module, has significantly improved the system's ability to understand and generate accurate answers CLIP solve problem at extracting image feature of ResNet The precision, recall, and F1-score achieved by this version were notably high, indicating the effectiveness of the combined textual and visual understanding Furthermore, the inclusion of the GPT-2 module in the KBVQA system allowed for the expansion of answer content GPT-2 helped provide more detailed and comprehensive information about the queried entity While this extension module exhibited functional capabilities, it occasionally generated non-standard or unrelated information, indicating the need for further refinement and context-awareness

To improve the KBVQA system in the future, several steps can be taken Firstly, training the BERT model with larger and more diverse datasets can enhance its understanding and answer generation capabilities Additionally, exploring alternative language models or text generation techniques that surpass the limitations of GPT-2 can result in more reliable and contextually appropriate answers Moreover, completing the module for object detection and face recognition would be beneficial to incorporate a comprehensive understanding of visual content, due to time constraints within the scope of this thesis, this module could not be fully implemented However, it remains a potential avenue for future development to further enhance the system's visual understanding and accuracy Besides, the dependence on query information on Wikipedia also needs to be improved, most of the content of the answer still depends on Wikipedia, so in the future the system will consider developing more KB with many different sources, increasing the accuracy in answering the system's questions Combining with data mining technique to show the link to the page to entity is also consider to be a way to upgrade the system

In conclusion, the KBVQA system with BERT base and CNN exhibited limitations, while the versions utilizing BERT large and CLIP showcased excellent performance The inclusion of the GPT-2 module expanded the context of the answers, albeit with some non- standard and unrelated information To improve the system, future work should focus on training models with larger datasets, exploring advanced language models, and completing additional modules like object detection and face recognition With further refinement and development, KBVQA systems hold great potential for real-world applications in various domains that require accurate and comprehensive question answering based on textual and visual inputs.

[1] J.-B Alayrac et al., “Flamingo: a Visual Language Model for Few-Shot Learning,” in

[2] P Anderson et al., “Bottom-Up and Top-Down Attention for Image Captioning and Visual

Question Answering,” in CVPR, Salt Lake, Utah, Mar 2018, pp 6077–6086

[3] H Bao et al., “VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-

Experts,” in NeurIPS, 2021 doi: https://doi.org/10.48550/arXiv.2111.02358

[4] E Bugliarello, R Cotterell, N Okazaki, and D Elliott, “Multimodal Pretraining

Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs,”

Transactions of the Association for Computational Linguistics, vol.9, pp 978–994, Sep

[5] D Chen, A Fisch, J Weston, and A Bordes, “Reading Wikipedia to Answer Open-

Domain Questions,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, 2017, pp 1870–1879

[6] Y.-C Chen et al., “UNITER: UNiversal Image-Text Representation Learning,” presented at the European Conference on Computer Vision, Glasgow, Jul 2020, p 104

[7] Y.-C Chen et al., “UNITER: UNiversal Image-Text Representation Learning,” in ECCV,

[8] P Clough, M Sanderson, and Henning Müller, “The CLEF Cross Language Image

Retrieval Track (ImageCLEF) 2004,” in Image and Video Retrieval, 1 st ed., vol 3115 D., Pajdla, T., Schiele, B., Tuytelaars, Ed Springer: Springer International Publishing, Jan

[9] Y Cui et al., “ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration,” in Proceedings of the 29th ACM International

Conference on Multimedia, Chengdu, Oct 2021, pp 797–806 doi: https://doi.org/10.1145/3474085.3475251

[10] J Devlin, M.-W Chang, K Lee, and K Toutanova, “BERT: Pre-training of Deep

Bidirectional Transformers for Language Understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, Minnesot: Association for

[11] J Hessel and L Lee, “Does my multimodal model learn cross-modal interactions? It’s harder to tell than you might think!,” in Proceedings of the 2020 Conference on Empirical

Methods in Natural Language Processing (EMNLP), Association for Computational

[12] R Hu, J Andreas, M Rohrbach, T Darrell, and K Saenko, “Learning to Reason: End-to-

End Module Networks for Visual Question Answering,” in ICCV, Venice, Apr 2017 doi: https://doi.org/10.48550/arxiv.1704.05526

[13] D A Hudson and C D Manning, “GQA: A New Dataset for Real-World Visual

Reasoning and Compositional Question Answering,” in CVPR, Long Beach, 2019, pp

[14] H Jiang, I Misra, M Rohrbach, E Learned-Miller, and X Chen, “In Defense of Grid

Features for Visual Question Answering,” in CVPR, 2020, pp 10267–10276 doi: https://doi.org/10.1016/j.artint.2021.103635

[15] J Johnson, B Hariharan, L van der Maaten, L Fei-Fei, C L Zitnick, and R Girshick,

“CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning,” in CVPR, Honolulu, 2017, pp 2901–2910

[16] S Khan et al., “Transformers in Vision: A Survey,” 2022 Available: https://arxiv.org/pdf/2101.01169.pdf

[17] J.-H Kim, J Jun, B.-T Zhang, and S Brain, “Bilinear Attention Networks,” in NeurIPS,

[18] G Li, N Duan, Y Fang, M Gong, D Jiang, and M Zhou, “Unicoder-VL: A Universal

Encoder for Vision and Language by Cross-modal Pre-training,” in Proceedings of the AAAI Conference on Articial Intelligence 34(07), Honolulu, 2019, pp 11336–11344 doi: https://doi.org/10.48550/arxiv.1908.06066

[19] L Li, Z Gan, Y Cheng, and J Liu, “Relation-Aware Graph Attention Network for Visual

Question Answering,” in ICCV, Seoul, 2019, pp 10313–10322 doi: https://doi.org/10.48550/arxiv.1903.12314

[20] J Li, D Li, C Xiong, and S Hoi, “BLIP: Bootstrapping Language-Image Pre-training for

Unified Vision-Language Understanding and Generation,” in ICML, Baltimore, Maryland,

[21] L H Li, M Yatskar, D Yin, C.-J Hsieh, and K.-W Chang, “VisualBERT: A Simple and

Performant Baseline for Vision and Language,” Jan 2019, Available: https://arxiv.org/abs/1908.03557

[22] X Li et al., “Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks,” in ECCV, 2020, pp 121–137 doi: https://link.springer.com/book/10.1007/978-3-030-

[23] T.-Y Lin et al., “Microsoft COCO: Common Objects in Context,” in Fleet, 1 st ed., vol 1

D., Pajdla, T., Schiele, B., Tuytelaars, Ed Springer: Springer International Publishing,

[24] J Lu, D Batra, D Parikh, and S Lee, “ViLBERT: Pretraining Task-Agnostic

Visiolinguistic Representations for Vision-and-Language Tasks,” in Proceedings of the

33rd International Conference on Neural Information Processing Systems, Vancouver,

[25] K Marino, M Rastegari, A Farhadi, and R Mottaghi, “OK-VQA: A Visual Question

Answering Benchmark Requiring External Knowledge,” in CVPR, Long Beach, 2019, pp 3195–3204 doi: https://doi.org/10.48550/arXiv.1906.00067

[26] M Narasimhan, S Lazebnik, and A Schwing, “Out of the Box: Reasoning with Graph

Convolution Nets for Factual Visual Question Answering,” in NeurIPS, 2021 doi: doi: https://doi.org/10.48550/arXiv.2111.02358

[27] R G Reddy et al., “MuMuQA: Multimedia Multi-Hop News Question Answering via

Cross-Media Knowledge Extraction and Grounding,” ArXiv, May 2022, doi: https://doi.org/10.48550/arXiv.2112.10728

[28] S Ren, K He, R Girshick, and J Sun, “Faster R-CNN: Towards Real-Time Object

Detection with Region Proposal Networks,” in NeurIPS, Montreal, 2015 Available: https://proceedings.neurips.cc/paper/2015/hash/14bfa6bb14875e45bba028a21ed38046- Abstract.html

[29] P Sharma, N Ding, S Goodman, and R Soricut, “Conceptual Captions: A Cleaned,

Hypernymed, Image Alt-text Dataset For Automatic Image Captioning,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne,

[30] S Shen et al., “How Much Can CLIP Benefit Vision-and-Language Tasks?,” in ICLR, Feb

2022 doi: https://doi.org/10.48550/arXiv.2107.06383

[31] R K Srihari, Z Zhang, and A Rao, “Intelligent Indexing and Semantic Retrieval of

Multimodal Documents Information Retrieval,” Information Retrieval, vol 2, pp 245–

[32] W Su et al., “VL-BERT: Pre-training of Generic Visual-Linguistic Representations,” in

Proceedings of ICLR 2020, Feb 2020 doi: https://iclr.cc/virtual_2020/index.html

[33] H Tan and M Bansal, “LXMERT: Learning Cross-Modality Encoder Representations from Transformers,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics,

2019, pp 5100–5111 Available: https://aclanthology.org/D19-1514.pdf

[34] A Vaswani et al., “Attention Is All You Need,” in NeurIPS, Long Beach, 2017

[35] E M Voorhees and D M Tice, “Building a question answering test collection,” in

Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’00), Athens, Greece: ACM Press, 2000, pp

[36] P Wang, Q Wu, C Shen, A van den Hengel, and A Dick, “FVQA: Fact-based Visual

Question Answering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol 40, no 10, pp 2413–2427, Aug 2017

[37] P Wang et al., “Unifying Architectures, Tasks, and Modalities Through a Simple

Sequence-to-Sequence Learning Framework,” in ICML, Baltimore, Maryland, 2022, pp 21218–23340

[38] Z Yu, Y Cui, J Yu, M Wang, D Tao, and Q Tian, “Deep Multimodal Neural

Architecture Search,” in ACM MM, Seattle, Oct 2020, pp 3743–3752 doi: https://doi.org/10.48550/arXiv.2004.12070

[39] J Yu et al., “CoCa: Contrastive Captioners are Image-Text Foundation Models,”

Transactions on Machine Learning Research, vol 1, no 3, 2022

[40] Z Yu, J Yu, J Fan, and D Tao, “Multi-modal Factorized Bilinear Pooling with Co-

Attention Learning for Visual Question Answering,” in ICCV, Venice, Aug 2017, pp

1821–1830 doi: https://doi.org/10.48550/arxiv.1708.01471

[41] [22]L Yuan et al., “Florence: A New Foundation Model for Computer Vision,” ArXiv, vol

[42] P Zhang et al., “VinVL: Revisiting Visual Representations in Vision-Language Models,” in CVPR, Nashville, Tennessee, 2021, pp 6077–6086 doi: https://doi.org/10.48550/arXiv.2101.00529

[43] M Zhou, L Yu, A Singh, M Wang, Z Yu, and N Zhang, “Unsupervised Vision-and-

Language Pre-training via Retrieval-based Multi-Granular Alignment,” in CVPR, New

Tiêu đề	Application of Visual Question Answering Using BERT Integrated with Knowledge Base to Answer Extensive Question
Tác giả	Phạm Điền Khoa
Người hướng dẫn	Assoc. Prof. Dr. Quản Thành Thơ, Dr. Bùi Hoài Thắng
Trường học	Ho Chi Minh City University of Technology
Chuyên ngành	Computer Science
Thể loại	master’s thesis
Năm xuất bản	2023
Thành phố	Ho Chi Minh City

Định dạng
Số trang	87
Dung lượng	1,52 MB