Đồ án tốt nghiệp Robtics và trí tuệ nhân tạo: Development of an AI system for data extraction from Vietnamese printed documents

The objectives of this project are to research, design and develop a software compatible with various types of scan devices to capture the scanned image of a document, and apply artifici

INTRODUCTION

Motivation

Vietnam is experiencing rapid digital transformation, highlighted by the national data sharing and integration platform that facilitates over 1.6 million daily transactions The national insurance database has successfully verified information for 91 million citizens, while the civil servant and public employee database encompasses data on nearly 2.1 million individuals, achieving an impressive 95% connectivity rate across all ministries, branches, and localities.

Digitization is a crucial element of digital transformation, involving the conversion of information from analog to digital formats, such as transforming handwritten text and analog audio recordings into digital versions This process goes beyond simple document scanning; it includes extracting data from documents, storing digital files in repositories or the cloud, and ensuring ongoing maintenance and management of these digital assets.

Digitization enables efficient information storage in databases, streamlining the registration and retrieval processes This transition significantly improves document security by reducing the risks of data loss, theft, or damage linked to paper formats, ultimately preventing serious data breaches.

Another benefit of digitization is cost savings Paper documents are expensive to produce, store, and access Digitizing these files reduces costs and minimizes paper waste, making it environmentally friendly

Digitization streamlines workflows by eliminating manual data entry, which not only saves time on repetitive tasks but also boosts overall productivity By minimizing human errors, it enhances data accuracy, while automation of data entry, retrieval, classification, and sorting replaces labor-intensive processes with efficient and precise algorithms.

Managing vast amounts of paper documents poses significant challenges in the contemporary era Effective solutions necessitate support from advanced computer

Despite advancements in technology, computers still struggle to autonomously read, understand, and analyze paper documents for data extraction and digitization This challenge demands significant human effort and time, prompting researchers and companies to seek automation solutions As a result, innovative systems have been developed to efficiently digitize and extract valuable information from documents, aiming to reduce human involvement in the process.

The team has identified a lack of scientific publications on extracting important information from Vietnamese text images, despite the presence of foreign services like SAP addressing this issue In Vietnam, while Computer Vision technologies exist, they are commercial and their underlying mechanisms remain inaccessible To lower usage costs, enhance understanding of the technology, and aid in the digitalization of documents, the team is developing a text information extraction system specifically designed for scanned documents This system aims to automate the extraction of desired information from images, thereby reducing manual labor and facilitating easier storage, retrieval, and management of digitized text.

In response to the challenges and opportunities in digitization, we initiated a research and development project to create an automated system for digitizing forms This specialized software efficiently extracts essential information, including names and phone numbers, and is compatible with various scanner types Furthermore, it offers a user-friendly Graphical User Interface (GUI) for seamless form scanning and data management.

1 https://store.sap.com/dcp/en/product/000000000008900237/document-information-extraction

2 https://www.computervision.com.vn

Scientific and Practical Significances

The thesis research and build an efficient end-to-end information extraction method tailored for Vietnamese scanned documents The major contributions of the thesis are as follows:

• Application Programming Interfaces (APIs) to simplify and expedite connections to various scanner hardware on the market

• An auto-digitizing method is proposed leveraging state-of-the-art techniques from related fields extract information from predefined forms efficiently

• A dataset is developed comprising Vietnamese form-like documents tailored for use with our data extraction system

• A framework is built for generating synthetic handwritten data to finetune the optical character regconition (OCR) model thereby enhancing the total accuracy of our work

• The database system specified for this thesis is built for efficient and convenient information management

• A web app GUI for users to interact with the data extraction system.

Objectives

The team focuses on solving the problem of extracting information from images To accomplish this task, the team has set specific objectives as follows:

• Develop application programming interface that is compatible with most of the market's current scanners

• Develop a system to extract information from various forms

• Develop a server system to process and store extracted information with low cost and fast response

• Develop a UI to interact with the system.

Scope of The Thesis

The article emphasizes the importance of extracting specific information, such as names and phone numbers, from scanned documents It also highlights the need for support for Vietnamese text, aiming to fill the existing gap in scientific publications related to this topic in Vietnam.

• Document Types o Primarily handle forms and structured documents scanned using different types of scanners o Potential to expand to other types of documents in future phases

To improve optical character recognition (OCR) for the Vietnamese language, it is essential to develop specialized algorithms that enhance text recognition accuracy through advanced machine learning and computer vision techniques Additionally, creating a user-friendly Graphical User Interface (GUI) will facilitate easy form scanning and efficient data management, ensuring a seamless experience for users.

• Integration and Compatibility o Ensure compatibility with various scanner models o Develop a system that can be integrated with existing university databases and platforms to facilitate seamless data sharing and verification

• Automation and Efficiency o Automate the data entry, retrieval, classification, and sorting processes to reduce manual labor o Focus on reducing human efforts and improving overall productivity within organizations

• Cost Efficiency o Develop an open and transparent system to reduce costs associated with commercial OCR services o Provide a cost-effective solution for Vietnamese organizations to digitize their documents

Scientific Research Methods

• Consult scientific literature, paper, journals, and specialized articles on various information extraction methods

• Analyze information extraction methods regarding their working principles, advantages and disadvantages, and real-world applications

The synthesis approach systematically compiles and organizes various text information extraction methods, offering a structured overview of existing techniques This facilitates comprehensive analysis and informed evaluation, thereby advancing research in text information extraction and its practical applications.

• Implement selected models of Document AI to compare and select the most suitable models that meet the thesis’s requirements

• Evaluate the effectiveness of each method using metrics like accuracy, completeness, processing time, etc

• Design, analyze and build a suitable database system for efficient and convenient information retrieval.

Limitations

The initial emphasis on Vietnamese text in language and character recognition may restrict the system's broader applicability due to the scarcity of pretrained models for the Vietnamese language Additionally, recognizing handwritten text presents potential challenges stemming from the diverse range of handwriting styles.

• Document Quality o Accuracy of OCR may be affected by the quality of scanned documents, such as poor resolution, skewed images, or noisy backgrounds

6 o Limited capability to process heavily damaged or obscured documents

Technical constraints in algorithm and model development can significantly impact performance, as achieving high accuracy often necessitates substantial training data and computational resources Additionally, there may be challenges in effectively managing complex layouts or unstructured documents without implementing extra processing steps.

• Resource Availability o Development may be constrained by the team's limited knowledge and resources o Need for continuous research and development to keep up with advancements in OCR technology.

Structure of The Report

Chapter 1 INTRODUCTION: In Chapter 1, the study begins with a concise introduction to the Development of an Artificial Intelligence (AI) system focusing on Vietnamese printed documents This chapter sets the stage by outlining the significance of automating document processing through AI technologies, particularly tailored for Vietnamese language documents It highlights the challenges and motivations behind the project, providing a glimpse into the expected outcomes and benefits of implementing such a system

Chapter 2 LITERATURE REVIEW: Chapter 2 delves into a comprehensive review of the theoretical foundations essential for the project's development It explores key concepts such as Document AI, database management, solutions for Document AI, Document Layout Analysis techniques, and Optical Character Recognition (OCR) technologies This chapter synthesizes existing knowledge from scholarly sources and industry practices, providing a robust theoretical framework that informs the subsequent stages of the AI system's design and implementation

Chapter 3 DOCUMENT IMAGE ACQUISITION: The focus of Chapter 3 is on the research and design of an API tailored to seamlessly integrate with the hardware system This chapter details the methodologies employed to bridge the gap between software and hardware components, ensuring efficient data exchange and system interoperability It emphasizes the strategic decisions made in API design to optimize system performance and reliability in handling Vietnamese printed documents

Chapter 4 DEVELOPMENT OF DATA GENERATION: This chapter focuses on creating a diverse dataset essential for training and evaluating OCR models It details methods for generating synthetic text images that vary in fonts, styles, and contexts encountered in real-world scenarios The chapter emphasizes preprocessing techniques to enhance image quality and discusses ethical considerations in dataset creation

Chapter 5 DEVELOPMENT OF TEMPLATE CREATION: Chapter 5 discusses the need for high precision in extracting information from administrative documents, which requires a foundational knowledge base and structured templates The proposed approach involves creating a pipeline for developing document templates, implementing advanced layout analysis models, and training text recognition models to enhance efficiency and accuracy in information extraction This system aims to reduce prototyping costs and quickly adapt to new forms

Chapter 6 DEVELOPMENT OF INFORMATION EXTRACTION: In Chapter 6, the focus shifts to the interaction design and database selection criteria essential for the Information Extraction System This chapter details the iterative process of system design, emphasizing user-centered principles to optimize user experience and system usability It discusses the strategic considerations in database selection to ensure scalability, efficiency, and data integrity in managing Vietnamese document information

Chapter 7 EXPERIMENTAL RESULTS AND DISCUSSION: Chapter 7 focuses on the training and evaluation the document layout analysis and text recognition models Additionally, it discusses it discusses the performance of each component within the entire system, and the last chapter is the conclusion and future works

LITERATURE REVIEW

Introduction to Document Imaging Method

Document imaging is essential for digitalization, serving as a foundation for AI analysis by converting physical documents through a hardware-software interface This meticulous conversion utilizes established image capture techniques, producing digital data that supports the application of metadata for effective organization and electronic archiving This scientific method enhances information management efficiency, especially for organizations overwhelmed by paper documentation The two main methodologies for document imaging are image capture and document scanning.

Digital still photography presents a unique method for image capture in document imaging, utilizing a digital camera with a light-sensitive sensor and high-quality lens to capture entire documents in a single frame The camera captures the light reflected off the document, converting it into a digital image file This technique is particularly advantageous for capturing large, flat documents or those containing intricate details.

Figure 2.1: Sample for image capturing (Source: Internet)

Digital still photography provides notable advantages in convenience and compactness, making it highly portable and user-friendly Unlike bulky scanners that necessitate dedicated workstations, digital cameras are lightweight and easily transportable This portability enables efficient document capture in various locations and situations, even in environments with limited workspace.

Digital photography offers exceptional versatility for capturing various document types, including not only standard flat documents but also bound materials like books and historical manuscripts This capability is essential for comprehensive documentation across scientific fields, such as archival preservation and historical research, where the ability to photograph documents that cannot be easily scanned is vital.

Capturing large, multi-page documents using digital photography can be time-consuming, as it requires manually photographing each page instead of utilizing a scanner that automatically captures documents page by page This manual process can significantly extend the total time needed for document acquisition, particularly for projects with extensive documentation.

Digital image capture in uncontrolled environments can lead to significant inconsistencies in image quality, influenced by factors such as uneven lighting, varying camera angles, and manual focusing adjustments These variables can result in noticeable differences in clarity, resolution, and color fidelity across the images of captured pages Although post-processing techniques can help reduce these discrepancies, they often fall short of the uniformity achieved with dedicated document scanners.

A document scanner uses a specialized imaging technique that involves a moving light source scanning across the document's surface This light illuminates narrow sections, while a sensor array captures the reflected light and converts it into a digital signal The document is fed through the scanner, allowing for sequential capture of each section until the entire document is digitized into a single image file This method is particularly efficient for processing large batches of consistently formatted documents, such as standard A4 or letter-sized papers.

Figure 2.2: Some type of scanners (Source: Internet)

Document scanners are engineered to deliver high-quality and consistent image captures by using a controlled light source and a calibrated sensor array This technology ensures uniform illumination and consistent image acquisition across all pages, significantly reducing variations in clarity, resolution, and color fidelity As a result, users receive high-quality digital representations of their original documents.

Modern document scanners with Automatic Document Feeders (ADFs) enhance processing efficiency by enabling automated batch processing with high throughput These advanced scanners can sequentially feed and capture multiple documents, minimizing the need for manual intervention and significantly improving document capture rates.

11 throughput This automation is particularly beneficial for high-volume document imaging projects, where time efficiency is a critical factor

Document scanners depend on accurate document feeding to ensure effective image capture When documents feed unevenly or skewed, it can lead to paper jams that interrupt the scanning process These jams require manual intervention to fix, which can negatively affect workflow efficiency and cause delays.

Document scanners, being larger and less portable than digital cameras, often face mobility challenges that can hinder their ease of use in various situations This bulkier design makes it difficult to capture documents in confined spaces or remote locations, which is especially problematic for field research and on-site data acquisition tasks.

This thesis investigates a document scanning method that leverages the advantages of consistent image quality, automated batch processing, and the utilization of readily available office scanner hardware.

Introduction to Document AI

Document AI involves diverse data science tasks focused on extracting, analyzing, and interpreting information from various documents Key tasks include image classification, image-to-text conversion, document question answering, table question answering, and visual question answering This article outlines a taxonomy of Document AI use cases, showcases top open-source models for each application, and discusses essential aspects such as licensing, data preparation, and modeling techniques.

Document AI solutions have various general use cases that depend on the type of document input and output These use cases often necessitate a combination of approaches to effectively address enterprise-level challenges.

Optical Character Recognition (OCR) is the process of transforming typed, handwritten, or printed text into machine-readable format This established technology is essential for various Document AI applications, enabling the digitization and extraction of text from documents Numerous open-source and commercial OCR solutions are available, as illustrated in Figure 2.3.

Figure 2.3: OCR is the process of transform image to text (Source: [1])

Document image classification is the process of organizing documents into specific categories, including forms, invoices, and letters This classification can be based on visual elements, textual content, or a combination of both Recent developments in multimodal models, which effectively integrate visual and textual information, have greatly enhanced the accuracy of this classification task.

Figure 2.4: Document image classification examples (Source: [1])

Document layout analysis involves recognizing the physical structure of a document, which includes elements such as text segments, headers, and tables This process is often approached as an image segmentation or object detection challenge, where models generate segmentation masks or bounding boxes paired with class labels to distinguish various components of the document An example of document layout analysis is illustrated in Figure 2.5.

Figure 2.5: Document layout analysis examples (Source: [1])

Table detection locates tables within documents, while table extraction forms a structured representation of the data Additionally, table structure recognition dissects tables into their respective rows, columns, and cells, and table functional analysis identifies key-value pairs within the tables These processes are akin to document layout analysis and frequently utilize object detection models.

Data preparation is a crucial and challenging component of Document AI, as it requires properly annotated data for training effective models Key considerations in this process include ensuring data quality, relevance, and accuracy to enhance model performance.

Figure 2.6: Table detection examples (Source: [1])

The effectiveness of machine learning models is significantly influenced by the scale and quality of the training data To achieve optimal performance, it is essential to utilize high-quality images and sufficiently large datasets Inadequate image quality can severely impact the accuracy of Optical Character Recognition (OCR) and various document analysis tasks.

To achieve optimal results in OCR implementation, it's essential to explore various methodologies, including open-source tools like Tesseract, commercial options such as the Cloud Vision API, and advanced multimodal models like Donut This flexibility enables adaptation to specific use cases, ultimately enhancing overall performance.

To achieve optimal predictive accuracy, start with a small, annotated dataset of a few hundred documents and assess performance meticulously Once an effective method is established, gradually scale up the dataset to enhance results Utilizing annotation tools that support bounding boxes is crucial for tasks such as layout identification and document extraction.

• Annotation Tools and Techniques: Using the right annotation tools is crucial For

OCR tasks, it's important to ensure that text annotations are precise and cover various types of text, such as printed, handwritten, and cursive

Introduction to Document Layout Analysis

DLA is a vital preprocessing step in document understanding systems, crucial for detecting and annotating the physical structure of documents This process underpins applications like document retrieval, content categorization, and text recognition By identifying uniform blocks within documents and defining their spatial relationships, DLA enhances the effectiveness of subsequent analysis and recognition phases.

The DLA pipeline consists of various phases that differ based on the complexities of document layouts and specific analysis goals At present, there is no one-size-fits-all DLA algorithm that can handle all document types or satisfy every analytical requirement This variation highlights the need for customized strategies in DLA methodologies.

Document layouts vary significantly across different document types, with printed documents categorized into six main types: rectangular, Manhattan, non-Manhattan, multi-column Manhattan, horizontal overlapping, and diagonal overlapping Examples of these diverse document layouts are illustrated in Figure 2.7.

Figure 2.7: Document Layouts: (a) Regular, (b) Manhattan-based, (c) Non-Manhattan, (d) Multi- column Manhattan, (e) Arbitrary Complex, (f) Overlapping horizontally and diagonally (Source:

This classification framework can be expanded to encompass historical manuscripts, which frequently exhibit irregular layouts Since these manuscripts are predominantly handwritten, this categorization method is also relevant for modern handwritten document designs.

The variations in document layouts and analysis goals result in different phases of Document Layout Analysis (DLA) processing, which differ among algorithms A common workflow pattern has been recognized across various DLA studies, leading to the development of a generalized DLA framework, illustrated in Figure 2.8.

Figure 2.8: General document layout analysis framework (Source: [2])

This framework encompasses five principal phases: preprocessing, analysis parameter estimation, layout analysis, post-processing, and performance evaluation Each phase is briefly elucidated below

Preprocessing is essential for converting raw document images into formats that are suitable for specific document analysis methods Document analysis techniques typically require input images to be clean, binary, or de-skewed To meet these requirements, the preprocessing phase employs key procedures such as binarization, de-skewing, and image enhancement, ensuring that the input images are optimized for analysis.

Analysis Parameters are essential predefined measurements for effective document analysis in DLA methods, categorized into model-driven and data-driven types Model-driven parameters adjust DLA models to meet specific analysis objectives, including the configuration of Multi-layer Perceptron structures and weight initialization for training In contrast, data-driven parameters are calculated from dataset metrics, such as average inter-line spacing and character dimensions, enhancing the analysis process.

Layout Analysis involves three main strategies: bottom-up, top-down, and hybrid The bottom-up approach begins with small elements, such as pixels or connected components, and gradually merges them into larger regions until certain stopping conditions are reached In contrast, the top-down strategy starts with large regions and recursively divides them into smaller zones based on homogeneity criteria The hybrid approach integrates elements from both bottom-up and top-down methods to provide a thorough layout analysis.

Post-processing is an essential step in Document Layout Analysis (DLA) algorithms, even though it is often optional It improves and refines analysis results to adapt to various document layouts, addressing any shortcomings in segmentation outcomes This process is vital for achieving precise extraction of document structures.

Performance evaluation in Document Layout Analysis (DLA) encompasses both physical and logical assessments The physical analysis focuses on identifying document structures and defining homogeneous regions, while the logical analysis categorizes these regions into specific document elements such as figures, headings, and paragraphs This evaluation process typically involves comparing the segmented results to ground-truth entities at both pixel and region levels using established matching techniques Notably, the PAGE framework developed by Pletschacher and Antonacopoulos offers a standardized method to manage various document representations and tackle different stages of DLA, enabling effective description of both ground-truth and segmented results.

Extensible Markup Language (XML) files, covering aspects like image borders, layout structure, content, geometric corrections, and binarization processes

This section highlights the essential tasks involved in document layout analysis (DLA), including parameter estimation, page segmentation, and post-processing While parameter estimation and post-processing are optional and often discussed briefly, we delve deeper into the diverse methods of DLA, which are classified into classical strategies: bottom-up, top-down, and hybrid approaches Figures 2.9 and 2.10 provide a visual taxonomy of these methods and illustrate the input and output of DLA, respectively.

Figure 2.9: Document layout analysis taxonomy (Source: [2])

The bottom-up strategy in document analysis generates insights from data at a granular level by estimating parameters through statistical analysis of pixel distributions and the characteristics of connected components, words, text lines, or regions This method typically begins with detailed examination at finer levels within an image.

The bottom-up strategy in document analysis involves assembling smaller elements like pixels, components, or words into larger regions, stopping once specific analytical goals are met This approach encompasses five key areas: connected component analysis, texture analysis, learning-based methods, Voronoi diagrams, and Delaunay triangulation.

Figure 2.10: A document is passed through a generic layout analysis model, resulting in a layout segmentation mask with the following classes: title (blue), text (red), table (green), and figure

Connected component analysis plays a crucial role in layout analysis by offering flexibility in characterizing diverse shape properties One of the pioneering bottom-up algorithms, the Docstrum algorithm, effectively organizes connected components using a polar structure defined by distance and angle, facilitating final segmentation Although initially designed for printed documents, Docstrum's methodology has also been successfully applied to tackle layout challenges in historical manuscripts by utilizing the local features of connected components.

Texture analysis in document-image processing enables the efficient identification of elements using techniques classified as bottom-up or top-down approaches Bottom-up methods begin by extracting texture features from pixel data, followed by the grouping of pixels into homogeneous regions An example of this is spatial autocorrelation, which plays a crucial role in detecting textures.

Introduction to Optical Character Recognition

OCR consists of two main elements: text detection and text recognition, which generally function separately with unique models for each task This section explores the latest techniques for these components and illustrates the processing of documents across different OCR systems.

In an OCR system, a document can be processed through one of three paths: the left path utilizes an object detection model to create bounding boxes, followed by transcription; the middle path employs a generic text instance segmentation model to identify and transcribe text-containing pixels; and the right path uses a character-specific instance segmentation model to match characters to pixels Despite the differing approaches, all methods yield the same structured output, with the illustrative document sourced from FUNSD.

Figure 2.15: General OCR process (Source:[26])

Text detection involves identifying the presence of text within an image Typically, the input image is represented by a three-dimensional tensor, denoted as C × H × W, where

In image processing, C represents the number of channels—typically three for red, green, and blue—while H and W denote the height and width of the image Text detection poses considerable challenges due to the varying shapes, orientations, and potential distortions of text This article examines two common methods for text detection: object detection and instance segmentation The object detection method identifies text by outputting bounding box coordinates, whereas instance segmentation creates a mask that differentiates between text-containing and non-text-containing pixels.

Recent advancements in scene text detection have led to the development of two main methodologies: regression/object detection methods and segmentation/instance segmentation methods Each approach possesses distinct advantages and disadvantages, which have been improved over time to address the challenges posed by diverse text orientations and shapes.

Text detection can be effectively approached as object detection by predicting bounding boxes around text instances This technique frequently utilizes established object detection models, adapting them for the specific task of detecting text.

• TextBoxes [28]: This method modifies Single Shot MultiBox Detector (SSD) by adjusting the anchors and convolutional kernel scales for text detection

• EAST [29]: Anchor-free methods that apply pixel-level regression for detecting multi-oriented text instances

FCENet utilizes Fourier signature vectors to predict text instances in the Fourier domain, subsequently reconstructing text contour point sequences in the image spatial domain through Inverse Fourier Transformation (IFT).

Despite their efficiency and simple post-processing algorithms like non-maximum suppression, these methods often struggle with accurately representing irregular shapes, such as curved text

Text Detection as Instance Segmentation: This method focuses on pixel-level predictions to identify text regions, making them adept at handling irregularly shaped text:

• PSENet [31]: Proposes progressive scale expansion by segmenting text instances with different scale kernels

• SAST [32]: Utilizes a context-aware multi-task learning framework, grounded in a FCNN, to acquire multiple geometric properties essential for reconstructing polygonal representations of text regions

• DBNet [42]: Performs the binarization process in a segmentation network adaptively setting the threshold for binarization to enhance the performance

• Centripetal Text (CT) [33]: Text instances are decomposed into text kernels and centripetal shifts, with the latter facilitating pixel aggregation by guiding external text pixels toward the internal text kernels

In conclusion, both object detection-based and instance segmentation-based methods have significantly advanced scene text detection This can be applicable in the

The integration of deep learning in text detection for document forms has significantly improved the efficiency and accuracy of these systems, enabling them to effectively handle diverse text orientations and shapes.

Prior to the rise of deep learning, numerous reviews focused on traditional text detection and recognition methods Traditional text recognition treats the character recognition of text lines as a multi-label learning task Key techniques in traditional text detection include joint component analysis and the sliding window method Additionally, traditional recognition encompasses character segmentation, text line segmentation, text binarization, single character recognition, word correction, and the mapping of strings to images based on learned label characteristics, often framing these tasks as classification challenges Text recognition algorithms can be categorized into six main types based on their network structure: Connectionist Temporal Classification (CTC)-based algorithms, Encoder-decoder with Attention algorithms, Transformer-based algorithms, segmentation-based recognition algorithms, end-to-end recognition algorithms, and various other recognition methods.

Deep learning-based text recognition models outperform traditional methods by leveraging advanced algorithms These algorithms encompass key processes such as image correction, feature extraction, and sequence prediction, leading to enhanced accuracy and efficiency in text recognition tasks.

Recognizing Vietnamese text in images presents a significant challenge due to its unique characters and diacritics To tackle this issue, various methods have been developed to accurately interpret and transcribe Vietnamese text from visual inputs Among these techniques, VietOCR (Vietnamese Optical Character Recognition) emerges as a leading solution.

3 https://pbcquoc.github.io/vietocr

Pham Ba Cuong Quoc proposed an optical character recognition method for scene text in the Cinnamon competition, focusing on recognizing Vietnamese characters in images This approach utilizes TransformerOCR or AttentionOCR, featuring a basic CNN architecture for initial feature extraction The extracted data is then processed through a Transformer or Attention model to analyze and learn from the information, ultimately producing recognized characters from the input image Sample results of VietOCR are illustrated in Figure 2.16.

Figure 2.16: Vietnames text recognition results of VietOCR (Source: [34])

The VietOCR model incorporates two distinct architectures, TransformerOCR and AttentionOCR, for optical character recognition from images

The TransformerOCR model integrates a basic CNN architecture for extracting essential features from input images and a Transformer for text recognition This dual-component approach allows the CNN to process images and extract vital features, which are then input into a Transformer encoder This encoder simultaneously receives image features and character labels, encoding character-level information into vectors that facilitate the model's understanding of the textual content within the images.

VietOCR, as illustrated in Figure 2.18, consists of two primary components: a foundational CNN architecture for extracting features from the input image and an Attention Seq2Seq model for processing these features The CNN network plays a crucial role in the initial feature extraction process.

To prepare the feature map for input into a Long Short-Term Memory (LSTM) model, which requires inputs of size hidden × timestep, the last two dimensions (height × width) of the feature map, sized channel × height × width, must be flattened This process ensures compatibility with the LSTM's input requirements.

Figure 2.17: TransformeOCR architecture in VietOCR (Source: [34])

Advanced deep learning techniques enhance optical character recognition by employing convolutional neural networks for feature extraction and utilizing Transformer or Attention Seq2Seq models for effective sequence processing and text recognition.

Scale Invariant Feature Transform

SIFT (Scale-Invariant Feature Transform) consists of four key stages: Scale-space extrema detection, Keypoint localization, Orientation assignment, and Keypoint descriptor The scale space of a filled form image is represented as a function, L(x,y,σ), which is generated by convolving a variable-scale Gaussian, G(x,y,σ), with the input filled form image, I(x,y).

𝐿(𝑥, 𝑦, 𝜎) = 𝐺(𝑥, 𝑦, 𝜎) ∗ 𝐼(𝑥, 𝑦) (2.1) Scale-space extrema in the difference-of-Gaussian function convolved with D(x,y,σ), which can be computed as:

The method outlined in [16] was employed to accurately localize the key point in the image by determining the interpolated location of the maximum This technique utilizes the scale-space function D(x,y,σ), which is adjusted to place the origin at the sample point, along with its Taylor expansion.

The location of the extremum, 𝑥̂, is determined by:

To achieve image rotation invariance, each keypoint is assigned a specific orientation, as demonstrated in Equation (2.5) Additionally, gradients are computed within a neighborhood surrounding the keypoint, which is defined by the selected scale, as illustrated in Equation (2.6).

Metrics Evaluation in Document AI

Intersection over Union (IoU) is a metric used to evaluate the overlap between a predicted bounding box and the corresponding ground truth bounding box This measure assesses the model's ability to accurately capture the content within designated layout areas A higher IoU score signifies a stronger alignment between the predicted content and its actual placement.

Precision and Recall: Precision quantifies the proportion of corect layout area among all positive predictions, assessing the model's capability to avoid false positives On

41 the other hand, Recall calculates the proportion of true positives among all actual positives, measuring the model's ability to detect all instances of a class:

Mean Average Precision (mAP) is a metric that evaluates a model's performance by calculating the Average Precision (AP) across various layout areas By determining the area under the precision-recall curve, mAP provides a comprehensive single value that reflects both precision and recall This metric is particularly useful for tasks involving multiple distinct layout regions, such as titles, text, and tables.

• n is the number of IoU thresholds from 0.5 to 0.95 in increments of 0.5

Character Error Rate (CER) and Word Error Rate (WER) are essential metrics for assessing the performance of Automatic Speech Recognition (ASR) systems In our text recognition experiments, we utilized these metrics to measure the accuracy of the system They represent the percentage of characters or words incorrectly identified compared to the actual text, with lower values indicating better accuracy, and a score of 0 representing perfect recognition CER and WER are calculated using specific equations (2.12 and 2.13).

• 𝑆 is the number of substitutions

• 𝐷 is the number of deletions

• 𝐼 is the number of insertions

• 𝑁 𝐶𝐸𝑅 is the number of characters in the reference

• 𝑁 𝑊𝐸𝑅 is the number of words in the reference

DOCUMENT IMAGE ACQUISITION

Objectives

• Developing an application programming interface that is compatible with most of the market's current scanners

• Developing of a network to facilitate multi-user access to network-shared scanners.

Technical Requirement

This thesis focuses on creating a versatile data extraction system compatible with various commercially available scanners By utilizing user-owned scanners, this approach reduces the necessity for additional hardware investments and enhances system scalability The architecture requires the integration of two essential components: the scanner hardware and a comprehensive scanner software suite, which includes a driver application programming interface.

To guarantee successful operation and effective AI analysis, some baseline hardware requirements should be satisfied

Document imaging method Document scanning

Scanner type Automatic Document Feeder (ADF)

Color depth At least 8 bits per pixel

Connection TCP/IP or USB

To ensure optimal performance, scanner software must maintain high system stability and facilitate seamless updates It should offer broad compatibility with various scanner models, providing users with flexibility Additionally, the software needs to support multi-user and multi-scanner functionality to enhance workflow efficiency, all while featuring an intuitive graphical interface for ease of use.

44 interface for intuitive interaction, and user-customizable functions to optimize performance for specific scanning tasks Figure 3.1 indicates the design of scanner interface

Figure 3.1: Design of scanner interface

Hardware System

The project requires a document capture method that emphasizes image quality, processing efficiency, and optimal resource utilization, as outlined in sections 2.1 and 3.2 The Brother MFC-795CW scanner, featuring an Automatic Document Feeder (ADF), meets these essential criteria effectively.

Document scanners like the MFC-795CW ensure superior image fidelity by using controlled light sources and high-resolution sensors, which consistently capture high-quality images of administrative documents This capability is crucial for maintaining the detail and integrity of official records.

The MFC-795CW's Automated Document Feeder (ADF) enhances batch processing by enabling near-continuous scanning of large volumes of documents with minimal human intervention This efficient system significantly reduces processing time compared to traditional individual capture methods, making document management faster and more streamlined.

• Reduced Manual Workload: The ADF minimizes manual document handling and page turning, mitigating human error and improving workflow efficiency

• Reusing Existing Infrastructure: Since the MFC-795CW is a commonly used office scanner, utilizing it promotes resourcefulness by maximizing existing

45 equipment, eliminating the need for additional capital expenditure on specialized capture devices

The Brother MFC-795CW scanner is designed for high-volume document capture, emphasizing image quality, processing speed, and efficient workflow Its ADF functionality enhances resource utilization, making it an ideal choice for demanding scanning projects.

Scanner Interface

Developing a scanner interface API that integrates with various scanner models is challenging due to the differences in underlying hardware Each scanner model features unique chipsets, components, and sensors, requiring specific drivers to utilize their functionalities For example, different sensor types have varying dynamic ranges that affect detail capture in different brightness levels A well-designed driver bridges these hardware differences, providing a consistent API interface that allows programmers to use the same commands across diverse scanners.

The API plays a crucial role in enabling software to interact seamlessly with scanners by serving as a bridge to scanner drivers These drivers, which are regularly updated, translate software commands into specific instructions for the scanner hardware, ensuring compatibility across various scanner models The architecture of the API is illustrated in Figure 3.2.

Each scanner model requires a distinct driver stored in a separate so file, which contains compiled code tailored to that specific model This driver includes essential functions for initiating scans, adjusting settings, and processing images.

• Driver manager: the driver manager collaborates with the dynamic linker The driver manager discovers available scanners and locates the corresponding so file for the connected model

Dynamic linking occurs at runtime, where the program identifies unresolved symbols related to scanner functionalities It then replaces these symbols with the corresponding functions from the driver (.so file) provided by the driver manager.

The central control block serves as the core of the program, managing interactions with the scanner It processes user input for device settings and image parameters, translating these into commands for the relevant scanner driver This functionality allows for the adjustment of settings such as source and scan mode, as well as contrast and brightness By facilitating seamless communication and image acquisition, the central control block ensures compatibility with various scanner models.

Figure 3.2 Block diagram of Application Programming Interface architecture

This article discusses two key connection methods in a document scanning software program The first method enables a single user to connect to multiple scanners via a central access point router The second method focuses on centralized management and user access control within the software, enhancing overall usability and security.

This research investigates the optimal configuration of a shared pool of scanners for multiple users by analyzing the unique components and advantages of different architectures The goal is to identify the best setup for various scanning scenarios tailored to the program's functionalities and user needs.

Figure 3.3 Single-User to Multi-Scanner

A single-user to multi-scanner setup utilizes a centralized access point router to connect one user with multiple scanners This network router serves as a bridge, enabling seamless communication between the user's PC and the selected scanners This configuration is straightforward, requiring minimal setup on both the scanners and user devices, making it an efficient solution for streamlined scanning operations.

Figure 3.4: Multi-User to Multi-Scanner (Server-Based)

The Multi-User to Multi-Scanner server-based approach allows multiple users to simultaneously access a shared pool of scanners Scanners are directly connected to a server via wired USB or TCP/IP connections, enabling efficient communication between user PCs and the scanning resources.

The architecture employs 48 servers over a dedicated TCP/IP network, ensuring centralized management and enhanced security control This setup also supports essential features such as job queuing and load balancing, which optimize scanner utilization for multiple users.

Scanning Process

There are four steps in the scanning process, as seen in Figure 3.5 include: create connection to desired scanner, device setup, image acquisition and image transmission

Figure 3.5: Block diagram of scanning process

3.5.1 Create Connection to Desired Scanner

The application establishes a TCP/IP connection to the desired scanner, or alternatively to an available scanner if the first is not accessible This driver serves as a translator, allowing the application to interact seamlessly with the scanner's hardware-specific protocols We utilize TCP/IP in our project as it guarantees complete data transmission, ensuring that the receiver can accurately assemble the information in the correct order.

Based on the retrieved scanner information and the user's desired output, the application configures the scanning device through the API This configuration specifies:

• Scan Mode: color channel of the image

• Source: Flatbed scanning for single documents or automatic document feeder for multi-page documents

Then, the API instructs the scanner to capture the image based on the settings Based on subsequent AI system, we adjust the image parameter include:

• Resolution: Desired image resolution, which determines the level of detail captured

• Image Adjustments: Fine-tuning of image parameters like brightness and contrast may be offered by some scanners and can be adjusted here

• Scan Area: Definition of the specific document region to be captured (entire document or a selected portion)

To ensure efficient document processing, it is essential to load papers straight and evenly on the tray, preventing jams and damage Once the input is set to the Automatic Document Feeder (ADF), the scanning function is initiated The ADF scanner employs various sensors, including optical, ultrasonic, and pressure sensors, to detect the presence of documents in the input tray These sensors are strategically placed on the paper tray or along the paper path to facilitate accurate scanning.

• Optical sensors: These sensors typically emit and detect light When a document is present, it will block the light path, signaling the scanner to initiate feeding

• Ultrasonic sensors: These sensors emit high-frequency sound waves The presence of a document will alter the reflection pattern of these sound waves, allowing the scanner to detect it

• Pressure sensors: These sensors measure the physical pressure exerted on the tray

When a document is placed on the tray, it will create pressure that the sensor can detect

The Pick Roller begins the document feeding process by rotating under moderate pressure to separate a single sheet from the stack The Brake Roller ensures that only one sheet enters the mechanism at a time, preventing multiple sheets from being fed simultaneously Document verification is conducted using sensors, including the Ultrasonic Sensor and Document Sensor, to ensure accurate sheet feeding In the event of an error, such as double feeding caused by thin documents, the system retracts the document for a retry Finally, the Feed Roller propels the verified document forward in the process.

The scanning platen, a glass surface, enables efficient image acquisition during the scanning process After a document is scanned, the ejection rollers, which include the Eject Roller and Plastic Idler Roller, activate to successfully transport the scanned document to the designated output tray.

When a document is placed on the scanning platen, the image acquisition process begins with an LED light source illuminating the document to improve feature contrast Linear sensor arrays, similar to a line-scanning camera, capture the reflected light in a line-by-line manner as the document moves across the sensor Each pixel in these arrays converts the incoming light into an electrical signal, where the intensity of the signal correlates directly with the document's reflected light, allowing brighter areas to produce stronger signals and darker areas to generate weaker signals.

The API processes the signal from the scanner and converts it into an image using a frame-based transmission approach This method divides the image into a series of frames, each representing the full image area and capable of managing individual color channels separately Frames can interleave all channels or transmit each color channel—red, green, and blue—as distinct frames The transmission of pixel values is determined by bit depth, where the Most Significant Bit (MSB) corresponds to the leftmost pixel and the Least Significant Bit (LSB) to the rightmost For higher bit depths (8 or more), each byte corresponds to a single sample value, while depths of 16 and above use multiple bytes per sample, following the machine's native byte order.

Figure 3.6: Bit and byte order of image data

DEVELOPMENT OF DATA GENERATION

Objectives

The main goal of the data generation system is to produce a detailed dataset for training and assessing Document Layout Analysis and parsing models This includes creating images of administrative documents filled with handwritten text, showcasing a range of text styles, fonts, and contexts that an OCR system may face in practical applications The system is designed to deliver comprehensive and diverse data for optimal model performance.

• Diverse Text Styles: Ensuring the dataset includes various fonts, sizes, and styles to cover a wide range of text appearances

• Realistic Text Contexts: Simulating different text contexts such as printed documents, handwritten notes, and forms

• Label Accuracy: Ensuring the generated data has accurate and precise labels for effective model training and evaluation

• Volume and Variety: Generating sufficient data volume and variety to enable robust model generalization and performance

In addition, we leverage this dataset to create another dataset for training OCR model by cutting text images in the filled document image and text image generation.

Font Generation

Font generation plays a vital role in synthetic data creation, focusing on generating filled document images with a variety of font styles We began by designing a template, as illustrated in Figure 4.1 on the Calligraphr website, which consists of four pages featuring 228 Vietnamese characters, special characters, and numbers, all accompanied by background guides to assist users in maintaining the correct character size To complete this template, we recruited 20 volunteers who were tasked with writing the characters within designated boxes, ensuring adherence to the specified sample sizes Subsequently, we utilized a scanner to digitize the finished templates.

QR code templates so that they are neat Finally, we employed an algorithm to convert the scanned sheets into fonts

Figure 4.1: Glyph template with characters background page 1 (Source: [36])

This template features four corner markers for aligning and identifying the character area, along with a unique Quick Response (QR) code in the top right corner that distinguishes each page.

4.2.2 Conversion of Template Images to Character Images

In this section, we divided the template image into numerous smaller images, each containing a single character

The Calligraphr template includes four large square markers that play a crucial role in accurately positioning characters These markers help define the character areas within the template, ensuring precise alignment and clarity in the final output.

We analyze a template image by detecting markers through contour identification and filtering based on area and shape After computing their centroids, we remove duplicate or nearby coordinates using a distance threshold Finally, we organize the corner coordinates by their angle in relation to the centroid of the corners.

After establishing a consistent list of corner coordinates, we identified the tiles arranged in an 8x8 matrix This led us to segment the images into a grid of tiles, designating certain areas for the placement of the QR code, as illustrated in Figure 4.2.

Figure 4.3: QR code image is cut from the image

We developed a configuration file that includes a character list for each template page and utilized a QR code, illustrated in Figure 4.3, to access this list This approach ensures a precise connection between each image and its respective character.

Figure 4.4: Character image in PNG format (65.png)

We saved character images to a specified directory, organizing them based on their Unicode code points A sample character image and its name are shown in Figure 4.4

4.2.3 Conversion of Character Images to SVG Format Images

This section introduces a class designed to systematically convert character PNG images into bitmap (BMP) and scalable vector graphics (SVG) formats, utilizing Potrace for efficient vector graphics conversion.

Figure 4.5: Character image in BMP format (65.bmp)

We first converted the image to RGBA format and resized it to 100x100 pixels To achieve a binary representation, we applied a threshold, transforming each pixel to black or white according to its intensity values.

Pixels with red, green, and blue values exceeding 200 are transformed into white, while all other pixels are turned black The final image is then saved in BMP format, as illustrated in Figure 4.5.

Figure 4.6: Character image in SVG format (65.svg)

Finally, we used Potrace to transform raster images into scalable vector graphics suitable for a wide range of applications, as shown in Figure 4.6

4.2.4 Conversion of SVG Format Character Images to Handwritten Fonts

In this section, we utilized FontForge, an open-source font editing software, to perform all tasks through its commands and graphical user interface FontForge enables users to create, modify, and convert digital fonts efficiently For this experiment, we specifically used FontForge to generate and edit fonts sourced from a directory of SVG image files.

We start by creating a configuration file that lists all glyphs along with their bearing and kerning settings Next, we process the directory of character SVG files, named according to Unicode conventions, to create glyph objects and import their outlines Finally, we assign a default width to the space character.

Next, we configured the bearing and kerning [38] for each glyph according to our specifications:

We configure the left and right side bearings for each character glyph in the font, retrieving these values from the configuration dictionary If specific bearings are not specified, default values are applied These bearings are then mapped to their respective glyph names within the font.

We configure font kerning values by reading from a specified configuration of rows and columns The process can either auto-generate kerning classes or use a manually defined kerning table When auto-kern is activated, kerning classes are calculated automatically; if not, predefined offsets are applied for precise adjustments.

Figure 4.7: A glyph after automatically designing both side bearings values (left_value, right_value) = (60, -50)

Following the configuration of the parameters, the FontForge application proceeded to generate the digital font This newly created font object was then permanently stored within the output directory.

Synthetic Data Generation

Synthetic data generation enables the rapid creation of diverse datasets by producing synthetic filled document images using handwritten text from font files This method overcomes the challenges of manual data collection and form filling, facilitating efficient data generation for various applications.

• Form collection: Applications must be diverse and widely utilized by students

• Form Template Creation: Various form templates are designed with different layouts, fields, and structures These templates mimic real-world forms and documents, providing a realistic context for synthetic data

Text Entry Simulation involves the automatic filling of form fields with handwritten text, utilizing realistic names, addresses, dates, and other pertinent information This method guarantees that the text entries are coherent and contextually relevant.

Figure 4.8: A sample image of synthetic data

More specifically, we first collected 10 distinct administrative document forms from

Ho Chi Minh City University of Technology and Education (HCMUTE) is featured on the FME website, where we collected student information from credible sources To enhance this data, we also created synthetic information, including phone numbers, dates, and other numerical values through random generation.

We developed form templates that define the structure of the form and the relationship between questions and answers Using OpenCV, we enabled interactive selection of regions within an image with the mouse, highlighting these areas with colored rectangles while displaying their coordinates on the terminal Additionally, we implemented an algorithm to incorporate student information into the form images, utilizing the fonts created in section 4.2.

Advantages and Disadvantages of Synthetic Data

Synthetic data generation offers several advantages over manual data generation:

• Scalability: Synthetic data can be generated in large volumes quickly, overcoming the limitations of manual data collection

• Control and Flexibility: The generation process can be controlled to produce specific text variations and contexts, ensuring comprehensive coverage of different scenarios

• Cost-Effectiveness: Synthetic data generation reduces the need for manual annotation, lowering the overall cost of dataset creation

Despite its advantages, synthetic data generation also faces challenges:

• Realism: Ensuring that synthetic data accurately mimics real-world text appearances and contexts can be difficult

• Overfitting: Models trained on synthetic data may be overfit to the specific variations generated, potentially reducing their performance on real-world data

• Diversity: While synthetic data can be diverse, it may still lack the full range of variations found in real-world text.

Real Data Collection

Synthetic data generation encounters challenges such as realism, overfitting, and a lack of diversity Accurately representing real-world text appearances and contexts proves to be a complex task Additionally, models trained on synthetic data risk overfitting to the generated variations, which can hinder their effectiveness in practical applications.

59 lowering performance on diverse real-world data Despite being diverse, synthetic data often lacks the full range of variations seen in real-world text

To overcome these challenges, the goal of real data collection is to compile a diverse and high-quality dataset that truly reflects real-world text This process encompasses several essential steps.

• Data Collection: Text images are collected from various sources, including scanned documents, photographs of text in natural settings, and handwritten notes

• Annotation: Collected text images are annotated with ground truth text This involves manually labeling text regions, which is a meticulous and time- consuming task

• Quality Control: Multiple rounds of validation by different annotators are conducted to verify the consistency and correctness of annotations

This approach not only addresses the limitations of synthetic data but also provides a robust foundation for training models that can generalize well to diverse real-world text scenarios

DEVELOPMENT OF TEMPLATE CREATION

Objectives

Extracting information from administrative documents demands exceptional precision, which relies on prior knowledge as a foundational element for an effective information extraction system Utilizing a structured template to define the location and relationships of information greatly boosts the efficiency and accuracy of the extraction process, ultimately enhancing data storage and retrieval performance Our proposed approach aims to address these critical aspects.

We create a streamlined pipeline for generating document templates by implementing processes that automatically identify key areas within documents, including titles, questions, answers, tables, and creation dates Our approach utilizes advanced AI algorithms and cutting-edge technologies to establish connections between these elements Furthermore, the system is engineered for quick and easy adaptation to new application forms.

The Template Creation system features a user-friendly GUI that highlights detected regions in different colors, enabling users to easily understand the document's layout Additionally, this interactive interface allows for real-time modifications to the structure, enhancing user engagement and efficiency.

Template Creation Workflow

The Template Creation System, as shown in Figure 5.1, processes blank form images through a series of steps to generate a structured document template and the relationship in the form

First, image preprocessing is performed to enhance image quality This includes contrast enhancement, noise filtering step, and converting to grayscale images

Document parsing involves identifying and extracting essential information from documents, including titles, questions, answers, and dates from administrative forms This process is treated as an image segmentation or object detection challenge, utilizing advanced techniques for improved accuracy.

YOLOv8 model to output a dictionary comprising segmentation masks or bounding boxes, along with confidence scores categorized by class names like title, question, answer, and date

Figure 5.1: The template creation system

The OCR module consists of two key components: a text detection model and a text recognition model Initially, it identifies areas containing text, and subsequently, these regions are processed by the text recognition model to transform the images into editable text.

The human correction stage then ensures the accuracy of the OCR output, allowing human intervention to correct any errors in the extracted text

After human correction, document layout analysis is performed to discern the physical arrangement of the document This involves identifying components such as text

The YOLOv8 model effectively segments images into 62 categories, including titles, text, tables, lists, and figures It generates a dictionary containing segmentation masks or bounding boxes, each accompanied by confidence scores that indicate the accuracy of the classifications.

Template formatting promotes consistency by aligning elements on the same row to share identical y coordinates This arrangement follows a logical sequence from left to right and top to bottom, based on their coordinates After sorting, the information is organized into relevant categories, including name, student ID, class, and university name, according to their positional relationships.

The final output consists of a structured document template and a relational format, designed for storage in a data management system This setup enhances the efficient access and management of document templates, specifically for use in an Information Extraction System.

Image preprocessing, as illustrated in Figure 5.2, enhances and prepares images for analysis through a series of transformations Initially, the image is converted from the BGR color space to the LAB color space, isolating the lightness channel (L-channel) The Contrast Limited Adaptive Histogram Equalization (CLAHE) technique is then applied to the L-channel, redistributing lightness values to improve contrast Finally, the enhanced L-channel is merged with the A and B color channels to create a new LAB image, which is subsequently converted back to the BGR color space.

Next, the enhanced image is converted to grayscale to simplify further processing

A thresholding technique is utilized on the grayscale image to generate a binary image, effectively isolating key features This process incorporates an inverse binary threshold alongside Otsu's method, which dynamically identifies the optimal threshold value.

The optimal threshold value of 63 is determined, followed by the application of morphological operations, particularly the morphological open operation, to the binary image This process effectively removes noise and small artifacts, yielding a cleaner binary image Figure 5.3 illustrates the input and output of this image preprocessing step.

Figure 5.3: Raw image (left) and preprocessed image (right)

The binary image is resized to a width of 2000 pixels for standardization and easier handling in subsequent tasks It is then converted back to the BGR color space to ensure compatibility with processes requiring a three-channel image format The function returns this preprocessed image, featuring enhanced contrast, reduced noise, and a standardized size, making it ideal for further analysis or processing.

Document parsing is the process of identifying and extracting essential components from documents, such as titles, questions, answers, and dates found in administrative forms These components are associated with their respective bounding boxes and confidence scores, which are then organized into a structured dictionary for improved accessibility and manipulation.

The prediction method utilizes the YOLOv8 model to generate predictions from an image, optionally incorporating a confidence threshold It organizes the detected elements by their vertical positions for logical coherence and modifies overlapping or aligned bounding boxes to ensure accurate structuring in the output.

Figure 5.4: A sample of document parsing

An algorithm assesses the relative positions of bounding boxes by comparing two boxes to determine if they align in the same position, row, or column, or if they occupy different positions By calculating the center of the second box and comparing it to the dimensions of the first box, the algorithm effectively establishes their spatial relationship.

The structured representation of parsed document elements includes the title, question, answer, and date, along with a list of bounding boxes and their corresponding confidence values An example of this document parsing output is illustrated in Figure 5.4.

The text detection process begins with an image extracted from document parsing, where Vietnamese, a Latin-based language with unique accents and diacritical marks, poses specific challenges Instead of creating a new model for Vietnamese characters, we utilize existing models trained on English datasets for effective text detection We chose the DBN detector for its impressive speed and accuracy, enabling efficient text localization in images DBN employs a segmentation technique that incorporates a differentiable binarization (DB) module, which utilizes an adaptive functional threshold developed during training, thereby improving detection performance and streamlining post-processing compared to conventional methods.

After identifying text regions in an image using bounding boxes, these boxes are processed through the AttentionOCR text recognition model, as detailed in section 2.4.2 Our architecture comprises two main modules, initially trained on printed texts To improve its performance on handwritten texts, we fine-tuned the model using a combination of manually collected handwritten samples and synthetically generated texts This refinement greatly enhanced the model's proficiency in recognizing handwritten text.

Template Creation Sequence Diagram

Figure 5.8 illustrates the sequence diagram of template creation system This system is developed for users to upload an image from computer to the website

Figure 5.8: Sequence diagram of template creation system

The server processes the image to enable user interaction, allowing actions such as inserting, removing, and adjusting bounding boxes, as well as editing the text within each bounding box.

Finally, after finishing creating the template for the image, the template and its corresponding image will be saved in the database for futher retrieval.

Template Creation GUI

The interface in Figure 5.9 allows users create a template aiding prior knowledge for the system Here is a description of the functions of the buttons in the images:

• Home Page: This button takes the user back to the homepage of the document extraction system

• Choose File (in the top left corner): This button allows the user to select an image file to upload to the system

• Submit: This button submits the uploaded file for processing

• Enter text for selected box: This section allows the user to enter text into a specific field on the form

• Title Mode, Question Mode, Answer Mode, Date Mode, and Table Mode:

The buttons enable users to choose a specific mode for capturing information from the file, with each mode represented by a distinct color Users can draw a box to define the text region and categorize it according to the five available modes.

• Choose File (in the top right corner): This button allows the user to select a template file in json format to upload to the system

• Save Template: This button allows the user to save the template they have created in json format

• Fields of Information: This section displays the information extracted from the file

Once the user has completed the correction of the OCR results, they can click the "Create Table" button This action will store the template image and the JSON format of the uploaded form in the system database.

The interface enables users to add, remove, and adjust bounding boxes, including their color and the information they contain In the "Fields of Information" section, users can verify the image's structure, which includes the title, question, answer, and date Once the template is confirmed to be correct, users can click the "Create Table" button to update the image template, its structural elements, and the relationship between the question and answer in the database, as detailed in section 5.2.6 and the prior knowledge for the information extraction system in section 5.3.

The Template Creation system GUI allows users to easily customize each box by clicking and adjusting its size and position Additionally, the text within each box can be modified using the input field located in the top left corner of the interface.

“Enter text for selected box”)

DEVELOPMENT OF INFORMATION EXTRACTION

Objectives

The aim of this system is to extract information from administrative documents using predefined templates established in CHAPTER 5 Our proposed approach must successfully perform the necessary tasks to achieve this goal.

• Building a pipeline to extract information: We build processes that leverage the predefined templates to optimize extraction speed In addition, we implement

AI algorithms to accurately recognize and extract the required information

We develop a user-friendly GUI for the Information Extraction system, which not only showcases the extracted data but also enables users to correct any inaccuracies, thereby improving the overall precision of the information presented.

Information Extraction Workflow

The Information Extraction System depicted in Figure 6.1 enhances filled form images through a series of steps aimed at improving image quality, categorizing documents, extracting essential information, recognizing text, and verifying the accuracy of the extracted data.

Initially, image preprocessing is performed, where steps such as contrast enhancement and noise filtering are applied to produce higher-quality images as mentioned in section 5.2.1

Document categorization plays a crucial role in extracting relevant information based on the document's title This process employs a fuzzy matching technique alongside a layout analysis model that incorporates the YOLOv8 architecture to accurately extract the document title Subsequently, the extracted title is utilized to select an appropriate template from the data storage system.

The template matching process follows, aiming to extract key information fields from the input document using a template retrieved from the data storage

Following template matching, OCR is employed, as mentioned in section 5.2.3, to convert the extracted key value pairs into plain text

Finally, the human correction stage involves human engagement to ensure the quality of the extracted information from the OCR stage, ensuring the final output is accurate and reliable

Figure 6.1: The Information extraction system

At this stage, the input consists of the recognized title of the form, which may sometimes be inaccurate It is essential to find the most similar title stored in the database For example, the recognized title "Đơn xin mien thi" is grammatically incorrect in Vietnamese Document categorization techniques are utilized to determine the closest matching title in the database, such as "Đơn xin miễn thi."

To accomplish this, we employ the "thefuzz" library, which specializes in fuzzy string matching A crucial element of this library is "python-Levenshtein," known for its implementation of the Levenshtein distance algorithm This algorithm measures the dissimilarity between two strings, providing a quantitative assessment of their differences.

73 strings by calculating the minimum number of operations required to transform one string into the other These operations include insertion, deletion, and substitution of characters

A smaller Levenshtein distance indicates a higher similarity between the strings

Utilizing the Levenshtein distance from the "thefuzz" library enables us to detect strings that closely match a specified pattern, accommodating for abbreviations and omissions This approach enhances document categorization by facilitating the identification of approximate matches instead of relying solely on exact ones.

Template matching is a vital process in the image alignment system that compares an input image with a template The process begins by converting both images to grayscale to facilitate easier comparison It employs the SIFT detector to identify keypoints and compute descriptors, which encapsulate critical information about the image's features Subsequently, the randomized kd-tree algorithm is utilized to establish correspondences between the keypoints of the input image and the template, with matches being filtered based on their distances to ensure precision.

Figure 6.2: Template matching with single points (blue) and matched lines (green)

After the matching process, the algorithm organizes the coordinates of the matched keypoints to calculate a homography matrix, which outlines the geometric transformation required to align the input image with the template.

The function utilizes a homography matrix through a perspective warp to align the input image with the dimensions of the template, ultimately returning the aligned image and finalizing the template matching process An example of SIFT is illustrated in Figure 6.2.

The information extracted from the system then will be updated to the table created in section 5.2.6 The information can be used for visualization or retrieval

Figure 6.3: The information after being extracted is stored in the database with its corresponding table

Figure 6.3 displays the extracted information of student Bùi Nam Khánh from the "ĐƠN XIN MIỄN THI" form, which is stored in our databases This information includes key details such as ID, name, phone number, and student ID.

Information Extraction Sequence Diagram

Figure 6.4 illustrates the sequence diagram of information extraction system Similarly to the template create system, this system allows user to upload image from computer or scanner to the website

The server processes the uploaded image and provides the DLA and OCR results Users have the ability to modify the bounding boxes and edit the text contained within each box.

Finally, after finishing the human correction process, the information from the image will be updated on the data for further retrieval and visualization

Figure 6.4: Sequence diagram of information extraction system

Information Extraction GUI

Here is a description of the function of each button and the content on the interface:

• Choose File: This button allows the student to upload a file, presumably to upload a copy of their transcript to support their exemption request

• Submit: This button submits the student’s exemption request to the university system

• Update Table: The function of this button is unclear from the image It’s possible that it allows the student to update their request after they’ve uploaded a file.

Figure 6.5: Information extraction system GUI

Figure 6.5 illustrates the user interface designed for extracting information from documents The left side of the webpage features the uploaded image with bounding boxes highlighting text regions, while the right side displays the recognized text for easy error checking Users can correct any inaccuracies by rewriting the text in the input field located at the top right corner of the site.

EXPERIMENTAL RESULTS AND DISCUSSION

Objectives

To gain a deeper understanding of the methods and evaluate the accuracy of the presented models, this chapter has to fulfill the following works:

We conduct training and evaluation of cutting-edge document layout analysis models on our dataset, drawing insights from reputable research articles to select the optimal neural network architecture for our training Following this, we assess the performance of our model against other fine-tuned models on the same dataset, ensuring a comprehensive comparison of effectiveness.

To train and evaluate a text recognition model, we first choose the most suitable neural network architecture, akin to the process used for document layout analysis models We then assess our model's performance against other models that support the Vietnamese language.

The analysis will enable a comparison of methods at each stage, highlighting their strengths and limitations This comparative study aims to reveal any existing issues within the methods employed in our system.

Document Layout Analysis Model

A study by Akanda et al [40] evaluated the effectiveness of DiT, LayoutLMv3, and YOLOv8 on a Bengali dataset to enhance processing for low-resource languages The findings revealed that YOLOv8 outperformed the other models, achieving an 8.95% higher Intersection over Union (IoU) score compared to DiT and a remarkable 38.48% higher IoU score than LayoutLMv3 in DLA tasks.

Vietnamese is classified as a low-resource language in the domains of DLA and OCR Our analysis of the Vietnamese dataset follows the same methodology, yielding comparable outcomes Notably, YOLOv8 emerges as the most effective approach for DLA within the Vietnamese datasets discussed in Chapter 4.

In this study, we utilized a 90-image dataset to refine our Document Layout Analysis model using the YOLOv8 neural network To prevent overfitting and selection bias, we divided the dataset into three segments: 70% for training, 20% for validation, and 10% for testing We applied various augmentation techniques, such as image rotation, shifting, noise addition, and blurring, to increase data variability Following these augmentations, our training set for Document Layout Analysis expanded to a total of 189 images.

The training parameters for this model are listed in Table 7.1:

The training and inference procedure for this model is operated on NVIDIA T4 GPU on Google Colab The time it takes for training process is 0.645 hours

Figure 7.1 illustrates the results of document layout analysis, showcasing bounding boxes that represent five classes: title, text, table, list, and figure, along with their accuracy YOLOv8 achieves high accuracy in identifying these class regions, while LayoutLMv3 incorrectly categorizes the date region as a figure instead of text Additionally, LayoutLMv3 fails to detect the crucial title region "ĐƠN XIN MIỄN THI," which is essential for the form's structure.

Figure 7.1: A sample result of YOLOv8 (left) and LayoutLMv3 (right)

This module utilized average precision, average recall, and mean Average Precision (mAP) at IoU thresholds from 50 to 95 for evaluation and comparison The comparison results of three distinct document layout analysis models are presented in Table 7.2.

Table 7.2: Comparison of document layout analysis models

Text Detection Model

Vietnamese, a Latin-based language with unique accents and diacritical marks, poses distinct challenges for text detection Instead of creating a new model for Vietnamese characters, we can utilize existing models trained on English datasets We chose the DBN detector for its remarkable speed and accuracy, which enhances the efficiency of text localization in images The DBN model employs a segmentation technique combined with a differentiable binarization (DB) module, offering advantages over traditional segmentation methods that depend on fixed thresholds and complicated post-processing.

DBN employs an adaptive functional threshold developed during training This enhances text detection performance and simplifies the post-processing Figure 7.2 shows an example of text detection model

Figure 7.2: Text detection results in typed and handwritten forms

Text Recognition Model

The architecture of a text recognition system is crucial because it determines how effectively and efficiently the system can process and interpret text Key aspects include:

• Feature Extraction: The ability to accurately capture and represent features from the text, such as characters, words, and contextual information

• Handling Variability: Robustness to variations in font, size, orientation, and noise in the text

• Scalability: Capability to handle large-scale datasets and complex documents with varying structures

AttentionOCR and TransformerOCR (as mention in section 2.4.2) represent two distinct approaches in OCR systems AttentionOCR employs an attention mechanism to

The model utilizes a selective focus mechanism to concentrate on key areas of an input image during the decoding process By integrating Vgg19 CNNs for feature extraction with LSTM networks for sequence modeling, it achieves efficient training and inference This architecture empowers the model to effectively prioritize significant visual features.

TransformerOCR utilizes a transformer architecture that employs self-attention mechanisms to process input sequences concurrently, improving its capacity to manage longer text dependencies essential for OCR tasks This capability allows for better context understanding across extensive text spans Although TransformerOCR often delivers superior performance thanks to its parallel processing ability, which takes advantage of modern hardware, it usually requires more computational resources than LSTM-based models such as AttentionOCR.

When choosing between AttentionOCR and TransformerOCR, it's crucial to consider the balance between accuracy and speed TransformerOCR excels in accuracy with its advanced self-attention mechanisms, while AttentionOCR significantly outperforms it in inference speed, being four times faster Although AttentionOCR has a slight accuracy trade-off, its rapid processing is vital for real-time text recognition applications, making it the more practical and efficient option overall.

In Chapter 3, we gathered character samples from students at HCMC University of Technology and Education to create a dataset for training a text recognition model The dataset statistics are detailed in Table 7.3 To prevent overfitting and model selection bias, we divided the dataset into three parts: 80% for training, 10% for validation, and 10% for testing Additionally, Table 7.4 presents the optimal hyperparameters used in the training process.

Table 7.3: Dataset description for training text recognition model

Category Description Number of images

Student name Names of students 2000

Student Id Unique id of students 100

Subject Names of academic subjects 50

Subject Id Unique id of subjects 50

Major Names of academic majors 100

Table 7.4: Training parameter for text recognition model

Batch size 4 Learning rate 0.001 Optimizer AdamW Loss Negative log likelihood loss Iteration 10000

Table 7.5 shows the comparison results of the two different text recognition models, AttentionOCR and TransformerOCR

Table 7.5: Comparison of text recognition models

It is noted that the performance of the two models is relatively good AttentionOCR has 18.07% CER and 48.20% WER On the other hand, Transformer has 12.24% of CER

83 and 36.04% of WER which are better than AttentionOCR In terms of speed, AttentionOCR having 40 FPS is nearly four times faster than the Transformer

Table 7.6 shows the results of some correctly recognized images and Table 7.7 shows some results of misidentified images after testing the TransformerOCR model

Table 7.6: Illustration of correctly recognized images

Name Image Label Recognition name_10.png Nguyễn Hồng Sơn Nguyễn Hồng Sơn major_6.png CNKT In CNKT In dept_25.png Đào tạo QT Đào tạo QT

Table 7.7: Illustration of misidentified images

Võ Trần Hoài Ân là sinh viên chuyên ngành Kỹ thuật nữ công tại khoa Cơ khí động lực.

The analysis of the two tables indicates that the model's accuracy in character recognition declines as the number of characters in an image increases For example, the image major_6.png, labeled "CNKT IN," features only 6 characters and is recognized correctly In contrast, the image major_48.png, labeled "Kỹ thuật nữ công," contains 13 characters, highlighting the model's challenges in achieving accurate recognition with more complex labels.

System Evaluation

Document image acquisition: Our system has successfully connected and controlled the Brother MFC-795CW scanner via USB and TCP/IP connections

Additionally, we developed a user interface that enables users to activate the scanner and manage scanned data efficiently

We generated 20 unique handwritten fonts and created two distinct datasets for training Document Layout Analysis (DLA) and Optical Character Recognition (OCR) models The DLA dataset consists of 160 images, whereas the OCR dataset contains 2,330 images, facilitating enhanced model performance.

We have developed an efficient Template Creation system for administrative documents, featuring advanced capabilities to predict question and answer locations This high level of customization allows users to swiftly create prototypes Furthermore, we successfully designed templates for 10 forms from Ho Chi Minh City University of Technology and Education.

We have successfully developed an Information Extraction system that achieves up to 80% accuracy in extracting data from handwritten applications This system not only allows users to edit the collected information but also significantly reduces the time needed for data entry.

In summary, we successfully met all our objectives by developing an API that connects to various document scanners We also integrated deep learning models to identify document layouts and automate information extraction, significantly reducing repetitive human tasks Our evaluation shows that the system boosts productivity in the digitization process, processing images in about 3 to 4 seconds Currently, it analyzes and extracts data from 10 types of forms, with the capability to accommodate additional forms as users expand the Template Creation System.

Despite the advancements in our project, we face challenges with handwritten text recognition The variability in features, size, and length of handwritten text compared to typed text poses difficulties Consequently, our system often misinterprets handwritten input, particularly when the text is difficult for humans to read.

Finally, we plan to continue our thesis with the following future works:

• Improve OCR especially for handwritten text by leveraging lastest architecture

• Expanding the diversity of the synthetic data by applying GANs (Generative Adversarial Networks)

• Optimize the flow in the server for faster processing

• Design a more friendly and nicer user interface

[1] Accelerating Document AI (n.d.) Retrieved from https://huggingface.co/blog/document-ai

[2] Binmakhashen, G M., & Mahmoud, S A (2019) Document layout analysis: a comprehensive survey ACM Computing Surveys (CSUR), 52(6), 1-36

[3] O'Gorman, L (1993) The document spectrum for page layout analysis IEEE

Transactions on pattern analysis and machine intelligence, 15(11), 1162-1173

[4] Marinai, S., Gori, M., & Soda, G (2005) Artificial neural networks for document analysis and recognition IEEE Transactions on pattern analysis and machine intelligence, 27(1), 23-35

[5] Garz, A., Fischer, A., Sablatnig, R., & Bunke, H (2012, March) Binarization-free text line segmentation for historical documents based on interest point clustering

In 2012 10th IAPR International Workshop on Document Analysis Systems (pp 95-

In their 2018 study, Capobianco, Scommegna, and Marinai presented a novel approach for segmenting historical handwritten documents through the application of a weighted loss function This research was showcased at the 8th IAPR TC3 Workshop on Artificial Neural Networks in Pattern Recognition held in Siena, Italy The findings, detailed in the proceedings published by Springer International Publishing, contribute significantly to the field of document analysis and pattern recognition.

[7] Chen, K., Liu, C L., Seuret, M., Liwicki, M., Hennebert, J., & Ingold, R (2016,

April) Page segmentation for historical document images based on superpixel classification with unsupervised feature learning In 2016 12th IAPR Workshop on Document Analysis Systems (DAS) (pp 299-304) IEEE

[8] Wick, C., & Puppe, F (2018, April) Fully convolutional neural networks for page segmentation of historical document images In 2018 13th IAPR International Workshop on Document Analysis Systems (DAS) (pp 287-292) IEEE

[9] Grỹning, T., Leifert, G., Strauò, T., Michael, J., & Labahn, R (2019) A two-stage method for text line detection in historical documents International Journal on Document Analysis and Recognition (IJDAR), 22(3), 285-302

[10] Oliveira, S A., Seguin, B., & Kaplan, F (2018, August) dhSegment: A generic deep- learning approach for document segmentation In 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR) (pp 7-12) IEEE

[11] Deng, J., Dong, W., Socher, R., Li, L J., Li, K., & Fei-Fei, L (2009, June) Imagenet:

A large-scale hierarchical image database In 2009 IEEE conference on computer vision and pattern recognition (pp 248-255) Ieee

[12] Kise, K., Sato, A., & Iwata, M (1998) Segmentation of page images using the area

Voronoi diagram Computer Vision and Image Understanding, 70(3), 370-382

[13] Lu, Y., Wang, Z., & Tan, C L (2004) Word grouping in document images based on

Voronoi tessellation In Document Analysis Systems VI: 6th International Workshop, DAS 2004, Florence, Italy, September 8-10, 2004 Proceedings 6 (pp 147-157) Springer Berlin Heidelberg

[14] Agrawal, M., & Doermann, D (2009, July) Voronoi++: A dynamic page segmentation approach based on voronoi and docstrum features In 2009 10th International Conference on Document Analysis and Recognition (pp 1011-1015) IEEE

[15] Bukhari, S S., Shafait, F., & Breuel, T M (2009, July) Script-independent handwritten textlines segmentation using active contours In 2009 10th International Conference on Document Analysis and Recognition (pp 446-450) IEEE

[16] Jain, A K., & Zhong, Y (1996) Page segmentation using texture analysis Pattern recognition, 29(5), 743-770

[17] Saabni, R., & El-Sana, J (2011, September) Language-independent text lines extraction using seam carving In 2011 International Conference on Document Analysis and Recognition (pp 563-568) IEEE

[18] Asi, A., Cohen, R., Kedem, K., & El-Sana, J (2015, August) Simplifying the reading of historical manuscripts In 2015 13th International Conference on Document Analysis and Recognition (ICDAR) (pp 826-830) IEEE

[19] Asi, A., Cohen, R., Kedem, K., El-Sana, J., & Dinstein, I (2014, September) A coarse-to-fine approach for layout analysis of ancient manuscripts In 2014 14th International Conference on Frontiers in Handwriting Recognition (pp 140-145) IEEE

[20] Nagy, G., & Seth, S C (1984) Hierarchical representation of optically scanned documents

[21] Tran, T A., Na, I S., & Kim, S H (2016) Page segmentation using minimum homogeneity algorithm and adaptive mathematical morphology International Journal on Document Analysis and Recognition (IJDAR), 19, 191-209

[22] Jocher, G., Chaurasia, A., & Qiu, J (2023, January 1) YOLOv8 by Ultralytics

GitHub https://github.com/ultralytics/ultralytics

[23] Solawetz, J (2024) What is YOLOv8? The Ultimate Guide [2024] Retrieved from https://blog.roboflow.com/whats-new-in-yolov8/

[24] Huang, Y., Lv, T., Cui, L., Lu, Y., & Wei, F (2022, October) Layoutlmv3: Pre- training for document ai with unified text and image masking In Proceedings of the 30th ACM International Conference on Multimedia (pp 4083-4091)

[25] Li, J., Xu, Y., Lv, T., Cui, L., Zhang, C., & Wei, F (2022, October) Dit: Self- supervised pre-training for document image transformer In Proceedings of the 30th ACM International Conference on Multimedia (pp 3530-3539)

[26] Subramani, N., Matton, A., Greaves, M., & Lam, A (2020) A survey of deep learning approaches for ocr and document understanding arXiv preprint arXiv:2011.13534

[27] Jaume, G., Ekenel, H K., & Thiran, J P (2019, September) Funsd: A dataset for form understanding in noisy scanned documents In 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW) (Vol 2, pp 1-6) IEEE

[28] Liao, M., Shi, B., Bai, X., Wang, X., & Liu, W (2017, February) Textboxes: A fast text detector with a single deep neural network In Proceedings of the AAAI conference on artificial intelligence (Vol 31, No 1)

[29] Zhou, X., Yao, C., Wen, H., Wang, Y., Zhou, S., He, W., & Liang, J (2017) East: an efficient and accurate scene text detector In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (pp 5551-5560)

[30] Zhu, Y., Chen, J., Liang, L., Kuang, Z., Jin, L., & Zhang, W (2021) Fourier contour embedding for arbitrary-shaped text detection In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp 3123-3131)

Wang et al (2019) presented a novel approach for shape robust text detection utilizing a Progressive Scale Expansion Network This method was showcased at the IEEE/CVF Conference on Computer Vision and Pattern Recognition, highlighting its effectiveness in accurately detecting text across various scales and shapes The research emphasizes the importance of advanced algorithms in enhancing text recognition capabilities in complex visual environments.

[32] Wang, P., Zhang, C., Qi, F., Huang, Z., En, M., Han, J., & Shi, G (2019, October)

A single-shot arbitrarily-shaped text detector based on context attended multi-task learning In Proceedings of the 27th ACM international conference on multimedia (pp 1277-1285)

[33] Sheng, T., Chen, J., & Lian, Z (2021) Centripetaltext: An efficient text instance representation for scene text detection Advances in Neural Information Processing Systems, 34, 335-346

[34] Pham, Q (n.d.) VietOCR - Nhận Dạng Tiếng Việt Sử Dụng Mô Hình Transformer và AttentionOCR Retrieved from https://pbcquoc.github.io/vietocr/

[35] Lowe, D G (2004) Distinctive image features from scale-invariant keypoints International journal of computer vision, 60, 91-110

[36] GmbH, M (n.d.) Create your own fonts Retrieved from https://www.calligraphr.com/en/

[37] Contributors, T F P (n.d.) FontForge Open Source Font Editor Retrieved from https://fontforge.org/en-US/

[38] Design With FontForge (n.d.) Retrieved from http://designwithfontforge.com/en-

US/Spacing_Metrics_and_Kerning.html

[39] Shen, Z., Zhang, R., Dell, M., Lee, B C., Carlson, J., & Li, W (2021) LayoutParser:

A Unified Toolkit for Deep Learning Based Document Image Analysis ArXiv /abs/2103.15348

[40] Akanda, M M B A N., Ahmed, M., Rabby, A S A., & Rahman, F (2024, April)

Optimum Deep Learning Method for Document Layout Analysis in Low Resource Languages In Proceedings of the 2024 ACM Southeast Conference (pp 199-204)

[41] Muja, M., & Lowe, D G (2009) Fast approximate nearest neighbors with automatic algorithm configuration VISAPP (1), 2(331-340), 2

[42] Liao, M., Wan, Z., Yao, C., Chen, K., & Bai, X (2020, April) Real-time scene text detection with differentiable binarization In Proceedings of the AAAI conference on artificial intelligence (Vol 34, No 07, pp 11474-11481)

2024 7 th International Conference on Green Technology and Sustainable Development (GTSD)

Development of an AI System for Data Extraction from Vietnamese Printed Documents

Phuc Huynh Vinh † Department of Mechatronics

HCMC University of Technology and Education

Ho Chi Minh City, Vietnam 20134005@student.hcmute.edu.vn

Quan Chu Nhat Minh † Department of Mechatronics

HCMC University of Technology and Education

Phi Nguyen Xuan † Department of Mechatronics HCMC University of Technology and Education

Duc Bui Ha * Department of Mechatronics HCMC University of Technology and Education

Ho Chi Minh City, Vietnam ducbh@hcmute.edu.vn

In Vietnam, the reliance on traditional paper-based official documents, such as legal contracts and application forms, poses accessibility challenges and high maintenance costs, creating a strong need for digitization This study focuses on developing a system that utilizes artificial intelligence (AI) techniques to accurately and efficiently analyze and filter information from these documents The system integrates deep learning models, specifically YOLO for document layout analysis and parsing, alongside TransformOCR for recognizing handwritten text Various document pre-processing pipelines were explored to improve the accuracy and efficiency of information extraction Experimental results revealed that the YOLO model achieved a precision of 91% in layout analysis and 89% in parsing, while the TransformerOCR model demonstrated a competitive Character Error Rate (CER).

12.24% and WER of 36.04% in text recognition task

Keywords—Optical character recognition, document layout analysis, document parsing, Vietnamese text detection and recognition

In Vietnam, paper-based documents continue to dominate information storage and communication, outpacing electronic options However, these traditional documents pose significant challenges, such as limited accessibility, high maintenance expenses, data management difficulties, and the requirement for substantial physical storage space.

The digitization of documents is essential for organizations aiming to improve information management efficiency With many organizations processing up to 1,000 unstructured documents daily, often in formats like PDFs or scanned images, the need for digitization is increasingly critical However, prevalent practices in Vietnam frequently rely on manual data entry, which is both time-consuming and dependent on human labor.

Previous research has investigated different methods for extracting information from documents, primarily utilizing handcrafted features like regex and rule-based techniques A notable study presented a quick rule-based approach for information extraction, although this method can struggle with highly variable data, especially in form extraction scenarios Additionally, other strategies depend on text positioning for information extraction but require an understanding of the document's structure beforehand.

Information extraction from documents can be approached as a sequence tagging problem or by using Recurrent Neural Networks (RNNs) for contextual information extraction However, these techniques typically treat text as a linear sequence, overlooking visual elements and document layouts Consequently, applying these methods to complex layouts, like forms, poses considerable challenges.

Modern information extraction methods for form-like documents combine both textual and visual features Research has demonstrated the use of semi-supervised learning techniques, particularly Graph Convolution Networks, for this purpose However, these methods face challenges, such as the need to pre-define the number of form fields and issues with outlier values in box coordinates Additionally, the amount of information extracted can vary significantly based on the individual completing the form, leading to inconsistencies in the field information present in each document.

The LayoutLM method has been proposed for pre-training text and layout understanding in document images, demonstrating strong performance with complex layouts However, it struggles with scanned Vietnamese forms and necessitates large datasets and extended fine-tuning periods for optimal functionality Currently, there is a significant shortage of adequately large datasets of scanned Vietnamese documents to support these methods.

This paper presents an efficient end-to-end information extraction method specifically designed for Vietnamese scanned documents, featuring two key systems: Template Creation and Information Extraction, both powered by advanced AI models for Document Layout Analysis and Document Parsing To combat the lack of available data resources, we also created a comprehensive dataset of both raw and synthetic Vietnamese documents The primary contributions of this research are highlighted in these innovative approaches and resources.

• We proposed an auto-digitizing method leveraging state-of-the-art techniques from related fields to efficiently extract information from predefined forms

• We built a dataset comprising Vietnamese form-like documents tailored for use with our data extraction system

• We conducted thorough evaluations of our proposed method and highlighted the experimental results obtained

• We presented a framework for generating synthetic handwriting data to finetune the OCR model thereby enhancing the total accuracy of our work

Fig 1 Sample image of the dataset

Tiêu đề	Development of an AI System for Data Extraction from Vietnamese Printed Documents
Tác giả	Huynh Vinh Phuc, Nguyen Xuan Phi, Chu Nhat Minh Quan
Người hướng dẫn	Bui Ha Duc, PhD
Trường học	Ho Chi Minh City University of Technology and Education
Chuyên ngành	Robotics and Artificial Intelligence
Thể loại	Graduation thesis
Năm xuất bản	2024
Thành phố	Ho Chi Minh City

Định dạng
Số trang	111
Dung lượng	6,02 MB