Digitalization of administrative documents a digital transformation step in practice

This paper presents a method to build a web application for digitizing the administrative documents applied in most public organizations.. Document digitalization [3], [4], [16] refers t

Trang 1

Digitalization of Administrative Documents

A Digital Transformation Step in Practice

Sinh Van Nguyen, Dung Anh Nguyen, Lam Son Quoc Pham

School of Computer Science and Engineering International University, Vietnam National University of HCMC

Ho Chi Minh, Vietnam ITITIU17073@student.hcmiu.edu.vn; pqslam@hcmiu.edu.vn Corresponding author: nvsinh@hcmiu.edu.vn (ORCID: 0000-0003-0424-5542)

Abstract—Digital transformation is one of the most popular

keyword in recent years It is not only a trend in science research

based on the development of information technology, but also a

proposed duty that applied in the companies or organizations

nowadays Digitalization of administrative documents is therefore

considered as the first step in digital transformation of public

organization Through the digitizing process, the information

that were in written format or hard copies will be converted

into digital format (e.g document files) to serve for storing,

mining, processing and managing the documents This paper

presents a method to build a web application for digitizing the

administrative documents applied in most public organizations

The method is based on the OCR (Optical Character

Recog-nition) combined with the image processing techniques Our

digital process is implemented as following steps (i) Scanning the

hard copies of the administrative documents (ii) Removing noise

data and filtering necessary information in the content based on

image processing technique (iii) Classifying automatically the

acquired contents into the respective components of a template

form following the structured format of Vietnam Government

(iv) Generating automatically a document file The application

can process a document with a single or multiple pages To

compare with similar applications, our application is processed

very fast, without limitation of pages for each document and

obtained accuracy as our expectation

Index Terms—Digital Transformation, Document

Digitaliza-tion, OCR, Image Processing, Smart Web Application

I INTRODUCTION

The development of Information Technology (IT) nowadays

brings us advantages in daily work, study, research and

en-tertainment Application of IT is considered as a popular tool

in the official activities and also a mean for administrative

management This leads to starting steps in digital

transforma-tion (DT) of any organizatransforma-tion or countries all over the world

In the side of work, DT is a transformation of work from

traditional to digital activities based on the background of IT

and communication devices On the other side, DT is formed

by the merger of personal and corporate IT environments

based on an intersection of digital technologies such as cloud

computing, big data, IoT, and AI, etc to serve for all

activ-ities of organization [1], [2] Digitalization is the process of

transforming data or information into computer-based digital

format Document digitalization [3], [4], [16] refers to the

technique of scanning the hard copy of a document and convert

its content into electronic soft version of document file format such as doc, docx or pdf files In these digital format, infor-mation are arranged into distinct data components stored on the computer memory that can be processed individually, and therefore they are understandable and readable to a computer The document digitalization is processed as follow: a hard copy of document file is scanned and saved as a picture or

a pdf file, page by page The light and dark regions on the scanned image are analyzed by an optical character recognition engine, which then turns each letter or number into an ASCII code; the system will analyze and divide the ASCII characters into several little portions that may be saved for later usage

In practice, we have too many documents that need to be kept and preserved carefully for a long time, because their values

to individual or even to the history of a country They can be certificates, degrees, legal documents of law or administration, etc., more and more increase day by day Normally, they are papers or may be made by special woods and difficult to keep for a long time

The problem comes from capacity of stored space and activi-ties in document preservation According to the State Records and Archives Department of Vietnam [5], we have six centrals

of national archives, where store a huge amount of national documents To study in each public organization or companies

in practice, the document storage and management workloads takes up half of the working time of an average employee, even more in large firms where documents are quickly piled

up Such minor task should not be used a large space and time consuming, which is why it should be automated by the time

As a solution to this problem, a computer-based application can be developed to convert all traditional papers into their digital format counterparts The documents in this application are structurally stored in a robust architecture to support the administrative workloads of employees Such application can assist large organizational needs for keeping information safe,

up to date and accessible to all authorized parties

The work in this paper aims to propose a method and create

an application for digitizing administrative documents Our solution is performed based on a web application that can support management needs, storing, searching and mining in large firms and public organizations of the Vietnam

Trang 2

govern-ment The method consists of following steps: (i) The hard

copies of the administrative documents are scanned and save

as picture or pdf files (ii) The image processing technique

is used to filter the scanned data by removing noise data [6]

(iii) Classifying automatically the acquired contents based on

the Tesseract OCR [7] technique and put them into the right

positions of a template form following the structured format

of Vietnam government (iv) Creating automatically document

files for using and managing The final product will provide

tools that assists companies in administrative works such as

organization, management, store, mining, using and reproduce

their documents The whole process is considered as a DT step

in administrative management

The remainder of the paper is structured as follows Section 2

reviews several methods, solutions and application in practice

Section 3 presents our method and system architecture of

the application in detail The implementation and obtained

results are presented in the Section 4 We compare and discuss

between the methods and usage functions of the application

in Section 5 The last Section is our conclusion

II RELATED WORK

Digitization of data is one important step in almost activities of

data mining and management in every companies and public

organizations This is also a module in the DT process of

Government [8] The existing tools, techniques and methods

that base on the background of IT are key factors in the

whole process The OCR technique [3] is used to read and

identify characters and image information on the scanned

documents Part of the conversion is to recognize characters

within the uploaded image of a document and export these

characters onto a digital copy The benefit of digital copies

from the source materials is that they are managed easily

in large quantity and (in theory) can use indefinitely This

technology [9], [10] aims to revolutionize any

administrative-based workload in large organizations and firms by eliminating

the hassle of paperwork instead option for a digital solution

that is both reliable and manageable

Koichi Kise [11] presented a method for classifying a

doc-ument image into homogeneous components such as text

blocks, figures, and tables The method is based on image

processing technique to distinguish background, foreground,

object components, color and intensity of pixels on the image

The obtained results proved a promising way in extraction and

recognition of characters in the document images In order

to process the images, OpenCV is one of the popular open

source libraries that can help to process very fast Chung

B W [12] introduced step by step how to install and work

on the OpenCV This guideline is useful to the researchers

and developers Image processing technique is widely applied

in computer graphics and computer vision Minh et al [13]

introduced a method for creating a virtual museum based on

the virtual reality application The method is considered as

a digital transformation step in the filed of digital heritage

The application allows user visiting and interacting with the

relics in the museum as in practice In the field of medical,

we can apply the image processing technique to build an application to support doctor in disease diagnosis Sinh et al [14] presented a method for building and visualizing medical data objects based on a web application This web-app was very useful and can be used for both medical staff and patients Therefore, application of image processing technique

to develop a web-base system for digitalizing administrative documents is popular and widely used in practice According

to the format of the Vietnam Government [15], the structure of

an administrative document is presented as in Fig.1 This is a required document template (using the paper size A4) to create

an administrative document in all the public organizations

It is also used in the private companies that following the administrative laws of Vietnam The structure of the document

Fig 1 Format of an administrative document by the Vietnam Government.

is numbered and distributed on different positions of the document parts as follows: (1) National name (2) Name

of the organization that issued the document (3) Document

ID (4) Place and date issued (5a) Type of document (5b) Abstract (6) Main content (7a, 7b, 7c) Title, Full name and Signature of competent person, respectively (8) Seal (Stamp) and Signature of organization (9a and 9b) Recipient (10a, 10b) Confidentiality Indicator, Urgency indicator (11) Scope

of circulation indication (12) Writer notation and number of editions (13) Contact information of organization (14) Digital signature of organization for copied version of document into electronic format Among these components, the main content

Trang 3

(6) can be extended more than one page.

Ruili Zhang et al [5] presented a framework for digital

document processing The proposed method includes several

important steps like scanning, indexing, quality checking,

archiving and backup of electronic documentary information

However, the method is worked on non-structured documents

without the comparison of accuracy to the existing methods in

the same context The next section presents in detail our

pro-posed method and application for digitalizing and managing

administrative documents

III PROPOSED METHOD

A System architecture

The system is based on a web application with MVC model

and a database management system for working, storing and

mining in practice Our proposed method and application is

described as the following steps (see Fig.2)

Fig 2 Our proposed workflow of digitizing process.

B Proposed algorithm

To start the process of digitalization, the system first needs

to receive and load a scanned image of the document After

scanning, a cropped replica of it is generated (which will be

used later to locate the stamp) Then, we use techniques in

image processing to convert it into a grayscale image (black

and white color) based on a threshold function to filter noisy

elements The threshold operation alters the value of pixels; if

the pixel value exceeds the threshold value, it is assigned the

value 1 (white); otherwise, it is set to 0 (black) After that,

the spaces between the characters are filled to form a uniform

partition to identify each component of the document in the

image by using the dilate method to increase the character

thickness to a specific level In the next step, to detect regions

of image or text, four points of a rectangular area surrounding

each partition are then identified to determine its contour

(called a bounding box)

Using the structure of the administrative document (as in Fig.1) to determine and extract text information in the doc-ument image The location of each partition corresponding

to the component in the picture can be predicted based on the four inferred points (x, y coordinates), and partition size (width, height) They will be matched with numbers in the document structure such as name of organization, ID, place and date, document type, abstract, content, recipient, position, stamp and signature) If the detected component is a stamp, the system proceed to detect the circle shape that represents the stamp’s border, using the RGB filter to maintain only the red color and save it as a PNG picture This image is noted that will not be utilized to extract the text inside but rather kept

as a replica of the stamp The image of each component is cropped from the replica after the components are determined (which was made at the start) and stored as a series of photos labeled with the component’s name This process allows us to treat each component independently Finally, an OCR engine

is used to transform each component’s picture into text and return the results

In general, our proposed algorithm (Algorithm 1) is described

as follows Algorithm 1 DocumentDigitization()

1: Input: Images files

2: Output: Document files

3: Load an image file

4: Crop the image

5: Create replica of the image

6: Convert the replica into pure black and white using threshold function

7: Dilate characters in image

8: Find contours of image components and store in array C Initialize i = 0

9: while i ≤ C.length do

10: Create bounding rectangle with C[i]

11: Get x, y, width, height of the bounding rectangle

12: Classify the component based on x, y, width, height

13: if the component is stamp then

14: Find mask of bounding circle of stamp

15: Extract the stamp from the image

16: (using BitwiseAND with the original image)

17: else

18: Crop the area at (x,y,x+width,y+height)

19: of the images clone

20: Extract text from the area and push into array texts

21: end if

22: i = i + 1

23: end while

24: Return image of stamp and texts

IV IMPLEMENTATION AND RESULTS

The application can be used by many users (or called actors) who interact with the system Any one can view, search and use the system The official staffs can process and manage

Trang 4

all administrative documents both outgoing and incoming of

the organization The administrator will manage users of the

system We use JavaScript programming language based on

the NodeJS [19], ReactJS [17], [18] and the Visual Studio

Code IDE [21] The Tesseract OCR [23] is selected among

many OCR engines due to its supports for various languages

and performance on different fonts To combine with the

image processing technique, we use OpenCV to enhance the

quality and obtain the best results The main page to process

the document digitalization is built as in Fig.3 The picture

of a document file is loaded on the left hand side of the

webpage After that, it is digitalized by extracting all texts and

picture of the stamp They are recognized and transformed

into the data structured from of the right hand side in this

webpage These data are then corrected on each component of

the form (if necessary), stored in the database of MongoDB

[22] and can be reproduced a new document based on the

function of document management (see Fig.4) In this form,

user can choose any document based on its ID to display a new

document file with content is exactly as the input document

image before

Fig 3 The main page to digitize an administrative document

Fig 4 The page of documents management

TABLE I

C OMPARISON OF PERFORMANCE (#E/#C: THE NUMBER OF ERROR / THE NUMBER OF CHARACTERS ), ACCURACY (A CC %) AND PROCESSING TIME

( MS : MILLISECOND ) DocID Quality Type #E/#C Acc

(%)

# of comp

Time (ms)

original 83/1037 91.996 10 6000 clean 85/1037 91.803 10 5730

110/TB-HQT Medium

original 45/1030 95.631 11 5550 clean 47/1030 95.437 12 4990

187/Q-UBND High

original 41/1514 96.019 12 5990 clean 46/1514 95.533 12 5300

V DISCUSSION AND COMPARISON

In this section, we test our application with different type

of documents focused on the administrative documents to compare the obtained results Several experiments have also been carried out to assess the application’s efficiency In the context of OCR and image processing, each of the tests was carefully examined, independent one by one Table I shows the overall findings for a variety of document characteristics Each entry in the table reflects a specific experiment that was carried out on the corresponding document Several documents have been tested with different quality and type to evaluate the obtained results with accuracy We have performed on three input document pictures (with their ID shows on the first column) They have different resolutions (the quality) Each of them is tested with the two types (original input and cleared one after removing noisy) Example, a scanned image with DPI (Dots Per Inch) equal or greater than 400 is considered as

a high resolution image; from 300 to 400 is medium resolution; and less than 300 is a low resolution image Depending on the quality of scanned documents and their resolution after denoising process, the number of components (# of comp) that determined on each document is different The processing time and accuracy [20] of the recognition are presented in Table I The OCR engine at the character level, which is determined

by the equation (1)

Ac= 1 − E

where, e: the number of error characters and c: the number of all characters in the document

We count and compute the ratio between number of errors per number of characters in each document (#E/#C) to obtained the accuracy of our proposed method In general, the clean documents clearly received more precise results in terms of character level precision, as well as obtained exactly number

of component placement, and the processing time is a bit faster to complete the task The obtained results shows that the better the input data the more efficient the outcome will be The accuracy of the clean documents (in case of medium and high) is above 95% The results indicated that the algorithms in the application can meet the demanding criteria By providing

a more capable system and higher quality input using other scanning methods can document be more accurately digitized, which will significantly with the outcome However, processing

Trang 5

the hand writing is still a challenge to the researchers [10];

especially the signature and stamp (or seal) of the organization

Because of the security issues, the seal is managed following

the rules and legal system of the Vietnam Government In

this application, we just get the shape of the seal based on

its boundary to prove the technical issue and not reuse it for

reproducing new documents The second point is the signature,

following the rule, It is always located 1/3 overlap the seal

from the left Therefore, we did not extract it to preserve the

legal of the root documents In practice, the administrative

document of any organization or company has legal value

when it is sealed (stamped) and signed by the leader Although

electronic signatures have been approved by Vietnamese law

(but only in some cases)

To compare with the existing methods and applications in

practice (see Table II and Table III), our application has many

advantages It is built based on the web-app, so it is useful,

easily to process It can handle a lot of documents as the same

time on the connected network system within an organization

or a company The capacity of stored space is invested by the

hardware devices in the data server Therefore, the application

is very utility and reality in management, storing, mining

effectively in the process of DT in organization In Table II,

VietOCR is a free tool built and run on both web and desktop

application It process fastest comparing to all application; the

accuracy is also higher our application The testing version

online can obtain exactly text data like information of an

identification card, medical card or one page document It

is very useful to develop the QR or Bar code scanner used

in supermarkets or stores to check items and paying

pro-cess However, it is not used in the administrative document

digitizing, with multiple pages While the ABYYFineReader

approximates to our application, with similar support,

accu-racy, and better processing time in some cases However, it

is a proprietary software, which may cause compatibility and

licensing issues SodaPDF obtained results with a very low

accuracy, while the processing time is largest and sometimes

generate overlapping words and noises Omnipage Docudirect

does not have support processing Vietnamese and making it

difficult to evaluate Without counting the Vietnamese accents,

it can generate decent characters, however the accuracy is

still low Both SodaPDF and Omnipage Docudirect could not

process the documents 480/Q-BGDT due to its low resolution,

which resulted in much lower accuracy In contrast, the others

and our application only suffered negligible loss in accuracy

with that sample Therefore, the more quality of the document

images, the better results we can obtain, both accuracy and

time processing All the application used in our experiments

were unable to extract the stamp separately, which can cause

legal problems as they are often overlapped with the signature

Our product has efficient accuracy and not time consuming,

while support processing Vietnamese, separately extract stamp

and signature

We compare the support functions, abilities to process stamp

and text, license issues and platform of several application

in practice in Table III The advantage of our application is

TABLE II

C OMPARISON OF THE ACCURACY AND PROCESSING TIME

Doc ID Application Error

of chars

Acc (%)

Time (ms)

480/Q-BGDT

Our Application 84 91.899 5724

ABYYFineReader 88 91.514 4280 Omnipage Docudirect 436 57.955 14420

110/TB-HQT

187/Q-UBND

TABLE III

C OMPARISON OF UTILITIES BETWEEN THE APPLICATION

Apps SupportVN Extractstamp Structureform License Platform Our

SodaPDF Yes clearNot Yes TrialVers Web-app,desktop ABYY

FineReader Yes

Not clear Yes

Trial Vers Desktop Omnipage

Docudirect No

Not clear Yes

Trial Vers Desktop

free for using; it is designed based on the web application; and it can process a document with multiple pages The Om-nipage Docudirect does not support processing Vietnamese, but still can generate relatively accurate words The SodaPDF, ABYYFineReader and Omnipage Docudirect cannot extract separately image of stamp and signature This is also a different point comparing to our method and VieOCR Besides,

we have also pay fee for using them In general, our product has efficient accuracy, supports Vietnamese, extract separately stamps and signature, process standard document and export output into the structured format of the administrative docu-ment form The important point is that, our application can support freely to the organizations in their digital transforma-tion

VI CONCLUSION

In this research, we have researched, proposed a method and built an application for digitizing the administrative doc-uments The research is relied on the fields of computer graphics, computer vision and image processing We used the ReactJS and NodeJS as the utility tools combined with the libraries of OpenCV and Tesseract OCR to build a web appli-cation system for digitizing and managing the administrative documents The obtained results reached more than 91% of accuracy and the processing time is just few seconds for each document page (both scanning and character recognizing) The application is very useful in the administrative document

Trang 6

management and it can support staffs in the office activities.

The current system has a difficulty in processing hand-written

documents as they are inconsistent and not following any rules

However, it can be solved based on the deep learning models

in the future Moreover, the system uses a predetermined

structure of an administrative document, which does not cover

all use cases in the real world This issue is also not important

because the structure of the document form is easily modified

and improved

While the accuracy of the obtained results is not 100%, it

makes up for the fast response time with sufficient accuracy

and excels at printed documents, as well as providing option

for editing before exportation Such high accuracy and fast

response time, the application can be used in the large

orga-nizations and firms, where a huge amount of administrative

documents are processed each day Moreover, the process

of integrating the OCR module in the web application did

not interfere with the performance, but it is rather enhanced

the user experience by providing an ease-of-use interface for

easy conversion, editing and management With the obtained

web application, the daily administrative workload can be

significantly reduced and providing a fast and secure solution

Improvements on accuracy and wider range of use cases can

be covered with a larger dataset containing different forms

of documents as well as using a wider variety of character

dictionaries other than the Tesseract OCR The further research

and tests in the future are necessary for optimizing the system

Besides, we will research to process hand-written characters

to improve next version

VII ACKNOWLEDGMENT

The research work in this paper is funded by the student

project of the International University, Vietnam National

Uni-versity of Ho Chi Minh City (HCMIU), with the ID is

SV20220-IT-05 We would like to thank for the fund

REFERENCES [1] Thomas M Siebel “Digital Transformation: Survive and Thrive in an

Era of Mass Extinction” First edition published by RosettaBooks, 2019.

[2] Ziyadin S and Suieubayeva S and Utegenova A “Digital

Transforma-tion in Business” InternaTransforma-tional Scientific Conference “Digital

Trans-formation of the Economy: Challenges, Trends, New Opportunities,

ISCDTE 2019 Lecture Notes in Networks and Systems, vol 84 pp

408-415, https://doi.org/10.1007/978-3-030-27015-5 49, Springer, 2020.

[3] Johan, M., Tan, R., Suteja, B and Afiany, N “Document

Digitalization and Scoring System of Students Final Project”.

Jurnal Teknik Informatika Dan Sistem Informasi, 6(3).

https://doi.org/10.28932/jutisi.v6i3.3126, 2020

[4] Johan, M., Tan, R., Suteja, B and Afiany, N “Document digitalization

through use of cloud computing technology” International Journal of

Engineering Applied Sciences and Technology Vol 4, Issue 10, ISSN

No 2455-2143, Pages 260-262, 2020.

[5] The Sate Records and Archives Department of Vietnam,

https://luutru.gov.vn/home.htm, access Nov, 2021.

[6] Fan, L Zhang, F Fan, H “Brief review of image denoising techniques”.

Vis Comput Ind Biomed https://doi.org/10.1186/s42492-019-0016-7,

2019.

[7] Ray Smith “An Overview of the Tesseract OCR engine” International

Conference on Document Analysis and Recognition (ICDAR), IEEE

Computer Society, pp 629-633, 2007.

[8] M Borg, T Olsson, U Franke and S Assar (2018) “Digitalization

of Swedish Government Agencies - A Perspective Through the Lens

of a Software Development Census” IEEE/ACM 40th International Conference on Software Engineering: Software Engineering in Society (ICSE-SEIS) pp 37-46, 2018.

[9] Chirag Pate, Chirag Pate, Dharmendra Patel “Optical Character Recog-nition by Open Source OCR Tool Tesseract: A Case Study” Interna-tional Journal of Computer Applications Volume 55 No.10, 2012 [10] J Memon, M Sami, R A Khan and M Uddin “Handwritten Optical Character Recognition (OCR): A Comprehensive Systematic Literature Review (SLR)” IEEE Access Vol 8, pp 142642-142668, 2020 [11] Koichi Kise “Page Segmentation Techniques in Document Analysis” Handbook of Document Image Processing and Recognition, pp 135-175, Springer, 2014.

[12] Chung B.W “Getting Started with Processing and OpenCV” Pro Processing for Images and Computer Vision with OpenCV, pp 1-37, doi.org/10.1007/978-1-4842-2775-6 1, 2017.

[13] Minh Khai Tran, Sinh Van Nguyen, Nghia Tuan To, Marcin Maleszka Processing and Visualizing the 3D Models in Digital Heritage 13th In-ternational Conference on Computational Collective Intelligence (ICCCI

2021, Rank B) Lecture Notes in Computer Science, vol 12876 Springer, Pages 613-625, 2021.

[14] NGUYEN Van Sinh, TRAN Manh Ha, LE Son Truong Visualization

of Medical Images Data Based on Geometric Modeling Lecture note in computer science 11814, ISSN 0302-9743, Pages 560-576, Springer, 2019.

[15] ,Vietnam Government “Format of the administrative document”, Num-ber 30/2020/ND-CP, March 23, 2020.

[16] Ruili Zhang, Yanming Yang and Wenxiu Wang “Research

on document digitization processing technology” MATEC Web of Conferences 309, 02014, CSCNS2019 pp 1-6, doi.org/10.1051/matecconf/202030902014, 2020.

[17] ReactJS, A JavaScript library for building user interfaces, https://reactjs.org, access Nov, 2021.

[18] Sanchit Aggarwal “Modern Web-Development using ReactJS” Inter-national Journal of Recent Research Aspects ISSN 2349-7688 Vol 5, Issue 1, March 2018, pp 133-137, 2018.

[19] Introduction to NodeJS, https://nodejs.dev/learn, access Nov, 2021 [20] Christian Clausner, Stefan Pletschacher, Apostolos Antonacopoulos.

“Flexible character accuracy measure for reading-order-independent evaluation” Journal of Pattern Recognition Letters 131 (2020), pp

390-397, doi.org/10.1016/j.patrec.2020.02.003.

[21] Visual Studio Code for the Web, https://code.visualstudio.com, access Nov, 2021.

[22] Christudas B “Install, Configure, and Run MongoDB” Practical Mi-croservices Architectural Patterns, 2019.

[23] Ray W Smith “History of the Tesseract OCR engine: what worked and what didn’t” Document Recognition and Retrieval XX, 865802 https://doi.org/10.1117/12.2010051, 2013.

Tiêu đề	Digitalization of Administrative Documents a Digital Transformation Step in Practice
Tác giả	Sinh Van Nguyen, Dung Anh Nguyen, Lam Son Quoc Pham
Trường học	International University, Vietnam National University of HCMC
Chuyên ngành	Information and Computer Science
Thể loại	Conference Paper
Năm xuất bản	2021
Thành phố	Ho Chi Minh

Định dạng
Số trang	6
Dung lượng	1,52 MB