This paper presents a method to build a web application for digitizing the administrative documents applied in most public organizations.. Document digitalization [3], [4], [16] refers t
Trang 1Digitalization of Administrative Documents
A Digital Transformation Step in Practice
Sinh Van Nguyen, Dung Anh Nguyen, Lam Son Quoc Pham
School of Computer Science and Engineering International University, Vietnam National University of HCMC
Ho Chi Minh, Vietnam ITITIU17073@student.hcmiu.edu.vn; pqslam@hcmiu.edu.vn Corresponding author: nvsinh@hcmiu.edu.vn (ORCID: 0000-0003-0424-5542)
Abstract—Digital transformation is one of the most popular
keyword in recent years It is not only a trend in science research
based on the development of information technology, but also a
proposed duty that applied in the companies or organizations
nowadays Digitalization of administrative documents is therefore
considered as the first step in digital transformation of public
organization Through the digitizing process, the information
that were in written format or hard copies will be converted
into digital format (e.g document files) to serve for storing,
mining, processing and managing the documents This paper
presents a method to build a web application for digitizing the
administrative documents applied in most public organizations
The method is based on the OCR (Optical Character
Recog-nition) combined with the image processing techniques Our
digital process is implemented as following steps (i) Scanning the
hard copies of the administrative documents (ii) Removing noise
data and filtering necessary information in the content based on
image processing technique (iii) Classifying automatically the
acquired contents into the respective components of a template
form following the structured format of Vietnam Government
(iv) Generating automatically a document file The application
can process a document with a single or multiple pages To
compare with similar applications, our application is processed
very fast, without limitation of pages for each document and
obtained accuracy as our expectation
Index Terms—Digital Transformation, Document
Digitaliza-tion, OCR, Image Processing, Smart Web Application
I INTRODUCTION
The development of Information Technology (IT) nowadays
brings us advantages in daily work, study, research and
en-tertainment Application of IT is considered as a popular tool
in the official activities and also a mean for administrative
management This leads to starting steps in digital
transforma-tion (DT) of any organizatransforma-tion or countries all over the world
In the side of work, DT is a transformation of work from
traditional to digital activities based on the background of IT
and communication devices On the other side, DT is formed
by the merger of personal and corporate IT environments
based on an intersection of digital technologies such as cloud
computing, big data, IoT, and AI, etc to serve for all
activ-ities of organization [1], [2] Digitalization is the process of
transforming data or information into computer-based digital
format Document digitalization [3], [4], [16] refers to the
technique of scanning the hard copy of a document and convert
its content into electronic soft version of document file format such as doc, docx or pdf files In these digital format, infor-mation are arranged into distinct data components stored on the computer memory that can be processed individually, and therefore they are understandable and readable to a computer The document digitalization is processed as follow: a hard copy of document file is scanned and saved as a picture or
a pdf file, page by page The light and dark regions on the scanned image are analyzed by an optical character recognition engine, which then turns each letter or number into an ASCII code; the system will analyze and divide the ASCII characters into several little portions that may be saved for later usage
In practice, we have too many documents that need to be kept and preserved carefully for a long time, because their values
to individual or even to the history of a country They can be certificates, degrees, legal documents of law or administration, etc., more and more increase day by day Normally, they are papers or may be made by special woods and difficult to keep for a long time
The problem comes from capacity of stored space and activi-ties in document preservation According to the State Records and Archives Department of Vietnam [5], we have six centrals
of national archives, where store a huge amount of national documents To study in each public organization or companies
in practice, the document storage and management workloads takes up half of the working time of an average employee, even more in large firms where documents are quickly piled
up Such minor task should not be used a large space and time consuming, which is why it should be automated by the time
As a solution to this problem, a computer-based application can be developed to convert all traditional papers into their digital format counterparts The documents in this application are structurally stored in a robust architecture to support the administrative workloads of employees Such application can assist large organizational needs for keeping information safe,
up to date and accessible to all authorized parties
The work in this paper aims to propose a method and create
an application for digitizing administrative documents Our solution is performed based on a web application that can support management needs, storing, searching and mining in large firms and public organizations of the Vietnam
Trang 2govern-ment The method consists of following steps: (i) The hard
copies of the administrative documents are scanned and save
as picture or pdf files (ii) The image processing technique
is used to filter the scanned data by removing noise data [6]
(iii) Classifying automatically the acquired contents based on
the Tesseract OCR [7] technique and put them into the right
positions of a template form following the structured format
of Vietnam government (iv) Creating automatically document
files for using and managing The final product will provide
tools that assists companies in administrative works such as
organization, management, store, mining, using and reproduce
their documents The whole process is considered as a DT step
in administrative management
The remainder of the paper is structured as follows Section 2
reviews several methods, solutions and application in practice
Section 3 presents our method and system architecture of
the application in detail The implementation and obtained
results are presented in the Section 4 We compare and discuss
between the methods and usage functions of the application
in Section 5 The last Section is our conclusion
II RELATED WORK
Digitization of data is one important step in almost activities of
data mining and management in every companies and public
organizations This is also a module in the DT process of
Government [8] The existing tools, techniques and methods
that base on the background of IT are key factors in the
whole process The OCR technique [3] is used to read and
identify characters and image information on the scanned
documents Part of the conversion is to recognize characters
within the uploaded image of a document and export these
characters onto a digital copy The benefit of digital copies
from the source materials is that they are managed easily
in large quantity and (in theory) can use indefinitely This
technology [9], [10] aims to revolutionize any
administrative-based workload in large organizations and firms by eliminating
the hassle of paperwork instead option for a digital solution
that is both reliable and manageable
Koichi Kise [11] presented a method for classifying a
doc-ument image into homogeneous components such as text
blocks, figures, and tables The method is based on image
processing technique to distinguish background, foreground,
object components, color and intensity of pixels on the image
The obtained results proved a promising way in extraction and
recognition of characters in the document images In order
to process the images, OpenCV is one of the popular open
source libraries that can help to process very fast Chung
B W [12] introduced step by step how to install and work
on the OpenCV This guideline is useful to the researchers
and developers Image processing technique is widely applied
in computer graphics and computer vision Minh et al [13]
introduced a method for creating a virtual museum based on
the virtual reality application The method is considered as
a digital transformation step in the filed of digital heritage
The application allows user visiting and interacting with the
relics in the museum as in practice In the field of medical,
we can apply the image processing technique to build an application to support doctor in disease diagnosis Sinh et al [14] presented a method for building and visualizing medical data objects based on a web application This web-app was very useful and can be used for both medical staff and patients Therefore, application of image processing technique
to develop a web-base system for digitalizing administrative documents is popular and widely used in practice According
to the format of the Vietnam Government [15], the structure of
an administrative document is presented as in Fig.1 This is a required document template (using the paper size A4) to create
an administrative document in all the public organizations
It is also used in the private companies that following the administrative laws of Vietnam The structure of the document
Fig 1 Format of an administrative document by the Vietnam Government.
is numbered and distributed on different positions of the document parts as follows: (1) National name (2) Name
of the organization that issued the document (3) Document
ID (4) Place and date issued (5a) Type of document (5b) Abstract (6) Main content (7a, 7b, 7c) Title, Full name and Signature of competent person, respectively (8) Seal (Stamp) and Signature of organization (9a and 9b) Recipient (10a, 10b) Confidentiality Indicator, Urgency indicator (11) Scope
of circulation indication (12) Writer notation and number of editions (13) Contact information of organization (14) Digital signature of organization for copied version of document into electronic format Among these components, the main content
Trang 3(6) can be extended more than one page.
Ruili Zhang et al [5] presented a framework for digital
document processing The proposed method includes several
important steps like scanning, indexing, quality checking,
archiving and backup of electronic documentary information
However, the method is worked on non-structured documents
without the comparison of accuracy to the existing methods in
the same context The next section presents in detail our
pro-posed method and application for digitalizing and managing
administrative documents
III PROPOSED METHOD
A System architecture
The system is based on a web application with MVC model
and a database management system for working, storing and
mining in practice Our proposed method and application is
described as the following steps (see Fig.2)
Fig 2 Our proposed workflow of digitizing process.
B Proposed algorithm
To start the process of digitalization, the system first needs
to receive and load a scanned image of the document After
scanning, a cropped replica of it is generated (which will be
used later to locate the stamp) Then, we use techniques in
image processing to convert it into a grayscale image (black
and white color) based on a threshold function to filter noisy
elements The threshold operation alters the value of pixels; if
the pixel value exceeds the threshold value, it is assigned the
value 1 (white); otherwise, it is set to 0 (black) After that,
the spaces between the characters are filled to form a uniform
partition to identify each component of the document in the
image by using the dilate method to increase the character
thickness to a specific level In the next step, to detect regions
of image or text, four points of a rectangular area surrounding
each partition are then identified to determine its contour
(called a bounding box)
Using the structure of the administrative document (as in Fig.1) to determine and extract text information in the doc-ument image The location of each partition corresponding
to the component in the picture can be predicted based on the four inferred points (x, y coordinates), and partition size (width, height) They will be matched with numbers in the document structure such as name of organization, ID, place and date, document type, abstract, content, recipient, position, stamp and signature) If the detected component is a stamp, the system proceed to detect the circle shape that represents the stamp’s border, using the RGB filter to maintain only the red color and save it as a PNG picture This image is noted that will not be utilized to extract the text inside but rather kept
as a replica of the stamp The image of each component is cropped from the replica after the components are determined (which was made at the start) and stored as a series of photos labeled with the component’s name This process allows us to treat each component independently Finally, an OCR engine
is used to transform each component’s picture into text and return the results
In general, our proposed algorithm (Algorithm 1) is described
as follows Algorithm 1 DocumentDigitization()
1: Input: Images files
2: Output: Document files
3: Load an image file
4: Crop the image
5: Create replica of the image
6: Convert the replica into pure black and white using threshold function
7: Dilate characters in image
8: Find contours of image components and store in array C Initialize i = 0
9: while i ≤ C.length do
10: Create bounding rectangle with C[i]
11: Get x, y, width, height of the bounding rectangle
12: Classify the component based on x, y, width, height
13: if the component is stamp then
14: Find mask of bounding circle of stamp
15: Extract the stamp from the image
16: (using BitwiseAND with the original image)
17: else
18: Crop the area at (x,y,x+width,y+height)
19: of the images clone
20: Extract text from the area and push into array texts
21: end if
22: i = i + 1
23: end while
24: Return image of stamp and texts
IV IMPLEMENTATION AND RESULTS
The application can be used by many users (or called actors) who interact with the system Any one can view, search and use the system The official staffs can process and manage
Trang 4all administrative documents both outgoing and incoming of
the organization The administrator will manage users of the
system We use JavaScript programming language based on
the NodeJS [19], ReactJS [17], [18] and the Visual Studio
Code IDE [21] The Tesseract OCR [23] is selected among
many OCR engines due to its supports for various languages
and performance on different fonts To combine with the
image processing technique, we use OpenCV to enhance the
quality and obtain the best results The main page to process
the document digitalization is built as in Fig.3 The picture
of a document file is loaded on the left hand side of the
webpage After that, it is digitalized by extracting all texts and
picture of the stamp They are recognized and transformed
into the data structured from of the right hand side in this
webpage These data are then corrected on each component of
the form (if necessary), stored in the database of MongoDB
[22] and can be reproduced a new document based on the
function of document management (see Fig.4) In this form,
user can choose any document based on its ID to display a new
document file with content is exactly as the input document
image before
Fig 3 The main page to digitize an administrative document
Fig 4 The page of documents management
TABLE I
C OMPARISON OF PERFORMANCE (#E/#C: THE NUMBER OF ERROR / THE NUMBER OF CHARACTERS ), ACCURACY (A CC %) AND PROCESSING TIME
( MS : MILLISECOND ) DocID Quality Type #E/#C Acc
(%)
# of comp
Time (ms)
original 83/1037 91.996 10 6000 clean 85/1037 91.803 10 5730
110/TB-HQT Medium
original 45/1030 95.631 11 5550 clean 47/1030 95.437 12 4990
187/Q-UBND High
original 41/1514 96.019 12 5990 clean 46/1514 95.533 12 5300
V DISCUSSION AND COMPARISON
In this section, we test our application with different type
of documents focused on the administrative documents to compare the obtained results Several experiments have also been carried out to assess the application’s efficiency In the context of OCR and image processing, each of the tests was carefully examined, independent one by one Table I shows the overall findings for a variety of document characteristics Each entry in the table reflects a specific experiment that was carried out on the corresponding document Several documents have been tested with different quality and type to evaluate the obtained results with accuracy We have performed on three input document pictures (with their ID shows on the first column) They have different resolutions (the quality) Each of them is tested with the two types (original input and cleared one after removing noisy) Example, a scanned image with DPI (Dots Per Inch) equal or greater than 400 is considered as
a high resolution image; from 300 to 400 is medium resolution; and less than 300 is a low resolution image Depending on the quality of scanned documents and their resolution after denoising process, the number of components (# of comp) that determined on each document is different The processing time and accuracy [20] of the recognition are presented in Table I The OCR engine at the character level, which is determined
by the equation (1)
Ac= 1 − E
where, e: the number of error characters and c: the number of all characters in the document
We count and compute the ratio between number of errors per number of characters in each document (#E/#C) to obtained the accuracy of our proposed method In general, the clean documents clearly received more precise results in terms of character level precision, as well as obtained exactly number
of component placement, and the processing time is a bit faster to complete the task The obtained results shows that the better the input data the more efficient the outcome will be The accuracy of the clean documents (in case of medium and high) is above 95% The results indicated that the algorithms in the application can meet the demanding criteria By providing
a more capable system and higher quality input using other scanning methods can document be more accurately digitized, which will significantly with the outcome However, processing
Trang 5the hand writing is still a challenge to the researchers [10];
especially the signature and stamp (or seal) of the organization
Because of the security issues, the seal is managed following
the rules and legal system of the Vietnam Government In
this application, we just get the shape of the seal based on
its boundary to prove the technical issue and not reuse it for
reproducing new documents The second point is the signature,
following the rule, It is always located 1/3 overlap the seal
from the left Therefore, we did not extract it to preserve the
legal of the root documents In practice, the administrative
document of any organization or company has legal value
when it is sealed (stamped) and signed by the leader Although
electronic signatures have been approved by Vietnamese law
(but only in some cases)
To compare with the existing methods and applications in
practice (see Table II and Table III), our application has many
advantages It is built based on the web-app, so it is useful,
easily to process It can handle a lot of documents as the same
time on the connected network system within an organization
or a company The capacity of stored space is invested by the
hardware devices in the data server Therefore, the application
is very utility and reality in management, storing, mining
effectively in the process of DT in organization In Table II,
VietOCR is a free tool built and run on both web and desktop
application It process fastest comparing to all application; the
accuracy is also higher our application The testing version
online can obtain exactly text data like information of an
identification card, medical card or one page document It
is very useful to develop the QR or Bar code scanner used
in supermarkets or stores to check items and paying
pro-cess However, it is not used in the administrative document
digitizing, with multiple pages While the ABYYFineReader
approximates to our application, with similar support,
accu-racy, and better processing time in some cases However, it
is a proprietary software, which may cause compatibility and
licensing issues SodaPDF obtained results with a very low
accuracy, while the processing time is largest and sometimes
generate overlapping words and noises Omnipage Docudirect
does not have support processing Vietnamese and making it
difficult to evaluate Without counting the Vietnamese accents,
it can generate decent characters, however the accuracy is
still low Both SodaPDF and Omnipage Docudirect could not
process the documents 480/Q-BGDT due to its low resolution,
which resulted in much lower accuracy In contrast, the others
and our application only suffered negligible loss in accuracy
with that sample Therefore, the more quality of the document
images, the better results we can obtain, both accuracy and
time processing All the application used in our experiments
were unable to extract the stamp separately, which can cause
legal problems as they are often overlapped with the signature
Our product has efficient accuracy and not time consuming,
while support processing Vietnamese, separately extract stamp
and signature
We compare the support functions, abilities to process stamp
and text, license issues and platform of several application
in practice in Table III The advantage of our application is
TABLE II
C OMPARISON OF THE ACCURACY AND PROCESSING TIME
Doc ID Application Error
of chars
Acc (%)
Time (ms)
480/Q-BGDT
Our Application 84 91.899 5724
ABYYFineReader 88 91.514 4280 Omnipage Docudirect 436 57.955 14420
110/TB-HQT
Our Application 46 95.534 5466
ABYYFineReader 45 95.641 3750 Omnipage Docudirect 175 83.009 6600
187/Q-UBND
Our Application 43 95.776 5558
ABYYFineReader 38 97.490 5890 Omnipage Docudirect 262 82.694 8200
TABLE III
C OMPARISON OF UTILITIES BETWEEN THE APPLICATION
Apps SupportVN Extractstamp Structureform License Platform Our
SodaPDF Yes clearNot Yes TrialVers Web-app,desktop ABYY
FineReader Yes
Not clear Yes
Trial Vers Desktop Omnipage
Docudirect No
Not clear Yes
Trial Vers Desktop
free for using; it is designed based on the web application; and it can process a document with multiple pages The Om-nipage Docudirect does not support processing Vietnamese, but still can generate relatively accurate words The SodaPDF, ABYYFineReader and Omnipage Docudirect cannot extract separately image of stamp and signature This is also a different point comparing to our method and VieOCR Besides,
we have also pay fee for using them In general, our product has efficient accuracy, supports Vietnamese, extract separately stamps and signature, process standard document and export output into the structured format of the administrative docu-ment form The important point is that, our application can support freely to the organizations in their digital transforma-tion
VI CONCLUSION
In this research, we have researched, proposed a method and built an application for digitizing the administrative doc-uments The research is relied on the fields of computer graphics, computer vision and image processing We used the ReactJS and NodeJS as the utility tools combined with the libraries of OpenCV and Tesseract OCR to build a web appli-cation system for digitizing and managing the administrative documents The obtained results reached more than 91% of accuracy and the processing time is just few seconds for each document page (both scanning and character recognizing) The application is very useful in the administrative document
Trang 6management and it can support staffs in the office activities.
The current system has a difficulty in processing hand-written
documents as they are inconsistent and not following any rules
However, it can be solved based on the deep learning models
in the future Moreover, the system uses a predetermined
structure of an administrative document, which does not cover
all use cases in the real world This issue is also not important
because the structure of the document form is easily modified
and improved
While the accuracy of the obtained results is not 100%, it
makes up for the fast response time with sufficient accuracy
and excels at printed documents, as well as providing option
for editing before exportation Such high accuracy and fast
response time, the application can be used in the large
orga-nizations and firms, where a huge amount of administrative
documents are processed each day Moreover, the process
of integrating the OCR module in the web application did
not interfere with the performance, but it is rather enhanced
the user experience by providing an ease-of-use interface for
easy conversion, editing and management With the obtained
web application, the daily administrative workload can be
significantly reduced and providing a fast and secure solution
Improvements on accuracy and wider range of use cases can
be covered with a larger dataset containing different forms
of documents as well as using a wider variety of character
dictionaries other than the Tesseract OCR The further research
and tests in the future are necessary for optimizing the system
Besides, we will research to process hand-written characters
to improve next version
VII ACKNOWLEDGMENT
The research work in this paper is funded by the student
project of the International University, Vietnam National
Uni-versity of Ho Chi Minh City (HCMIU), with the ID is
SV20220-IT-05 We would like to thank for the fund
REFERENCES [1] Thomas M Siebel “Digital Transformation: Survive and Thrive in an
Era of Mass Extinction” First edition published by RosettaBooks, 2019.
[2] Ziyadin S and Suieubayeva S and Utegenova A “Digital
Transforma-tion in Business” InternaTransforma-tional Scientific Conference “Digital
Trans-formation of the Economy: Challenges, Trends, New Opportunities,
ISCDTE 2019 Lecture Notes in Networks and Systems, vol 84 pp
408-415, https://doi.org/10.1007/978-3-030-27015-5 49, Springer, 2020.
[3] Johan, M., Tan, R., Suteja, B and Afiany, N “Document
Digitalization and Scoring System of Students Final Project”.
Jurnal Teknik Informatika Dan Sistem Informasi, 6(3).
https://doi.org/10.28932/jutisi.v6i3.3126, 2020
[4] Johan, M., Tan, R., Suteja, B and Afiany, N “Document digitalization
through use of cloud computing technology” International Journal of
Engineering Applied Sciences and Technology Vol 4, Issue 10, ISSN
No 2455-2143, Pages 260-262, 2020.
[5] The Sate Records and Archives Department of Vietnam,
https://luutru.gov.vn/home.htm, access Nov, 2021.
[6] Fan, L Zhang, F Fan, H “Brief review of image denoising techniques”.
Vis Comput Ind Biomed https://doi.org/10.1186/s42492-019-0016-7,
2019.
[7] Ray Smith “An Overview of the Tesseract OCR engine” International
Conference on Document Analysis and Recognition (ICDAR), IEEE
Computer Society, pp 629-633, 2007.
[8] M Borg, T Olsson, U Franke and S Assar (2018) “Digitalization
of Swedish Government Agencies - A Perspective Through the Lens
of a Software Development Census” IEEE/ACM 40th International Conference on Software Engineering: Software Engineering in Society (ICSE-SEIS) pp 37-46, 2018.
[9] Chirag Pate, Chirag Pate, Dharmendra Patel “Optical Character Recog-nition by Open Source OCR Tool Tesseract: A Case Study” Interna-tional Journal of Computer Applications Volume 55 No.10, 2012 [10] J Memon, M Sami, R A Khan and M Uddin “Handwritten Optical Character Recognition (OCR): A Comprehensive Systematic Literature Review (SLR)” IEEE Access Vol 8, pp 142642-142668, 2020 [11] Koichi Kise “Page Segmentation Techniques in Document Analysis” Handbook of Document Image Processing and Recognition, pp 135-175, Springer, 2014.
[12] Chung B.W “Getting Started with Processing and OpenCV” Pro Processing for Images and Computer Vision with OpenCV, pp 1-37, doi.org/10.1007/978-1-4842-2775-6 1, 2017.
[13] Minh Khai Tran, Sinh Van Nguyen, Nghia Tuan To, Marcin Maleszka Processing and Visualizing the 3D Models in Digital Heritage 13th In-ternational Conference on Computational Collective Intelligence (ICCCI
2021, Rank B) Lecture Notes in Computer Science, vol 12876 Springer, Pages 613-625, 2021.
[14] NGUYEN Van Sinh, TRAN Manh Ha, LE Son Truong Visualization
of Medical Images Data Based on Geometric Modeling Lecture note in computer science 11814, ISSN 0302-9743, Pages 560-576, Springer, 2019.
[15] ,Vietnam Government “Format of the administrative document”, Num-ber 30/2020/ND-CP, March 23, 2020.
[16] Ruili Zhang, Yanming Yang and Wenxiu Wang “Research
on document digitization processing technology” MATEC Web of Conferences 309, 02014, CSCNS2019 pp 1-6, doi.org/10.1051/matecconf/202030902014, 2020.
[17] ReactJS, A JavaScript library for building user interfaces, https://reactjs.org, access Nov, 2021.
[18] Sanchit Aggarwal “Modern Web-Development using ReactJS” Inter-national Journal of Recent Research Aspects ISSN 2349-7688 Vol 5, Issue 1, March 2018, pp 133-137, 2018.
[19] Introduction to NodeJS, https://nodejs.dev/learn, access Nov, 2021 [20] Christian Clausner, Stefan Pletschacher, Apostolos Antonacopoulos.
“Flexible character accuracy measure for reading-order-independent evaluation” Journal of Pattern Recognition Letters 131 (2020), pp
390-397, doi.org/10.1016/j.patrec.2020.02.003.
[21] Visual Studio Code for the Web, https://code.visualstudio.com, access Nov, 2021.
[22] Christudas B “Install, Configure, and Run MongoDB” Practical Mi-croservices Architectural Patterns, 2019.
[23] Ray W Smith “History of the Tesseract OCR engine: what worked and what didn’t” Document Recognition and Retrieval XX, 865802 https://doi.org/10.1117/12.2010051, 2013.