One of the most commonly adopted systems worldwide for clinical data storage is the Relational Database Management System RDBMS.. While there is a possibility to store some of the clinic
Trang 1A Non-relational Approach (NoSQL
and XML)
Priya Shah, Rajasi Adurkar, Shreya Desai, Swapnil Kadakia,
and Kiran Bhowmick
Abstract The advancing COVID-19 pandemic caused by the novel coronavirus has
taken the world by a storm due to its unprecedented nature In order to increase the understanding of the disease and create countermeasures for the same, collecting and storing data in a proper and efficient format is of utmost importance However, this tremendous amount of data is obtained from various heterogeneous sources and is usually dynamic in nature Traditional RDBMS might not be the most efficient choice for the sporadic and ever-changing clinical data associated with COVID patients due to its highly rigid nature This paper utilized a primary dataset acquired from COVID-19 patients as a premise to portray the inefficiencies of RDBMS and further proposes two new schemaless, unstructured databases, NoSQL and XML databases,
as an offset to this drawback The intention is to propose the two most efficient technologies and delineate the findings through a sample implementation
Keywords COVID-19·Clinical data·Non-relational databases·NoSQL·XML
All authors have contributed equally to the paper.
P Shah · R Adurkar · S Desai · S Kadakia (B) · K Bhowmick
Depatment of Computer Engineering, D J Sanghvi College of Engineering, Mumbai, India e-mail: swapnilkadakia@gmail.com
P Shah
e-mail: pshah3103@gmail.com
R Adurkar
e-mail: adurkar.rajasi562@gmail.com
S Desai
e-mail: shreyadesai1202@gmail.com
K Bhowmick
e-mail: kiran.bhowmick@djsce.ac.in
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd 2021
J Hemanth et al (eds.), Intelligent Data Communication Technologies and Internet
of Things, Lecture Notes on Data Engineering and Communications Technologies 57,
https://doi.org/10.1007/978-981-15-9509-7_40
483
Trang 21 Introduction
The COVID-19 pandemic is an advancing pandemic of the novel coronavirus which has been said to be caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) First detected in Wuhan, China, in December 2019, the outbreak has become a matter of earnest agitation with cases rising at an alarming rate in different parts of the world The World Health Organization reported the outbreak as a Public Health Emergency of International Concern on January 30, 2019, and a pandemic on March 11, 2019 Globally, as of June 14, 2020, there have been 7,690,708 confirmed cases of COVID-19, including 427,630 deaths, reported in more than 188 countries and territories across the world [1] The majority of the population tested positive with COVID-19 endure mild to moderate respiratory sickness and recover without requiring hospital treatment However, older people and those with previous medical records like heart diseases, diabetes, chronic respiratory disease, and cancer are more likely to be severely ill and need to be hospitalized
In view of the foregoing pandemic, it has become extremely important to docu-ment all the medical records of the COVID patients, thus providing a deeper insight into the disease and its cure The clinical data is procured from several heterogeneous sources The data may be highly structured like a patient’s demographic information
or his blood platelet count, as well as unstructured like the descriptive text of the patient’s symptoms and the patient’s radiology images Moreover, the data obtained
is sporadic and dynamic [2] The health of some COVID patients deteriorates within one week of illness onset [3] The constantly changing values of test results and the heterogeneity of data pose a challenge to the database management system currently being used
One of the most commonly adopted systems worldwide for clinical data storage
is the Relational Database Management System (RDBMS) The data in RDBMS
is highly structured in the form of tables with relations incorporated among them, where SQL is used to communicate with the stored data While there is a possibility
to store some of the clinical data in the structured format, due to the sporadic nature, the relational model is not practical when the requirement of fields is high This is because it will lead to empty fields resulting in insufficient storage [2] However, it faces a drawback of rigidity for which the data should always be in the form of tables Since it is evident from the recent searches and experiments that the n-COV data is heterogeneous in nature, a shift from the traditional storage in a relational database
to an advanced non-relational database format becomes a necessity
The effective management of clinical data and the transformation of the data into
a structured format for data analysis are extremely challenging issues in electronic health records development [4], thus a solution with non-relational databases, NoSQL and XML database models, is proposed Non-relational databases have been put forward as a flexible alternative to the traditional relational data models since they allow related data to be nested within a single data structure In NoSQL database models, there is no need to store related data elements like tables NoSQL allows a schemaless database design, and XML involves no transition of natural text data and is
Trang 3also a subclass of document-oriented databases NoSQL systems are the most suitable for query speed because their performance is efficient, and they are more scalable Furthermore, XML technologies are beneficial in terms of flexibility, extensibility, and robust mark-up language to deal with the characteristics of clinical data and encode documents electronically such that the data can be delineated in a structured manner
This paper proceeds as follows—Sect 2 briefly explains the various techno-logical research done in the COVID pandemic as well as the work done in the database management systems Section3mentions the advantages of non-RDBMS over RDBMS in the COVID scenario Section4deals with the clinical document stan-dard set by the HL7 CDA Sections5and6discuss the NoSQL and XML databases and their benefits over RDBMS And finally, Sect.7provides implementation and Sect.8states the conclusion
2 Literature Review
2.1 COVID-19 Pandemic
The novel coronavirus has been proved by the epidemiologists as a tricky illness due to its emergence in various forms, ranging from mild symptoms to high risk of organ failure or death [5] Since n-COV is caused due to SARS, lung infection is one of the commonly observed symptoms detected using imaging techniques such
as computed tomography (CT), positron emission tomography—CT (PET/CT), lung ultrasound, and magnetic resonance imaging (MRI) [6] Furthermore, to accelerate the diagnosis and treatment process, advanced artificial intelligence (AI) methods using deep learning and bioinformatics approaches have been introduced [1] Prompt actions like identifying high-risk individuals employing the Social Internet of Things (SIOT) with the relationship among mobile devices can be of great advantage to suppress the contagion [7] Also, to forecast the number of infected cases, deaths, and recoveries, prediction models such as exponential smoothing (EN), least absolute shrinkage and selection operator (LASSO), and linear regression (LR) have been implemented [8] In addition, with the help of social media, the Natural Language Process (NLP) method is utilized to unveil various issues regarding public opinion surrounding COVID-19 [9] Surveys adopting ML approach prove that this pandemic has had effects on mental health, learning styles, and activities in countries like India due to a complete shutdown from March 24, 2020 [10] According to our research
to date (June 21, 2020), not much exploration has been performed on the storage of this enormous data engendered from n-COV
Trang 42.2 Non-relational Databases
One primary problem, industries all across the world are facing due to the advance-ment and augadvance-mentation of technology is the storage and manageadvance-ment of millions
of data that is being produced at intervals of less than a nanosecond Data is being engendered and processed more actively than ever Data generation will continue
to rise in volume in the mere future at an exponential rate [11] NoSQL databases have been developed to deal with the extensive data along with proficiencies like adaptivity, join free queries, and to deliver flexibility with the help of no fixed schema [12] NoSQL database systems have emerged parallel to the major Internet companies, such as Google, Amazon, and Facebook These companies faced a major challenge in managing the copious amount of data that the traditional RDBMS could not deal with [13] A variety of activities, consisting of experimental and prophetic analysis, ETL-style data transformation, and non-mission-critical OLTP (for instance, handling protracted or inter-organization transactions) are assisted
by NoSQL databases [14] In the current scenario, e-commerce has developed its application with forums like Magento, Zen Cart, Prestashop, Spree, etc to run a booming online store Customer e-commerce apps also use them and thus an extend-able database is required for storing the data Thus, e-commerce applications widely use NoSQL databases to handle extensive business applications by providing compe-tent storage access and processing background with horizontal hierarchy and trans-ferable strategy over RDBMS [15] Oracle and MongoDB are the two prominent NoSQL databases Oracle offers a better distribution model and is most preferred for implementing storage nodes intended to provide adaptability, accessibility then performing on warehouse and distribution models Conversely, MongoDB depends
on sharding for resizing and assigns a definite server to hold bits of data when data burgeons unremittingly [16]
3 Advantages of Non-RDBMS Over RDBMS
See Table1
Table 1 Difference between RDBMS and non-RDBMS
Table-oriented with fixed, predetermined, and
restrictive schema
Document-oriented databases that are schemaless
Can be only scaled vertically which is limited by
budget
Can be scaled horizontally to provide more resilience and lower costs
A very rigid schema and making regular
changes is not feasible
It has no constraints and provides adaptability Can handle data coming in low velocity Can handle data coming in high velocity
Trang 53.1 Disadvantages of RDMS
One of the main drawbacks of RDBMS is the cost of maintenance of complex software In addition to the management of the high volume of data, it also possesses
a property of rigidity for the formation of the schema Lastly, the data processing hampers the speed, and it is difficult to recover the data
4 Clinical Document Architecture
The collection of massive amounts of clinical data during COVID-19 is not enough; it has to be stored in an efficient format for future reference The HL7 Version 3 Clinical Document Architecture (CDA) is a document mark-up standard that specifies the structure and semantics of “clinical documents” for exchange between healthcare providers and patients [17] It is an XML document, consisting of a header and a body
The CDA is beneficial because of its striking features of re-usability, flexibility, and conciseness The clinical document contains data like pathology report, imaging report, symptom description, and alternative parts of a multimedia system—all inte-gral components of electronic health records (EHRs) and have the following six characteristics, set forth by HL7:
• Persistence—A clinical document remains unaltered for a long period of time defined by local and regulatory requirements [17]
• Stewardship—A clinical document is maintained by someone or organization vouchsafed with its care [17]
• Potential for authentication—A clinical document is a collection of information that is intended to be legally attested [17]
• Context—A clinical document is a default context of the recorded data including the creator of the document, and patient’s identity, etc [17]
• Wholeness—A clinical document can be authenticated as a whole and is not just restricted to certain parts of the document [17]
• Human Readability—A clinical document is easily read by humans or can be browsed on devices [17]
NoSQL, which stands for ‘Not Only SQL’, databases came into existence due to the limitations of the traditional relational database systems Though they’ve been in existence for many years, they’ve recently gained popularity in the era of cloud and big data [12]
Trang 6Table 2 A table in NoSQL
Patient ID Key Value1 Subkey1 Subkey Value1 …
It enables agile storage and pre-processing along with swiftness in utilization Since these functionalities are very essential in the management of the COVID-19 data, it is one of the best systems to tackle unstructured, semi-structured, or structured data
According to data models, NoSQL databases are classified as “key-value store,”
“column-oriented store,” “document-oriented store” and “graph databases” [18]
• Key-value store: It is the most fundamental data model where data is stored as a key-value
• Column-oriented store: In this data model, columnar manner is preferred over the traditional row manner
• Document store database: It provides an efficacious way to administer document-oriented information in a semi-structured data format It has a layer that manages the association between these documents
• Graph store database: This type of data model records data in a graph structure
to depict the relationship between data by warehousing data in the form of nodes, edges, and properties
The following table depicts the general representation of data of COVID patients along with their commonly observed symptoms in NoSQL (Table2)
Here, the dynamic nature of the sporadic symptoms is organized without being confined to a predefined structure Thus, it helps in decreasing redundancy and turn improving efficiency
5.1 Advantages of Using NoSQL for Storing COVID-19 Data
(1) Scalability: The burgeoning number of COVID-19 cases and the concurrent increase in healthcare data demands an expandable EHR system as most of the current systems are based on relational databases that restrict scalability The number of COVID-19 cases has increased exponentially and continues to rise Thus, NoSQL database systems are critical as they allow scaling up to large datasets without any amendments in the comprehensive structure of data or architecture Hardware requirements and expenses can develop rectilinearly as storage demands grow Thus, in the time of economic crisis, cost-efficient scaling can be made possible, and preliminary investment in hardware requirements can
be avoided by the already encumbered medical systems with the help of NoSQL
Trang 7Conventional relational database systems expand their capacity by acquiring more expensive and potent servers, whereas NoSQL database systems are based
on a shared-nothing approach In a shared-nothing architecture, servers have their resources and thus do not divide a common RAM, processor or warehouse Thus, a large number of read/write operations can be made feasible with the help
of horizontal scaling, dissemination of data and handling operations over many servers [19] Hence, the capacity to store accurate data about a large number of patients can be elevated by the addition of more commodity servers dynamically without any reconfiguration or mitigation in performance
(2) Flexibility: On studying the humongous amount of data on n-Cov, it can be inferred that due to sporadicity and homogeneity, a rigid data storage system like RDBMS should be supplanted with a much more flexible model To avoid redundancy, it compels the user to prioritize flexibility for the management of such disparate data NoSQL databases provide pliability in the development of the schemas by avoiding the traditional table-like format for the data storage Given that the COVID-19 data is voluminous and dynamic in nature, NoSQL proves to be efficient for quick iterations and frequent code pushes [12] Due
to the absence of a predefined structure, NoSQL helps to easily add and make changes with no need for any regard to the structure/schema of the database Thus, it provides ad hoc schema changes that are often difficult and complex to carry out by using RDBMS [12]
(3) High Functionality: COVID-19 data needs to be analyzed for the following reasons:
• To comprehend accurate responses: With the right analytics capabilities, healthcare professionals can answer queries such as where the next cluster
is most likely to arise, which demographic is most vulnerable, and how the virus may mutate over time
• To see the inconspicuous: Heterogeneous data from various sources has led
to novel sharing of visualizations and messages to enlighten the public and
to track the situation to understand its gravity
• To comprehend accurate responses: With the right analytics capabilities, healthcare professionals can answer queries such as where the next cluster
is most likely to arise, which demographic is most vulnerable, and how the virus may mutate over time
• To see the inconspicuous: Heterogeneous data from various sources has led
to novel sharing of visualizations and messages to enlighten the public and
to track the situation to understand its gravity
NoSQL databases offer many highly functional APIs and data types that are specially designed for each of their corresponding data models [20] New application archetypes can be more easily supported with the help of NoSQL The extensibility
of NoSQL databases enables a single database to serve both transactional and analyt-ical workloads from the same database as opposed to SQL databases that require a separate data warehouse to substantiate analytics Since the NoSQL databases have
Trang 8been developed during the era of cloud computing, they have accustomed themselves quickly to the automation that is part of the cloud NoSQL also makes it easier to deploy databases extensively in a way that supports microservices
NoSQL databases support polyglot persistence, which means combining various types of NoSQL databases depending on the requirements of a specific health-care system For example, some hospitals storing most of their data in a document database like MongoDB, but supplement that with a graph database to seize innate relationships between patients and symptoms
(4) Security: Data breaches are a major concern that needs to be taken into consider-ation while selecting a database system A database system needs to be extremely secure and provide the four features of security—authorization, authentication, encryption, and auditing MongoDB, a very popular non-relational database, provides these security features through the MongoDB Enterprise Advanced service Advanced security controls like LDAP integration and AWS private link can be integrated with MongoDB Enterprise Advanced [21]
• Authorization: Access to the database by an entity is governed by MongoDB using the Role-Based Access Control (RBAC)
• Authentication: Authentication mechanisms like Kerberos and LDAP are supported by MongoDB for validation of entity
• Encryption: Data can be encrypted while it is in transit over the network or
at rest in storage and backups by the administrators
Auditing: MongoDB Enterprise Advanced provides an auditing framework which can log all the actions (DDL and DML) and accesses made to the database (5) Retrieval of archived data: The emergence of big data technologies to handle gigantic volumes of structured and unstructured data all at low cost has right suited it to take the position of data archival solution MongoDB is designed
to establish long-term storage needs, an effective and prompt search of content with the help of keywords or full texts and cost-efficient services For instance, COVID data produces a huge amount of content each day, X-ray images, symptom details, doctor’s comments and chat transcripts Not only would such
an institution produce such varied content, but it would also need to archive the content for long-term retention and serve as precedence By leveraging MongoDB, the agility of the organization’s business can be benefited The chal-lenges associated with the velocity, volume and variety of data can be tracked down in a swift, elegant, and agile manner, thereby making MongoDB a scalable back-end data archival solution to such clinical data
Trang 96 XML
XML databases are generally used to store information that is in varied forms These databases are document-centric and are a subset of the NoSQL database XML docu-ments are marked by the heterogeneity of data records, extensibility by allowing different data types in a single document, larger grained data, and flexibility in size [22] Similar characteristics are observed in the data obtained from COVID patients (Fig.1)
Raw data obtained is highly irregular and contains lots of mixed content It consists of measures and values that are structured and narrative descriptions that are unstructured It is crucial to gather all this information and convert it into a computer-processable structure for storage and processing Data can be transformed into knowledge only if it is understandable by the machine and humans Hence, the use of a tree data structure to represent this information is recommended
As seen from the above figure, clinical notes in its raw form are situated at the absolute apex level, followed by the medical concepts These consist of narrative text descriptions and their related numeric values XML databases are broadly classified into two categories:
• XML-enabled database: Data is stored in tables consisting of rows and columns
• Native XML database: Document-centric databases Data is stored as a list of files
XML databases are a relatively newer data storage technology and overcome certain drawbacks faced by the relational databases The processing time required for the execution of queries is found to be a lot lesser in XML databases as compared
Fig 1 A general structure
for clinical notes [ 23 ]
Trang 10to RDBMS [23] They support a hierarchical structure of data that allows a high level
of granularity and is also flexible
6.1 XML Document Architecture
The HL7 clinical document architecture is a standard approach for storing and exchanging various healthcare information The header of the document is consistent across all clinical documents The COVID-19 narrative data is the primary thing that
is stored in the header The CDA body indicates the human-readable content The data accumulated from the corona tested patients with a patient’s ID and their ages is incorporated within the header From a patient’s blood count of the red blood cells, plasma and platelets to the deficiency in each of the nutrients in the body are the human-readable narrative data blocks These attributes encompass the whole body This facilitates in satisfying CDA standards as it excludes encoding
Flexibility is the most conspicuous as well as an astonishing feature of this document, where
• Each COVID patient record can have zero to many medical records concomitants
to them
• Each medical record can have zero to many symptoms correlated
• Each symptom can comprise zero or more properties which can be textual as well
as numeric
Dealing with XML as large strings is ineffectual, native XML is customized for its storage and querying Native XML databases are specially adapted to the XML data They are highly capable of storing, maintaining, and querying the XML document Relational databases fail to espouse such adaptability and even cannot provide an efficacious way of handling, storing, recovering, and analyzing such fitful data Consequently, XML databases unlock a remarkable path for treating medical data that is voluminous and sporadic (Fig.2)
7 Implementation
Scenario: Consider the following contrived dataset consisting of patients and their various COVID-19 attributes Due to the sporadic nature of this data, not every single patient will have all the attributes populated for him/her It is represented as follows
in the RDBMS:
As it is visible, for example, patient “04235e8a80d92ed,” the attribute influenza B
is null Due to the rigid structure of RDBMS, that particular column cannot be ignored for the patient which makes the database bulky, inefficient, and full of unwanted null values Moreover, considering irregular addition and reduction of different attributes,