Chapter 8: Practical & Research based Topics
2.1.1 Terminology used in NoSQL and RDBMS
RDBMS: Partitions, Table, Row, Column
NoSQL: Shard, Document root element (JSON/XML), Aggregated, Attribute/ field
2.1.2 Database use in NoSQL
Following are the types of database that use by NoSQL:
Key-Value database: In this kind of database all records stored in key value pair. Key is unique and value can be whole line. This help in quickly retrieval of data by using key. Data can also be referring by using key for further use. As example of key value pair are:
(a) URL and its web contents
(b) Account no. and Account holder name
(c) Page no. and contents
(d) Roll no. and student name
It is possible to refer data by using key but there is no implicit ordering. Key can further be change with the condition of uniqueness with its record.
Following are the example of usages of key-value in industry:
Amazon's dynamo is the example of database that use key-value database of NoSQL. They use commodity hardware, standard mode of operation, loosely coupled and Service oriented architecture of hundred of services. Because of using commodity hardware usage it
is having scalability nature. Objects are stored with versioned data.
To maintain the consistency during updates Dynamo uses quorum-like technique and a protocol for decentralized replica synchronization.
Table 2.1 shows the problem which dynamo handles with key-value databases.
In Dynamo, all nodes have equal responsibilities; there are not any distinguished nodes which performs special roles. In addition, it favours “decentralized peer-to-peer techniques over the centralized control” because the latter has “resulted in outages” in the past at Amazon. Storage hosts added to the system will have heterogeneous hardware that dynamo must consider to distribute work proportionally
“to the capabilities of the individual servers” As dynamo is operated in Amazon's own administrative domain, the environment and every nodes are considered non-hostile and thus no security connected options such as authorization and authentication are implemented in dynamo.
As dynamo is meant to be “always writable (i.e. a data-store that is extremely available for writes)” conflict resolution must happen
throughout reads. If application developers do not need to implement such business logic then specific reconciliation strategy dynamo
provides easy methods which they can simply use, such as “last write wins”, a timestamp-based reconciliation
System interface of dynamo consist of two operations that is use to interact with users are:
get(key), returning a list of objects and a context.
put(key, context, object), with no return value.
With get operation more than one object can be stored with key. It has also returned system metadata as object version is stored. Put operation can have context object as a parameter.
Key and object values are not interpreted by Dynamo but handled as
“an opaque array of bytes”. The key is hashed by the MD5 algorithm to determine the storage nodes accountable for this key-value-pair.
key-value-pair.
key-value-pair. key-value-pair.
key-value-pair. key-value-pair. key-value-pair. key-value-pair.
key-value-pair. key-value-pair. key-value-pair.
key-value-pair. key-value-pair. key-value-pair. key-value-pair.
key-value-pair. key-value-pair. key-value-pair. key-value-pair.
key-value-pair. key-value-pair. key-value-pair. key-value-pair.
key-value-pair.
key-value-pair. key-value-pair. key-value-pair. key-value-pair.
key-value-pair. key-value-pair.
key-value-pair. key-value-pair. key-value-pair. key-value-pair.
key-value-pair. key-value-pair. key-value-pair. key-value-pair.
key-value-pair. key-value-pair. key-value-pair. key-value-pair.
key-value-pair. key-value-pair. key-value-pair.
Table 2.1: Amazon's Dynamo
To provide incremental scalability, Dynamo uses consistent hashing to dynamically partition data across the storage hosts that are present in the system at a given time. To ensure scalability and availability of data dynamo use replication factor with N nodes. Each data is
replicated by N times where N can be configured “per-instance” of Dynamo.
Project Voldemort is a key-value-store which was initially developed for and still used at LinkedIn.
get(key), returning a value object
put(key, value)
delete(key)
Key and value that use in its data-store can be complex and consist of lists and maps. Is has been claimed in the project that as
compared to relational databases its database is simply designed and API of key-value data-store are not complex. Fig. 2.1 shows the
architecture of its design pattern.
Fig. 2.1: Design diagram of Voldemort
Every layer of the architecture performs its own function for operation of get, put and delete. e.g. the put operation is invoked on the
routing layer and it is accountable for distributing this operation to all nodes in parallel and for possible errors.
Project Voldemort permits namespaces for key-value-pairs known as
“stores”, in which keys are distinctive. While each key related to precisely one value, values are allowed to contain lists and maps as scalar values. Operations in Project Voldemort are atomic to precisely one key-value-pair. Once a get operation is executed, the value is streamed from the server via pointer. Documentation of Project
Voldemort considers this approach to not work all right together with
values consisting of large lists “which should be kept on the server and streamed lazily via cursor”; in this case, breaking the query into sub-queries is seen as efficient.
Project voldemort offer possibilities of data types used in:
Table 2.2: JSON Serialization Format Data Types
The data type definition for the JSON (JavaScript Object Notation) serialization format that is shown in Table 2.2 allows project
Voldemort to check values and store them with efficiency, albeit the data types for values cannot be advantage for data queries and requests. To prevent invalidate data caused by redefinition of value
data types, project Voldemort is storing a version along with the data allowing schema migrations.
Tokyo Cabinet and Tokyo Tyrant is data store which is build on key- value pair of databases. Tokyo cabinet is the core library of this data persistence and extracts data based on B++ tree structure or hash indexes. This data-store compress pages by LZW algorithm that satisfy the output by giving good compression ratio and partition data automatically with similar approach of SQL. With respect to order of key it provide lookup with matches of result. The Toyko suite is developed actively, well documented and give high-
performance, as 1 million records can be stored in 0.7 seconds by using the hash-table engine and in 1.6 seconds by using the b-tree.
Document database: Document (sometimes known as databases are very much like key-value databases, except that the worth associated with a key which contains structured or semi-structured data, which can be labelled as a document. In contrast to in a key-value
database, there is a question against the structure of the document as well as components inside that structure, and return only parts of the document because the results of the query. An example of
document-oriented information could be a book database in which the key is the book title and also the value is book metadata expressed as an XML document or JSON as in fig.
Fig. 2.2: Document database
Apache CouchDB and MongoDB is the two-leading symbolic for the class of document databases.
Apache CouchDB: Many of technocrats also called it “Cluster of
unreliable commodity hardware” of document database that is written or developed by Erlang. CouchDB will be considered a descendant of Lotus Notes, whose main developer Damien Katz worked at IBM before he later initiated the CouchDB project on his own. A lot of ideas from Lotus Notes can be found in CouchDB, documents, views, distribution and replication between servers and clients. The approach of CouchDB is to make such document database from scratch with technologies of the web space like representational State Transfer, JavaScript Object Notation (JSON) as a data interchange format, and also the ability to integrate with infrastructure elements like load balancers and caching proxies etc. CouchDB can be shortly characterized as a document database that is accessible via a restful HTTP-interface, containing schema-free documents in a flat address
area. For these documents JavaScript functions choose and aggregate documents and represents them in a MapReduce manner to make views of the database that also get indexed. CouchDB is distributed and ready to replicate between server nodes; Similarly, as clients and servers incrementally. Multiple concurrent versions of an equivalent document are allowed in CouchDB and also the database is ready to detect conflicts and manage their resolution which is delegated to client applications. The foremost notable use of CouchDB in
production is ubuntu one which provides cloud storage and
replication service for Ubuntu Linux. CouchDB is additionally a part of the BBC's new net application platform. Moreover some Blogs, Wikis, Social networks, Facebook apps and smaller internet sites use CouchDB as their datastore.
CouchDB databases are addressed via a RESTful HTTP interface that allows reading and updating of documents.
As per its name document database couchDB is the database that is based on the document with key name and value. Document in
database cannot be nested as per its limitations.
Functioning: Besides fields, documents may additionally have
attachments and CouchDB maintains some metadata like a unique
symbol and a sequence id for every document. The document id is a 128-bit Value; the revision number could be a 32 bit value determined by a hash-function. CouchDB considers itself as semi-structured
information. Whereas relational databases are designed for structured and interdependent data and key-value-stores operate on
uninterrupted, isolated key-value-pairs document databases like
CouchDB pursue a third path: data is contained in documents that do not correspond to a set schema (schema-free) however some inner structure known to applications further as the database itself.
The benefits of this approach are that: first there is no requirement for schema migrations which cause lots of effort within the relative databases world; secondly as compared to key-value which stores data that can be evaluated sophisticatedly (e. g. within the calculation of views). Within the internet application field there are lots of
document-oriented applications that CouchDB addresses as its data model fits this category of applications and therefore the possibility to iteratively extend or change documents may be done with less effort as compared to a relational database. Each CouchDB database consists of precisely one flat/non-hierarchical namespace that contains all the documents that have a unique symbol calculated by CouchDB.
A CouchDB server will host more than one among these databases.
Documents were once stored as XML documents however, nowadays they are serialized in a JSON-like format to disk. Document indexing is done in B-Trees that are indexing the document's id and revision number (sequence id, column family).
Views: CouchDBs query, present, combine and report the semi- structured document data views. A typical example for views is to separate different kinds of documents (such as journal posts,
comments, authors in a web log system) that are not distinguished by the database itself as all of them are simply documents to that.
Views are outlined by JavaScript functions that can neither be
amended nor save or cache the underlying documents but solely gift them to the requesting user or client application. Thus documents additionally, consider as views (which are really special documents, referred to as design-documents) may be replicated and views do not interfere with replication. Views are calculated on demand. There is no limitation concerning the quantity of views in one database or the quantity of representations of documents by views. The JavaScript functions process a view and are referred to as map and have
similar responsibilities as in Google's MapReduce approach. The map perform gets a document as a parameter, can do any calculation and should emit arbitrary data from it, if it matches the view's criteria; if the given document does not match these criteria the map operate emits nothing. Samples of emitted information for a document are the document itself, extracts from it, and are references to or
contents of alternative documents (e. g. semantically related ones like the comments of a user in a forum, journal or wiki). The data
structure emitted by the map operate, consists of document id, a key and a value which can be chosen by the map function. Documents get sorted by the key that does not have to be unique, however will occur for over one document; the key as a sorting criteria may be used to define a view that sorts journal posts descending by date for a blog's home page. The value emitted by the map function is
optional and should contain arbitrary data. The document id is set by CouchDB implicitly and represents the document that was given to the emitting map function as an argument. After the map function has been executed its results get passed to an optional reduce
function which is optional but can do some aggregation on the view.
As all documents of the database are processed by a view's functions this can be time consuming and resource intensive for large
databases. Therefore, a view is not created and indexed when write operations occur but on demand (at the first request directed to it)
and updated incrementally when it is requested again. To provide incremental view updates CouchDB holds indexes for views. As
mentioned before views are defined and stored in special documents.
These design documents can contain functions for more than one view if they are named uniquely. View indexes are maintained based on these design documents and not single views contained in them.
Hence, if a user requests a view its index and the indexes of all views defined in the same design document get updated. Incremental view updates furthermore have the precondition that the map
function is required to be referentially transparent which means that for the same document it has to emit the same key and value each time it is invoked. To update a view, the component responsible for it (called view-builder) compares the sequence id of the whole
database and checks if it has changed since the last refresh of the view. If not, the view-builder determines the documents changed, deleted or created since that time; it passes new and updated documents to the view's map and reduce functions and removes deleted documents from the view. As changes to the database are written in an append-only fashion to disk, the incremental updates of views can occur efficiently as the number of disk head seeks is
minimal. A further advantage of the append-only index persistence is that system crashes during the update of indexes the previous state remains consistent, CouchDB omits the incompletely appended data when it starts up and can update an index when it is requested the next time. While the view-builder changes a view data, then the view's recent state will be scan by clients. It is also possible to
present the previous state of the view to one client and the new one to a different client as view indexes are also written in an append only manner and the compaction of view data does not omit an recent index state whereas a client remains reading from it.
Versioning: Versioning of document represent different types of documents that updates time to time according to modification.
Documents are updated optimistically and update operations don't imply any locks. If an update is issued by some client then the contacted server creates replacement document revisions in a copy- on-modify manner and a history of recent revisions is stored in CouchDB till the database gets compacted subsequent time.
A document therefore is known by a document id/key that sticks to that till it gets deleted and a revision number created by CouchDB once the document is formed and every time it is updated. If a document is updated, not only this revision number is stored but also a list of revision numbers preceding it to permit the database (when replicating with another node or processing read requests) as well as client applications to reason on the revision history within the presence of conflicting versions.
CouchDB doesn't think about version conflicts as an exception but rather a standard case. It will not only occur by totally different clients operating on a similar CouchDB node but also as a result of clients operating on different replicas of a similar database. It is not prohibited by the database to have a vast number of concurrent versions. A CouchDB database will deterministically discover that versions of document succeed each other and that are in conflict and have to be resolved by the client application. Conflict resolution may occur on any duplicate node of a database as the node that receiving the resolved version transmits it to all replicas that have to accept this version as valid. It should occur as conflict resolution is issued on different node concurrently; the locally resolved versions on each nodes then are detected to be in conflict and get resolved
similar to all different version conflicts. Version conflicts are detected
at read time and the conflicting versions are come to the client that is responsible for conflict resolution. A document which is most up- to-date versions are in conflict and is excluded from views.
Distribution and Replication: CouchDB follows peer pattern to set up server distributed without any individual roles (like in master/slave- setups, standby-clusters etc.). Totally different database nodes will by design, operate fully independent and process read - write requests.
Two database nodes will replicate the databases (documents,
document attachments, views) bilaterally if they reach one another via network.
The replication method works incrementally and may detect conflicting versions in simple manner as every update of a document causes CouchDB to create a new revision of the updated document and a listing of out-of-date revision numbers is stored. By this revision number as well as the list of out-of-date revision number CouchDB will verify if they are conflicting or not; if there are version which conflicts each nodes and have a notion of them, may increase the conflicting versions to clients for conflict resolution; if there are no version conflicts the node, then having the most recent version of the document updates it. Distribution scenarios for CouchDB include clusters, off line-usage on a notebook or at company locations
distributed over the planet wherever live-access to a company's or organizations local network is slow or unstable. Within the latter two scenarios one will work on a disconnected CouchDB instance and is not restricted in its usage. If the network associates to duplicate nodes is established again, then the database nodes can synchronize their state again.