Fundamental Principle and Properties of Big Data

Một phần của tài liệu Data analytics concepts, techniques, and applications by mohiuddin ahmed (Trang 166 - 169)

Before we dive into big data tools and technologies, it is necessary to understand the basic principle and properties of big data. It is also essential to know the complexity and scalability issues of traditional data systems in managing large data as discussed previously. To overcome the issues of traditional methods, several open-source tools are developed, such as Hadoop, MongoDB, Cassandra, HBase, and MLlib. Theses system scale to huge set of data processing and management. On the other hand, to design and understand a robust and scalable system for big data processing, it is essential to know the basic principle and properties of big data.

6.3.3.1 Issues with Traditional Architecture for Big Data Processing

There are many issues with the traditional systems to deal with increasing complex data. Some major challenges are identified and discussed as follows:

1. Architecture was not completely fault tolerant: As the number of machines increases, it is more likely that a machine would go down as it is not hori- zontally scalable. Manual interventions are required such as managing queue failures and setting replicas to keep the applications running.

2. Distributed nature of data: Data is scattered in pieces on many clusters, and the complexity is increased at the application layer to select the appropriate data and process it. Applications must be aware of the data to be modified or must inspect the scattered pieces over the clusters and process it and then merge the result to present the final result.

3. Insufficient backup and unavoidable mistakes in software: Complexities are pushed to the application layer with the introduction of big data technol- ogy. As the complexity of system increases, the possibility of making mistakes

will also increase. Systems must be built robust enough to avoid or handle human mistakes and limit damages. In addition, a database should be aware of its distributed nature. It is more time consuming to manage distributed processes.

The big data system scalability and complexity issues of traditional systems are addressed and resolved in a systematic approach.

◾ First, replication and fragmentation are managed by the distributed nature of database and distributed computation methods. Furthermore, systems are scaled up by adding more machines to the existing systems to cope with increasing data.

◾ Second, database systems should be immutable to design systems in different ways to manage and process large-scale data, so that making changes to the original data does not destroy the valuable data.

To manage a large amount of data, many large-scale computation systems, such as Hadoop and Spark, and database systems, such as HBase and Cassandra, were introduced. Hadoop has high computation latency for batch processing large-scale data in parallel, whereas database Cassandra offers a much more limited data mode to achieve their scalability.

The database systems are not human fault tolerant as they are alterable.

Consequently, every system has its own pros and cons. To address these arbitrary issues, systems must be developed in combination with one another with least pos- sible complexity.

6.3.3.2 Fundamental Principle for Scalable Database System To build scalable database systems [6], primarily we need to understand “what does a data system do?” Basically, database systems are used to store and retrieve infor- mation. Instead of limiting it to storage, new generation database systems must be able to process large amounts of complex data and extract meaningful information to take better decisions in less time.

In general, a data system must answer queries by executing a function that takes the entire dataset as an input. It is defined as

Query function on all data= ( )

To implement the above arbitrary function on random dataset with small latency, Lambda Architecture provides some general steps for developing scalable big data systems. Hence, it becomes essential to understand the elementary properties of big data systems to develop scalable systems.

1. Fault tolerant and reliable: Systems must be robust enough to tolerate faults and manage their work when one or two machines are down. The main challenge of distributed systems is to “do the right thing.” Database systems should be able to handle complications such as randomly chang- ing data in distributed database, replication of data, and concurrency so that we can drive around the systems and make the systems recalculate the original systems immutably. The system must be tolerant to handle human errors [27].

2. Minimal latency for reads and update: Several applications read and update the database. Some applications need updates to be transmitted immedi- ately. Without compromising on the speed and robustness of the systems, database systems should be able to satisfy low latency reads and updates for applications.

3. High scalability and performance: Increasing the data size would increase the load; however, this should not affect the performance of the system.

Scalability and performance of the system are achieved by adding more machines with an increased processing capacity. To handle increasing data and load, systems are horizontally scaled over all the layers of the system stack.

4. Support for a wide range of applications: A system should support a diverse range of applications. To achieve this, systems are built with many combinations and are generalized to support a wide range of applications.

Applications such as financial management systems, hospital management, social media analytics, scientific applications, and social networking require big data systems to manage their data and values [28].

5. Compatible system with low cost: Systems should be extensible by adding new functionalities at low cost to support a wide range of applications. In such cases, old data is required to relocate in new formats. Systems must be easily compatible to support past data with minimal upgrading cost for sup- porting a wide range of applications.

6. Random queries on a large dataset: Executing random queries on a large dataset is very important to discover and learn interesting insights from the data. To find new business insights, applications require random mining and querying on datasets.

7. Scalable system with minimal maintenance: The number of machines added to scale should not increase the maintenance. Choosing a module with small implementation complexity is key to reduced maintenance. The more complex the system is, the more likely it is that something will go wrong, hence requiring more debugging and tuning of the system. Minimum main- tenance is obtained by keeping a system simple. Keeping processes up, fixing

errors, and running efficiently when machines are scaled are the important factors to be considered for developing systems.

8. Easy to restore: Systems must be able to provide the basic necessary informa- tion to restore the data when something goes wrong. It should have enough information replicas saved on distributed nodes to easily compute and restore the original data by utilizing saved replicas.

Một phần của tài liệu Data analytics concepts, techniques, and applications by mohiuddin ahmed (Trang 166 - 169)

Tải bản đầy đủ (PDF)

(451 trang)