Challenges in Big Data Analytics

Capturing the data generated at high speed from various sources, storing huge data, querying, distributing, analyzing, and visualization are the major challenges of a big data system. Data incompleteness and inconsistency, scalability, timeliness, and data security are the challenges [34] in analyzing the large data of big data systems.

The primary step in big data analytics is to clean and preprocess raw data to get quality information. However, efficient access, analysis, and visualization would still remain big challenges for future research work. Some of the challenges in each phase of big data analytics are discussed in Sections 6.4.3.1–6.4.3.5.

6.4.3.1 Collect and Store Data

The enterprise storage designs, such as direct-attached storage (DAS), storage area network (SAN), and network-attached storage (NAS), were usually used for col- lecting and storing data. In large-scale distributed systems, some drawbacks and limitations of all these existing storage structures are observed.

On highly scalable computing clusters, concurrency and throughput for each server are essential for the applications, but current systems lack these fea- tures. Improving data access is a way to improve the data intensive computing

performance [21]. Data access needs to be improved by including the replication of data, distribution of data, relocation of data, and parallel access.

6.4.3.2 Data Management

The traditional methods of managing structured data includes two important parts. One is a schema to store the dataset and another is a relational database for data retrieval. Data warehouse and data marts are the two standard approaches for managing large-scale structured datasets. SQL is used to perform operations on relational structured data. Data warehouse is used to store, analyze, and report the outcomes to users. Access and analysis of the data obtained from a warehouse is enabled by a data mart.

To overcome the rigidity of normalized RDBMS schemas, big data system accepts NoSQL. NOSQL is a method to manage and store unstructured and non-relational data, also known as “Not Only SQL” [19], for example, HBase database.

Since SQL is simpler and is a reliable query language, many big data analytical platforms, such as SQL stream, Impala, and Cloudera, still use SQL in their database systems.

NoSQL employs many approaches to store and manage unstructured data.

Data storage and management are controlled independent to each other to improve the scalability of data storage and low-level access mechanism in data management. However, the schema-free structure of NoSQL database allows applications to dynamically change the structures of tables and data are not needed to rewrite. Apache Cassandra [7] is the most popular NoSQL database used by many businesses, such as Twitter, LinkedIn, and Netflix.

Updating the developments and deployments of applications, NoSQL pro- vides very flexible methods and is also used for data modeling.

6.4.3.3 Data Analysis

According to Moore’s law, to cope with increasing data size, researchers gave more attention to speeding up the analysis algorithms. As the data size increases significantly faster than the CPU speed, there is a remarkable change in processor technology, even though processors are doubling the clock cycle frequency. It is essential to develop on-line, sampling, and multi-resolution analysis means. On the other hand, development of parallel computing is required with increasing num- bers of cores in processors. Large clusters of processors, distributed computing, and cloud computing are developed fast to aggregate several different workloads.

In real-time applications like navigation, social networks, finance, biomedi- cine, astronomy, intelligent transport systems, and IoT, speed is the top priority.

It is still a big challenge to be addressed for stream processing by giving quick

short span of time.

6.4.3.3.1 Algorithms for Big Data Analysis

In big data analysis, data mining algorithms play a dynamic role in determining the cost of computation, requirement of memory, and accuracy of final results.

Problems associated with large data generation have been appearing since the last decade. Fan and Bifet defined [35] the terms big data and big data mining for rep- resenting large datasets and knowledge extraction methods from large data, respec- tively. Many machine learning algorithms play a major role in solving big data analysis tasks. Data mining, machine learning algorithms, and their importance in big data analytics are described as follows:

1. Clustering algorithms: In data clustering, many challenges are emerging in addressing the characteristics of big data. One of the important issues that need to be addressed in big data clustering is how to reduce the data complexity. Big data clustering is divided into two groups [36]: (i) single- machine clustering using sampling and dimension reduction solutions and (ii) multiple-machine clustering using parallel and Map Reduce solutions [37]. Using sampling and dimension reduction methods, complexity and memory space required for data analytical processes will be reduced.

Inappropriate data and dimensions are discarded before data analysis process starts. To reduce the data size for data analysis processes, data sampling is used, and for reducing the whole dataset, dimension reduction is used.

To perform the clustering process in parallel, CloudVista [38] uses cloud computing. It is a common solution for clustering big data. To handle large- scale data, CloudVista uses balanced iterative reducing and clustering using hierarchies (BRICH) and sampling methods.

2. Classification algorithms: Many researchers are working toward developing new classification algorithms for big data mining and transforming traditional classification algorithms for parallel computing. Classification [39] algorithms are designed in such a way that they take input data from distributed data sources and use various sets of learners to process them. Tekin et al. presented

“classify or send for classification” as a novel classification algorithm.

In the distributed data classification method, the input data should be processed in two different ways by each learner. One performs classification functions, whereas the other forwards the input data to another labeled learner. Big data classification problem improves the accuracy using these kinds of solutions.

For example, to perform big data classification, Rebentrost et al. [40]

defined a quantum-based support vector machine and showed that with O(log NM) time complexity the proposed classification algorithm can be

implemented, where M represents the amount of training dataset and N is the number of dimensions.

3. Association rules and sequential pattern mining algorithms: The early methods of pattern mining were tried to analyze the transaction data of large shopping malls. At the beginning, many researches tried to use frequent pattern mining methods for processing big datasets. FP-tree (frequent-pattern tree) [41] uses the tree structure to reduce the computation time of association rule mining. Further, Map Reduce method was used in the frequent pattern mining algorithms to improve its performance [42,43]. Big data analysis using the Map Reduce model significantly improves the performance of these methods compared to old-style frequent pattern mining algorithms running on a single machine.

4. Machine learning algorithms for big data: Machine learning algorithms [44,45] typically work as the “search” algorithms for required solutions and are used for different mining and analysis problems compared to data mining methods. To find a fairly accurate solution for the optimization problem, machine learning algorithms are used. For example, machine learning algorithms and genetic algorithms can also be used to resolve the frequent pattern mining problems as they are used to solve the clustering problems. Improving the performance of the other parts of knowledge discovery in databases (KDD), the potential of machine learning is used as input operators feature reduction.

The consequences indicate that machine learning algorithms have become the essential parts of big data analytics. Subsequently, many statistical methods, data mining algorithms, data processing solutions, graphical user inter- faces, and several descriptive tools also play major role in big data platforms.

6.4.3.4 Security for Big Data

To improve data security, data protection laws are implemented by several developed and developing countries. Intellectual property protection, financial information protection, personal privacy protection, and commercial secrets are major security issues. Data security is difficult as large amounts of data are generated due to the digitization in various sectors. Hence, the big data security challenges in many applications to protect the increasing distributed nature of big data need to be addressed in future research work. It is even more complex to identify the threats that can intensify the problems from anywhere in the big data network.

6.4.3.5 Visualization of Data

Information hidden in large and complex datasets can easily be conveyed in both functionality and visual forms. The challenges in data visualization [46] are to

and visualization techniques. For valuable data analysis, information should be abstracted in some schematic form from complex datasets, and it should include variables or attributes for the units of information.

To extract and understand the hidden insights from the data, e-commerce com- panies, such as eBay and Amazon, use big data visualization tools, such as Tableau [47]. This tool helps to convert large complex datasets into interactive results and intuitive pictures. For example, data about thousands of customers, goods sold, feedback, and their inclinations. However, there are many challenges in the current visualization tools, such as scalability, functionalities, and response time that can be addressed in future work.

Walk-through of iR: An illustrative example

Fundamental Principle and Properties of Big Data