Distributed Systems and Database Systems

Machine learning has been traditionally associated with algorithm design, analysis and synthesis. However, with the advent of big data, machine learning is becoming integral part of many computer systems. In this section, we cover two such computer systems, namely database systems and distributed systems.

Database systems have the ability to store a wide variety of data. Traditionally, data was stored in flat files on the operating system. With the advent of data that is structured according to relations between sets, relational database systems have been widely used. The original relational databases are not designed for storing the voluminous big data in a cost-effective manner. To deal with the problems of big data, relational database has spawned new generations of systems such as in- memory databases and analytic databases. With the availability of even more data generated by humans and machines, there is a bigger need for databases that can deal with the streaming, unstructured, spatiotemporal nature of the data sourced from documents, emails, photos, videos, web logs, click streams, sensors and connected devices. Therefore, from a standpoint of database systems, big data has been defined in terms of volume, variety, velocity and veracity. To be intelligible, the big data must be mined using machine learning algorithms. These algorithms require lots of processing power to effectively mine the data while considering various trade-offs of speed, scale and accuracy. The processing power is becoming available through distributed systems of software built for coordinated management of a network of commodity hardware clusters. That is why there is a close relationship between database systems and distributed systems used for supporting Big Data Analytics in the cloud.

The traditional data processing model of distributed systems is composed of two tiers. A first ‘storage cluster’ tier is used to clean and aggregate the data. A second

‘compute cluster’ tier is used to copy and process the data. However, this data processing model does not work for storing and processing big data that is operating at scale. One solution is to merge the two tiers into one tier where data is both stored and processed. The computing over this one tier is called cloud computing. To deal with the problems of speed, scale and accuracy, cloud computing defines a loosely connected coordination of various hardware clusters that allow well-defined protocols for utilizing the hardware toward both storage and processing.

Modern distributed systems like Hadoop are built for big data processing. These systems do not impose prior conditions on structure of the unstructured data, scale

horizontally so that more processing power is achieved by adding more hardware nodes to the cloud, integrate distributed storage and computing. Moreover, these systems are cost effective by allowing the end user to perform reasonably complex computations on off-the-shelf commodity hardware. This in turn allows organizations to capture and store data at a reasonable cost for longer durations. However, the real benefit of the big data captured over such distributed systems is in the big data analysis frameworks provided by them. For example, the big data framework of Apache Hadoop is MapReduce, and the big data frame work of Apache Spark is Resilient Distributed Datasets. These frameworks support analytics in programming languages like R, Java, Python and Scala.

Hadoop is an open-source implementation of proprietary distributed computing frameworks. It consists of Hadoop Distributed File System (HDFS) and Hadoop MapReduce. Like Spring and Struts, MapReduce is a Java programming framework that originated in functional programming constructs. MapReduce takes care of distributing the computations in a program. This allows the programmer to concen- trate on programming logic rather than organizing multiple computers into a cluster.

MapReduce also takes care of hardware and network failures to avoid loss in data and computation. To use MapReduce, the programmer has to break the computational problem into Map and Reduce functions that are processed in parallel on a shared nothing architecture. Many useful storage and computer components have been built on top of Hadoop. Some of these components that are widely used include Apache HBase, Apache Hive, Apache Pig, Apache Mahout, Apache Giraph, Apache Yarn, Apache Tez, Apache Presto, Apache Drill and Apache Crunch [9].

HDFS allows horizontal scalability with fault tolerance on commodity hardware.

MapReduce does distributed computing in a fault-tolerant manner. HBase is a key- value database that scales to millions of columns and billions of rows. Hive is a data warehouse that uses a SQL-like query language. Through stored procedures, Pig allows analysis of large volumes of data to create new derivative datasets that solve sophisticated large-scale problems without having to code at the level of MapReduce.

Mahout is a machine learning library that has common algorithms for analytics in a distributed environment. Drill and Crunch also allow analytics at scale. Graph is graph-driven distributed programming framework implementing the Bulk Syn- chronous Parallel Model on computations. ETL tools to pull data into Hadoop include Flume, Chukwa, Sqoop and Kafka. Thus, we can see that Hadoop ecosystem provides for a rich functionality as far as big data analysis is concerned. Moreover, custom MapReduce programs can be written to extend the capabilities of Hadoop software stack. We shall survey the above modules in detail in Chap.4. For storing key–value datasets, some of the alternatives for Hadoop include columnar NoSQL databases like MongoDB, Cassandra, Vertica and Hypertable. Grid computing frameworks like Web Services Resource Framework (WSRF), Grid Gain and JavaSpaces are some of the more complex open-source alternatives to MapReduce. Just like Linux distributions, Hadoop software stack has been packaged into Hadoop distributions for easy management of software. Some of the common Hadoop distributions are provided by Apache Foundation, Cloudera, Horton Works and MapR.

Spark defines the big data processing framework in terms of Scala transformations and actions on data stored into Resilient Distributed Datasets (RDDs). Internally, RDDs are stored as directed acyclic graphs (DAGs). DAGs are supposed to allow a better representation of data and compute dependencies in a distributed program.

Also, Scala is easier to program than Java. Thus, for Big Data Analytics, the functional programming constructs of Spark RDD are an alternative to Hadoop MapReduce.

Data caching is also possible in Spark. The in-memory processing of Spark also has better benchmarks than Hadoop. Spark has components for accessing data from either local file system or cluster file system. Spark framework can be accessed and programmed in R, Python, Java and Scala. By allowing functional programming over distributed processing, Spark is quickly becoming the de facto standard for machine learning algorithms applied to big data integrated in a distributed system. By con- trast, Hadoop is already the standard for database systems storing big data. Spark can also interoperate with Hadoop components for storage and processing like HDFS and YARN. Like Hadoop software stack, Spark software stack has components like Spark Streaming, Shark, MLLib, GraphX, SparkR for data storage, data processing and data analytics [10]. The novelty in Spark is a software stack that has compute models for both data streams and analytics in shared memory architecture. Spark stream processing also integrates well with both batch processing and interactive queries. As compared to record-at-a-time stream processing models, stream computation in Spark is based on a series of very small deterministic batch jobs. State between batches of data is stored in memory as a fault-tolerant RDD dataset. The data parallel MapReduce framework is not suitable for algorithms that require results of Map functions as input to other Map functions in same procedure. MapReduce framework is not suitable for computations between datasets that are not independent and impose a specific chronology on data dependencies. Moreover, MapReduce is not designed to provide iterative execution of Map and Reduce steps. Thus, MapReduce is not suitable for iterative processing of machine learning algorithms like expecta- tion–maximization algorithms and belief propagation algorithms. The disadvantages of MapReduce have led to richer big data frameworks where more control is left to the framework user at the cost of more programming complexity. Spark filter, join, aggregate functions for distributed analytics allow better parameter tuning and model fitting in machine learning algorithms that are executed in a distributed fashion. Typ- ically, these algorithms are CPU intensive and need to process data streams. Some of the popular machine learning algorithms in Spark include logistic regression, deci- sion trees, support vector machines, graphical models, collaborative filtering, topic modeling, structured prediction and graph clustering. GraphLab and DryadLINQ like alternatives to Spark are software components that provide scalable machine learning with graph parallel processing. All such graph parallel systems reduce the communication cost of distributed processing by utilizing asynchronous iterative computations. They exploit graph structure in datasets to reduce inefficiencies in data movement and duplication to achieve orders-of-magnitude performance gains over more general data parallel systems. To deal with trade-offs in speed, scale and accuracy, graph-based distributed systems combine advances in machine learning

with low-level system design constructs like asynchronous distributed graph computation, prioritized scheduling and efficient data structures. Moreover, graph parallel systems are computationally suitable for graphs that arise in social networks (e.g., human or animal), transaction networks (e.g., Internet, banking), molecular biologi- cal networks (e.g., protein–protein interactions) and semantic networks (e.g., syntax and compositional semantics).

Distributed Systems and Database Systems

Machine Intelligence and Computational Intelligence

Data Engineering and Data Sciences