Spark Constructs and Components

5 Ofﬂine Big Data Processing

5.3.2 Spark Constructs and Components

On a high level, a Spark program has a driver that controls the flow of the application and has executors that execute the tasks assigned by the driver. The driver is responsible for maintaining application information and scheduling work for the executors. On the other hand, executors are responsible for executing assigned tasks from the driver. In doing so, Spark makes use of different constructs and components.

5.3.2.1 Resilient Distributed Datasets

An RDD is an immutable data structure that lives in one or more machines. The data in RDD does not exist in permanent storage. Instead, there is enough information to compute an RDD. Hence, RDDs can be reconstructed from a reliable store if one or more node fails. Moreover, an RDD can be materialized in both memory and disk (Zaharia et al., 2012). An RDD has the following metadata.

● Dependencies, the lineage graph, is a list of parent RDDs that was used in computing the RDD.

● Computation functionis the function to calculate children’s RDD from parent RDD.

● Preferred locationsare information to compute partitions in terms of data locality.

● The partitionerspecifies whether the RDD is hash/range partitioned.

Let us now see an example pyspark application where we would create an RDD out of a log file in HDFS. The data would be automatically partitioned, but we can

k k repartition the data. Repartitioning data to a higher number of partitions might be

important when distributing data to executors more evenly and potentially more executors to increase parallelism. In this example, we transform data two times by filtering. Later, we get the result back to the driver by counting.

from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() lines = sparksparkContext \

.textFile (’hdfs://dbdp/dbdp-spark.log’)

#transformation to filter logs by warning

warnings = lines.filter(lambda line: line.startswith(’WARN’))

#transformation to filter warnings by connection exception connection_warnings = lines \

.filter(lambda line: ’ConnectionException’ in line) connection_warnings.count()

A typical spark application would load data from a data source, then transform the RDD into another one, as shown in Figure 5.5. A new RDD points back to its parent, which creates a lineage graph or directed acyclic graph (DAG). DAG instructs Spark about how to execute these transformations. After transformations are applied, the spark application finishes with an action. An action can be simply saving data to an external storage or collecting results back to the driver.

Spark transformations are lazy. When we call a transformation method, it does not get executed immediately. Thus, the data is not loaded until an action is triggered. Transformations rather become a step in a plan of transformations, DAG. Having a plan comes in handy because Spark gets the ability to run the plan as efficiently as possible. There are two kinds of transformations: narrow and wide. A narrow transformation happens when each partition of parent RDD is used only by at most one partition of child RDD. Narrow transformations do not require shuffling or redistributing data between partitions. For instance:mapor filteroperations. A wide transformation happens when more than one partition of parent RDD is used by one partition of a child RDD. For instance:groupByKey orjoinoperations. For wide transformations, Spark has to use shuffle operation since the data can be in several partitions of parent RDD. Narrow partitions allow operations to be pipelined in the same node. Moreover, narrow partitions help to recover efficiently since the transformation can happen in one node. In contrast,

Data source (HDFS, S3, local

filees)

RDD

Transformation (map, reduce, groupBy, join)

Action (parquet, saveAs Table)

Figure 5.5 Spark RDD ﬂow.

k k

Narrow transformations Wide transformations

GroupByKey Map/filter

Union Join, not copartioned

Figure 5.6 Narrow vs wide transformations.

a wide transformation node failure might require computing all of the data again.

Narrow and wide transformations are illustrated in Figure 5.6.

5.3.2.2 Distributed Shared Variables

Distributed shared variables are another element of low-level API in Spark. There are two types of variables: broadcast variables and accumulators.

Broadcast Variables The general way of using a variable throughout the cluster is simply referencing the variable in the closures, e.g.map/foreach. Nevertheless, this method might become inefficient once the variable size passes a certain threshold.

For example, a globally shared dictionary. This is where broadcast variables come into play. Broadcast variables are a way to share immutable data across the Spark cluster. Spark serializes broadcast variables and makes them available in each task.

airport_codes = {’IST’: ’Istanbul Airport’, ’DUB’: ’Dublin Airport’}

broadcasted_airport_codes = spark.sparkContext.broadcast(airport_codes) print(broadcasted_airport_codes.value)

k k Accumulators Accumulators provide propagating a value to the driver that is

updated during transformations in the executors. We can use accumulators to get sum or count for an operation. Accumulator updates only happen in actions.

Spark updates each value of an accumulator inside a task only once. We can use accumulators as follows:

def count_positive_review(review):

if review.score > 0.75:

positive_review_count.add(1)

positive_review_count = spark.sparkContext.accumulator(0) reviews.foreach(lamdbda review: count_positive_review(review)) positive_review_count.value

5.3.2.3 Datasets and DataFrames

Spark has a concept of a dataset that represents a collection of distributed data.

On a high level, datasets are very close to RDDs. But, they support a full relational engine. If we want to perform aggregation or SQL like filtering, we can rely on built-in aggregation operations instead of writing them. We can add to our previ- ous example about log lines and try to determine the count for the different types of exceptions. Since dataset API is not available in Python, we will use scala instead.

spark.sparkContext.textFile("hdfs://dbdp/dbdp-spark.log").toDS() .select(col("value").alias("word"))

.filter(col("word").like("%Exception")) .groupBy(col("word"))

.count()

.orderBy(col("count").desc) .show()

A DataFrame is a dataset with named columns. A DataFrame is synonymous with a table in a relational database. DataFrames are aware of their schema and provide relational operations that might allow some optimizations. DataFrames can be created using a system catalog or through API. Once we create DataFrame, we can then apply functions like groupByorwhere. Just like RDD operations, DataFrames manipulations are also lazy. In DataFrames, we define manipula- tion operations, and Spark will determine how to execute these on the cluster (Armbrust et al., 2015). In the following example, we try to find the top visiting countries by using the visits table. As you can see, we are making use of SQL-like statements.

val visits = spark.table("dbdp.visits") visits

.where($"ds" === "2020-04-19") .select($"device_id", $"ip_country") .distinct()

k k .groupBy($"ip_country")

.count()

.sort($"count".desc)

.write.saveAsTable("dbdp.top_visiting_countries")

We can get the same results through Spark SQL as follows. Note that both options would compile to the same physical plan.

spark.sql("SELECT ip_country, COUNT(1) " +

"FROM (SELECT DISTINCT device_id, ip_country " +

"FROM yaytas.visits WHERE ds=’2020-04-19’)t " +

"GROUP BY t.ip_country " +

"ORDER BY 2")

The compiled physical plan is as follows:

== Physical Plan ==

*(4) Sort [count(1)#43L ASC NULLS FIRST], true, 0

+- Exchange rangepartitioning(count(1)#43L ASC NULLS FIRST, 10009) +- *(3) HashAggregate(keys=[ip_country#40], functions=[count(1)])

+- Exchange hashpartitioning(ip_country#40, 10009) +- *(2) HashAggregate(keys=[ip_country#40],

functions=[partial_count(1)])

+- *(2) HashAggregate(keys=[device_id#41, ip_country#40], functions=[])

+- Exchange hashpartitioning(device_id#41, ip_country#40, 10009)

+- *(1) HashAggregate(keys=[device_id#41, ip_country#40], functions=[])

+- *(1) Project [device_id#41, ip_country#40]

+- *(1) FileScan orc dbdp.visits[ip_country#40, device_id#41,ds#42]

5.3.2.4 Spark Libraries and Connectors

Being a unified execution engine, Spark supports different computational needs through its library ecosystem. We can use these libraries together. By default, Spark comes equipped with Spark streaming, Spark ML, and GraphFrames.

Spark streaming enables Spark to process live data streams. The data can be ingested into Spark from many sources such as Kafka and Flume. Spark ML offers common machine learning algorithms such as classification, regression, and clustering. Lastly, GraphFrames implements parallel graph computation.

GraphFrames extends DataFrame and introduces GraphFrame abstraction. Spark core enables these libraries to run on a cluster of machines and get the results fast. Please refer to Figure 5.7 for the relationship between the Spark core and its library ecosystem.

Spark is not a data store, and it supports a limited number of data stores. Instead, it offers a data source API to plugin read or write support for the underlying data

k k

JDBC Console Spark programs

(Scala, Java, Python, R)

GraphFrames Spark

streaming Spark ML

Spark SQL

DataFrame API

Catalyst optimizer

Spark core

Cassandra PostgreSQL MySQL ElasticSearch

Figure 5.7 Spark layers.

source. By implementing data source API, one can represent the underlying tech- nology as RDD or DataFrame. Thanks to the huge community of Spark, many connectors implement data source API.

Processing Large Data with Linux Commands

Processing Large Data with PostgreSQL