In order to run the PySpark command, let’s create a new notebook by using Python 2, as shown in Figure 3-3.
After creating the notebook, you will see the web page in Figure 3-4.
Figure 3-5. Printing the Python list
In Figure 3-5, we are printing pythonList.
Spark Architecture and the Resilient Distributed Dataset
You learned Python in the preceding chapter. Now it is time to learn PySpark and utilize the power of a distributed system to solve problems related to big data. We generally distribute large amounts of data on a cluster and perform processing on that distributed data.
This chapter covers the following recipes:
Recipe 4-1. Create an RDD
Recipe 4-2. Convert temperature data Recipe 4-3. Perform basic data manipulation Recipe 4-4. Run set operations
Recipe 4-5. Calculate summary statistics
Recipe 4-6. Start PySpark shell on Standalone cluster manager Recipe 4-7. Start PySpark shell on Mesos
Recipe 4-8. Start PySpark shell on YARN
Learning about the architecture of Spark will be very helpful to your understanding of the various components of Spark. Before delving into the recipes let’s explore this topic.
Figure 4-1 describes the Spark architecture.
The main components of the Spark architecture are the driver and executors. For each PySpark application, there will be one driver program and one or more executors running on the cluster slave machine. You might be wondering, what is an application in the context of PySpark? An application is a whole bunch of code used to solve a problem.
The driver is the process that coordinates with many executors running on various slave machines. Spark follows a master/slave architecture. The SparkContext object is created by the driver. SparkContext is the main entry point to a PySpark application. You will learn more about SparkContext in upcoming chapters. In this chapter, we will run our PySpark commands in the PySpark shell. After starting the shell, we will find that the SparkContext object is automatically created. We will encounter the SparkContext object in the PySpark shell as the variable sc. The shell itself is working as our driver.
The driver breaks our application into small tasks; a task is the smallest unit of your application. Tasks are run on different executors in parallel. The driver is also responsible for scheduling tasks to different executors.
Executors are slave processes. An executor runs tasks. It also has the capability to cache data in memory by using the BlockManager process. Each executor runs in its own Java Virtual Machine (JVM).
The cluster manager manages cluster resources. The driver talks to the cluster manager to negotiate resources. The cluster manager also schedules tasks on behalf of the driver on various slave executor processes. PySpark is dispatched with Standalone Cluster Manager. PySpark can also be configured on YARN and Apache Mesos. In our recipes, you are going to see how to configure PySpark on Standalone Cluster Manager and Apache Mesos. On a single machine, PySpark can be started in local mode too.
The main celebrated component of PySpark is the resilient distributed dataset (RDD).
The RDD is a data abstraction over the distributed collection. Python collections such as lists, tuples, and sets can be distributed very easily. An RDD is recomputed on node failures. Only part of the data is calculated or recalculated, as required. An RDD is created using various functions defined in the SparkContext class. One important method for
Driver
SparkContext
Cluster Manager
Executor
Executor Task
Task
Task
Task
Figure 4-1. Spark architecture
creating an RDD is parallelize(), which you will encounter again and again in this chapter. Figure 4-2 illustrates the creation of an RDD.
Data1 Data2 Data3 Data4
Data1 Data2 Data3 Data4
Data5 Data6 Data7
Data5 Data Distribution
Data6 Data7
Figure 4-2. Creating an RDD
Let’s say that we have a Python collection with the elements Data1, Data2, Data3, Data4, Data5, Data6, and Data7. This collection is distributed over the cluster to create an RDD. For simplicity, we can assume that two executors are running. Our collection is divided into two parts. The first executor gets the first part of the collection, which has the elements Data1, Data2, Data3, and Data4. The second part of the collection is sent to the second executor. So, the second executor has the data elements Data5, Data6, and Data7.
We can perform two types of operations on the RDD: transformation and action.
Transformation on an RDD returns another RDD. We know that RDDs are immutable;
therefore, changing the RDD is impossible. Hence transformations always return another RDD. Transformations are lazy, whereas actions are eagerly evaluated. I say that the transformation is lazy because whenever a transformation is applied to an RDD, that operation is not applied to the data at the same time. Instead, PySpark notes the operation request, but all the transformations are applied when the first action is called.
Figure 4-3 illustrates a transformation operation. The transformation on RDD1 creates RDD2. RDD1 has two partitions. The first partition of RDD1 has four data elements: Data1, Data2, Data3, and Data4. The second data partition of RDD1 has three elements: Data5, Data6, and Data7. After transformation on RDD1, RDD2 is created.
RDD2 has six elements. So it is clear that the daughter RDD might have a different number of data elements than the father RDD. RDD2 also has two partitions. The first partition of RDD2 has three data points: Data8, Data9, and Data10. The second partition of RDD2 also has three elements: Data11, Data12, and Data13. Don’t get confused about the daughter RDD having a different number of partitions than the father RDD.
Figure 4-4 illustrates an action performed on an RDD. In this example, we are applying the summation action. Summed data is returned to the driver. In other cases, the result of an action can be saved to a file or to another destination.
Data8