Pyspark power guide

df = spark.read.load"file path" # Spark load the data source from the defined file path df.select"column name", "column name".write.save"file name" # The DataFrame is saved in the define

Trang 2

What is PySpark ?

PySpark is nothing but the Python API for Apache Spark

It offers PySpark Shell which connects the Python API to the spark core and in turn

initializes the Spark context.More on PySpark

 For any spark functionality, the entry point is SparkContext

 SparkContext uses Py4J to launch a JVM and creates a JavaSparkContext

 By default, PySpark has SparkContext available as sc, so creating a new SparkContext won't work

Py4J

 PySpark is built on top of Spark's Java API

 Data is processed in Python and cached / shuffled in the JVM

 Py4J enables Python programs running in a Python interpreter to dynamically access Java objects in a Java Virtual Machine

 Here methods are called as if the Java objects resided in the Python interpreter and Java collections can be accessed through standard Python collection methods

More on Py4J

 In the Python driver program, SparkContext uses Py4J to launch a JVM and create a JavaSparkContext

Trang 3

 To establish local communication between the Python and Java SparkContext objects Py4J is used on the driver.

Installing and Configuring PySpark

 PySpark requires Python 2.6 or higher

 PySpark applications are executed using a standard CPython interpreter in order to support Python modules that use C extensions

 By default, PySpark requires python to be available on the system PATH and use it to run programs

 Among PySpark’s library dependencies all of them are bundled with PySpark including Py4J and they are automatically imported

Getting Started

We can enter the Spark's python environment by running the given command in theshell

./bin/pyspark

This will start yourPySpark shell.`

Python 2.7.12 (default, Nov 20 2017, 18:23:56) [GCC 5.4.0 20160609] on linux2

Type "help", "copyright", "credits" or "license" for more information.

Welcome to

/ / _ _/ /

_\ \/ _ \/ _ `/ / '_/

/ / /\_,_/_/ /_/\_\ version 2.2.0 /_/

Using Python version 2.7.12 (default, Nov 20 2017 18:23:56) SparkSession available as 'spark'.

<<<

Trang 4

 This chapter describes the general methods for loading and saving data using the Spark Data Sources.

Generic Load/Save Functions

In most of the cases, the default data source will be used for all operations

df = spark.read.load("file path")

# Spark load the data source from the defined file path

df.select("column name", "column name").write.save("file name")

# The DataFrame is saved in the defined format

# By default it is saved in the Spark Warehouse

File path can be from local machine as well as from HDFS.

Manually Specifying Options

You can also manually specify the data source that will be used along with any extra options that you would like to pass to the data source

Data sources fully qualified name is used to specify them, but for built-in sources, youcan also use their short names (json, parquet, jdbc, orc, libsvm, csv, text)

Specific File Formats

DataFrames which are loaded from any type of data can be converted to other types

by using the syntax shown below

A json file can be loaded:

df = spark.read.load("path of json file", format="json")

Apache Parquet

 Apache Parquet is a columnar storage format available to all projects in the Hadoop ecosystem, irrespective of the choice of the framework used for data processing, the model of data or programming language used

 Spark SQL provides support for both reading and writing Parquet files

 Automatic conversion to nullable occurs when one tries to write Parquet files, This is done due to compatibility reasons

Trang 5

 map(func): Passes each element of the RDD via the supplied function.

 union(): New RDD contains elements from source argument and RDD.

 intersection(): New RDD includes only common elements from source

argument and RDD

 cartesian(): New RDD cross product of all elements from source argument and

RDD

Actions

 Actions return concluding results of RDD computations

 Actions trigger execution utilising lineage graph to load the data into original RDD, and then execute all intermediate transformations and write final results out to file system or return it to Driver program

 Count, collect, reduce, take, and first are few actions in spark.

Example of Actions

 count(): Get the number of data elements in the RDD.

 collect(): Get all the data elements in an RDD as an array.

 reduce(func): Aggregate the data elements in an RDD using this function

which takes two arguments and returns one

 take (n): Fetch first n data elements in an RDD computed by driver program.

 foreach(func): Execute function for each data element in RDD usually used to

update an accumulator or interacting with external systems

 first(): Retrieves the first data element in RDD It is similar to take(1).

 saveAsTextFile(path): Writes the content of RDD to a text file or a set of text

files to local file system/HDFS

What is Dataframe ?

In general DataFrames can be defined as a data structure, which is tabular in nature

It represents rows, each of them consists of a number of observations

Rows can have a variety of data formats (heterogeneous), whereas a column can have data of the same data type (homogeneous)

They mainly contain some metadata in addition to data like column and row names

Why DataFrames ?

 DataFrames are widely used for processing a large collection of structured or semi-structured data

 They are having the ability to handle petabytes of data

 In addition, it supports a wide range of data format for reading as well as writing

Trang 6

Student = Row("firstName", "lastName", "age", "telephone") s1 = Student('David', 'Julian', 22, 100000)

s2 = Student('Mark', 'Webb', 23, 658545) StudentData=[s1,s2]

df=spark.createDataFrame(StudentData) df.show()

from pyspark.sql import SparkSession

from pyspark import * spark = SparkSession \ builder \

.config("spark.some.config.option", "some-value") \ getOrCreate()

passenger = Row("Name", "age", "source", "destination") s1 = passenger('David', 22 , 'London', 'Paris')

s2 = passenger('Steve', 22 , 'New York', 'Sydney')

x = [s1,s2]

df1=spark.createDataFrame(x) df1.show()

Trang 7

Resilient Distributed Datasets (RDDs)

 Resilient distributed datasets (RDDs) are known as the main abstraction

in Spark

 It is a partitioned collection of objects spread across a cluster, and can be persisted in memory or on disk

 Once RDDs are created they are immutable

There are two ways to create RDDs:

1 Parallelizing a collection in driver program

2 Referencing one dataset in an external storage system, like a shared filesystem, HBase, HDFS, or any data source providing a Hadoop InputFormat

Features Of RDDs

 Resilient, i.e tolerant to faults using RDD lineage graph and therefore ready

to recompute damaged or missing partitions due to node failures

 Dataset - A set of partitioned data with primitive values or values of values,

For example, records or tuples

 Distributed with data remaining on multiple nodes in a cluster

Creating RDDs

Parallelizing a collection in driver program.

E.g., here is how to create a parallelized collection holding the numbers 1 to 5:

Trang 8

As a conclusion DataFrame is data organized into named columns

Features of DataFrame

 Distributed

 Lazy Evals

 ImmutableFeatures Explained

 DataFrames are Distributed in Nature, which makes it fault tolerant and highly

available data structure

 Lazy Evaluation is an evaluation strategy which will hold the evaluation of an

expression until its value is needed

 DataFrames are Immutable in nature which means that it is an object whose

state cannot be modified after it is created

DataFrame SourcesFor constructing a DataFrame a wide range of sources are available such as:

 Structured data files

It provides a programming abstraction called DataFrame and can act as

distributed SQL query engine.Features of Spark SQL

The main capabilities of using structured and semi-structured data, by Spark SQL

Such as:

 Provides DataFrame abstraction in Scala, Java, and Python

 Spark SQL can read and write data from Hive Tables, JSON, and Parquet in various structured formats

 Data can be queried by using Spark SQL

Trang 9

 Distributed

 Lazy Evals

Such as:

Trang 10

argument and RDD

RDD

Actions

Example of Actions

What is Dataframe ?

Why DataFrames ?

Trang 11

For more details about Spark SQL refer the fresco course Spark SQLImportant classes of Spark SQL and DataFrames

 pyspark.sql.SparkSession :Main entry point for Dataframe SparkSQL functionality

 pyspark.sql.DataFrame :A distributed collection of data grouped into named columns

 pyspark.sql.Column : A column expression in a DataFrame

 pyspark.sql.Row : A row of data in a DataFrame

 pyspark.sql.GroupedData :Aggregation methods, returned by DataFrame.groupBy()

More On Classes

 pyspark.sql.DataFrameNaFunctions : Methods for handling missing data (null values)

 pyspark.sql.DataFrameStatFunctions : Methods for statistics functionality

 pyspark.sql.functions : List of built-in functions available for DataFrame

 pyspark.sql.types : List of data types available

 pyspark.sql.Window : For working with window functions

Creating a DataFrame demo

The entry point into all functionality in Spark is the SparkSession class.

To create a basic SparkSession, just use SparkSession.builder:

from pyspark.sql import SparkSession spark = SparkSession \

.builder \ appName("Data Frame Example") \ config("spark.some.config.option", "some-value") \ getOrCreate()

More On Creation

Import the sql module from pyspark

from pyspark.sql import *

Trang 12

Apache Parquet

Trang 13

distFile = sc.textFile("data.txt")

RDD Operations

RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver programafter running a computation on the dataset

For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results

Similiarly, reduce is an action which aggregates all RDD elements by using some functions and then returns the final result to driver program

More On Classes

More On Creation

Trang 18

 Distributed

 Lazy Evals

Such as:

Trang 19

More On Classes

More On Creation

Trang 20

 Distributed

 Lazy Evals

Such as:

Trang 21

distFile = sc.textFile("data.txt")

RDD Operations

RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver programafter running a computation on the dataset

For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results

Similiarly, reduce is an action which aggregates all RDD elements by using some functions and then returns the final result to driver program

More On Classes

More On Creation

Trang 24

x = [s1,s2]

Trang 25

argument and RDD

RDD

Actions

Example of Actions

What is Dataframe ?

Why DataFrames ?

Trang 26

Apache Parquet

Trang 27

 Distributed

 Lazy Evals

Such as:

Trang 28

in Spark

Features Of RDDs

Creating RDDs

Trang 29

More On Classes

More On Creation

Trang 30

Apache Parquet

Trang 31

in Spark

Features Of RDDs

Creating RDDs

Trang 32

More On Classes

More On Creation

Trang 33

in Spark

Features Of RDDs

Creating RDDs

Trang 34

x = [s1,s2]

Trang 35

Apache Parquet

Trang 36

More On Classes

More On Creation

Trang 37

in Spark

Features Of RDDs

Creating RDDs

Tiêu đề	Pyspark Power Guide
Thể loại	Hướng dẫn

Định dạng
Số trang	75
Dung lượng	1,01 MB