df = spark.read.load"file path" # Spark load the data source from the defined file path df.select"column name", "column name".write.save"file name" # The DataFrame is saved in the define
Trang 2What is PySpark ?
PySpark is nothing but the Python API for Apache Spark
It offers PySpark Shell which connects the Python API to the spark core and in turn
initializes the Spark context.More on PySpark
For any spark functionality, the entry point is SparkContext
SparkContext uses Py4J to launch a JVM and creates a JavaSparkContext
By default, PySpark has SparkContext available as sc, so creating a new SparkContext won't work
Py4J
PySpark is built on top of Spark's Java API
Data is processed in Python and cached / shuffled in the JVM
Py4J enables Python programs running in a Python interpreter to dynamically access Java objects in a Java Virtual Machine
Here methods are called as if the Java objects resided in the Python interpreter and Java collections can be accessed through standard Python collection methods
More on Py4J
In the Python driver program, SparkContext uses Py4J to launch a JVM and create a JavaSparkContext
Trang 3 To establish local communication between the Python and Java SparkContext objects Py4J is used on the driver.
Installing and Configuring PySpark
PySpark requires Python 2.6 or higher
PySpark applications are executed using a standard CPython interpreter in order to support Python modules that use C extensions
By default, PySpark requires python to be available on the system PATH and use it to run programs
Among PySpark’s library dependencies all of them are bundled with PySpark including Py4J and they are automatically imported
Getting Started
We can enter the Spark's python environment by running the given command in theshell
./bin/pyspark
This will start yourPySpark shell.`
Python 2.7.12 (default, Nov 20 2017, 18:23:56) [GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Welcome to
/ / _ _/ /
_\ \/ _ \/ _ `/ / '_/
/ / /\_,_/_/ /_/\_\ version 2.2.0 /_/
Using Python version 2.7.12 (default, Nov 20 2017 18:23:56) SparkSession available as 'spark'.
<<<
Trang 4 This chapter describes the general methods for loading and saving data using the Spark Data Sources.
Generic Load/Save Functions
In most of the cases, the default data source will be used for all operations
df = spark.read.load("file path")
# Spark load the data source from the defined file path
df.select("column name", "column name").write.save("file name")
# The DataFrame is saved in the defined format
# By default it is saved in the Spark Warehouse
File path can be from local machine as well as from HDFS.
Manually Specifying Options
You can also manually specify the data source that will be used along with any extra options that you would like to pass to the data source
Data sources fully qualified name is used to specify them, but for built-in sources, youcan also use their short names (json, parquet, jdbc, orc, libsvm, csv, text)
Specific File Formats
DataFrames which are loaded from any type of data can be converted to other types
by using the syntax shown below
A json file can be loaded:
df = spark.read.load("path of json file", format="json")
Apache Parquet
Apache Parquet is a columnar storage format available to all projects in the Hadoop ecosystem, irrespective of the choice of the framework used for data processing, the model of data or programming language used
Spark SQL provides support for both reading and writing Parquet files
Automatic conversion to nullable occurs when one tries to write Parquet files, This is done due to compatibility reasons
Trang 5 map(func): Passes each element of the RDD via the supplied function.
union(): New RDD contains elements from source argument and RDD.
intersection(): New RDD includes only common elements from source
argument and RDD
cartesian(): New RDD cross product of all elements from source argument and
RDD
Actions
Actions return concluding results of RDD computations
Actions trigger execution utilising lineage graph to load the data into original RDD, and then execute all intermediate transformations and write final results out to file system or return it to Driver program
Count, collect, reduce, take, and first are few actions in spark.
Example of Actions
count(): Get the number of data elements in the RDD.
collect(): Get all the data elements in an RDD as an array.
reduce(func): Aggregate the data elements in an RDD using this function
which takes two arguments and returns one
take (n): Fetch first n data elements in an RDD computed by driver program.
foreach(func): Execute function for each data element in RDD usually used to
update an accumulator or interacting with external systems
first(): Retrieves the first data element in RDD It is similar to take(1).
saveAsTextFile(path): Writes the content of RDD to a text file or a set of text
files to local file system/HDFS
What is Dataframe ?
In general DataFrames can be defined as a data structure, which is tabular in nature
It represents rows, each of them consists of a number of observations
Rows can have a variety of data formats (heterogeneous), whereas a column can have data of the same data type (homogeneous)
They mainly contain some metadata in addition to data like column and row names
Why DataFrames ?
DataFrames are widely used for processing a large collection of structured or semi-structured data
They are having the ability to handle petabytes of data
In addition, it supports a wide range of data format for reading as well as writing
Trang 6Student = Row("firstName", "lastName", "age", "telephone") s1 = Student('David', 'Julian', 22, 100000)
s2 = Student('Mark', 'Webb', 23, 658545) StudentData=[s1,s2]
df=spark.createDataFrame(StudentData) df.show()
from pyspark.sql import SparkSession
from pyspark import * spark = SparkSession \ builder \
.config("spark.some.config.option", "some-value") \ getOrCreate()
passenger = Row("Name", "age", "source", "destination") s1 = passenger('David', 22 , 'London', 'Paris')
s2 = passenger('Steve', 22 , 'New York', 'Sydney')
x = [s1,s2]
df1=spark.createDataFrame(x) df1.show()
Trang 7Resilient Distributed Datasets (RDDs)
Resilient distributed datasets (RDDs) are known as the main abstraction
in Spark
It is a partitioned collection of objects spread across a cluster, and can be persisted in memory or on disk
Once RDDs are created they are immutable
There are two ways to create RDDs:
1 Parallelizing a collection in driver program
2 Referencing one dataset in an external storage system, like a shared filesystem, HBase, HDFS, or any data source providing a Hadoop InputFormat
Features Of RDDs
Resilient, i.e tolerant to faults using RDD lineage graph and therefore ready
to recompute damaged or missing partitions due to node failures
Dataset - A set of partitioned data with primitive values or values of values,
For example, records or tuples
Distributed with data remaining on multiple nodes in a cluster
Creating RDDs
Parallelizing a collection in driver program.
E.g., here is how to create a parallelized collection holding the numbers 1 to 5:
Trang 8As a conclusion DataFrame is data organized into named columns
Features of DataFrame
Distributed
Lazy Evals
ImmutableFeatures Explained
DataFrames are Distributed in Nature, which makes it fault tolerant and highly
available data structure
Lazy Evaluation is an evaluation strategy which will hold the evaluation of an
expression until its value is needed
DataFrames are Immutable in nature which means that it is an object whose
state cannot be modified after it is created
DataFrame SourcesFor constructing a DataFrame a wide range of sources are available such as:
Structured data files
It provides a programming abstraction called DataFrame and can act as
distributed SQL query engine.Features of Spark SQL
The main capabilities of using structured and semi-structured data, by Spark SQL
Such as:
Provides DataFrame abstraction in Scala, Java, and Python
Spark SQL can read and write data from Hive Tables, JSON, and Parquet in various structured formats
Data can be queried by using Spark SQL
Trang 9As a conclusion DataFrame is data organized into named columns
Features of DataFrame
Distributed
Lazy Evals
ImmutableFeatures Explained
DataFrames are Distributed in Nature, which makes it fault tolerant and highly
available data structure
Lazy Evaluation is an evaluation strategy which will hold the evaluation of an
expression until its value is needed
DataFrames are Immutable in nature which means that it is an object whose
state cannot be modified after it is created
DataFrame SourcesFor constructing a DataFrame a wide range of sources are available such as:
Structured data files
It provides a programming abstraction called DataFrame and can act as
distributed SQL query engine.Features of Spark SQL
The main capabilities of using structured and semi-structured data, by Spark SQL
Such as:
Provides DataFrame abstraction in Scala, Java, and Python
Spark SQL can read and write data from Hive Tables, JSON, and Parquet in various structured formats
Data can be queried by using Spark SQL
Trang 10 map(func): Passes each element of the RDD via the supplied function.
union(): New RDD contains elements from source argument and RDD.
intersection(): New RDD includes only common elements from source
argument and RDD
cartesian(): New RDD cross product of all elements from source argument and
RDD
Actions
Actions return concluding results of RDD computations
Actions trigger execution utilising lineage graph to load the data into original RDD, and then execute all intermediate transformations and write final results out to file system or return it to Driver program
Count, collect, reduce, take, and first are few actions in spark.
Example of Actions
count(): Get the number of data elements in the RDD.
collect(): Get all the data elements in an RDD as an array.
reduce(func): Aggregate the data elements in an RDD using this function
which takes two arguments and returns one
take (n): Fetch first n data elements in an RDD computed by driver program.
foreach(func): Execute function for each data element in RDD usually used to
update an accumulator or interacting with external systems
first(): Retrieves the first data element in RDD It is similar to take(1).
saveAsTextFile(path): Writes the content of RDD to a text file or a set of text
files to local file system/HDFS
What is Dataframe ?
In general DataFrames can be defined as a data structure, which is tabular in nature
It represents rows, each of them consists of a number of observations
Rows can have a variety of data formats (heterogeneous), whereas a column can have data of the same data type (homogeneous)
They mainly contain some metadata in addition to data like column and row names
Why DataFrames ?
DataFrames are widely used for processing a large collection of structured or semi-structured data
They are having the ability to handle petabytes of data
In addition, it supports a wide range of data format for reading as well as writing
Trang 11For more details about Spark SQL refer the fresco course Spark SQLImportant classes of Spark SQL and DataFrames
pyspark.sql.SparkSession :Main entry point for Dataframe SparkSQL functionality
pyspark.sql.DataFrame :A distributed collection of data grouped into named columns
pyspark.sql.Column : A column expression in a DataFrame
pyspark.sql.Row : A row of data in a DataFrame
pyspark.sql.GroupedData :Aggregation methods, returned by DataFrame.groupBy()
More On Classes
pyspark.sql.DataFrameNaFunctions : Methods for handling missing data (null values)
pyspark.sql.DataFrameStatFunctions : Methods for statistics functionality
pyspark.sql.functions : List of built-in functions available for DataFrame
pyspark.sql.types : List of data types available
pyspark.sql.Window : For working with window functions
Creating a DataFrame demo
The entry point into all functionality in Spark is the SparkSession class.
To create a basic SparkSession, just use SparkSession.builder:
from pyspark.sql import SparkSession spark = SparkSession \
.builder \ appName("Data Frame Example") \ config("spark.some.config.option", "some-value") \ getOrCreate()
More On Creation
Import the sql module from pyspark
from pyspark.sql import *
Trang 12 This chapter describes the general methods for loading and saving data using the Spark Data Sources.
Generic Load/Save Functions
In most of the cases, the default data source will be used for all operations
df = spark.read.load("file path")
# Spark load the data source from the defined file path
df.select("column name", "column name").write.save("file name")
# The DataFrame is saved in the defined format
# By default it is saved in the Spark Warehouse
File path can be from local machine as well as from HDFS.
Manually Specifying Options
You can also manually specify the data source that will be used along with any extra options that you would like to pass to the data source
Data sources fully qualified name is used to specify them, but for built-in sources, youcan also use their short names (json, parquet, jdbc, orc, libsvm, csv, text)
Specific File Formats
DataFrames which are loaded from any type of data can be converted to other types
by using the syntax shown below
A json file can be loaded:
df = spark.read.load("path of json file", format="json")
Apache Parquet
Apache Parquet is a columnar storage format available to all projects in the Hadoop ecosystem, irrespective of the choice of the framework used for data processing, the model of data or programming language used
Spark SQL provides support for both reading and writing Parquet files
Automatic conversion to nullable occurs when one tries to write Parquet files, This is done due to compatibility reasons
Trang 13distFile = sc.textFile("data.txt")
RDD Operations
RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver programafter running a computation on the dataset
For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results
Similiarly, reduce is an action which aggregates all RDD elements by using some functions and then returns the final result to driver program
More On RDD Operations
As a recap to RDD basics, consider the simple program shown below:
lines = sc.textFile("data.txt") lineLengths = lines.map(lambda s: len(s)) totalLength = lineLengths.reduce(lambda a, b: a + b)
The first line defines a base RDD from an external file.The second line defines lineLengths as the result of a map transformation.Finally, in the third line, we run reduce, which is an action
filter(func): Returns a new dataset (RDD) that are created by choosing the
elements of the source on which the function returns true
Trang 14 map(func): Passes each element of the RDD via the supplied function.
union(): New RDD contains elements from source argument and RDD.
intersection(): New RDD includes only common elements from source
argument and RDD
cartesian(): New RDD cross product of all elements from source argument and
RDD
Actions
Actions return concluding results of RDD computations
Actions trigger execution utilising lineage graph to load the data into original RDD, and then execute all intermediate transformations and write final results out to file system or return it to Driver program
Count, collect, reduce, take, and first are few actions in spark.
Example of Actions
count(): Get the number of data elements in the RDD.
collect(): Get all the data elements in an RDD as an array.
reduce(func): Aggregate the data elements in an RDD using this function
which takes two arguments and returns one
take (n): Fetch first n data elements in an RDD computed by driver program.
foreach(func): Execute function for each data element in RDD usually used to
update an accumulator or interacting with external systems
first(): Retrieves the first data element in RDD It is similar to take(1).
saveAsTextFile(path): Writes the content of RDD to a text file or a set of text
files to local file system/HDFS
What is Dataframe ?
In general DataFrames can be defined as a data structure, which is tabular in nature
It represents rows, each of them consists of a number of observations
Rows can have a variety of data formats (heterogeneous), whereas a column can have data of the same data type (homogeneous)
They mainly contain some metadata in addition to data like column and row names
Why DataFrames ?
DataFrames are widely used for processing a large collection of structured or semi-structured data
They are having the ability to handle petabytes of data
In addition, it supports a wide range of data format for reading as well as writing
Trang 15As a conclusion DataFrame is data organized into named columns
Features of DataFrame
Distributed
Lazy Evals
ImmutableFeatures Explained
DataFrames are Distributed in Nature, which makes it fault tolerant and highly
available data structure
Lazy Evaluation is an evaluation strategy which will hold the evaluation of an
expression until its value is needed
DataFrames are Immutable in nature which means that it is an object whose
state cannot be modified after it is created
DataFrame SourcesFor constructing a DataFrame a wide range of sources are available such as:
Structured data files
It provides a programming abstraction called DataFrame and can act as
distributed SQL query engine.Features of Spark SQL
The main capabilities of using structured and semi-structured data, by Spark SQL
Such as:
Provides DataFrame abstraction in Scala, Java, and Python
Spark SQL can read and write data from Hive Tables, JSON, and Parquet in various structured formats
Data can be queried by using Spark SQL
Trang 16As a conclusion DataFrame is data organized into named columns
Features of DataFrame
Distributed
Lazy Evals
ImmutableFeatures Explained
DataFrames are Distributed in Nature, which makes it fault tolerant and highly
available data structure
Lazy Evaluation is an evaluation strategy which will hold the evaluation of an
expression until its value is needed
DataFrames are Immutable in nature which means that it is an object whose
state cannot be modified after it is created
DataFrame SourcesFor constructing a DataFrame a wide range of sources are available such as:
Structured data files
It provides a programming abstraction called DataFrame and can act as
distributed SQL query engine.Features of Spark SQL
The main capabilities of using structured and semi-structured data, by Spark SQL
Such as:
Provides DataFrame abstraction in Scala, Java, and Python
Spark SQL can read and write data from Hive Tables, JSON, and Parquet in various structured formats
Data can be queried by using Spark SQL
Trang 17For more details about Spark SQL refer the fresco course Spark SQLImportant classes of Spark SQL and DataFrames
pyspark.sql.SparkSession :Main entry point for Dataframe SparkSQL functionality
pyspark.sql.DataFrame :A distributed collection of data grouped into named columns
pyspark.sql.Column : A column expression in a DataFrame
pyspark.sql.Row : A row of data in a DataFrame
pyspark.sql.GroupedData :Aggregation methods, returned by DataFrame.groupBy()
More On Classes
pyspark.sql.DataFrameNaFunctions : Methods for handling missing data (null values)
pyspark.sql.DataFrameStatFunctions : Methods for statistics functionality
pyspark.sql.functions : List of built-in functions available for DataFrame
pyspark.sql.types : List of data types available
pyspark.sql.Window : For working with window functions
Creating a DataFrame demo
The entry point into all functionality in Spark is the SparkSession class.
To create a basic SparkSession, just use SparkSession.builder:
from pyspark.sql import SparkSession spark = SparkSession \
.builder \ appName("Data Frame Example") \ config("spark.some.config.option", "some-value") \ getOrCreate()
More On Creation
Import the sql module from pyspark
from pyspark.sql import *
Trang 18As a conclusion DataFrame is data organized into named columns
Features of DataFrame
Distributed
Lazy Evals
ImmutableFeatures Explained
DataFrames are Distributed in Nature, which makes it fault tolerant and highly
available data structure
Lazy Evaluation is an evaluation strategy which will hold the evaluation of an
expression until its value is needed
DataFrames are Immutable in nature which means that it is an object whose
state cannot be modified after it is created
DataFrame SourcesFor constructing a DataFrame a wide range of sources are available such as:
Structured data files
It provides a programming abstraction called DataFrame and can act as
distributed SQL query engine.Features of Spark SQL
The main capabilities of using structured and semi-structured data, by Spark SQL
Such as:
Provides DataFrame abstraction in Scala, Java, and Python
Spark SQL can read and write data from Hive Tables, JSON, and Parquet in various structured formats
Data can be queried by using Spark SQL
Trang 19For more details about Spark SQL refer the fresco course Spark SQLImportant classes of Spark SQL and DataFrames
pyspark.sql.SparkSession :Main entry point for Dataframe SparkSQL functionality
pyspark.sql.DataFrame :A distributed collection of data grouped into named columns
pyspark.sql.Column : A column expression in a DataFrame
pyspark.sql.Row : A row of data in a DataFrame
pyspark.sql.GroupedData :Aggregation methods, returned by DataFrame.groupBy()
More On Classes
pyspark.sql.DataFrameNaFunctions : Methods for handling missing data (null values)
pyspark.sql.DataFrameStatFunctions : Methods for statistics functionality
pyspark.sql.functions : List of built-in functions available for DataFrame
pyspark.sql.types : List of data types available
pyspark.sql.Window : For working with window functions
Creating a DataFrame demo
The entry point into all functionality in Spark is the SparkSession class.
To create a basic SparkSession, just use SparkSession.builder:
from pyspark.sql import SparkSession spark = SparkSession \
.builder \ appName("Data Frame Example") \ config("spark.some.config.option", "some-value") \ getOrCreate()
More On Creation
Import the sql module from pyspark
from pyspark.sql import *
Trang 20As a conclusion DataFrame is data organized into named columns
Features of DataFrame
Distributed
Lazy Evals
ImmutableFeatures Explained
DataFrames are Distributed in Nature, which makes it fault tolerant and highly
available data structure
Lazy Evaluation is an evaluation strategy which will hold the evaluation of an
expression until its value is needed
DataFrames are Immutable in nature which means that it is an object whose
state cannot be modified after it is created
DataFrame SourcesFor constructing a DataFrame a wide range of sources are available such as:
Structured data files
It provides a programming abstraction called DataFrame and can act as
distributed SQL query engine.Features of Spark SQL
The main capabilities of using structured and semi-structured data, by Spark SQL
Such as:
Provides DataFrame abstraction in Scala, Java, and Python
Spark SQL can read and write data from Hive Tables, JSON, and Parquet in various structured formats
Data can be queried by using Spark SQL
Trang 21distFile = sc.textFile("data.txt")
RDD Operations
RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver programafter running a computation on the dataset
For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results
Similiarly, reduce is an action which aggregates all RDD elements by using some functions and then returns the final result to driver program
More On RDD Operations
As a recap to RDD basics, consider the simple program shown below:
lines = sc.textFile("data.txt") lineLengths = lines.map(lambda s: len(s)) totalLength = lineLengths.reduce(lambda a, b: a + b)
The first line defines a base RDD from an external file.The second line defines lineLengths as the result of a map transformation.Finally, in the third line, we run reduce, which is an action
filter(func): Returns a new dataset (RDD) that are created by choosing the
elements of the source on which the function returns true
Trang 22 This chapter describes the general methods for loading and saving data using the Spark Data Sources.
Generic Load/Save Functions
In most of the cases, the default data source will be used for all operations
df = spark.read.load("file path")
# Spark load the data source from the defined file path
df.select("column name", "column name").write.save("file name")
# The DataFrame is saved in the defined format
# By default it is saved in the Spark Warehouse
File path can be from local machine as well as from HDFS.
Manually Specifying Options
You can also manually specify the data source that will be used along with any extra options that you would like to pass to the data source
Data sources fully qualified name is used to specify them, but for built-in sources, youcan also use their short names (json, parquet, jdbc, orc, libsvm, csv, text)
Specific File Formats
DataFrames which are loaded from any type of data can be converted to other types
by using the syntax shown below
A json file can be loaded:
df = spark.read.load("path of json file", format="json")
Apache Parquet
Apache Parquet is a columnar storage format available to all projects in the Hadoop ecosystem, irrespective of the choice of the framework used for data processing, the model of data or programming language used
Spark SQL provides support for both reading and writing Parquet files
Automatic conversion to nullable occurs when one tries to write Parquet files, This is done due to compatibility reasons
Trang 23For more details about Spark SQL refer the fresco course Spark SQLImportant classes of Spark SQL and DataFrames
pyspark.sql.SparkSession :Main entry point for Dataframe SparkSQL functionality
pyspark.sql.DataFrame :A distributed collection of data grouped into named columns
pyspark.sql.Column : A column expression in a DataFrame
pyspark.sql.Row : A row of data in a DataFrame
pyspark.sql.GroupedData :Aggregation methods, returned by DataFrame.groupBy()
More On Classes
pyspark.sql.DataFrameNaFunctions : Methods for handling missing data (null values)
pyspark.sql.DataFrameStatFunctions : Methods for statistics functionality
pyspark.sql.functions : List of built-in functions available for DataFrame
pyspark.sql.types : List of data types available
pyspark.sql.Window : For working with window functions
Creating a DataFrame demo
The entry point into all functionality in Spark is the SparkSession class.
To create a basic SparkSession, just use SparkSession.builder:
from pyspark.sql import SparkSession spark = SparkSession \
.builder \ appName("Data Frame Example") \ config("spark.some.config.option", "some-value") \ getOrCreate()
More On Creation
Import the sql module from pyspark
from pyspark.sql import *
Trang 24Student = Row("firstName", "lastName", "age", "telephone") s1 = Student('David', 'Julian', 22, 100000)
s2 = Student('Mark', 'Webb', 23, 658545) StudentData=[s1,s2]
df=spark.createDataFrame(StudentData) df.show()
from pyspark.sql import SparkSession
from pyspark import * spark = SparkSession \ builder \
.config("spark.some.config.option", "some-value") \ getOrCreate()
passenger = Row("Name", "age", "source", "destination") s1 = passenger('David', 22 , 'London', 'Paris')
s2 = passenger('Steve', 22 , 'New York', 'Sydney')
x = [s1,s2]
df1=spark.createDataFrame(x) df1.show()
Trang 25 map(func): Passes each element of the RDD via the supplied function.
union(): New RDD contains elements from source argument and RDD.
intersection(): New RDD includes only common elements from source
argument and RDD
cartesian(): New RDD cross product of all elements from source argument and
RDD
Actions
Actions return concluding results of RDD computations
Actions trigger execution utilising lineage graph to load the data into original RDD, and then execute all intermediate transformations and write final results out to file system or return it to Driver program
Count, collect, reduce, take, and first are few actions in spark.
Example of Actions
count(): Get the number of data elements in the RDD.
collect(): Get all the data elements in an RDD as an array.
reduce(func): Aggregate the data elements in an RDD using this function
which takes two arguments and returns one
take (n): Fetch first n data elements in an RDD computed by driver program.
foreach(func): Execute function for each data element in RDD usually used to
update an accumulator or interacting with external systems
first(): Retrieves the first data element in RDD It is similar to take(1).
saveAsTextFile(path): Writes the content of RDD to a text file or a set of text
files to local file system/HDFS
What is Dataframe ?
In general DataFrames can be defined as a data structure, which is tabular in nature
It represents rows, each of them consists of a number of observations
Rows can have a variety of data formats (heterogeneous), whereas a column can have data of the same data type (homogeneous)
They mainly contain some metadata in addition to data like column and row names
Why DataFrames ?
DataFrames are widely used for processing a large collection of structured or semi-structured data
They are having the ability to handle petabytes of data
In addition, it supports a wide range of data format for reading as well as writing
Trang 26 This chapter describes the general methods for loading and saving data using the Spark Data Sources.
Generic Load/Save Functions
In most of the cases, the default data source will be used for all operations
df = spark.read.load("file path")
# Spark load the data source from the defined file path
df.select("column name", "column name").write.save("file name")
# The DataFrame is saved in the defined format
# By default it is saved in the Spark Warehouse
File path can be from local machine as well as from HDFS.
Manually Specifying Options
You can also manually specify the data source that will be used along with any extra options that you would like to pass to the data source
Data sources fully qualified name is used to specify them, but for built-in sources, youcan also use their short names (json, parquet, jdbc, orc, libsvm, csv, text)
Specific File Formats
DataFrames which are loaded from any type of data can be converted to other types
by using the syntax shown below
A json file can be loaded:
df = spark.read.load("path of json file", format="json")
Apache Parquet
Apache Parquet is a columnar storage format available to all projects in the Hadoop ecosystem, irrespective of the choice of the framework used for data processing, the model of data or programming language used
Spark SQL provides support for both reading and writing Parquet files
Automatic conversion to nullable occurs when one tries to write Parquet files, This is done due to compatibility reasons
Trang 27As a conclusion DataFrame is data organized into named columns
Features of DataFrame
Distributed
Lazy Evals
ImmutableFeatures Explained
DataFrames are Distributed in Nature, which makes it fault tolerant and highly
available data structure
Lazy Evaluation is an evaluation strategy which will hold the evaluation of an
expression until its value is needed
DataFrames are Immutable in nature which means that it is an object whose
state cannot be modified after it is created
DataFrame SourcesFor constructing a DataFrame a wide range of sources are available such as:
Structured data files
It provides a programming abstraction called DataFrame and can act as
distributed SQL query engine.Features of Spark SQL
The main capabilities of using structured and semi-structured data, by Spark SQL
Such as:
Provides DataFrame abstraction in Scala, Java, and Python
Spark SQL can read and write data from Hive Tables, JSON, and Parquet in various structured formats
Data can be queried by using Spark SQL
Trang 28Resilient Distributed Datasets (RDDs)
Resilient distributed datasets (RDDs) are known as the main abstraction
in Spark
It is a partitioned collection of objects spread across a cluster, and can be persisted in memory or on disk
Once RDDs are created they are immutable
There are two ways to create RDDs:
1 Parallelizing a collection in driver program
2 Referencing one dataset in an external storage system, like a shared filesystem, HBase, HDFS, or any data source providing a Hadoop InputFormat
Features Of RDDs
Resilient, i.e tolerant to faults using RDD lineage graph and therefore ready
to recompute damaged or missing partitions due to node failures
Dataset - A set of partitioned data with primitive values or values of values,
For example, records or tuples
Distributed with data remaining on multiple nodes in a cluster
Creating RDDs
Parallelizing a collection in driver program.
E.g., here is how to create a parallelized collection holding the numbers 1 to 5:
Trang 29For more details about Spark SQL refer the fresco course Spark SQLImportant classes of Spark SQL and DataFrames
pyspark.sql.SparkSession :Main entry point for Dataframe SparkSQL functionality
pyspark.sql.DataFrame :A distributed collection of data grouped into named columns
pyspark.sql.Column : A column expression in a DataFrame
pyspark.sql.Row : A row of data in a DataFrame
pyspark.sql.GroupedData :Aggregation methods, returned by DataFrame.groupBy()
More On Classes
pyspark.sql.DataFrameNaFunctions : Methods for handling missing data (null values)
pyspark.sql.DataFrameStatFunctions : Methods for statistics functionality
pyspark.sql.functions : List of built-in functions available for DataFrame
pyspark.sql.types : List of data types available
pyspark.sql.Window : For working with window functions
Creating a DataFrame demo
The entry point into all functionality in Spark is the SparkSession class.
To create a basic SparkSession, just use SparkSession.builder:
from pyspark.sql import SparkSession spark = SparkSession \
.builder \ appName("Data Frame Example") \ config("spark.some.config.option", "some-value") \ getOrCreate()
More On Creation
Import the sql module from pyspark
from pyspark.sql import *
Trang 30 This chapter describes the general methods for loading and saving data using the Spark Data Sources.
Generic Load/Save Functions
In most of the cases, the default data source will be used for all operations
df = spark.read.load("file path")
# Spark load the data source from the defined file path
df.select("column name", "column name").write.save("file name")
# The DataFrame is saved in the defined format
# By default it is saved in the Spark Warehouse
File path can be from local machine as well as from HDFS.
Manually Specifying Options
You can also manually specify the data source that will be used along with any extra options that you would like to pass to the data source
Data sources fully qualified name is used to specify them, but for built-in sources, youcan also use their short names (json, parquet, jdbc, orc, libsvm, csv, text)
Specific File Formats
DataFrames which are loaded from any type of data can be converted to other types
by using the syntax shown below
A json file can be loaded:
df = spark.read.load("path of json file", format="json")
Apache Parquet
Apache Parquet is a columnar storage format available to all projects in the Hadoop ecosystem, irrespective of the choice of the framework used for data processing, the model of data or programming language used
Spark SQL provides support for both reading and writing Parquet files
Automatic conversion to nullable occurs when one tries to write Parquet files, This is done due to compatibility reasons
Trang 31Resilient Distributed Datasets (RDDs)
Resilient distributed datasets (RDDs) are known as the main abstraction
in Spark
It is a partitioned collection of objects spread across a cluster, and can be persisted in memory or on disk
Once RDDs are created they are immutable
There are two ways to create RDDs:
1 Parallelizing a collection in driver program
2 Referencing one dataset in an external storage system, like a shared filesystem, HBase, HDFS, or any data source providing a Hadoop InputFormat
Features Of RDDs
Resilient, i.e tolerant to faults using RDD lineage graph and therefore ready
to recompute damaged or missing partitions due to node failures
Dataset - A set of partitioned data with primitive values or values of values,
For example, records or tuples
Distributed with data remaining on multiple nodes in a cluster
Creating RDDs
Parallelizing a collection in driver program.
E.g., here is how to create a parallelized collection holding the numbers 1 to 5:
Trang 32For more details about Spark SQL refer the fresco course Spark SQLImportant classes of Spark SQL and DataFrames
pyspark.sql.SparkSession :Main entry point for Dataframe SparkSQL functionality
pyspark.sql.DataFrame :A distributed collection of data grouped into named columns
pyspark.sql.Column : A column expression in a DataFrame
pyspark.sql.Row : A row of data in a DataFrame
pyspark.sql.GroupedData :Aggregation methods, returned by DataFrame.groupBy()
More On Classes
pyspark.sql.DataFrameNaFunctions : Methods for handling missing data (null values)
pyspark.sql.DataFrameStatFunctions : Methods for statistics functionality
pyspark.sql.functions : List of built-in functions available for DataFrame
pyspark.sql.types : List of data types available
pyspark.sql.Window : For working with window functions
Creating a DataFrame demo
The entry point into all functionality in Spark is the SparkSession class.
To create a basic SparkSession, just use SparkSession.builder:
from pyspark.sql import SparkSession spark = SparkSession \
.builder \ appName("Data Frame Example") \ config("spark.some.config.option", "some-value") \ getOrCreate()
More On Creation
Import the sql module from pyspark
from pyspark.sql import *
Trang 33Resilient Distributed Datasets (RDDs)
Resilient distributed datasets (RDDs) are known as the main abstraction
in Spark
It is a partitioned collection of objects spread across a cluster, and can be persisted in memory or on disk
Once RDDs are created they are immutable
There are two ways to create RDDs:
1 Parallelizing a collection in driver program
2 Referencing one dataset in an external storage system, like a shared filesystem, HBase, HDFS, or any data source providing a Hadoop InputFormat
Features Of RDDs
Resilient, i.e tolerant to faults using RDD lineage graph and therefore ready
to recompute damaged or missing partitions due to node failures
Dataset - A set of partitioned data with primitive values or values of values,
For example, records or tuples
Distributed with data remaining on multiple nodes in a cluster
Creating RDDs
Parallelizing a collection in driver program.
E.g., here is how to create a parallelized collection holding the numbers 1 to 5:
Trang 34Student = Row("firstName", "lastName", "age", "telephone") s1 = Student('David', 'Julian', 22, 100000)
s2 = Student('Mark', 'Webb', 23, 658545) StudentData=[s1,s2]
df=spark.createDataFrame(StudentData) df.show()
from pyspark.sql import SparkSession
from pyspark import * spark = SparkSession \ builder \
.config("spark.some.config.option", "some-value") \ getOrCreate()
passenger = Row("Name", "age", "source", "destination") s1 = passenger('David', 22 , 'London', 'Paris')
s2 = passenger('Steve', 22 , 'New York', 'Sydney')
x = [s1,s2]
df1=spark.createDataFrame(x) df1.show()
Trang 35 This chapter describes the general methods for loading and saving data using the Spark Data Sources.
Generic Load/Save Functions
In most of the cases, the default data source will be used for all operations
df = spark.read.load("file path")
# Spark load the data source from the defined file path
df.select("column name", "column name").write.save("file name")
# The DataFrame is saved in the defined format
# By default it is saved in the Spark Warehouse
File path can be from local machine as well as from HDFS.
Manually Specifying Options
You can also manually specify the data source that will be used along with any extra options that you would like to pass to the data source
Data sources fully qualified name is used to specify them, but for built-in sources, youcan also use their short names (json, parquet, jdbc, orc, libsvm, csv, text)
Specific File Formats
DataFrames which are loaded from any type of data can be converted to other types
by using the syntax shown below
A json file can be loaded:
df = spark.read.load("path of json file", format="json")
Apache Parquet
Apache Parquet is a columnar storage format available to all projects in the Hadoop ecosystem, irrespective of the choice of the framework used for data processing, the model of data or programming language used
Spark SQL provides support for both reading and writing Parquet files
Automatic conversion to nullable occurs when one tries to write Parquet files, This is done due to compatibility reasons
Trang 36For more details about Spark SQL refer the fresco course Spark SQLImportant classes of Spark SQL and DataFrames
pyspark.sql.SparkSession :Main entry point for Dataframe SparkSQL functionality
pyspark.sql.DataFrame :A distributed collection of data grouped into named columns
pyspark.sql.Column : A column expression in a DataFrame
pyspark.sql.Row : A row of data in a DataFrame
pyspark.sql.GroupedData :Aggregation methods, returned by DataFrame.groupBy()
More On Classes
pyspark.sql.DataFrameNaFunctions : Methods for handling missing data (null values)
pyspark.sql.DataFrameStatFunctions : Methods for statistics functionality
pyspark.sql.functions : List of built-in functions available for DataFrame
pyspark.sql.types : List of data types available
pyspark.sql.Window : For working with window functions
Creating a DataFrame demo
The entry point into all functionality in Spark is the SparkSession class.
To create a basic SparkSession, just use SparkSession.builder:
from pyspark.sql import SparkSession spark = SparkSession \
.builder \ appName("Data Frame Example") \ config("spark.some.config.option", "some-value") \ getOrCreate()
More On Creation
Import the sql module from pyspark
from pyspark.sql import *
Trang 37Resilient Distributed Datasets (RDDs)
Resilient distributed datasets (RDDs) are known as the main abstraction
in Spark
It is a partitioned collection of objects spread across a cluster, and can be persisted in memory or on disk
Once RDDs are created they are immutable
There are two ways to create RDDs:
1 Parallelizing a collection in driver program
2 Referencing one dataset in an external storage system, like a shared filesystem, HBase, HDFS, or any data source providing a Hadoop InputFormat
Features Of RDDs
Resilient, i.e tolerant to faults using RDD lineage graph and therefore ready
to recompute damaged or missing partitions due to node failures
Dataset - A set of partitioned data with primitive values or values of values,
For example, records or tuples
Distributed with data remaining on multiple nodes in a cluster
Creating RDDs
Parallelizing a collection in driver program.
E.g., here is how to create a parallelized collection holding the numbers 1 to 5: