1. Trang chủ
  2. » Luận Văn - Báo Cáo

Pyspark power guide

75 0 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Pyspark Power Guide
Thể loại Hướng dẫn
Định dạng
Số trang 75
Dung lượng 1,01 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

df = spark.read.load"file path" # Spark load the data source from the defined file path df.select"column name", "column name".write.save"file name" # The DataFrame is saved in the define

Trang 2

What is PySpark ?

PySpark is nothing but the Python API for Apache Spark

It offers PySpark Shell which connects the Python API to the spark core and in turn

initializes the Spark context.More on PySpark

 For any spark functionality, the entry point is SparkContext

 SparkContext uses Py4J to launch a JVM and creates a JavaSparkContext

 By default, PySpark has SparkContext available as sc, so creating a new SparkContext won't work

Py4J

 PySpark is built on top of Spark's Java API

 Data is processed in Python and cached / shuffled in the JVM

 Py4J enables Python programs running in a Python interpreter to dynamically access Java objects in a Java Virtual Machine

 Here methods are called as if the Java objects resided in the Python interpreter and Java collections can be accessed through standard Python collection methods

More on Py4J

 In the Python driver program, SparkContext uses Py4J to launch a JVM and create a JavaSparkContext

Trang 3

 To establish local communication between the Python and Java SparkContext objects Py4J is used on the driver.

Installing and Configuring PySpark

 PySpark requires Python 2.6 or higher

 PySpark applications are executed using a standard CPython interpreter in order to support Python modules that use C extensions

 By default, PySpark requires python to be available on the system PATH and use it to run programs

 Among PySpark’s library dependencies all of them are bundled with PySpark including Py4J and they are automatically imported

Getting Started

We can enter the Spark's python environment by running the given command in theshell

./bin/pyspark

This will start yourPySpark shell.`

Python 2.7.12 (default, Nov 20 2017, 18:23:56) [GCC 5.4.0 20160609] on linux2

Type "help", "copyright", "credits" or "license" for more information.

Welcome to

/ / _ _/ /

_\ \/ _ \/ _ `/ / '_/

/ / /\_,_/_/ /_/\_\ version 2.2.0 /_/

Using Python version 2.7.12 (default, Nov 20 2017 18:23:56) SparkSession available as 'spark'.

<<<

Trang 4

 This chapter describes the general methods for loading and saving data using the Spark Data Sources.

Generic Load/Save Functions

In most of the cases, the default data source will be used for all operations

df = spark.read.load("file path")

# Spark load the data source from the defined file path

df.select("column name", "column name").write.save("file name")

# The DataFrame is saved in the defined format

# By default it is saved in the Spark Warehouse

File path can be from local machine as well as from HDFS.

Manually Specifying Options

You can also manually specify the data source that will be used along with any extra options that you would like to pass to the data source

Data sources fully qualified name is used to specify them, but for built-in sources, youcan also use their short names (json, parquet, jdbc, orc, libsvm, csv, text)

Specific File Formats

DataFrames which are loaded from any type of data can be converted to other types

by using the syntax shown below

A json file can be loaded:

df = spark.read.load("path of json file", format="json")

Apache Parquet

 Apache Parquet is a columnar storage format available to all projects in the Hadoop ecosystem, irrespective of the choice of the framework used for data processing, the model of data or programming language used

 Spark SQL provides support for both reading and writing Parquet files

 Automatic conversion to nullable occurs when one tries to write Parquet files, This is done due to compatibility reasons

Trang 5

map(func): Passes each element of the RDD via the supplied function.

union(): New RDD contains elements from source argument and RDD.

intersection(): New RDD includes only common elements from source

argument and RDD

cartesian(): New RDD cross product of all elements from source argument and

RDD

Actions

 Actions return concluding results of RDD computations

 Actions trigger execution utilising lineage graph to load the data into original RDD, and then execute all intermediate transformations and write final results out to file system or return it to Driver program

Count, collect, reduce, take, and first are few actions in spark.

Example of Actions

count(): Get the number of data elements in the RDD.

collect(): Get all the data elements in an RDD as an array.

reduce(func): Aggregate the data elements in an RDD using this function

which takes two arguments and returns one

take (n): Fetch first n data elements in an RDD computed by driver program.

foreach(func): Execute function for each data element in RDD usually used to

update an accumulator or interacting with external systems

first(): Retrieves the first data element in RDD It is similar to take(1).

saveAsTextFile(path): Writes the content of RDD to a text file or a set of text

files to local file system/HDFS

What is Dataframe ?

In general DataFrames can be defined as a data structure, which is tabular in nature

It represents rows, each of them consists of a number of observations

Rows can have a variety of data formats (heterogeneous), whereas a column can have data of the same data type (homogeneous)

They mainly contain some metadata in addition to data like column and row names

Why DataFrames ?

 DataFrames are widely used for processing a large collection of structured or semi-structured data

 They are having the ability to handle petabytes of data

 In addition, it supports a wide range of data format for reading as well as writing

Trang 6

Student = Row("firstName", "lastName", "age", "telephone") s1 = Student('David', 'Julian', 22, 100000)

s2 = Student('Mark', 'Webb', 23, 658545) StudentData=[s1,s2]

df=spark.createDataFrame(StudentData) df.show()

from pyspark.sql import SparkSession

from pyspark import * spark = SparkSession \ builder \

.config("spark.some.config.option", "some-value") \ getOrCreate()

passenger = Row("Name", "age", "source", "destination") s1 = passenger('David', 22 , 'London', 'Paris')

s2 = passenger('Steve', 22 , 'New York', 'Sydney')

x = [s1,s2]

df1=spark.createDataFrame(x) df1.show()

Trang 7

Resilient Distributed Datasets (RDDs)

 Resilient distributed datasets (RDDs) are known as the main abstraction

in Spark

 It is a partitioned collection of objects spread across a cluster, and can be persisted in memory or on disk

 Once RDDs are created they are immutable

There are two ways to create RDDs:

1 Parallelizing a collection in driver program

2 Referencing one dataset in an external storage system, like a shared filesystem, HBase, HDFS, or any data source providing a Hadoop InputFormat

Features Of RDDs

Resilient, i.e tolerant to faults using RDD lineage graph and therefore ready

to recompute damaged or missing partitions due to node failures

Dataset - A set of partitioned data with primitive values or values of values,

For example, records or tuples

Distributed with data remaining on multiple nodes in a cluster

Creating RDDs

Parallelizing a collection in driver program.

E.g., here is how to create a parallelized collection holding the numbers 1 to 5:

Trang 8

As a conclusion DataFrame is data organized into named columns

Features of DataFrame

 Distributed

 Lazy Evals

 ImmutableFeatures Explained

DataFrames are Distributed in Nature, which makes it fault tolerant and highly

available data structure

Lazy Evaluation is an evaluation strategy which will hold the evaluation of an

expression until its value is needed

DataFrames are Immutable in nature which means that it is an object whose

state cannot be modified after it is created

DataFrame SourcesFor constructing a DataFrame a wide range of sources are available such as:

 Structured data files

It provides a programming abstraction called DataFrame and can act as

distributed SQL query engine.Features of Spark SQL

The main capabilities of using structured and semi-structured data, by Spark SQL

Such as:

 Provides DataFrame abstraction in Scala, Java, and Python

 Spark SQL can read and write data from Hive Tables, JSON, and Parquet in various structured formats

 Data can be queried by using Spark SQL

Trang 9

As a conclusion DataFrame is data organized into named columns

Features of DataFrame

 Distributed

 Lazy Evals

 ImmutableFeatures Explained

DataFrames are Distributed in Nature, which makes it fault tolerant and highly

available data structure

Lazy Evaluation is an evaluation strategy which will hold the evaluation of an

expression until its value is needed

DataFrames are Immutable in nature which means that it is an object whose

state cannot be modified after it is created

DataFrame SourcesFor constructing a DataFrame a wide range of sources are available such as:

 Structured data files

It provides a programming abstraction called DataFrame and can act as

distributed SQL query engine.Features of Spark SQL

The main capabilities of using structured and semi-structured data, by Spark SQL

Such as:

 Provides DataFrame abstraction in Scala, Java, and Python

 Spark SQL can read and write data from Hive Tables, JSON, and Parquet in various structured formats

 Data can be queried by using Spark SQL

Trang 10

map(func): Passes each element of the RDD via the supplied function.

union(): New RDD contains elements from source argument and RDD.

intersection(): New RDD includes only common elements from source

argument and RDD

cartesian(): New RDD cross product of all elements from source argument and

RDD

Actions

 Actions return concluding results of RDD computations

 Actions trigger execution utilising lineage graph to load the data into original RDD, and then execute all intermediate transformations and write final results out to file system or return it to Driver program

Count, collect, reduce, take, and first are few actions in spark.

Example of Actions

count(): Get the number of data elements in the RDD.

collect(): Get all the data elements in an RDD as an array.

reduce(func): Aggregate the data elements in an RDD using this function

which takes two arguments and returns one

take (n): Fetch first n data elements in an RDD computed by driver program.

foreach(func): Execute function for each data element in RDD usually used to

update an accumulator or interacting with external systems

first(): Retrieves the first data element in RDD It is similar to take(1).

saveAsTextFile(path): Writes the content of RDD to a text file or a set of text

files to local file system/HDFS

What is Dataframe ?

In general DataFrames can be defined as a data structure, which is tabular in nature

It represents rows, each of them consists of a number of observations

Rows can have a variety of data formats (heterogeneous), whereas a column can have data of the same data type (homogeneous)

They mainly contain some metadata in addition to data like column and row names

Why DataFrames ?

 DataFrames are widely used for processing a large collection of structured or semi-structured data

 They are having the ability to handle petabytes of data

 In addition, it supports a wide range of data format for reading as well as writing

Trang 11

For more details about Spark SQL refer the fresco course Spark SQLImportant classes of Spark SQL and DataFrames

 pyspark.sql.SparkSession :Main entry point for Dataframe SparkSQL functionality

 pyspark.sql.DataFrame :A distributed collection of data grouped into named columns

 pyspark.sql.Column : A column expression in a DataFrame

 pyspark.sql.Row : A row of data in a DataFrame

 pyspark.sql.GroupedData :Aggregation methods, returned by DataFrame.groupBy()

More On Classes

 pyspark.sql.DataFrameNaFunctions : Methods for handling missing data (null values)

 pyspark.sql.DataFrameStatFunctions : Methods for statistics functionality

 pyspark.sql.functions : List of built-in functions available for DataFrame

 pyspark.sql.types : List of data types available

 pyspark.sql.Window : For working with window functions

Creating a DataFrame demo

The entry point into all functionality in Spark is the SparkSession class.

To create a basic SparkSession, just use SparkSession.builder:

from pyspark.sql import SparkSession spark = SparkSession \

.builder \ appName("Data Frame Example") \ config("spark.some.config.option", "some-value") \ getOrCreate()

More On Creation

Import the sql module from pyspark

from pyspark.sql import *

Trang 12

 This chapter describes the general methods for loading and saving data using the Spark Data Sources.

Generic Load/Save Functions

In most of the cases, the default data source will be used for all operations

df = spark.read.load("file path")

# Spark load the data source from the defined file path

df.select("column name", "column name").write.save("file name")

# The DataFrame is saved in the defined format

# By default it is saved in the Spark Warehouse

File path can be from local machine as well as from HDFS.

Manually Specifying Options

You can also manually specify the data source that will be used along with any extra options that you would like to pass to the data source

Data sources fully qualified name is used to specify them, but for built-in sources, youcan also use their short names (json, parquet, jdbc, orc, libsvm, csv, text)

Specific File Formats

DataFrames which are loaded from any type of data can be converted to other types

by using the syntax shown below

A json file can be loaded:

df = spark.read.load("path of json file", format="json")

Apache Parquet

 Apache Parquet is a columnar storage format available to all projects in the Hadoop ecosystem, irrespective of the choice of the framework used for data processing, the model of data or programming language used

 Spark SQL provides support for both reading and writing Parquet files

 Automatic conversion to nullable occurs when one tries to write Parquet files, This is done due to compatibility reasons

Trang 13

distFile = sc.textFile("data.txt")

RDD Operations

RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver programafter running a computation on the dataset

For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results

Similiarly, reduce is an action which aggregates all RDD elements by using some functions and then returns the final result to driver program

More On RDD Operations

As a recap to RDD basics, consider the simple program shown below:

lines = sc.textFile("data.txt") lineLengths = lines.map(lambda s: len(s)) totalLength = lineLengths.reduce(lambda a, b: a + b)

The first line defines a base RDD from an external file.The second line defines lineLengths as the result of a map transformation.Finally, in the third line, we run reduce, which is an action

filter(func): Returns a new dataset (RDD) that are created by choosing the

elements of the source on which the function returns true

Trang 14

map(func): Passes each element of the RDD via the supplied function.

union(): New RDD contains elements from source argument and RDD.

intersection(): New RDD includes only common elements from source

argument and RDD

cartesian(): New RDD cross product of all elements from source argument and

RDD

Actions

 Actions return concluding results of RDD computations

 Actions trigger execution utilising lineage graph to load the data into original RDD, and then execute all intermediate transformations and write final results out to file system or return it to Driver program

Count, collect, reduce, take, and first are few actions in spark.

Example of Actions

count(): Get the number of data elements in the RDD.

collect(): Get all the data elements in an RDD as an array.

reduce(func): Aggregate the data elements in an RDD using this function

which takes two arguments and returns one

take (n): Fetch first n data elements in an RDD computed by driver program.

foreach(func): Execute function for each data element in RDD usually used to

update an accumulator or interacting with external systems

first(): Retrieves the first data element in RDD It is similar to take(1).

saveAsTextFile(path): Writes the content of RDD to a text file or a set of text

files to local file system/HDFS

What is Dataframe ?

In general DataFrames can be defined as a data structure, which is tabular in nature

It represents rows, each of them consists of a number of observations

Rows can have a variety of data formats (heterogeneous), whereas a column can have data of the same data type (homogeneous)

They mainly contain some metadata in addition to data like column and row names

Why DataFrames ?

 DataFrames are widely used for processing a large collection of structured or semi-structured data

 They are having the ability to handle petabytes of data

 In addition, it supports a wide range of data format for reading as well as writing

Trang 15

As a conclusion DataFrame is data organized into named columns

Features of DataFrame

 Distributed

 Lazy Evals

 ImmutableFeatures Explained

DataFrames are Distributed in Nature, which makes it fault tolerant and highly

available data structure

Lazy Evaluation is an evaluation strategy which will hold the evaluation of an

expression until its value is needed

DataFrames are Immutable in nature which means that it is an object whose

state cannot be modified after it is created

DataFrame SourcesFor constructing a DataFrame a wide range of sources are available such as:

 Structured data files

It provides a programming abstraction called DataFrame and can act as

distributed SQL query engine.Features of Spark SQL

The main capabilities of using structured and semi-structured data, by Spark SQL

Such as:

 Provides DataFrame abstraction in Scala, Java, and Python

 Spark SQL can read and write data from Hive Tables, JSON, and Parquet in various structured formats

 Data can be queried by using Spark SQL

Trang 16

As a conclusion DataFrame is data organized into named columns

Features of DataFrame

 Distributed

 Lazy Evals

 ImmutableFeatures Explained

DataFrames are Distributed in Nature, which makes it fault tolerant and highly

available data structure

Lazy Evaluation is an evaluation strategy which will hold the evaluation of an

expression until its value is needed

DataFrames are Immutable in nature which means that it is an object whose

state cannot be modified after it is created

DataFrame SourcesFor constructing a DataFrame a wide range of sources are available such as:

 Structured data files

It provides a programming abstraction called DataFrame and can act as

distributed SQL query engine.Features of Spark SQL

The main capabilities of using structured and semi-structured data, by Spark SQL

Such as:

 Provides DataFrame abstraction in Scala, Java, and Python

 Spark SQL can read and write data from Hive Tables, JSON, and Parquet in various structured formats

 Data can be queried by using Spark SQL

Trang 17

For more details about Spark SQL refer the fresco course Spark SQLImportant classes of Spark SQL and DataFrames

 pyspark.sql.SparkSession :Main entry point for Dataframe SparkSQL functionality

 pyspark.sql.DataFrame :A distributed collection of data grouped into named columns

 pyspark.sql.Column : A column expression in a DataFrame

 pyspark.sql.Row : A row of data in a DataFrame

 pyspark.sql.GroupedData :Aggregation methods, returned by DataFrame.groupBy()

More On Classes

 pyspark.sql.DataFrameNaFunctions : Methods for handling missing data (null values)

 pyspark.sql.DataFrameStatFunctions : Methods for statistics functionality

 pyspark.sql.functions : List of built-in functions available for DataFrame

 pyspark.sql.types : List of data types available

 pyspark.sql.Window : For working with window functions

Creating a DataFrame demo

The entry point into all functionality in Spark is the SparkSession class.

To create a basic SparkSession, just use SparkSession.builder:

from pyspark.sql import SparkSession spark = SparkSession \

.builder \ appName("Data Frame Example") \ config("spark.some.config.option", "some-value") \ getOrCreate()

More On Creation

Import the sql module from pyspark

from pyspark.sql import *

Trang 18

As a conclusion DataFrame is data organized into named columns

Features of DataFrame

 Distributed

 Lazy Evals

 ImmutableFeatures Explained

DataFrames are Distributed in Nature, which makes it fault tolerant and highly

available data structure

Lazy Evaluation is an evaluation strategy which will hold the evaluation of an

expression until its value is needed

DataFrames are Immutable in nature which means that it is an object whose

state cannot be modified after it is created

DataFrame SourcesFor constructing a DataFrame a wide range of sources are available such as:

 Structured data files

It provides a programming abstraction called DataFrame and can act as

distributed SQL query engine.Features of Spark SQL

The main capabilities of using structured and semi-structured data, by Spark SQL

Such as:

 Provides DataFrame abstraction in Scala, Java, and Python

 Spark SQL can read and write data from Hive Tables, JSON, and Parquet in various structured formats

 Data can be queried by using Spark SQL

Trang 19

For more details about Spark SQL refer the fresco course Spark SQLImportant classes of Spark SQL and DataFrames

 pyspark.sql.SparkSession :Main entry point for Dataframe SparkSQL functionality

 pyspark.sql.DataFrame :A distributed collection of data grouped into named columns

 pyspark.sql.Column : A column expression in a DataFrame

 pyspark.sql.Row : A row of data in a DataFrame

 pyspark.sql.GroupedData :Aggregation methods, returned by DataFrame.groupBy()

More On Classes

 pyspark.sql.DataFrameNaFunctions : Methods for handling missing data (null values)

 pyspark.sql.DataFrameStatFunctions : Methods for statistics functionality

 pyspark.sql.functions : List of built-in functions available for DataFrame

 pyspark.sql.types : List of data types available

 pyspark.sql.Window : For working with window functions

Creating a DataFrame demo

The entry point into all functionality in Spark is the SparkSession class.

To create a basic SparkSession, just use SparkSession.builder:

from pyspark.sql import SparkSession spark = SparkSession \

.builder \ appName("Data Frame Example") \ config("spark.some.config.option", "some-value") \ getOrCreate()

More On Creation

Import the sql module from pyspark

from pyspark.sql import *

Trang 20

As a conclusion DataFrame is data organized into named columns

Features of DataFrame

 Distributed

 Lazy Evals

 ImmutableFeatures Explained

DataFrames are Distributed in Nature, which makes it fault tolerant and highly

available data structure

Lazy Evaluation is an evaluation strategy which will hold the evaluation of an

expression until its value is needed

DataFrames are Immutable in nature which means that it is an object whose

state cannot be modified after it is created

DataFrame SourcesFor constructing a DataFrame a wide range of sources are available such as:

 Structured data files

It provides a programming abstraction called DataFrame and can act as

distributed SQL query engine.Features of Spark SQL

The main capabilities of using structured and semi-structured data, by Spark SQL

Such as:

 Provides DataFrame abstraction in Scala, Java, and Python

 Spark SQL can read and write data from Hive Tables, JSON, and Parquet in various structured formats

 Data can be queried by using Spark SQL

Trang 21

distFile = sc.textFile("data.txt")

RDD Operations

RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver programafter running a computation on the dataset

For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results

Similiarly, reduce is an action which aggregates all RDD elements by using some functions and then returns the final result to driver program

More On RDD Operations

As a recap to RDD basics, consider the simple program shown below:

lines = sc.textFile("data.txt") lineLengths = lines.map(lambda s: len(s)) totalLength = lineLengths.reduce(lambda a, b: a + b)

The first line defines a base RDD from an external file.The second line defines lineLengths as the result of a map transformation.Finally, in the third line, we run reduce, which is an action

filter(func): Returns a new dataset (RDD) that are created by choosing the

elements of the source on which the function returns true

Trang 22

 This chapter describes the general methods for loading and saving data using the Spark Data Sources.

Generic Load/Save Functions

In most of the cases, the default data source will be used for all operations

df = spark.read.load("file path")

# Spark load the data source from the defined file path

df.select("column name", "column name").write.save("file name")

# The DataFrame is saved in the defined format

# By default it is saved in the Spark Warehouse

File path can be from local machine as well as from HDFS.

Manually Specifying Options

You can also manually specify the data source that will be used along with any extra options that you would like to pass to the data source

Data sources fully qualified name is used to specify them, but for built-in sources, youcan also use their short names (json, parquet, jdbc, orc, libsvm, csv, text)

Specific File Formats

DataFrames which are loaded from any type of data can be converted to other types

by using the syntax shown below

A json file can be loaded:

df = spark.read.load("path of json file", format="json")

Apache Parquet

 Apache Parquet is a columnar storage format available to all projects in the Hadoop ecosystem, irrespective of the choice of the framework used for data processing, the model of data or programming language used

 Spark SQL provides support for both reading and writing Parquet files

 Automatic conversion to nullable occurs when one tries to write Parquet files, This is done due to compatibility reasons

Trang 23

For more details about Spark SQL refer the fresco course Spark SQLImportant classes of Spark SQL and DataFrames

 pyspark.sql.SparkSession :Main entry point for Dataframe SparkSQL functionality

 pyspark.sql.DataFrame :A distributed collection of data grouped into named columns

 pyspark.sql.Column : A column expression in a DataFrame

 pyspark.sql.Row : A row of data in a DataFrame

 pyspark.sql.GroupedData :Aggregation methods, returned by DataFrame.groupBy()

More On Classes

 pyspark.sql.DataFrameNaFunctions : Methods for handling missing data (null values)

 pyspark.sql.DataFrameStatFunctions : Methods for statistics functionality

 pyspark.sql.functions : List of built-in functions available for DataFrame

 pyspark.sql.types : List of data types available

 pyspark.sql.Window : For working with window functions

Creating a DataFrame demo

The entry point into all functionality in Spark is the SparkSession class.

To create a basic SparkSession, just use SparkSession.builder:

from pyspark.sql import SparkSession spark = SparkSession \

.builder \ appName("Data Frame Example") \ config("spark.some.config.option", "some-value") \ getOrCreate()

More On Creation

Import the sql module from pyspark

from pyspark.sql import *

Trang 24

Student = Row("firstName", "lastName", "age", "telephone") s1 = Student('David', 'Julian', 22, 100000)

s2 = Student('Mark', 'Webb', 23, 658545) StudentData=[s1,s2]

df=spark.createDataFrame(StudentData) df.show()

from pyspark.sql import SparkSession

from pyspark import * spark = SparkSession \ builder \

.config("spark.some.config.option", "some-value") \ getOrCreate()

passenger = Row("Name", "age", "source", "destination") s1 = passenger('David', 22 , 'London', 'Paris')

s2 = passenger('Steve', 22 , 'New York', 'Sydney')

x = [s1,s2]

df1=spark.createDataFrame(x) df1.show()

Trang 25

map(func): Passes each element of the RDD via the supplied function.

union(): New RDD contains elements from source argument and RDD.

intersection(): New RDD includes only common elements from source

argument and RDD

cartesian(): New RDD cross product of all elements from source argument and

RDD

Actions

 Actions return concluding results of RDD computations

 Actions trigger execution utilising lineage graph to load the data into original RDD, and then execute all intermediate transformations and write final results out to file system or return it to Driver program

Count, collect, reduce, take, and first are few actions in spark.

Example of Actions

count(): Get the number of data elements in the RDD.

collect(): Get all the data elements in an RDD as an array.

reduce(func): Aggregate the data elements in an RDD using this function

which takes two arguments and returns one

take (n): Fetch first n data elements in an RDD computed by driver program.

foreach(func): Execute function for each data element in RDD usually used to

update an accumulator or interacting with external systems

first(): Retrieves the first data element in RDD It is similar to take(1).

saveAsTextFile(path): Writes the content of RDD to a text file or a set of text

files to local file system/HDFS

What is Dataframe ?

In general DataFrames can be defined as a data structure, which is tabular in nature

It represents rows, each of them consists of a number of observations

Rows can have a variety of data formats (heterogeneous), whereas a column can have data of the same data type (homogeneous)

They mainly contain some metadata in addition to data like column and row names

Why DataFrames ?

 DataFrames are widely used for processing a large collection of structured or semi-structured data

 They are having the ability to handle petabytes of data

 In addition, it supports a wide range of data format for reading as well as writing

Trang 26

 This chapter describes the general methods for loading and saving data using the Spark Data Sources.

Generic Load/Save Functions

In most of the cases, the default data source will be used for all operations

df = spark.read.load("file path")

# Spark load the data source from the defined file path

df.select("column name", "column name").write.save("file name")

# The DataFrame is saved in the defined format

# By default it is saved in the Spark Warehouse

File path can be from local machine as well as from HDFS.

Manually Specifying Options

You can also manually specify the data source that will be used along with any extra options that you would like to pass to the data source

Data sources fully qualified name is used to specify them, but for built-in sources, youcan also use their short names (json, parquet, jdbc, orc, libsvm, csv, text)

Specific File Formats

DataFrames which are loaded from any type of data can be converted to other types

by using the syntax shown below

A json file can be loaded:

df = spark.read.load("path of json file", format="json")

Apache Parquet

 Apache Parquet is a columnar storage format available to all projects in the Hadoop ecosystem, irrespective of the choice of the framework used for data processing, the model of data or programming language used

 Spark SQL provides support for both reading and writing Parquet files

 Automatic conversion to nullable occurs when one tries to write Parquet files, This is done due to compatibility reasons

Trang 27

As a conclusion DataFrame is data organized into named columns

Features of DataFrame

 Distributed

 Lazy Evals

 ImmutableFeatures Explained

DataFrames are Distributed in Nature, which makes it fault tolerant and highly

available data structure

Lazy Evaluation is an evaluation strategy which will hold the evaluation of an

expression until its value is needed

DataFrames are Immutable in nature which means that it is an object whose

state cannot be modified after it is created

DataFrame SourcesFor constructing a DataFrame a wide range of sources are available such as:

 Structured data files

It provides a programming abstraction called DataFrame and can act as

distributed SQL query engine.Features of Spark SQL

The main capabilities of using structured and semi-structured data, by Spark SQL

Such as:

 Provides DataFrame abstraction in Scala, Java, and Python

 Spark SQL can read and write data from Hive Tables, JSON, and Parquet in various structured formats

 Data can be queried by using Spark SQL

Trang 28

Resilient Distributed Datasets (RDDs)

 Resilient distributed datasets (RDDs) are known as the main abstraction

in Spark

 It is a partitioned collection of objects spread across a cluster, and can be persisted in memory or on disk

 Once RDDs are created they are immutable

There are two ways to create RDDs:

1 Parallelizing a collection in driver program

2 Referencing one dataset in an external storage system, like a shared filesystem, HBase, HDFS, or any data source providing a Hadoop InputFormat

Features Of RDDs

Resilient, i.e tolerant to faults using RDD lineage graph and therefore ready

to recompute damaged or missing partitions due to node failures

Dataset - A set of partitioned data with primitive values or values of values,

For example, records or tuples

Distributed with data remaining on multiple nodes in a cluster

Creating RDDs

Parallelizing a collection in driver program.

E.g., here is how to create a parallelized collection holding the numbers 1 to 5:

Trang 29

For more details about Spark SQL refer the fresco course Spark SQLImportant classes of Spark SQL and DataFrames

 pyspark.sql.SparkSession :Main entry point for Dataframe SparkSQL functionality

 pyspark.sql.DataFrame :A distributed collection of data grouped into named columns

 pyspark.sql.Column : A column expression in a DataFrame

 pyspark.sql.Row : A row of data in a DataFrame

 pyspark.sql.GroupedData :Aggregation methods, returned by DataFrame.groupBy()

More On Classes

 pyspark.sql.DataFrameNaFunctions : Methods for handling missing data (null values)

 pyspark.sql.DataFrameStatFunctions : Methods for statistics functionality

 pyspark.sql.functions : List of built-in functions available for DataFrame

 pyspark.sql.types : List of data types available

 pyspark.sql.Window : For working with window functions

Creating a DataFrame demo

The entry point into all functionality in Spark is the SparkSession class.

To create a basic SparkSession, just use SparkSession.builder:

from pyspark.sql import SparkSession spark = SparkSession \

.builder \ appName("Data Frame Example") \ config("spark.some.config.option", "some-value") \ getOrCreate()

More On Creation

Import the sql module from pyspark

from pyspark.sql import *

Trang 30

 This chapter describes the general methods for loading and saving data using the Spark Data Sources.

Generic Load/Save Functions

In most of the cases, the default data source will be used for all operations

df = spark.read.load("file path")

# Spark load the data source from the defined file path

df.select("column name", "column name").write.save("file name")

# The DataFrame is saved in the defined format

# By default it is saved in the Spark Warehouse

File path can be from local machine as well as from HDFS.

Manually Specifying Options

You can also manually specify the data source that will be used along with any extra options that you would like to pass to the data source

Data sources fully qualified name is used to specify them, but for built-in sources, youcan also use their short names (json, parquet, jdbc, orc, libsvm, csv, text)

Specific File Formats

DataFrames which are loaded from any type of data can be converted to other types

by using the syntax shown below

A json file can be loaded:

df = spark.read.load("path of json file", format="json")

Apache Parquet

 Apache Parquet is a columnar storage format available to all projects in the Hadoop ecosystem, irrespective of the choice of the framework used for data processing, the model of data or programming language used

 Spark SQL provides support for both reading and writing Parquet files

 Automatic conversion to nullable occurs when one tries to write Parquet files, This is done due to compatibility reasons

Trang 31

Resilient Distributed Datasets (RDDs)

 Resilient distributed datasets (RDDs) are known as the main abstraction

in Spark

 It is a partitioned collection of objects spread across a cluster, and can be persisted in memory or on disk

 Once RDDs are created they are immutable

There are two ways to create RDDs:

1 Parallelizing a collection in driver program

2 Referencing one dataset in an external storage system, like a shared filesystem, HBase, HDFS, or any data source providing a Hadoop InputFormat

Features Of RDDs

Resilient, i.e tolerant to faults using RDD lineage graph and therefore ready

to recompute damaged or missing partitions due to node failures

Dataset - A set of partitioned data with primitive values or values of values,

For example, records or tuples

Distributed with data remaining on multiple nodes in a cluster

Creating RDDs

Parallelizing a collection in driver program.

E.g., here is how to create a parallelized collection holding the numbers 1 to 5:

Trang 32

For more details about Spark SQL refer the fresco course Spark SQLImportant classes of Spark SQL and DataFrames

 pyspark.sql.SparkSession :Main entry point for Dataframe SparkSQL functionality

 pyspark.sql.DataFrame :A distributed collection of data grouped into named columns

 pyspark.sql.Column : A column expression in a DataFrame

 pyspark.sql.Row : A row of data in a DataFrame

 pyspark.sql.GroupedData :Aggregation methods, returned by DataFrame.groupBy()

More On Classes

 pyspark.sql.DataFrameNaFunctions : Methods for handling missing data (null values)

 pyspark.sql.DataFrameStatFunctions : Methods for statistics functionality

 pyspark.sql.functions : List of built-in functions available for DataFrame

 pyspark.sql.types : List of data types available

 pyspark.sql.Window : For working with window functions

Creating a DataFrame demo

The entry point into all functionality in Spark is the SparkSession class.

To create a basic SparkSession, just use SparkSession.builder:

from pyspark.sql import SparkSession spark = SparkSession \

.builder \ appName("Data Frame Example") \ config("spark.some.config.option", "some-value") \ getOrCreate()

More On Creation

Import the sql module from pyspark

from pyspark.sql import *

Trang 33

Resilient Distributed Datasets (RDDs)

 Resilient distributed datasets (RDDs) are known as the main abstraction

in Spark

 It is a partitioned collection of objects spread across a cluster, and can be persisted in memory or on disk

 Once RDDs are created they are immutable

There are two ways to create RDDs:

1 Parallelizing a collection in driver program

2 Referencing one dataset in an external storage system, like a shared filesystem, HBase, HDFS, or any data source providing a Hadoop InputFormat

Features Of RDDs

Resilient, i.e tolerant to faults using RDD lineage graph and therefore ready

to recompute damaged or missing partitions due to node failures

Dataset - A set of partitioned data with primitive values or values of values,

For example, records or tuples

Distributed with data remaining on multiple nodes in a cluster

Creating RDDs

Parallelizing a collection in driver program.

E.g., here is how to create a parallelized collection holding the numbers 1 to 5:

Trang 34

Student = Row("firstName", "lastName", "age", "telephone") s1 = Student('David', 'Julian', 22, 100000)

s2 = Student('Mark', 'Webb', 23, 658545) StudentData=[s1,s2]

df=spark.createDataFrame(StudentData) df.show()

from pyspark.sql import SparkSession

from pyspark import * spark = SparkSession \ builder \

.config("spark.some.config.option", "some-value") \ getOrCreate()

passenger = Row("Name", "age", "source", "destination") s1 = passenger('David', 22 , 'London', 'Paris')

s2 = passenger('Steve', 22 , 'New York', 'Sydney')

x = [s1,s2]

df1=spark.createDataFrame(x) df1.show()

Trang 35

 This chapter describes the general methods for loading and saving data using the Spark Data Sources.

Generic Load/Save Functions

In most of the cases, the default data source will be used for all operations

df = spark.read.load("file path")

# Spark load the data source from the defined file path

df.select("column name", "column name").write.save("file name")

# The DataFrame is saved in the defined format

# By default it is saved in the Spark Warehouse

File path can be from local machine as well as from HDFS.

Manually Specifying Options

You can also manually specify the data source that will be used along with any extra options that you would like to pass to the data source

Data sources fully qualified name is used to specify them, but for built-in sources, youcan also use their short names (json, parquet, jdbc, orc, libsvm, csv, text)

Specific File Formats

DataFrames which are loaded from any type of data can be converted to other types

by using the syntax shown below

A json file can be loaded:

df = spark.read.load("path of json file", format="json")

Apache Parquet

 Apache Parquet is a columnar storage format available to all projects in the Hadoop ecosystem, irrespective of the choice of the framework used for data processing, the model of data or programming language used

 Spark SQL provides support for both reading and writing Parquet files

 Automatic conversion to nullable occurs when one tries to write Parquet files, This is done due to compatibility reasons

Trang 36

For more details about Spark SQL refer the fresco course Spark SQLImportant classes of Spark SQL and DataFrames

 pyspark.sql.SparkSession :Main entry point for Dataframe SparkSQL functionality

 pyspark.sql.DataFrame :A distributed collection of data grouped into named columns

 pyspark.sql.Column : A column expression in a DataFrame

 pyspark.sql.Row : A row of data in a DataFrame

 pyspark.sql.GroupedData :Aggregation methods, returned by DataFrame.groupBy()

More On Classes

 pyspark.sql.DataFrameNaFunctions : Methods for handling missing data (null values)

 pyspark.sql.DataFrameStatFunctions : Methods for statistics functionality

 pyspark.sql.functions : List of built-in functions available for DataFrame

 pyspark.sql.types : List of data types available

 pyspark.sql.Window : For working with window functions

Creating a DataFrame demo

The entry point into all functionality in Spark is the SparkSession class.

To create a basic SparkSession, just use SparkSession.builder:

from pyspark.sql import SparkSession spark = SparkSession \

.builder \ appName("Data Frame Example") \ config("spark.some.config.option", "some-value") \ getOrCreate()

More On Creation

Import the sql module from pyspark

from pyspark.sql import *

Trang 37

Resilient Distributed Datasets (RDDs)

 Resilient distributed datasets (RDDs) are known as the main abstraction

in Spark

 It is a partitioned collection of objects spread across a cluster, and can be persisted in memory or on disk

 Once RDDs are created they are immutable

There are two ways to create RDDs:

1 Parallelizing a collection in driver program

2 Referencing one dataset in an external storage system, like a shared filesystem, HBase, HDFS, or any data source providing a Hadoop InputFormat

Features Of RDDs

Resilient, i.e tolerant to faults using RDD lineage graph and therefore ready

to recompute damaged or missing partitions due to node failures

Dataset - A set of partitioned data with primitive values or values of values,

For example, records or tuples

Distributed with data remaining on multiple nodes in a cluster

Creating RDDs

Parallelizing a collection in driver program.

E.g., here is how to create a parallelized collection holding the numbers 1 to 5:

Ngày đăng: 22/06/2025, 16:28

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN