Problem
You want to read table data from HBase.
Solution
We have been given a data table named pysparkTable in HBase. You want to read that table by using PySpark. The data in pysparkTable is shown in Table 6-3.
Table 6-3. pysparkTable Data
RowID btcf1 btc1
c11 c12 c13 c14 00001
00002 00003 00004
btcf2 btc2
c21 c22 c23 c24
Let’s explain this table data. The pysparkTable table consists of four rows and two column families, btcf1 and btcf2. Column btc1 is under column family btcf1, and column btc2 is under column family btcf2. Remember that the code presented later in this section will work only with spark-1.6 and the older PySpark versions. Try tweaking the code to run on PySpark version 2.x.
We are going to use the newAPIHadoopRDD() function, which is defined on SparkContext sc. This function returns a paired RDD. Table 6-4 lists the arguments of the newAPIHadoopRDD() function.
How It Works
Let’s first define all the arguments that have to be passed into our newAPIHadoopRDD() function:
>>> hostName = 'localhost'
>>> tableName = 'pysparkBookTable'
>>> ourInputFormatClass='org.apache.hadoop.hbase.mapreduce.TableInputFormat'
>>> ourKeyClass='org.apache.hadoop.hbase.io.ImmutableBytesWritable'
>>> ourValueClass='org.apache.hadoop.hbase.client.Result'
>>> ourKeyConverter='org.apache.spark.examples.pythonconverters.
ImmutableBytesWritableToStringConverter'
>>> ourValueConverter='org.apache.spark.examples.pythonconverters.
HBaseResultToStringConverter'
>>> configuration = {}
>>> configuration['hbase.mapreduce.inputtable'] = tableName
>>> configuration['hbase.zookeeper.quorum'] = hostName
Now it is time to call the newAPIHadoopRDD() function with its arguments:
>>> tableRDDfromHBase = sc.newAPIHadoopRDD(
... inputFormatClass = ourInputFormatClass, ... keyClass = ourKeyClass,
... valueClass = ourValueClass, Table 6-4. Arguments of the newAPIHadoopRDD() Function
Argument Description
Fully qualified classname of Hadoop InputFormat (e.g.,
“org.apache.hadoop.mapreduce.lib.input.TextI nputFormat”)
Fully qualified classname of value Writable class (e.g.,
“org.apache.hadoop.io.LongWritable”)
The number of Python objects represented as a single Java object. (default 0, choose batchSize automatically)
keyClass inputFormatClass
keyConverter Key converter valueConverter Value converter
Hadoop configuration, passed in as a dict conf
batchSize valueClass
... keyConverter = ourKeyConverter, ... valueConverter = ourValueConverter, ... conf = configuration
... )
Let’s see how our paired RDD, tableRDDfromHBase, looks:
>>> tableRDDfromHBase.take(2) Here is the output:
[(u'00001', u'{"qualifier" : "btc1", "timestamp" : "1496715394968",
"columnFamily" : "btcf1", "row" : "00001", "type" : "Put", "value" : "c11"}\
n{"qualifier" : "btc2", "timestamp" : "1496715408865", "columnFamily" :
"btcf2", "row" : "00001", "type" : "Put", "value" : "c21"}'), (u'00002', u'{"qualifier" : "btc1", "timestamp" : "1496715423206", "columnFamily" :
"btcf1", "row" : "00002", "type" : "Put", "value" : "c12"}\n{"qualifier" :
"btc2", "timestamp" : "1496715436087", "columnFamily" : "btcf2", "row" :
"00002", "type" : "Put", "value" : "c22"}')]
The paired RDD tableRDDfromHBase has RowID as a key. The columns and column classifiers are JSON strings, which is the value part. In a previous recipe, we solved the problem of reading JSON files.
■Note remember, recipe 6-12 code will work with only spark-1.6 and before.
you can get the code on Github at https://github.com/apache/spark/blob/
ed9d80385486cd39a84a689ef467795262af919a/examples/src/main/python/hbase_
inputformat.py.
There is another twist. We are using many classes, so we have to add some JARs while starting the PySpark shell. The following are the JAR files:
• spark-examples-1.6.0-hadoop2.6.0.jar
• hbase-client-1.2.4.jar
• hbase-common-1.2.4.jar
The following is the command to start the PySpark shell:
pyspark --jars 'spark-examples-1.6.0-hadoop2.6.0.jar','/hbase-client- 1.2.4.jar','hbase-common-1.2.4.jar'
Optimizing PySpark and PySpark Streaming
Spark is a distributed framework for facilitating parallel processing. The parallel algorithms require computation and communication between machines. While communicating, machines send or exchange data. This is also known as shuffling.
Writing code is easy. But writing a program that is efficient and easy to understand by others requires more effort. This chapter presents some techniques for making the PySpark program clearer and more efficient.
Making decisions is a day-to-day activity. Our data-conscious population wants to include data analysis and result inference at the time of decision-making. We can gather data and do analysis, and we have done all of that in previous chapters. But people are becoming more interested in analyzing data as it is coming in. This means people are becoming more interested in analyzing streaming data.
Handling streaming data requires more robust systems and proper algorithms.
The fault-tolerance of batch-processing systems is sometimes less complex than the fault-tolerance of a streaming-execution system. This is because in stream data
processing, we are reading data from an outer source, running execution, and saving the results, all at the same time. More activities translate into a greater chance of failure.
In PySpark, streaming data is handled by its library, PySpark Streaming. PySpark Streaming is a set of APIs that provide a wrapper over PySpark Core. These APIs are efficient and deal with many aspects of fault-tolerance too. We are going to read streaming data from the console by using PySpark and then analyze it. We are also going to read data from Apache Kafka by using PySpark Streaming and then analyze the data.
This chapter covers the following recipes:
Recipe 7-1. Optimize the page-rank algorithm using PySpark code Recipe 7-2. Implement the k-nearest neighbors algorithm using PySpark
Recipe 7-3. Read streaming data from the console using PySpark Streaming
Recipe 7-4. Integrate Apache Kafka with PySpark Streaming, and read and analyze the data
Recipe 7-5. Execute a PySpark script in local mode
Recipe 7-6. Execute a PySpark script using Standalone Cluster Manager and the Mesos cluster manager