Read Table Data from HBase by Using PySpark- 123docz.net

Problem

You want to read table data from HBase.

Solution

We have been given a data table named pysparkTable in HBase. You want to read that table by using PySpark. The data in pysparkTable is shown in Table 6-3.

Table 6-3. pysparkTable Data

RowID btcf1 btc1

c11 c12 c13 c14 00001

00002 00003 00004

btcf2 btc2

c21 c22 c23 c24

Let’s explain this table data. The pysparkTable table consists of four rows and two column families, btcf1 and btcf2. Column btc1 is under column family btcf1, and column btc2 is under column family btcf2. Remember that the code presented later in this section will work only with spark-1.6 and the older PySpark versions. Try tweaking the code to run on PySpark version 2.x.

We are going to use the newAPIHadoopRDD() function, which is defined on SparkContext sc. This function returns a paired RDD. Table 6-4 lists the arguments of the newAPIHadoopRDD() function.

How It Works

Let’s first define all the arguments that have to be passed into our newAPIHadoopRDD() function:

>>> hostName = 'localhost'

>>> tableName = 'pysparkBookTable'

>>> ourInputFormatClass='org.apache.hadoop.hbase.mapreduce.TableInputFormat'

>>> ourKeyClass='org.apache.hadoop.hbase.io.ImmutableBytesWritable'

>>> ourValueClass='org.apache.hadoop.hbase.client.Result'

>>> ourKeyConverter='org.apache.spark.examples.pythonconverters.

ImmutableBytesWritableToStringConverter'

>>> ourValueConverter='org.apache.spark.examples.pythonconverters.

HBaseResultToStringConverter'

>>> configuration = {}

>>> configuration['hbase.mapreduce.inputtable'] = tableName

>>> configuration['hbase.zookeeper.quorum'] = hostName

Now it is time to call the newAPIHadoopRDD() function with its arguments:

>>> tableRDDfromHBase = sc.newAPIHadoopRDD(

... inputFormatClass = ourInputFormatClass, ... keyClass = ourKeyClass,

... valueClass = ourValueClass, Table 6-4. Arguments of the newAPIHadoopRDD() Function

Argument Description

Fully qualified classname of Hadoop InputFormat (e.g.,

“org.apache.hadoop.mapreduce.lib.input.TextI nputFormat”)

Fully qualified classname of value Writable class (e.g.,

“org.apache.hadoop.io.LongWritable”)

The number of Python objects represented as a single Java object. (default 0, choose batchSize automatically)

keyClass inputFormatClass

keyConverter Key converter valueConverter Value converter

Hadoop configuration, passed in as a dict conf

batchSize valueClass

... keyConverter = ourKeyConverter, ... valueConverter = ourValueConverter, ... conf = configuration

... )

Let’s see how our paired RDD, tableRDDfromHBase, looks:

>>> tableRDDfromHBase.take(2) Here is the output:

[(u'00001', u'{"qualifier" : "btc1", "timestamp" : "1496715394968",

"columnFamily" : "btcf1", "row" : "00001", "type" : "Put", "value" : "c11"}\

n{"qualifier" : "btc2", "timestamp" : "1496715408865", "columnFamily" :

"btcf2", "row" : "00001", "type" : "Put", "value" : "c21"}'), (u'00002', u'{"qualifier" : "btc1", "timestamp" : "1496715423206", "columnFamily" :

"btcf1", "row" : "00002", "type" : "Put", "value" : "c12"}\n{"qualifier" :

"btc2", "timestamp" : "1496715436087", "columnFamily" : "btcf2", "row" :

"00002", "type" : "Put", "value" : "c22"}')]

The paired RDD tableRDDfromHBase has RowID as a key. The columns and column classifiers are JSON strings, which is the value part. In a previous recipe, we solved the problem of reading JSON files.

■Note remember, recipe 6-12 code will work with only spark-1.6 and before.

you can get the code on Github at https://github.com/apache/spark/blob/

ed9d80385486cd39a84a689ef467795262af919a/examples/src/main/python/hbase_

inputformat.py.

There is another twist. We are using many classes, so we have to add some JARs while starting the PySpark shell. The following are the JAR files:

• spark-examples-1.6.0-hadoop2.6.0.jar

• hbase-client-1.2.4.jar

• hbase-common-1.2.4.jar

The following is the command to start the PySpark shell:

pyspark --jars 'spark-examples-1.6.0-hadoop2.6.0.jar','/hbase-client- 1.2.4.jar','hbase-common-1.2.4.jar'

Optimizing PySpark and PySpark Streaming

Spark is a distributed framework for facilitating parallel processing. The parallel algorithms require computation and communication between machines. While communicating, machines send or exchange data. This is also known as shuffling.

Writing code is easy. But writing a program that is efficient and easy to understand by others requires more effort. This chapter presents some techniques for making the PySpark program clearer and more efficient.

Making decisions is a day-to-day activity. Our data-conscious population wants to include data analysis and result inference at the time of decision-making. We can gather data and do analysis, and we have done all of that in previous chapters. But people are becoming more interested in analyzing data as it is coming in. This means people are becoming more interested in analyzing streaming data.

Handling streaming data requires more robust systems and proper algorithms.

The fault-tolerance of batch-processing systems is sometimes less complex than the fault-tolerance of a streaming-execution system. This is because in stream data

processing, we are reading data from an outer source, running execution, and saving the results, all at the same time. More activities translate into a greater chance of failure.

In PySpark, streaming data is handled by its library, PySpark Streaming. PySpark Streaming is a set of APIs that provide a wrapper over PySpark Core. These APIs are efficient and deal with many aspects of fault-tolerance too. We are going to read streaming data from the console by using PySpark and then analyze it. We are also going to read data from Apache Kafka by using PySpark Streaming and then analyze the data.

This chapter covers the following recipes:

Recipe 7-1. Optimize the page-rank algorithm using PySpark code Recipe 7-2. Implement the k-nearest neighbors algorithm using PySpark

Recipe 7-3. Read streaming data from the console using PySpark Streaming

Recipe 7-4. Integrate Apache Kafka with PySpark Streaming, and read and analyze the data

Recipe 7-5. Execute a PySpark script in local mode

Recipe 7-6. Execute a PySpark script using Standalone Cluster Manager and the Mesos cluster manager

Read Table Data from HBase by Using PySpark

Running PySpark Commands on IPython Notebook

Implement the k-Nearest Neighbors Algorithm by Using PySpark