Implement the k-Nearest Neighbors Algorithm by Us- 123docz.net

Problem

You want to implement the k-nearest neighbors (KNN) algorithm by using PySpark.

Solution

The k-nearest neighbors algorithm is one of the simplest data-classification algorithms.

The similarity between two data points is measured on the basis of the distance between two points.

We have been given a dataset of nine records. This dataset is shown in Table 7-1. In this table, you can see a column named RN. That column indicates the record number.

This is not part of the data; the record number is given to help you understand the KNN algorithm.

RN 1 2 3 4 5 6 7 8 9

ivs1 3.09 2.96 2.87 3.02 1.80 1.36 1.71 1.03 2.30

ivs2 1.97 2.15 1.93 1.55 3.65 4.43 4.35 3.75 3.59

ivs3 3.73 4.16 4.39 4.43 2.08 1.95 1.94 2.12 1.99

Group group1 group1 group1 group1 group2 group2 group2 group2 group2

Table 7-1. Data for Classification by KNN

Let’s say that we have a record (ivs1 = 2.5, ivs2 = 1.7, ivs3 = 4.2). We will call this record new record. We have to classify this record; it will be in either group1 or group2.

To classify the record, we’ll use the KNN algorithm. Here are the steps:

1. Decide the k.

k is the number of nearest neighbors we are going to choose for deciding the class of the new record. Let’s say k is 5.

2. Find the distance of the new record from each record in the data.

The distance calculation is done using the Euclidean distance method, as shown in Table 7-2.

In this table, we have calculated the distance between the new record and other records. The third column is the distance.

The distance value of the first row of this table is the distance between the new record and record 1.

3. Sort the distances.

Sorting is required now. And we have to sort the distances in increasing order. Simultaneously, we have to maintain the association between the RN column and the Distance column. In Table 7-3, we have sorted the Distance column, which is still associated with the RN and Group columns.

RN Distance Calculation with New Record

sqrt((1.80-2.5)^2 + (3.65 -1.7)^2 + (2.08 -4.2) ^2)

sqrt((1.03-2.5)^2 + (3.75 -1.7)^2 + (2.12 -4.2) ^2) sqrt((2.30-2.5)^2 + (3.59 -1.7)^2 + (1.99 -4.2) ^2) sqrt((1.71-2.5)^2 + (4.35 -1.7)^2 + (1.94 -4.2) ^2) sqrt((1.36-2.5)^2 + (4.43 -1.7)^2 + (1.95 -4.2) ^2) sqrt((3.02-2.5)^2 + (1.55 -1.7)^2 + (4.43 -4.2) ^2) sqrt((2.87-2.5)^2 + (1.93 -1.7)^2 + (4.39 -4.2) ^2) sqrt((2.96-2.5)^2 + (2.15 -1.7)^2 + (4.16 -4.2) ^2) sqrt((3.09-2.5)^2 + (1.97 -1.7)^2 + (3.73 -4.2) ^2)

Distance 0.80 0.65 0.47 0.58 2.96 3.71 3.57 3.26 2.91 1

2 3 4 5 6 7 8 9

Table 7-2. Distance Calculation

RN Distance

3 0.47

4 0.58

2 0.65

1 0.8

9 2.91

5 2.96

8 3.26

7 3.57

6 3.71

RN Distance Group group1 group1 group1 group1 group2 group2 group2 group2 group2

3 0.47

4 0.58

2 0.65

1 0.8

9 2.91

5 2.96

8 3.26

7 3.57

6 3.71

Table 7-3. Distance Calculation

4. Find the k-nearest neighbors.

Now that we have sorted the Distance column, we have to identify the neighbors of the new record. What do I mean by neighbors here? Neighbors are those records in the table that are near the new record. Near means having less distance between two nodes. Now look for the five nearest neighbors in Table 7-3. For the new record, records 3, 4, 2, 1, and 9 are neighbors. The group for records 3, 4, 2, and 1 is group1. The group for record 9 is group2. The majority of neighbors are from group1. Therefore, we can classify the new record in group1.

We have discussed the KNN algorithm in detail. Let’s see how to implement it by using PySpark. We are going to implement KNN in a naive way and then we will optimize it in the “How It Works” section.

First, we are going to calculate the distance between two tuples. We’ll write a Python function, distanceBetweenTuples(). This function will take two tuples, calculate the distance between them, and return that distance:

>>> def distanceBetweenTuples(data1 , data2) : ... squaredSum = 0.0

... for i in range(len(data1)):

... squaredSum = squaredSum + (data1[i] - data2[i])**2 ... return(squaredSum**0.5)

Now that we’ve written the function to calculate the distance, let’s test it:

>>> pythonTuple1 = (1.2, 3.4, 3.2)

>>> pythonTuple2 = (2.4, 2.2, 4.2)

>>> distanceBetweenTuples(pythonTuple1, pythonTuple2) Here is the output:

1.9697715603592207

Our method has been tested. It is a general function. We can run it for tuples of length 4 or 5 also. In the following lines of code, we’ll create a list. The elements of this list are tuples. Each tuple has two elements. The first element is itself a tuple of data. The second element of each tuple is the group associated with each tuple.

>>> knnDataList = [((3.09,1.97,3.73),'group1'),

... ((2.96,2.15,4.16),'group1'), ... ((2.87,1.93,4.39),'group1'), ... ((3.02,1.55,4.43),'group1'), ... ((1.80,3.65,2.08),'group2'), ... ((1.36,4.43,1.95),'group2'),

... ((1.71,4.35,1.94),'group2'), ... ((1.03,3.75,2.12),'group2'), ... ((2.30,3.59,1.99),'group2')]

>>> knnDataRDD = sc.parallelize(knnDataList, 4)

The data has been parallelized. We define newRecord as [(2.5, 1.7, 4.2)]:

>>> newRecord = [(2.5, 1.7, 4.2)]

>>> newRecordRDD = sc.parallelize(newRecord, 1)

>>> cartesianDataRDD = knnDataRDD.cartesian(newRecordRDD)

>>> cartesianDataRDD.take(5) Here is the output:

[(((3.09, 1.97, 3.73), 'group1'), (2.5, 1.7, 4.2)), (((2.96, 2.15, 4.16), 'group1'), (2.5, 1.7, 4.2)), (((2.87, 1.93, 4.39), 'group1'), (2.5, 1.7, 4.2)), (((3.02, 1.55, 4.43), 'group1'), (2.5, 1.7, 4.2)), (((1.8, 3.65, 2.08), 'group2'), (2.5, 1.7, 4.2))]

We have created a Cartesian by using the older record data and the new record data.

You might be wondering why I have created this Cartesian. at the time of defining the list knnDataList. In a real case, you would have a large file. That file might be distributed also. So for that condition, we’d have to read the file first and then create the Cartesian.

After creating the Cartesian, we have the older data and the new record data in the same row, so we can easily calculate the distance with the map() method:

>>> K = 5

>>> groupAndDistanceRDD = cartesianDataRDD.map(lambda data : (data[0][1]

,distanceBetweenTuples(data[0][0], data[1])))

>>> groupAndDistanceRDD.take(5) Here is the output:

[('group1', 0.8011866199581719), ('group1', 0.6447480127925947), ('group1', 0.47528938553264566), ('group1', 0.5880476171195661), ('group2', 2.9642705679475347)]

We have calculated the RDD groupAndDistanceRDD; its first element is the group, and the second element is the distance between the new record and older records.

We have to sort it now in increasing order of distance. You might remember the

takeOrdered() function described in Chapter 4. So let’s get five groups in increasing order of distance:

>>> ourClasses = groupAndDistanceRDD.takeOrdered(K, key = lambda data : data[1])

>>> ourClasses Here is the output:

[('group1', 0.47528938553264566), ('group1', 0.5880476171195661), ('group1', 0.6447480127925947), ('group1', 0.8011866199581719), ('group2', 2.9148241799463652)]

Using the takeOrdered() method, we have fetched five elements of the RDD, with the distance in increasing order. We have to find the group that is in the majority. So we have to first fetch only the group part and then we have to find the most frequent group:

>>> ourClassesGroup = [data[0] for data in ourClasses]

>>> ourClassesGroup Here is the output:

['group1', 'group1', 'group1', 'group1', 'group2']

The group part has been fetched. The most frequent group can be found using the max() Python function as follows:

>>> max(ourClassesGroup,key=ourClassesGroup.count) Here is the output:

'group1'

We finally have the group of the new record, and that is group1.

You might be thinking that now that we have implemented KNN, what’s next? Next, we should optimize the code. Let me say that again. We can optimize different aspects of this code. For this example, we’ll use the broadcasting technique using the broadcast variable. This is a very good technique for optimizing code.

The Cartesian has been applied to join the older records with the new record.

PySpark provides another way to achieve a similar result. We can send the new record to every executor before. This new record data will be available to each executor, and they can use it for distance calculations. We can send the new record tuple to all the executors as a broadcast variable.

Broadcast variables are shared and read-only variables. Read-only means executors cannot change the value of a broadcast variable; they can only read the value of it.

In PySpark, we create a broadcast variable by using the broadcast() function. This broadcast() function is defined on SparkContext. We know that in the PySpark console,

we have SparkContext as sc. We are going to reimplement the KNN by using the broadcast technique.

How It Works

We have already discussed most of the code. Therefore, I will keep the discussion short in the coming steps.

Implement the k-Nearest Neighbors Algorithm by Using PySpark

Running PySpark Commands on IPython Notebook

Read Table Data from HBase by Using PySpark