K-Nearest Neighbors (KNN) with Python

In my previous article i talked about Logistic Regression , a classification algorithm. In this article we will explore another classification algorithm which is K-Nearest Neighbors (KNN). We will see it’s implementation with python.

K Nearest Neighbors is a classification algorithm that operates on a very simple principle. It is best shown through example! Imagine we had some imaginary data on Dogs and Horses, with heights and weights.
Training Algorithm:

Store all the Data

Prediction Algorithm:

Calculate the distance from x to all points in your data
Sort the points in your data by increasing distance from x
Predict the majority label of the “k” closest points

Choosing a K will affect what class a new point is assigned to:

In above example if k=3 then new point will be in class B but if k=6 then it will in class A. Because majority of points in k=6 circle are from class A.
Lets return back to our imaginary data on Dogs and Horses:

If we choose k=1 we will pick up a lot of noise in the model. But if we increase value of k, you’ll notice that we achieve smooth separation or bias. This cleaner cut-off is achieved at the cost of miss-labeling some data points.
You can read more about Bias variance tradeoff.
Pros:

Very simple
Training is trivial
Works with any number of classes
Easy to add more data
Few parameters ○ K ○ Distance Metric

Cons:

High Prediction Cost (worse for large data sets)
Not good with high dimensional data
Categorical Features don’t work well

Suppose we’ve been given a classified data set from a company! They’ve hidden the feature column names but have given you the data and the target classes. We’ll try to use KNN to create a model that directly predicts a class for a new data point based off of the features.
Let’s grab it and use it!

Import Libraries

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

Get the Data

Set index_col=0 to use the first column as the index.

df = pd.read_csv("Classified Data",index_col=0)

Check the head of data:

df.head()

Standardize the Variables

Because the KNN classifier predicts the class of a given test observation by identifying the observations that are nearest to it, the scale of the variables matters. Any variables that are on a large scale will have a much larger effect on the distance between the observations, and hence on the KNN classifier, than variables that are on a small scale.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(df.drop('TARGET CLASS',axis=1))
StandardScaler(copy=True, with_mean=True, with_std=True)

scaled_features = scaler.transform(df.drop('TARGET CLASS',axis=1))
df_feat = pd.DataFrame(scaled_features,columns=df.columns[:-1])
df_feat.head()

Train Test Split

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(scaled_features,df['TARGET CLASS'],
                                                    test_size=0.30)

Using KNN

Remember that we are trying to come up with a model to predict whether someone will TARGET CLASS or not. We’ll start with k=1.

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train,y_train)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=1, p=2,
           weights='uniform')

Predictions and Evaluations

pred = knn.predict(X_test)pred = knn.predict(X_test)

Let’s evaluate our KNN model!

from sklearn.metrics import classification_report,confusion_matrix
print(confusion_matrix(y_test,pred))
[[129  12]
 [ 15 144]]

print(classification_report(y_test,pred))
  precision    recall  f1-score   support

          0       0.90      0.91      0.91       141
          1       0.92      0.91      0.91       159

avg / total       0.91      0.91      0.91       300

Choosing a K Value

Let’s go ahead and use the elbow method to pick a good K Value. We will basically check the error rate for k=1 to say k=40. For every value of k we will call KNN classifier and then choose the value of k which has the least error rate.

error_rate = []
# Might take some time
for i in range(1,40):
    
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train,y_train)
    pred_i = knn.predict(X_test)
    error_rate.append(np.mean(pred_i != y_test))

Let’s plot a Line graph of the error rate.

plt.figure(figsize=(10,6))
plt.plot(range(1,40),error_rate,color='blue', linestyle='dashed', marker='o',
         markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')

Here we can see that that after around K>23 the error rate just tends to hover around 0.06-0.05 Let’s retrain the model with that and check the classification report!

# FIRST A QUICK COMPARISON TO OUR ORIGINAL K=1
knn = KNeighborsClassifier(n_neighbors=1)

knn.fit(X_train,y_train)
pred = knn.predict(X_test)

print('WITH K=1')
print('\n')
print(confusion_matrix(y_test,pred))
print('\n')
print(classification_report(y_test,pred))
WITH K=1


[[151   8]
 [ 15 126]]


             precision    recall  f1-score   support

          0       0.91      0.95      0.93       159
          1       0.94      0.89      0.92       141

avg / total       0.92      0.92      0.92       300

# NOW WITH K=23
knn = KNeighborsClassifier(n_neighbors=23)

knn.fit(X_train,y_train)
pred = knn.predict(X_test)

print('WITH K=23')
print('\n')
print(confusion_matrix(y_test,pred))
print('\n')
print(classification_report(y_test,pred))
WITH K=23


[[150   9]
 [ 10 131]]


             precision    recall  f1-score   support

          0       0.94      0.94      0.94       159
          1       0.94      0.93      0.93       141

avg / total       0.94      0.94      0.94       300

We were able to squeeze some more performance out of our model by tuning to a better K value. The above notebook is available here on github.

Thank you!

Data VisualisationK-Nearest NeighborsMachine Learning