k-Nearest Neighbours

The \(k\)-Nearest Neighbours (k-NN) is a simple yet effective algorithm for regression and classificaion tasks (with identical code for each task). The k-NN algorithm is a non-parametric method proposed by Thomas Cover 1. Essentially, we use a similarity measure between some query point \(\mathbf{x}^{*}\) and some training data \(\mathbf{X}\) to predict the associated target value(s). As usual, assume that our dataset \(\mathbf{X}\in\mathbb{R}^{n,m}\) and \(\mathbf{Y}\in\mathbb{R}^{n,c}\) where \(n\) is the number of observations, \(m\) is the number of features, and \(c\) is the number of targets.

Note

The similarity measure we use is known as the Minkowski distance and is defined by:

\begin{align} d(\mathbf{x}_i,\mathbf{x}_j;p) :&= ||\mathbf{x}_i - \mathbf{x}_j||_p,\\ &= \bigg[\sum_{k = 1}^m |x_{ik} - x_{jk}|^p\bigg]^{1 / p}.\label{eq:minkowski} \end{align}

Essentially, Eq. \(\eqref{eq:minkowski}\) is defined by the vector \(p\)-norm so the Minkowski distance and \(p\)-norm are synonymous.

Regression

For visual aid, lets consider a one-dimensional toy problem. Lets generate the observations \(x\sim U(0,2\pi)\) and the targets \(y|x\sim\mathcal{N}(\sin(x),\epsilon^2)\) where we set \(\epsilon=0.3\) and group them into a train, validation, and test datasets.

Python

Generating synthetic regression data

# Vital imports
from   sklearn.model_selection import train_test_split
import numpy as np

# Reproducibility
np.random.seed(2020)

N = 200 # Total number of points
n = 50  # Total number of samples
X = np.linspace(0, 2 * np.pi, N).reshape(-1, 1)
T = np.sin(X)

# Sample random indices to be our data
r = np.random.choice(N, size = n)
x = X[r]
t = T[r]
y = t + np.random.normal(scale = 0.3, size = t.shape) # Add noise to observations

# Train / Val / Test
x_train, x_test, y_train, y_test = train_test_split(x, y)
x_train, x_val , y_train, y_val  = train_test_split(x_train, y_train)

train = x_train, y_train
val   = x_val  , y_val
test  = x_test , y_test
../../_images/example-regression.png

Fig. 9 (a) Generated noisy \(\sin\) function. (b) Train / validation / test split.

As the algorithm name suggests, when querying a new data point in observation space, we use the \(k\) nearest data points in observation space to predict in target space.

../../_images/example-regression-query.png

Fig. 10 (a) Training data. (b) Example new query point (purple) using the 5 nearest neighbours from the train for inference.

Python

Coding of Eq. \({\eqref{eq:minkowski}}\)

def numpy_distance_neighbours(X1, X2, k, p):
    """ Computes the k-nearest distances and neighbours based on the numpy implementation of the minkowski distance """
    distances  = np.power(np.sum(np.power(np.fabs(X1[:,None] - X2[None,:]), p), axis = -1), 1 / p)
    neighbours = distances.argsort(axis = 1)[:,:k]
    rows       = np.arange(len(distances))[:,None]
    return distances[rows, neighbours], neighbours

Python

k-Nearest Neighbours Class
class kNearestNeighbours():
    """
    k-Nearest Neighbours class

    Parameters
    =============
        k        : int
                   Number of neighbours to consider when running model inference.

        p        : int, float [1, \infty]
                   The p parameter in the Minkowski distance metric.

        weighted : bool
                   If True, weight neighbours by their inverse distance.

        func     : function
                   Function call to use to compute the k-nearest distances and neighbours.
    """
    def __init__(self, k = 5, p = 2, weighted = True, func = numpy_distance_neighbours):
        self._params = dict(k = k, p = p, weighted = weighted)
        self.__func  = func

    def fit(self, X, y):
        self.X = X
        self.y = y
        return self

    def __call__(self, X):
        distances, neighbours = self.__func(X, self.X, k = self._params['k'], p = self._params['p'])
        if self._params['weighted']:
            is0             = np.where(distances == 0)                    # find where the distance is 0
            distances[is0]  = 1
            weights         = 1 / distances                               # weight by inverse distance
            weights[is0[0]] = 0                                           # zero out all rows that had initially 0 distance
            weights[is0]    = 1                                           # correct inverse weights to be 1 where distance is 0
            weights        /= weights.sum(axis = 1, keepdims = True)      # ensure that weights sum to 1
        else:
            weights         = np.ones_like(distances) / self._params['k']
        weights.resize(*weights.shape, 1)
        return (weights * self.y[neighbours]).sum(axis = 1)

    def rmse(self, *data):
        return [np.sqrt(np.mean((np.square(self(X) - y)))) for (X, y) in data]

Python

Model results on toy regression data

k_range       = range(1, 13)
rmse_uniform  = [kNearestNeighbours(k = k, weighted = False).fit(x_train, y_train).rmse(train, val, test) for k in k_range]
rmse_weighted = [kNearestNeighbours(k = k, weighted = True ).fit(x_train, y_train).rmse(train, val, test) for k in k_range]

Plotting the rmse results yields the following plot.

../../_images/example-regression-1.png

Fig. 11 RMSE results of toy regression data over the number of neighbours \(k\) with constant \(p=2\).

Not bad! The scatter points along the black dashed line represent the results when the validation error is minimised. It is normal practice to choose the model that minimises the validation data whilst training on the training data. We see that the weighted solution has marginally better results. Lets try the same code on a more realistic regression problem - the load_boston dataset.

Dataset

load_boston

Data Set Characteristics:

Number of Instances

506

Number of Attributes

13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

Attribute Information (in order)
  • CRIM per capita crime rate by town

  • ZN proportion of residential land zoned for lots over 25,000 sq.ft.

  • INDUS proportion of non-retail business acres per town

  • CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)

  • NOX nitric oxides concentration (parts per 10 million)

  • RM average number of rooms per dwelling

  • AGE proportion of owner-occupied units built prior to 1940

  • DIS weighted distances to five Boston employment centres

  • RAD index of accessibility to radial highways

  • TAX full-value property-tax rate per $10,000

  • PTRATIO pupil-teacher ratio by town

  • B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town

  • LSTAT % lower status of the population

  • MEDV Median value of owner-occupied homes in $1000’s

Missing Attribute Values

None

Creator

Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset. https://archive.ics.uci.edu/ml/machine-learning-databases/housing/

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. ‘Hedonic prices and the demand for clean air’, J. Environ. Economics & Management, vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, ‘Regression diagnostics …’, Wiley, 1980. N.B. Various transformations are used in the table on pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression problems.

  • Belsley, Kuh & Welsch, ‘Regression diagnostics: Identifying Influential Data and Sources of Collinearity’, Wiley, 1980. 244-261.

  • Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

Python

Loading sklearn’s load_boston dataset

from sklearn.datasets      import load_boston
from sklearn.preprocessing import StandardScalar()

X, y = load_boston(return_X_y = True)
y    = y.reshape(-1, 1)

def train_val_test_split(X, y, *args, **kwargs):
    X_train, X_test, y_train, y_test = train_test_split(X      , y      , *args, **kwargs)
    X_train, X_val , y_train, y_val  = train_test_split(X_train, y_train, *args, **kwargs)
    return X_train, X_val, X_test, y_train, y_val, y_test, (X_train, y_train), (X_val, y_val), (X_test, y_test)

X_train, X_val, X_test, y_train, y_val, y_test, train, val, test = train_val_test_split(X, y, random_state = 2020)

# scale values so each dimension has an equal say when computing the Minkowski distances
scaler = StandardScaler().fit(X_train)

X_train, X_test = scaler.transform(X_train), scaler.transform(X_test)

del X, y

Python

Model results on the load_boston data

k_range       = range(1, len(X_train))
rmse_uniform  = [kNearestNeighbours(k = k, weighted = False).fit(X_train, y_train).rmse(train, val, test) for k in k_range]
rmse_weighted = [kNearestNeighbours(k = k, weighted = True ).fit(X_train, y_train).rmse(train, val, test) for k in k_range]
../../_images/boston.png

Fig. 12 RMSE results of the load_boston data over the number of neighbours \(k\) with constant \(p=2\).

From the above plots, it seems that both the uniform and weighted cases yield the same errors. Lets try varying \(p\) to see if they really are the same.

Python

Model results on the load_boston data varying \(p\)

rmse_1 = [kNearestNeighbours(k = k, weighted = True, p = 1).fit(X_train, y_train).rmse(train, val, test) for k in k_range]
rmse_2 = [kNearestNeighbours(k = k, weighted = True, p = 2).fit(X_train, y_train).rmse(train, val, test) for k in k_range]
rmse_3 = [kNearestNeighbours(k = k, weighted = True, p = 3).fit(X_train, y_train).rmse(train, val, test) for k in k_range]
rmse_4 = [kNearestNeighbours(k = k, weighted = True, p = 4).fit(X_train, y_train).rmse(train, val, test) for k in k_range]
../../_images/boston-p.png

Fig. 13 RMSE results of the load_boston data over the number of neighbours :k and Minkowski distance parameter \(p\).

From the above, we can see that as \(p\) varies, the curves do change with slightly different optimal results with the trend that the RMSE values decrease as \(p\) increases. Running this code however proves to be slow so lets look at how we can make the code more efficient for our use case. One route may be to simply do the same code in another language i.e. C and see how that provides a speed up. Alternatively, we can consider partitioning the \(m\) dimensional observation data space into a k-dimensional trees <https://en.wikipedia.org/wiki/K-d_tree>. Lets see how the numpy, C, and KDTree implementations costs in terms of time to run.

Python

Implementations of Nearest Neighbours

from time                     import time as _time

from sklearn.spatial.distance import cdist
from sklearn.spatial          import cKDTree

def timeit(func, *args, n_iterations = 1000):
    """ Times the function func, with parsed args and returns the n_iterations worth of times in seconds """
    times = np.empty(n_iterations)
    for i in range(n_iterations):
        start    = _time()
        func(*args)
        times[i] = _time() - start
    return times

def numpy_distance_neighbours(X1, X2, k, p):
    """ Computes the k-nearest distances and neighbours based on the numpy implementation of the minkowski distance """
    distances  = np.power(np.sum(np.power(np.fabs(X1[:,None] - X2[None,:]), p), axis = -1), 1 / p)
    neighbours = distances.argsort(axis = 1)[:,:k]
    rows       = np.arange(len(distances))[:,None]
    return distances[rows, neighbours], neighbours

def cdist_distance_neighbours(X1, X2, k, p):
    """ Computes the k-nearest distances and neighbours based on the c implementation of the minkowski distance """
    distances  = cdist(X1, X2, metric = 'minkowski', p = p)
    neighbours = distances.argsort(axis = 1)[:,:k]
    rows       = np.arange(len(distances))[:,None]
    return distances[rows, neighbours], neighbours

def ckdtree_distance_neighbours(X1, X2, k, p):
    """ Computes the k-nearest distances and neighbours using the c implementation of the KD Tree """
    distances, neighbours = cKDTree(X2).query(X1, k = k, p = p, n_jobs = -1)
    if k == 1:
        distances, neighbours = distances.reshape(-1, 1), neighbours.reshape(-1, 1)
    return distances, neighbours


numpy_time  = timeit(numpy_distance_neighbours  , X_train, X_test, 2, 2)
cdist_time  = timeit(cdist_distance_neighbours  , X_train, X_test, 2, 2)
kdtree_time = timeit(ckdtree_distance_neighbours, X_train, X_test, 2, 2)
../../_images/boston-times.png

Fig. 14 Time cost of computing the nearest neighbours and distances across different implementations.

From the above, it is clear that the k-d tree method is the fastest of the three with the C implementation already being a significant improvement over the numpy implementation. We redefine our earlier defined class implementation with the k-d tree method.

Python

Efficient implementation of k-Nearest Neighbours Class
class kNearestNeighbours():
    """
    k-Nearest Neighbours class

    Parameters
    =============
        k        : int
                   Number of neighbours to consider when running model inference.

        p        : int, float [1, \infty]
                   The p parameter in the Minkowski distance metric.

        weighted : bool
                   If True, weight neighbours by their inverse distance.
    """
    def __init__(self, k = 5, p = 2, weighted = True):
        self._params = dict(k = k, p = p, weighted = weighted)

    def fit(self, X, y):
        self.cKDTree = cKDTree(X).query
        self.y       = y
        return self

    def __call__(self, X):
        distances, neighbours = self.cKDTree(X, k = self._params['k'], p = self._params['p'], n_jobs = -1)
        if self._params['k'] == 1:
            distances, neighbours = distances.reshape(-1, 1), neighbours.reshape(-1, 1)
        if self._params['weighted']:
            is0             = np.where(distances == 0)
            distances[is0]  = 1
            weights         = 1 / distances
            weights[is0[0]] = 0
            weights[is0]    = 1
            weights        /= weights.sum(axis = 1, keepdims = True)
        else:
            weights         = np.ones_like(distances) / self._params['k']
        weights.resize(*weights.shape, 1)
        return (weights * self.y[neighbours]).sum(axis = 1)

    def rmse(self, *data):
        return [np.sqrt(np.mean((np.square(self(X) - y)))) for (X, y) in data]

    def acc(self, *data):
        return [(self(X).round(0) == y).mean() for (X, y) in data]

Classification

In this section we consider the target variable \(\mathbf{y}\in\{0,1\}^n\) i.e. the binary classification problem. This is easily extendible to the case where there are more than two classes. We now generate a toy non-linear boundaried classification problem below. The toy problem has the target variable dependent on a two dimensional observation. \(x_1\overset{\text{iid}}{\sim} U(0,1)\), \(x_2\overset{\text{iid}}{\sim} U(0,1)\) and \(p(y=1|x_1,x_2) = p\big(\sqrt{x_1^2 + x_2^2} - r > \epsilon\big)\) where we set \(r = 0.8\) and \(\epsilon\overset{\text{iid}}{\sim}\mathcal{N}(0, 0.2^2)\).

Python

Generating synthetic classification data

# radius of boundary
r = 0.8

def gen_classification(n, random_state = None):
    np.random.seed(random_state)

    x1 = np.random.uniform(size = n)
    x2 = np.random.uniform(size = n)
    X  = np.c_[x1, x2]
    y = np.sqrt(x1 ** 2 + x2 ** 2 + np.random.normal(scale = 0.2, size = n)) > r
    y = y.astype(int).reshape(n, 1)
    return X, y

x = np.linspace(0, r, 100)
b = np.sqrt(r ** 2 - x ** 2) # boundary

X, y = gen_classification(500, random_state = 2020)
../../_images/example-classification.png

Fig. 15 Generated classification data. The classes have been noisily assigned if the distance from the origin is more than 0.8 or not. For less cluttered visual display, only the first 60 scatter points have been visualised.

Python

Model results on toy classification data

train = X_train, y_train
val   = X_val  , y_val
test  = X_test , y_test

rmse_uniform  = []
acc_uniform   = []

rmse_weighted = []
acc_weighted  = []

for k in range(1, len(X_train)):
    model = kNearestNeighbours(k = k).fit(X_train, y_train)
    rmse_uniform.append(model.rmse(train, val, test))
    acc_uniform.append(model.acc(train, val, test))

    model = kNearestNeighbours(k = k, weighted = True).fit(X_train, y_train)
    rmse_weighted.append(model.rmse(train, val, test))
    acc_weighted.append(model.acc(train, val, test))
../../_images/example-classification-1.png

Fig. 16 RMSE and accuracy results of fitting a \(k\)-NN to the toy classication data over the number of neighbours \(k\) with constant \(p=2\).

A similar analysis can be done where we vary \(p\) yielding the below.

Python

Model results on the toy classification data varying \(p\)

p_range = range(1, 5)
k_range = range(1, len(X_train))

results = dict(rmse = {}, acc = {})
for p in p_range:
    rmse = results['rmse'][p] = []
    acc  = results['acc' ][p] = []
    for k in k_range:
        model = kNearestNeighbours(k = k, weighted = True, p = p).fit(X_train, y_train)
        rmse.append(model.rmse(train, val, test))
        acc.append(model.acc(train, val, test))
../../_images/example-classification-rmse.png
../../_images/example-classification-acc.png

Fig. 17 RMSE and accuracy results of fitting a \(k\)-NN to the toy classication data over the number of neighbours \(k\) with varying \(p\).

1
  1. Cover, P. Hart, Nearest Neighbor Pattern Classification, 1967, https://ieeexplore.ieee.org/document/1053964?arnumber=1053964