A New Hashing based Nearest Neighbors Selection Technique for Big Datasets

Jude Tchaye-Kondi, Yanlong Zhai and Liehuang Zhu, Beijing Institute of Technology, China; Jude Tchaye-Kondi, Yanlong Zhai and Liehuang Zhu, Beijing Institute of Technology, China

A New Hashing based Nearest Neighbors Selection Technique for Big Datasets

Authors

Jude Tchaye-Kondi, Yanlong Zhai and Liehuang Zhu, Beijing Institute of Technology, China

Abstract

KNN has the reputation of being a simple and powerful supervised learning algorithm used for either classification or regression. Although KNN prediction performance highly depends on the size of the training dataset, when this one is large, KNN suffers from slow decision making. This is because each decision-making process requires the KNN algorithm to look for nearest neighbors within the entire dataset. To overcome this slowness problem, we propose a new technique that enables the selection of nearest neighbors directly in the neighborhood of a given data point. The proposed approach consists of dividing the data space into sub-cells of a virtual grid built on top of the dataset. The mapping between data points and sub-cells is achieved using hashing. When it comes to selecting the nearest neighbors of a new observation, we first identify the central cell where the observation is contained. Once that central cell is known, we then start looking for the nearest neighbors from it and the cells around. From our experimental performance analysis of publicly available datasets, our algorithm outperforms the original KNN with a predictive quality as good and offers competitive performance with solutions such as KDtree.

Keywords

Machine learning, Nearest neighbors, Hashing, Big data.

CS&IT Conference Proceedings

A New Hashing based Nearest Neighbors Selection Technique for Big Datasets