How to view the nearest neighbors in R? - McMap

About

How to view the nearest neighbors in R?

Asked 28/8, 2012 at 5:27 Answered 28/8, 2012 at 6:3

Solved r kaggle

S

1

15

Let me start by saying I have no experience with R, KNN or data science in general. I recently found Kaggle and have been playing around with the Digit Recognition competition/tutorial.

In this tutorial they provide some sample code to get you started with a basic submission:

# makes the KNN submission

library(FNN)

train <- read.csv("c:/Development/data/digits/train.csv", header=TRUE)
test <- read.csv("c:/Development/data/digits/test.csv", header=TRUE)

labels <- train[,1]
train <- train[,-1]

results <- (0:9)[knn(train, test, labels, k = 10, algorithm="cover_tree")]

write(results, file="knn_benchmark.csv", ncolumns=1)

My questions are:

How can I view the nearest neighbors that have been selected for a particular test row?
How can I modify which of those ten is selected for my results?

These questions may be too broad. If so, I would welcome any links that could point me down the right road.

It is very possible that I have said something that doesn't make sense here. If this is the case, please correct me.

Suntan answered 28/8, 2012 at 5:27 Comment(0)

F

23

1) You can get the nearest neighbors of a given row like so:

k <- knn(train, test, labels, k = 10, algorithm="cover_tree")
indices <- attr(k, "nn.index")

Then if you want the indices of the 10 nearest neighbors to row 20 in the training set:

print(indices[20, ])

(You'll get the 10 nearest neighbors because you selected k=10). For example, if you run with only the first 1000 rows of the training and testing set (to make it computationally easier):

train <- read.csv("train.csv", header=TRUE)[1:1000, ]
test <- read.csv("test.csv", header=TRUE)[1:1000, ]

labels <- train[,1]
train <- train[,-1]

k <- knn(train, test, labels, k = 10, algorithm="cover_tree")
indices = attr(k, "nn.index")

print(indices[20, ])
# output:
#  [1] 829 539 784 487 293 882 367 268 201 277

Those are the indices within the training set of 1000 that are closest to the 20th row of the test set.

2) It depends what you mean by "modify". For starters, you can get the indices of each of the 10 closest labels to each row like this:

closest.labels = apply(indices, 2, function(col) labels[col])

You can then see the labels of the 10 closest points to the 20th training point like this:

closest.labels[20, ]
# [1] 0 0 0 0 0 0 0 0 0 0

This indicates that all 10 of the closest points to row 20 are all in the group labeled 0. knn simply chooses the label by majority vote (with ties broken at random), but you could choose some kind of weighting scheme if you prefer.

ETA: If you're interested in weighting the closer elements more heavily in your voting scheme, note that you can also get the distances to each of the k neighbors like this:

dists = attr(k, "nn.dist")
dists[20, ]
# output:
# [1] 1238.777 1243.581 1323.538 1398.060 1503.371 1529.660 1538.128 1609.730
# [9] 1630.910 1667.014

Flameproof answered 28/8, 2012 at 6:3 Comment(9)

Wonderful response, thank you! I had a few questions. Any time I try to print indices it returns null, should I be doing anything different from your example? Can you recommend any resources for researching more on a creating a custom weighting scheme? Or examples of someone creating one that I can look at? – Suntan 28/8, 2012 at 17:46

That's very strange. What do you get if you do print(k)? As for other weighting schemes- you'd have as much luck as I would searching for the phrase "KNN weighted" on Google. But I'm writing a little more about weighting into my answer. – Flameproof 28/8, 2012 at 17:50

Ok, so just to clarify I am actually using results instead of k. I assume this doesn't make a difference, but figured I should just throw that out there. When i do print(results) It prints out the 1000 elements that are eventually written to my csv file. – Suntan 28/8, 2012 at 17:52

Looks like I was doing something wrong. I copied/pasted your code exactly and it worked. Sorry for the confusion. – Suntan 28/8, 2012 at 17:55

Sounds good. I edited in above how to get the distances to each of the points- you could use that to create an alternative weighting scheme. However, you should make sure you have a good reason to do so. (One thing you could do is come up with three or four different weighting schemes, and see which one works best on the training data). – Flameproof 28/8, 2012 at 17:58

Right- what you were doing was the line results <- (0:9)[knn.... This throws away the extra information about which points were nearest, and keeps only the remaining label assignments. – Flameproof 28/8, 2012 at 18:0

Also, could you explain what is happening with this line: train <- train[,-1]? – Suntan 28/8, 2012 at 20:40

The first column of the provided training dataset is the label of each item (open it in Excel and you'll see). The line labels <- train[, 1] grabs that column as a labels vector, and the line train <- train[,-1] removes that column to leave only the actual training data. – Flameproof 28/8, 2012 at 20:43

let us continue this discussion in chat – Suntan 28/8, 2012 at 20:47

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.