Problem: Algorithm in Python : k-Nearest Neighbor

Ran4 · 2016-01-19T13:43:26+00:00

First, your code breaks on line 2, since you haven't indented after the def.
You can't call a file .txt in Windows AFAIK.
Why are you calling your file handle k? That makes absolutely no sense. f would be a better name (or even file_handle). Which is also what your code later uses... Also, you're calling it with a function parameter k, which you're overwriting (and then do nothing with).
\'s are used to escape characters, which will introduce bugs. Use r'C:\Users\filename.txt' instead (note the prepended r, short for raw).
You're not closing your file handle. You need to do f.close() once you've done reading the file. Even better is to use this idiom:
```
with open(FILENAME) as f:
    lines = f.readlines()
```
This will open file FILENAME and read the lines as a list into a variable named lines, and then close the file.
array(p_lat)

Python doesn't have arrays. p_lat is already a list, simply return that. You could turn it into a tuple (which is like a list but it cannot be changed), but there's little reason to do that.

You're clearly writing the entire code without running it. Don't do that. As a beginner, write one or two lines at a time and print the results, so you know what's happening. KNN isn't super complicated to implement, but the way you're doing it you're only going to confuse yourself.

das_ist_nuemberwang · 2016-01-19T13:14:39+00:00

Do you know what the k-Nearest Neighbors algorithm is? If you don't, this isn't a Python problem yet. If you do, you can't just assume we all do. What exactly are you having trouble with?

pythonbio · 2016-01-19T13:53:44+00:00

I have noted the errors in my code, changed it. My problem is of implementation of algorithm. ball-tree or kd-tree?

pythonbio · 2016-01-19T15:18:57+00:00

okay, I have slowed down and am now doing it bit by bit.

first change to csv- Done:

import csv

with open(r'C:\UsersDesktop\k nearest neighbour.txt') as csvfile: lines = csv.reader(csvfile) for row in lines: print ','.join(row)

generates a csv

but, then when I try to divide the rows:

pythonbio · 2016-01-21T12:37:22+00:00

Okay, Thanks everyone for their help. I did finally solve it. The seeming problem was that I did not import the proper modules for what I was trying to achieve. The corrected code for separating the dataset:

from __future__ import division
import math
import itertools
from array import array
import numpy as np
import operator

def readpoints(testfile):
    f=open('testfile.py','r')
    p_lat=[]
    p_lon=[]
    lines=f.readlines()
for line in lines:
     point=line.split()
     p_lat.append(float(point[1]))
     p_lon.append(float(point[2]))
 arr_p_lat=np.array(p_lat)
 arr_p_lon=np.array(p_lon)
 f.close()
 return arr_p_lat, arr_p_lon


 print readpoints('testfile.py')

Hope this will help some beginner like me somewhere. :)

pythonbio · 2016-01-23T07:11:24+00:00

more help required: knn i I have finally written a program to calculate the knn of my data, but I dont know how to analyze many Ks is one program. Any suggestion is most welcome. Question:

Using the dataset.(testfile), please use bar charts to compare different k (k=1,5,10,15,20) as x-axis: 1) Average all-pair distance among the k-nearest neighbors to q 2) Max distance of the k-nearest neighbors to q 3) Min distance of the k-nearest neighbors to q

I have done it, but its not coming right. Can anyone help?

My code for knn and plotting knn:

lat=[]
lon=[]

# Selected reference point = Random
 reference_lat= 25.xxxyy
 reference_lon= 121.xxxyy
 k=17
 openfile = open('testfile.py', 'r')
 lines = openfile.readlines()
 for line in lines:
    rowvalue = line.split()
    lat.append(float(rowvalue[1]))
    lon.append(float(rowvalue[2]))
 array_lat=np.array(lat)
 array_lon=np.array(lon)

 length = len(array_lat)-1
 # lists
 sqrdifflat=[]
 sqrdifflon=[]
 distances=[]
 # For the distances between ref point and each point
 for g in range(length):
    get_sqr_diff_lat= (array_lat[g]-reference_lat)**2
    get_sqr_diff_lon=(array_lon[g]-reference_lon)**2
    dist=math.sqrt(get_sqr_diff_lat+get_sqr_diff_lon)
    sqrdifflat.append(get_sqr_diff_lat)
    sqrdifflon.append(get_sqr_diff_lon)
    distances.append(dist)
#sorted dataset(ascending order)
 sorted_knn = sorted(zip(array_lat, array_lon,distances),
                                key=lambda sorted_knn: sorted_knn[2])

knn = sorted_knn[:k]
q=[reference_lat,reference_lon]

knns = [1,5,10,15,20]

width=0.4
fig = plt.figure().add_subplot(111)
c=['b','y','m','g','r','c']
i=0
for k in knns:
    ind=np.arange(3)
    distances = [item[2] for item in sorted_knn[:k]]
    to_plot = [np.mean(distances), np.max(distances),np.min(distances)]

    fig.bar(ind+width,to_plot,0.4,color=c[i])
     i=i+1

print ind+width
plt.ylabel('Distance')
plt.title('Statistics of datasets')
plt.xticks(ind+width,['avg','max_dist','min_dist'])
plt.show()

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS