机器学习算法之KNN

k-最近邻是基于实例的学习方法中最基本的算法。

基本描述

K最近邻(kNN，k-NearestNeighbor)分类算法，它的工作原理为：给定测试样本，基于某种距离度量找出训练集中与其最靠近的k个训练样本，然后基于这k个“邻居”的信息来进行预测。

在回归任务中可使用“平均法”，即将这k个样本的实值输出标记的平均值作为预测结果；还可基于距离远近进行加权平均或加权投票，距离越近的样本权重越大。既是最简单的机器学习算法之一，也是基于实例的学习方法中最基本的，又是最好的文本分类算法之一。

通过上面的描述，我们可以得知算法理论步骤：

1
2
3

1）算距离：给定测试对象，计算它与训练集中的每个对象的距离
2）找邻居：圈定距离最近的k个训练对象，作为测试对象的近邻
3）做分类：根据这k个近邻归属的主要类别，来对测试对象分类

算法优缺点

先来说说优点：

①简单，易于理解，易于实现，无需参数估计，无需训练;
②精度高，对异常值不敏感（个别噪音数据对结果的影响不是很大）;
③适合对稀有事件进行分类;
④特别适合于多分类问题(multi-modal,对象具有多个类别标签)，KNN要比SVM表现要好

再来说说缺点：

①KNN算法是懒散学习方法（lazy learning,基本上不学习），一些积极学习的算法要快很多。
②类别评分不是规格化的（不像概率评分）。
③输出的可解释性不强，例如决策树的可解释性较强。 ④还有一点也很容易想到，就是如果我们样本很不平衡的时候，有可能导致当输入一个新样本时，该样本的K个邻居中大容量类的样本占多数。该算法只计算“最近的”邻居样本，某一类的样本数量很大，那么或者这类样本并不接近目标样本，或者这类样本很靠近目标样本。不过这个时候我们可以通过采取权值的方法改进（比如最近权值高）。

20150805204307882

算法实现

算法应该很容易实现，下面来看看在kaggle中的手写数字识别中的实现：

首先我们完成将文件转化成向量的函数，方便读取：

# convert image to vector  
def  img2vector(filename):  
    rows = 32  
    cols = 32  
    imgVector = zeros((1, rows * cols))   
    fileIn = open(filename)  
    for row in xrange(rows):  
        lineStr = fileIn.readline()  
        for col in xrange(cols):  
            imgVector[0, row * 32 + col] = int(lineStr[col])  
 
    return imgVector

然后我们加载整个数据库

# load dataSet  
def loadDataSet():  
    ## step 1: Getting training set  
    print "---Getting training set..."  
    dataSetDir = './'
    trainingFileList = os.listdir(dataSetDir + 'trainingDigits') # load the training set  
    numSamples = len(trainingFileList)  
 
    train_x = zeros((numSamples, 1024))  
    train_y = []  
    for i in xrange(numSamples):  
        filename = trainingFileList[i]  
 
        # get train_x  
        train_x[i, :] = img2vector(dataSetDir + 'trainingDigits/%s' % filename)   
 
        # get label from file name such as "1_18.txt"  
        label = int(filename.split('_')[0]) # return 1  
        train_y.append(label)  
 
    ## step 2: Getting testing set  
    print "---Getting testing set..."  
    testingFileList = os.listdir(dataSetDir + 'testDigits') # load the testing set  
    numSamples = len(testingFileList)  
    test_x = zeros((numSamples, 1024))  
    test_y = []  
    for i in xrange(numSamples):  
        filename = testingFileList[i]  
 
        # get train_x  
        test_x[i, :] = img2vector(dataSetDir + 'testDigits/%s' % filename)   
 
        # get label from file name such as "1_18.txt"  
        label = int(filename.split('_')[0]) # return 1  
        test_y.append(label)  
 
    return train_x, train_y, test_x, test_y

然后就是我们的knn算法了：

# classify using kNN  
def kNNClassify(newInput, dataSet, labels, k):  
    numSamples = dataSet.shape[0] # shape[0] stands for the num of row  
 
    ## step 1:计算距离
    # tile(A, reps): Construct an array by repeating A reps times  
    # the following copy numSamples rows for dataSet  
    diff = tile(newInput, (numSamples, 1)) - dataSet # Subtract element-wise
    squaredDiff = diff ** 2 # squared for the subtract  
    squaredDist = sum(squaredDiff, axis = 1) # sum is performed by row  
    distance = squaredDist ** 0.5  
 
    ## step 2: 排序
    # argsort() returns the indices that would sort an array in a ascending order  
    sortedDistIndices = argsort(distance)  
 
    classCount = {} # define a dictionary (can be append element)  
    for i in xrange(k):  
        ## step 3: 选最符合的k个
        voteLabel = labels[sortedDistIndices[i]]  
 
        ## step 4: 统计标签
        # when the key voteLabel is not in dictionary classCount, get()  
        # will return 0  
        classCount[voteLabel] = classCount.get(voteLabel, 0) + 1  
 
    ## step 5: 返回频数最大的标签
    maxCount = 0  
    for key, value in classCount.items():  
        if value > maxCount:  
            maxCount = value  
            maxIndex = key  
    return maxIndex

最后是测试函数：

# test hand writing class  
def testHandWritingClass():  
    ## step 1: load data  
    print "step 1: load data..."  
    train_x, train_y, test_x, test_y = loadDataSet()  
 
    ## step 2: training...  
    print "step 2: training..."  
    pass  
 
    ## step 3: testing  
    print "step 3: testing..."  
    numTestSamples = test_x.shape[0]  
    matchCount = 0  
    for i in xrange(numTestSamples):  
        predict = kNNClassify(test_x[i], train_x, train_y, 3)
        if predict == test_y[i]:  
            matchCount += 1  
    accuracy = float(matchCount) / numTestSamples  
 
    ## step 4: show the result  
    print "step 4: show the result..."  
    print 'The classify accuracy is: %.2f%%' % (accuracy * 100)

由于这种题目的数据都是完整的，而且向量也是有规格的（32*32），也就是说每个测试例子都没有丢失的数据，而且每个参数的值也是在0-1的，所以这里不用进行数据处理阶段，至于要不要做特征工程我也不清楚。。。