机器学习knn算法
来源:互联网 发布:分手后 知乎 编辑:程序博客网 时间:2024/05/18 13:06
KNN算法很早就写过了,但是一直耿耿于怀,今天特以此篇来终结这个算法
因为不止一个同学跟我说KNN算法速度没那么慢,但是我自己写的代码跑一遍要用将近一个小时的时间。不知道是算法问题,还是电脑机器的问题。我的电脑是ThinkPad i5处理器,4G内存。这个代码是网上的,跑一遍用了1个半小时的时间。还在这个代码层次还是比较清晰的,所以手撸代码部分就用这个吧。
在这个代码里用了很多预处理,如下:
以下代码是把源文件里的文字一行一个字地写到新的问价夹里面
其中在处理源文件的时候,有一段代码确实做的非常好,他用了词干提取的方法,这种方法在以后处理英文文本的时候,确实可以用来借鉴,代码如下
def lineProcess(line):
stopwords = nltk.corpus.stopwords.words(‘english’) #去停用词
porter = nltk.PorterStemmer() #词干分析
splitter = re.compile(‘[^a-zA-Z]’) #去除非字母字符,形成分隔
words = [porter.stem(word.lower()) for word in splitter.split(line)\
if len(word)>0 and\
word.lower() not in stopwords]
return words
以上代码用正则匹配方法,去除非字幕符号,形成分隔,然后把字母统一转化为小写,同时查看字母是否在stopwords,这一套下来,基本就是对英文文本的标准预处理方法。def createFiles():
srcFilesList = listdir(originalSample)
for i in range(len(srcFilesList)):
if i==0: continue
dataFilesDir = originalSample + ‘\’+ srcFilesList[i] # 20个文件夹每个的路径
dataFilesList = listdir(dataFilesDir)
targetDir = r’processedSample_includeNotSpecial’+’\’ + srcFilesList[i] # 20个新文件夹每个的路径
if path.exists(targetDir)==False:
mkdir(targetDir)
else:
print (‘%s exists’ % targetDir)
for j in range(len(dataFilesList)):
createProcessFile(srcFilesList[i],dataFilesList[j]) # 调用createProcessFile()在新文档中处理文本
print (‘%s %s’ % (srcFilesList[i],dataFilesList[j]))
def createProcessFile(srcFilesName,dataFilesName):
srcFile = originalSample + ‘\’ + srcFilesName + ‘\’ + dataFilesName
targetFile= ‘processedSample_includeNotSpecial\’ + srcFilesName\
+ ‘\’ + dataFilesName
fw = open(targetFile,’w’)
dataList = []
try:
dataList = open(srcFile).readlines()
except:
print(‘error occur’)
for line in dataList:
resLine = lineProcess(line) # 调用lineProcess()处理每行文本
for word in resLine:
fw.write(‘%s\n’ % word) #一行一个单词
fw.close()
以下就是特征提取
def filterSpecialWords():
fileDir = ‘processedSample_includeNotSpecial’
wordMapDict = {}
sortedWordMap = countWords()
for i in range(len(sortedWordMap)):
wordMapDict[sortedWordMap[i][0]]=sortedWordMap[i][0]
sampleDir = listdir(fileDir)
for i in range(len(sampleDir)):
targetDir = ‘processedSampleOnlySpecial’ + ‘/’ + sampleDir[i]
srcDir = ‘processedSample_includeNotSpecial’ + ‘/’ + sampleDir[i]
if path.exists(targetDir) == False:
mkdir(targetDir)
sample = listdir(srcDir)
for j in range(len(sample)):
targetSampleFile = targetDir + ‘/’ + sample[j]
fr=open(targetSampleFile,’w’)
srcSampleFile = srcDir + ‘/’ + sample[j]
for line in open(srcSampleFile).readlines():
word = line.strip(‘\n’)
if word in wordMapDict.keys():
fr.write(‘%s\n’ % word)
fr.close()
def countWords():
wordMap = {}
newWordMap = {}
fileDir = ‘processedSample_includeNotSpecial’
sampleFilesList = listdir(fileDir)
for i in range(len(sampleFilesList)):
sampleFilesDir = fileDir + ‘/’ + sampleFilesList[i]
sampleList = listdir(sampleFilesDir)
for j in range(len(sampleList)):
sampleDir = sampleFilesDir + ‘/’ + sampleList[j]
for line in open(sampleDir).readlines():
word = line.strip(‘\n’)
wordMap[word] = wordMap.get(word,0.0) + 1.0
#只返回出现次数大于4的单词
for key, value in wordMap.items():
if value > 4:
newWordMap[key] = value
sortedNewWordMap = sorted(newWordMap.items())
print (‘wordMap size : %d’ % len(wordMap))
print (‘newWordMap size : %d’ % len(sortedNewWordMap))
return sortedNewWordMap
这样的特征提取方式比较naive,但是他有效啊! 我之前在选特征的时候用了个各种方法,神马TF-IDF,互信息啥的,但是都没有这么做好!为什么呢? 因为他这么做相当于是把全文本都作为特征,对那些出现次数实在少的才考虑除去。而且这个程序的思路就是一层层地预处理数据,保留中间数据,使得时间复杂度高得KNN算法在实现上成为可能。
通过以上的方式,我们就创建了只包含特征词的文档,其中根目录就是 processedSampleOnlySpecial
接下来我们就需要通过程序来分割数据,形成训练数据和测试数据
def createTestSample(indexOfSample,classifyRightCate,trainSamplePercent=0.9):
fr = open(classifyRightCate,’w’)
fileDir = ‘processedSampleOnlySpecial’
sampleFilesList=listdir(fileDir)
for i in range(len(sampleFilesList)):
sampleFilesDir = fileDir + ‘/’ + sampleFilesList[i]
sampleList = listdir(sampleFilesDir)
m = len(sampleList)
testBeginIndex = indexOfSample * ( m * (1-trainSamplePercent) )
testEndIndex = (indexOfSample + 1) * ( m * (1-trainSamplePercent) )
for j in range(m):
# 序号在规定区间内的作为测试样本,需要为测试样本生成类别-序号文件,最后加入分类的结果,
# 一行对应一个文件,方便统计准确率
if (j > testBeginIndex) and (j < testEndIndex):
fr.write(‘%s %s\n’ % (sampleList[j],sampleFilesList[i])) # 写入内容:每篇文档序号 它所在的文档名称即分类
targetDir = ‘TestSample’+str(indexOfSample)+\
‘/’+sampleFilesList[i]
else:
targetDir = ‘TrainSample’+str(indexOfSample)+\
‘/’+sampleFilesList[i]
if path.exists(targetDir) == False:
mkdir(targetDir)
sampleDir = sampleFilesDir + ‘/’ + sampleList[j]
sample = open(sampleDir).readlines()
sampleWriter = open(targetDir+’/’+sampleList[j],’w’)
for line in sample:
sampleWriter.write(‘%s\n’ % line.strip(‘\n’))
sampleWriter.close()
fr.close()调用以上函数生成标注集,训练和测试集合
def test():
for i in range(10):
classifyRightCate = ‘classifyRightCate’ + str(i) + ‘.txt’
createTestSample(i,classifyRightCate)
这种分割原始数据和测试数据的方式也是很值得借鉴,特别是我们要用交叉验证的时候可以直接用这段程序。我们在KNN算法里并没有独立地使用这段代码,而是在’computeTFMultiIDF()’这段代码里使用了
以下代码是计算IDF值
def computeIDF():
fileDir = ‘processedSampleOnlySpecial’
wordDocMap = {}
IDFPerWordMap = {}
countDoc = 0.0
cateList = listdir(fileDir)
for i in range(len(cateList)):
sampleDir = fileDir + ‘/’ + cateList[i]
sampleList = listdir(sampleDir)
for j in range(len(sampleList)):
sample = sampleDir + ‘/’ + sampleList[j]
for line in open(sample).readlines():
word = line.strip(‘\n’)
if word in wordDocMap.keys():
wordDocMap[word].add(sampleList[j])
else:
wordDocMap.setdefault(word,set())
wordDocMap[word].add(sampleList[j])
print(‘just finished %d round ’ % i)
for word in wordDocMap.keys():
countDoc = len(wordDocMap[word])
IDF = log(20000/countDoc)/log(10)
IDFPerWordMap[word] = IDF
return IDFPerWordMap
这段代码计算IDF值地时候使用了集合的概念。这样做确实代码比较精炼
以下代码是把计算出来的IDF值保存起来以便下次使用
def main():
start = time.clock()
IDFPerWordMap = computeIDF()
end = time.clock()
print(‘runtime: ’ + str(end - start))
fw = open(‘IDFPerWord’,’w’)
for word,IDF in IDFPerWordMap.items():
fw.write(‘%s %.6f\n’ % (word,IDF))
fw.close()def computeTFMultiIDF(indexOfSample,trainSamplePercent):
IDFPerWord = {}
for line in open(‘IDFPerWord’).readlines():
word,IDF = line.strip(‘\n’).split(’ ‘)
IDFPerWord[word] = IDF
fileDir = ‘processedSampleOnlySpecial’
trainFileDir = “docVector/” + ‘wordTFIDFMapTrainSample’ + str(indexOfSample)
testFileDir = “docVector/” + ‘wordTFIDFMapTestSample’ + str(indexOfSample)
tsTrainWriter = open(trainFileDir,’w’)
tsTestWriter = open(testFileDir,’w’)
cateList = listdir(fileDir)for i in range(len(cateList)): sampleDir = fileDir + '/' + cateList[i] sampleList = listdir(sampleDir) testBeginIndex = indexOfSample * (len(sampleList)*(1-trainSamplePercent)) testEndIndex = (indexOfSample + 1) * ( len(sampleList) * (1-trainSamplePercent) ) for j in range(len(sampleList)): TFPerDocMap = {} sumPerDoc = 0 sample = sampleDir + '/' + sampleList[j] for line in open(sample).readlines(): sumPerDoc += 1 word = line.strip('\n') TFPerDocMap[word] = TFPerDocMap.get(word,0) + 1 if(j >= testBeginIndex) and (j <= testEndIndex): tsWriter = tsTestWriter else: tsWriter = tsTrainWriter tsWriter.write('%s %s ' % (cateList[i], sampleList[j])) for word,count in TFPerDocMap.items(): TF = float(count)/ float(sumPerDoc) tsWriter.write('%s %f ' % (word, TF * float(IDFPerWord[word]))) tsWriter.write('\n') print('just finished %d round ' % i)tsTrainWriter.close()tsTestWriter.close()tsWriter.close()
这段代码是从processedSampleOnlySpecial文件中读取数据,对每个字母计算一次TF-IDF值,同时把文件分为训练数据和测试数据存入到doVector文档中。
以下就是我们代码的分类部分了
def computeSim(testDic,trainDic):
testList = []
trainList = []
for word,weight in testDic.items(): if trainDic. __contains__(word): testList.append(float(weight)) trainList.append(float(trainDic[word]))testVect = mat(testList)trainVect = mat(trainList)num = float(testVect * trainVect.T)denom = linalg.norm(testVect) * linalg.norm(trainVect)return float(num) / (1 + float(denom))
def KNNComputeCate(cate_Doc,testDic,trainMap):
simMap = {}
for item in trainMap.items():
similarity = computeSim(testDic,item[1])
simMap[item[0]] = similarity
sortedSimMap = sorted(simMap.items(),key=itemgetter(1),reverse= True)
k = 20cateSimMap = {}for i in range(k): cate = sortedSimMap[i][0].split('_')[0] cateSimMap[cate] = cateSimMap.get(cate,0) + sortedSimMap[i][1]sortedCateSimMap = sorted(cateSimMap.items(),key=itemgetter(1),reverse=True)return sortedCateSimMap[0][0]
def doProcess():
trainFiles = ‘docVector/wordTFIDFMapTrainSample2’
testFiles = ‘docVector/wordTFIDFMapTestSample2’
kNNResultFile = ‘docVector/KNNClassifyResult’
trainDocWordMap = {}for line in open(trainFiles).readlines(): lineSplitBlock = line.strip('\n').split(' ') trainWordMap = {} m = len(lineSplitBlock) -1 for i in range(2,m,2): trainWordMap[lineSplitBlock[i]] = lineSplitBlock[i+1] temp_key = lineSplitBlock[0] + '_' + lineSplitBlock[1] trainDocWordMap[temp_key] = trainWordMaptestDocWordMap = {}for line in open(testFiles).readlines(): lineSplitBlock = line.strip('\n').split(' ') testWordMap = {} m = len(lineSplitBlock) - 1 for i in range(2,m,2): testWordMap[lineSplitBlock[i]] = lineSplitBlock[i+1] temp_key = lineSplitBlock[0] + '_' + lineSplitBlock[1] # print('test!!!!!!!!'+ temp_key) testDocWordMap[temp_key] = testWordMap with open('log','a') as f: f.write(temp_key)count = 0rightCount = 0KNNResultWriter = open(kNNResultFile,'w')for item in testDocWordMap.items(): classifyResult = KNNComputeCate(item[0],item[1],trainDocWordMap) count += 1 print('this is %d round' % count) classifyRight = item[0].split('_')[0] KNNResultWriter.write('%s %s\n' % (classifyRight,classifyResult)) if classifyRight == classifyResult: rightCount +=1 print('%s %s rightCount:%d' % (classifyRight,classifyResult,rightCount))accuracy = float(rightCount)/float(count)print('rightCount : %d , count : %d , accuracy : %.6f' % (rightCount,count,accuracy))return accuracy
从这段代码中需要注意的是,K近邻算法中,他用的算法思想,用语言大概描述就是 把到各个类的距离综合作为比较,在KNNComputeCate有体现.
以下是关于正确率的截图
- 《机器学习》 KNN算法
- 机器学习:KNN算法
- 机器学习-KNN 算法
- 【机器学习】kNN算法
- 机器学习 -- kNN算法
- 机器学习---kNN算法
- 机器学习--kNN算法
- 机器学习--KNN算法
- 机器学习算法-kNN
- 机器学习knn算法
- 机器学习算法---kNN算法
- 机器学习之kNN算法
- 机器学习之KNN 算法
- 机器学习之KNN算法
- 机器学习 Python kNN算法
- 机器学习之KNN算法
- 机器学习之kNN算法
- 机器学习系列----KNN算法
- 如何在同一台电脑上打开多个iPhone模拟器
- 停用eclipse自动更新
- 5.2.1)均方根法(Root-Sum-Squares,RSS);
- [详解]STOER-WAGNER算法求解无向图最大流最小割
- Springboot项目云部署 ---- 基于Docker的部署
- 机器学习knn算法
- Git 分支
- P1731生日蛋糕 减枝
- [笔记]《操作系统精髓与设计原理》---(5)I/O管理和磁盘调度
- sed修炼系列(四):sed中的疑难杂症
- 近些天的三场面试
- Android布局总结三:include总结
- 零基础的Web前端初学者应该如何系统地学习?
- 设计模式——单例模式