Python小脚本:计算两个文件内容的相似率

来源:互联网 发布:linux 定时重启tomcat 编辑:程序博客网 时间:2024/06/04 18:03

     之前拿到一堆源码文件,在整理的时候需要计算一些文件之间的相似率。与之相关第一个想到的就是Linux上的” diff “命令。把diff命令的输出拿过来简单计算一下,就有结果了。于是用Python调用一下命令行,短短几行代码,就OK了。

# !/usr/bin/python# -*- coding: utf-8 -*-"""create_author : 蛙鳜鸡鹳狸猿create_time   : 2017-06-01program       : *_* evaluate the similarity between two files *_*"""import commandsdef simieval(lf, rf):    """    UDF of evaluating the similarity between two files.    Rely on "diff" handler on Linux platform.    Default to ignore blanks by "-bB".    :param lf: string        File to diff at left.    :param rf: string        File to diff at right.    :return: float        Similarity ratio between two files.    """    le = float(commands.getoutput("diff -bB %s %s | awk 'BEGIN {cnt = 0}; /^[0-9]/ {cnt -= 1}; /^[<>]/ {cnt += 1}; END {print cnt}'" % (lf, rf)))    re = float(commands.getoutput("nl %s %s | awk '{print $2}' | grep -v '^$' | wc -l" % (lf, rf)))    result = 1 - le / re    return resultif __name__ == "__main__":    writer = "/home/student/"    ll = """    Ubuntu    MySQL    C    JAVA    Shell    """    wl = open(writer + "lf", 'w')    wl.writelines(ll)    wl.close()    lr = """    CentOS    MySQL    C    Python    """    wr = open(writer + "rf", 'w')    wr.writelines(lr)    wr.close()    se = simieval(lf=writer + "lf", rf=writer + "rf")    print(se)

原创粉丝点击