chatterbot源码comparisons的测试
来源:互联网 发布:淘宝 串货 编辑:程序博客网 时间:2024/06/05 11:22
comparisons需要调用utils.py,
utils.py代码如下:
"""ChatterBot utility functions"""def import_module(dotted_path): """ Imports the specified module based on the dot notated import path for the module. """ import importlib module_parts = dotted_path.split('.') module_path = '.'.join(module_parts[:-1]) module = importlib.import_module(module_path) return getattr(module, module_parts[-1])def initialize_class(data, **kwargs): """ :param data: A string or dictionary containing a import_path attribute. """ if isinstance(data, dict): import_path = data.pop('import_path') data.update(kwargs) Class = import_module(import_path) return Class(**data) else: Class = import_module(data) return Class(**kwargs)def validate_adapter_class(validate_class, adapter_class): """ Raises an exception if validate_class is not a subclass of adapter_class. :param validate_class: The class to be validated. :type validate_class: class :param adapter_class: The class type to check against. :type adapter_class: class :raises: Adapter.InvalidAdapterTypeException """ from .adapters import Adapter # If a dictionary was passed in, check if it has an import_path attribute if isinstance(validate_class, dict): origional_data = validate_class.copy() validate_class = validate_class.get('import_path') if not validate_class: raise Adapter.InvalidAdapterTypeException( 'The dictionary {} must contain a value for "import_path"'.format( str(origional_data) ) ) if not issubclass(import_module(validate_class), adapter_class): raise Adapter.InvalidAdapterTypeException( '{} must be a subclass of {}'.format( validate_class, adapter_class.__name__ ) )def input_function(): """ Normalizes reading input between python 2 and 3. The function 'raw_input' becomes 'input' in Python 3. """ import sys if sys.version_info[0] < 3: user_input = str(raw_input()) # NOQA # Avoid problems using format strings with unicode characters if user_input: user_input = user_input.decode('utf-8') else: user_input = input() # NOQA return user_inputdef nltk_download_corpus(resource_path): """ Download the specified NLTK corpus file unless it has already been downloaded. Returns True if the corpus needed to be downloaded. """ from nltk.data import find from nltk import download from os.path import split, sep from zipfile import BadZipfile # Download the NLTK data only if it is not already downloaded _, corpus_name = split(resource_path) # From http://www.nltk.org/api/nltk.html # When using find() to locate a directory contained in a zipfile, # the resource name must end with the forward slash character. # Otherwise, find() will not locate the directory. # # Helps when resource_path=='sentiment/vader_lexicon'' if not resource_path.endswith(sep): resource_path = resource_path + sep downloaded = False try: find(resource_path) except LookupError: download(corpus_name) downloaded = True except BadZipfile: raise BadZipfile( 'The NLTK corpus file being opened is not a zipfile, ' 'or it has been corrupted and needs to be manually deleted.' ) return downloadeddef remove_stopwords(tokens, language): """ Takes a language (i.e. 'english'), and a set of word tokens. Returns the tokenized text with any stopwords removed. Stop words are words like "is, the, a, ..." """ from nltk.corpus import stopwords # Get the stopwords for the specified language stop_words = stopwords.words(language) # Remove the stop words from the set of word tokens tokens = set(tokens) - set(stop_words) return tokensdef get_response_time(chatbot): """ Returns the amount of time taken for a given chat bot to return a response. :param chatbot: A chat bot instance. :type chatbot: ChatBot :returns: The response time in seconds. :rtype: float """ import time start_time = time.time() chatbot.get_response('Hello') return time.time() - start_timedef generate_strings(total_strings, string_length=20): """ Generate a list of random strings. :param total_strings: The number of strings to generate. :type total_strings: int :param string_length: The length of each string to generate. :type string_length: int :returns: The generated list of random strings. :rtype: list """ import random import string statements = [] for _ in range(0, total_strings): text = ''.join( random.choice(string.ascii_letters + string.digits + ' ') for _ in range(string_length) ) statements.append(text) return statements
# -*- coding: utf-8 -*-"""This module contains various text-comparison algorithmsdesigned to compare one statement to another."""class Comparator: def __call__(self, statement_a, statement_b): return self.compare(statement_a, statement_b) def compare(self, statement_a, statement_b): return 0 def get_initialization_functions(self): """ Return all initialization methods for the comparison algorithm. Initialization methods must start with 'initialize_' and take no parameters. """ initialization_methods = [ ( method, getattr(self, method), ) for method in dir(self) if method.startswith('initialize_') ] return { key: value for (key, value) in initialization_methods }class LevenshteinDistance(Comparator): """ Compare two statements based on the Levenshtein distance of each statement's text. For example, there is a 65% similarity between the statements "where is the post office?" and "looking for the post office" based on the Levenshtein distance algorithm. """ def compare(self, statement, other_statement): """ Compare the two input statements. :return: The percent of similarity between the text of the statements. :rtype: float """ import sys # Use python-Levenshtein if available try: from Levenshtein.StringMatcher import StringMatcher as SequenceMatcher except ImportError: from difflib import SequenceMatcher PYTHON = sys.version_info[0] # Return 0 if either statement has a falsy text value if not statement or not other_statement: return 0 # Get the lowercase version of both strings if PYTHON < 3: statement_text = unicode(statement.lower()) # NOQA other_statement_text = unicode(other_statement.lower()) # NOQA else: statement_text = str(statement.lower()) other_statement_text = str(other_statement.lower()) similarity = SequenceMatcher( None, statement_text, other_statement_text ) # Calculate a decimal percent of the similarity percent = round(similarity.ratio(), 2) return percentclass SynsetDistance(Comparator): """ Calculate the similarity of two statements. This is based on the total maximum synset similarity between each word in each sentence. This algorithm uses the `wordnet`_ functionality of `NLTK`_ to determine the similarity of two statements based on the path similarity between each token of each statement. This is essentially an evaluation of the closeness of synonyms. """ def initialize_nltk_wordnet(self): """ Download required NLTK corpora if they have not already been downloaded. """ from utils import nltk_download_corpus nltk_download_corpus('corpora/wordnet') def initialize_nltk_punkt(self): """ Download required NLTK corpora if they have not already been downloaded. """ from utils import nltk_download_corpus nltk_download_corpus('tokenizers/punkt') def initialize_nltk_stopwords(self): from utils import nltk_download_corpus nltk_download_corpus('tokenizers/stopwords') def compare(self, statement, other_statement): """ Compare the two input statements. :return: The percent of similarity between the closest synset distance. :rtype: float .. _wordnet: http://www.nltk.org/howto/wordnet.html .. _NLTK: http://www.nltk.org/ """ from nltk.corpus import wordnet from nltk import word_tokenize from chatterbot import utils import itertools tokens1 = word_tokenize(statement.lower()) tokens2 = word_tokenize(other_statement.lower()) # Remove all stop words from the list of word tokens tokens1 = utils.remove_stopwords(tokens1, language='english') tokens2 = utils.remove_stopwords(tokens2, language='english') # The maximum possible similarity is an exact match # Because path_similarity returns a value between 0 and 1, # max_possible_similarity is the number of words in the longer # of the two input statements. max_possible_similarity = max( len(statement.split()), len(other_statement.split()) ) max_similarity = 0.0 # Get the highest matching value for each possible combination of words for combination in itertools.product(*[tokens1, tokens2]): synset1 = wordnet.synsets(combination[0]) synset2 = wordnet.synsets(combination[1]) if synset1 and synset2: # Get the highest similarity for each combination of synsets for synset in itertools.product(*[synset1, synset2]): similarity = synset[0].path_similarity(synset[1]) if similarity and (similarity > max_similarity): max_similarity = similarity if max_possible_similarity == 0: return 0 return max_similarity / max_possible_similarityclass SentimentComparison(Comparator): """ Calculate the similarity of two statements based on the closeness of the sentiment value calculated for each statement. """ def initialize_nltk_vader_lexicon(self): """ Download the NLTK vader lexicon for sentiment analysis that is required for this algorithm to run. """ from utils import nltk_download_corpus nltk_download_corpus('sentiment/vader_lexicon') def compare(self, statement, other_statement): """ Return the similarity of two statements based on their calculated sentiment values. :return: The percent of similarity between the sentiment value. :rtype: float """ from nltk.sentiment.vader import SentimentIntensityAnalyzer sentiment_analyzer = SentimentIntensityAnalyzer() statement_polarity = sentiment_analyzer.polarity_scores(statement.lower()) statement2_polarity = sentiment_analyzer.polarity_scores(other_statement.lower()) statement_greatest_polarity = 'neu' statement_greatest_score = -1 for polarity in sorted(statement_polarity): if statement_polarity[polarity] > statement_greatest_score: statement_greatest_polarity = polarity statement_greatest_score = statement_polarity[polarity] statement2_greatest_polarity = 'neu' statement2_greatest_score = -1 for polarity in sorted(statement2_polarity): if statement2_polarity[polarity] > statement2_greatest_score: statement2_greatest_polarity = polarity statement2_greatest_score = statement2_polarity[polarity] # Check if the polarity if of a different type if statement_greatest_polarity != statement2_greatest_polarity: return 0 values = [statement_greatest_score, statement2_greatest_score] difference = max(values) - min(values) return 1.0 - differenceclass JaccardSimilarity(Comparator): """ Calculates the similarity of two statements based on the Jaccard index. The Jaccard index is composed of a numerator and denominator. In the numerator, we count the number of items that are shared between the sets. In the denominator, we count the total number of items across both sets. Let's say we define sentences to be equivalent if 50% or more of their tokens are equivalent. Here are two sample sentences: The young cat is hungry. The cat is very hungry. When we parse these sentences to remove stopwords, we end up with the following two sets: {young, cat, hungry} {cat, very, hungry} In our example above, our intersection is {cat, hungry}, which has count of two. The union of the sets is {young, cat, very, hungry}, which has a count of four. Therefore, our `Jaccard similarity index`_ is two divided by four, or 50%. Given our similarity threshold above, we would consider this to be a match. .. _`Jaccard similarity index`: https://en.wikipedia.org/wiki/Jaccard_index """ SIMILARITY_THRESHOLD = 0.5 def initialize_nltk_wordnet(self): """ Download the NLTK wordnet corpora that is required for this algorithm to run only if the corpora has not already been downloaded. """ from utils import nltk_download_corpus nltk_download_corpus('corpora/wordnet') def initialize_nltk_averaged_perceptron_tagger(self): from utils import nltk_download_corpus nltk_download_corpus('corpora/averaged_perceptron_tagger') def compare(self, statement, other_statement): """ Return the calculated similarity of two statements based on the Jaccard index. """ from nltk.corpus import wordnet import nltk import string a = statement.lower() b = other_statement.lower() # Get default English stopwords and extend with punctuation stopwords = nltk.corpus.stopwords.words('english') stopwords.extend(string.punctuation) stopwords.append('') lemmatizer = nltk.stem.wordnet.WordNetLemmatizer() def get_wordnet_pos(pos_tag): if pos_tag[1].startswith('J'): return (pos_tag[0], wordnet.ADJ) elif pos_tag[1].startswith('V'): return (pos_tag[0], wordnet.VERB) elif pos_tag[1].startswith('N'): return (pos_tag[0], wordnet.NOUN) elif pos_tag[1].startswith('R'): return (pos_tag[0], wordnet.ADV) else: return (pos_tag[0], wordnet.NOUN) ratio = 0 pos_a = map(get_wordnet_pos, nltk.pos_tag(nltk.tokenize.word_tokenize(a))) pos_b = map(get_wordnet_pos, nltk.pos_tag(nltk.tokenize.word_tokenize(b))) lemma_a = [ lemmatizer.lemmatize( token.strip(string.punctuation), pos ) for token, pos in pos_a if pos == wordnet.NOUN and token.strip( string.punctuation ) not in stopwords ] lemma_b = [ lemmatizer.lemmatize( token.strip(string.punctuation), pos ) for token, pos in pos_b if pos == wordnet.NOUN and token.strip( string.punctuation ) not in stopwords ] # Calculate Jaccard similarity try: ratio = len(set(lemma_a).intersection(lemma_b)) / float(len(set(lemma_a).union(lemma_b))) print("intersection=", len(set(lemma_a).intersection(lemma_b))) print("union=",float(len(set(lemma_a).union(lemma_b)))) except Exception as e: print('Error', e) print ("Jaccard ratio=",ratio) return ratio >= self.SIMILARITY_THRESHOLD# ---------------------------------------- #levenshtein_distance = LevenshteinDistance()synset_distance = SynsetDistance()sentiment_comparison = SentimentComparison()jaccard_similarity = JaccardSimilarity()
compare_test.py代码如下:
# -*- coding:utf-8 -*-from comparisons import LevenshteinDistance#编辑距离L1=LevenshteinDistance();result_L1=L1.compare("go home","go to school")print ("LevenshteinDistance=",result_L1)from compare import SynsetDistance#近义词距离s1=SynsetDistance();s1.initialize_nltk_wordnet()s1.initialize_nltk_punkt()s1.initialize_nltk_stopwords()result_s1=s1.compare("bad","worse")print ("SynsetDistance=",result_s1)from compare import SentimentComparison#情感分析s2=SentimentComparison();s2.initialize_nltk_vader_lexicon()result_s2=s2.compare("bad mood","god weather")print ("sentiment=",result_s2)from compare import JaccardSimilarityj=JaccardSimilarity();j.initialize_nltk_wordnet()j.initialize_nltk_averaged_perceptron_tagger()result_j=j.compare("I go to school","I go to school")print("Jaccard factor=",result_j)
comparisons中四个class,分别各自进行一种语义分析or处理,测试结果如下:
class LevenshteinDistance(Comparator)
编辑距离测试:
输入测试用例
go to school 和 go to home
测试结果:
('LevenshteinDistance=', 0.53)
-----------------------------------------------------------------------------------------------------------
近义词距离测试:
class SynsetDistance(Comparator)
输入测试用例:
bad
worse
输出结果
('SynsetDistance=', 1.0)
-----------------------------------------------------------------------------------------------------------
情感分析比较测试:
class SentimentComparison(Comparator)
输入测试用例:
bad mood
good weather
输出结果:
('sentiment=', 0)
---------------------------------------------------------------------------------------------------------------------------------
Jaccard相似度相似度测试:
class JaccardSimilarity(Comparator)
输入测试用例:
I go to school
I go to school
('intersection=', 1)
('union=', 1.0)
('Jaccard ratio=', 1.0)
('Jaccard factor=', True)
- chatterbot源码comparisons的测试
- 两种开源聊天机器人的性能测试(一)——ChatterBot
- 两种开源聊天机器人的性能测试(一)——ChatterBot
- 聊天机器人(Chatterbot)的诞生
- 使用 ChatterBot 做简单的机器人
- 基于Python-ChatterBot搭建不同adapter的聊天机器人(使用NB进行场景分类)
- 基于Python-ChatterBot搭建不同adapter的聊天机器人(使用NB进行场景分类)
- Row and Array Comparisons
- URAL 1177 Like Comparisons
- Multiple comparisons problem
- chatterbot初步使用
- chatterbot连接本地数据库
- chatterbot配合ChatterBotCorpusTrainer使用
- ChatterBot安装出错
- ChatterBot结构简述
- 测试编译器性能的源码
- Android的GLSurfaceView测试源码
- Barracuda - Framework Comparisons(翻译)
- 从 MVC 到前后端分离
- Java核心技术点之集合框架
- wait()和sleep()的区别
- CCF CSP 201512-3 画图(Java-90分)
- hdu 6097 Mindis (反演点)
- chatterbot源码comparisons的测试
- iOS Coredata的基本使用
- js BOM
- Linux上Mysql 快速进入及常用命令
- 大数据具体有哪些类型你知道吗
- poj 2761 Feed the dogs (Treap+离线处理)
- POI下载EXCEL模板-设置列的属性!!!
- 好的产品原型具备哪些特点?
- C++字符串处理string