Final Assignment

来源：互联网发布：好用的自动铅笔知乎编辑：程序博客网时间：2024/06/03 22:57

1 Systematic Literature Review

1.1 Background

我调研的这篇paper主要是关于自然语言处理的。具体的讲是文本匹配，比如给出一段话的上半句，让机器从一系列文件中找出与这句话最匹配的回答，或者找出与器语义最相关的一句话。文本匹配之所以很重要，是因为它与文本的语义有很大的联系，而这又是很多自然语言处理问题的重中之重。如果能很好的利用这写语义信息，就能为其它问题提供很多帮助，比如提高文本聚类的精确度，提高推荐算法的准确度等，这就是这篇paper的研究意义所在。
深度学习已在图像、语音等领域取得了很好的效果，但由于自然语言本身的特殊性，深度学习技术在这一领域还没有太高的突破。很多学者、研究院都在这方面努力探索。自然语言处理是人工智能的一个重要领域，它在使机器像人一样思考、自动实时翻译以及自动问答等领域都有着机器重要的地位。相信随着深度学习等技术的发展，自然语言处理的问题会逐渐得到解决。

1.2 Aims and objectives of

提出一种新的卷积神经网络模型，用来处理自然语言，目的使提高文本匹配的准确度。

1.3 Summa of the Literature Review

对结构化数据进行匹配通常都是通过比较对象之间的语义来完成。在自然语言领域，当前研究的许多方法仍然是通过计算文本之间的向量内积来表示两者语义相关度。只有很少的工作利用深层神经网络来建模文本之间的关系，但这些工作都是使用”词袋模型“，使得效果提升并不明显。另外，也有一些使用递归神经网络来处理文本的，但主要用于分类任务；另外一些CNN结构需要很喜字的调整超参数和网络结构。
针对这些不足，作者提出了一种新的模型力求提高文本语义

1.4 Summary of the paper

卷积神经网络在图像领域与语音识别领域都取得了较好的效果，作者借鉴这些思想，提出一种使用图像处理的方式处理文本的方法，期待提高文本语义相似度计算的准确度。
论文中作者主要提出了两种模型：（1）分别独立的对两个文本进行卷积操作，然后在网络的最后一层把两个文本结合，接着输入一个多层感知机进行文本语义相似度的计算。（2）在一开始就对两个文本进行组合，使用类似滑动窗口的方式对得到的每一个组合进行卷积操作，经过第一个卷积层后，两个文本就变成了图像的形式，接着使用图像处理的方式对其进行操作。
实验结果：作者做了相关实验并对比分析了不同的模型效果。试验结果表明，作者提出的两种结构在文本结构在文本语义建模方面优于现有的模型。且提出的第二种模型优于第一种模型。
文本匹配能够很好的表示文本之间的语义关系。根据作者的方法，在计算出文本之间的语义关系之后，可以利用这些语义关系进行文本聚类，文本分类等任务，从而利用语义关系提高准确度。也可以应用到自然语言理解的相关领域。

2 Identifying potential threat

2.1 Organizational situation given

保安人员，决定一个人能否进入工作大楼，不经常使用电脑，但对于安保方面有很大的学习兴趣。

2.2 Example proposed

社交工程，网络钓鱼

2.3 How to do it

首先找机会和这个安保人员取得交谈的机会，告诉他自己是一个安保方面的专家，在进行安保方面的课程教学以及安保措施的检查等相关工作，问他有没有兴趣参与。
经过多次沟通后，就能够取得他的信任，然后就可以进入内部进行“攻击”。

2.4 Confidence level about proposal

高。

3. KAOS Model

3.1. 问题描述

当前的文本聚类没有考虑文本之间的语义关系，使得聚类的准确度不理想。

3.2. 目标模型

系统的总体目标模型如图1所示。

目标模型Accurate clustering satisfied如图2所示。

目标模型Texts being clustered inputed to system如图3所示。

目标模型Texts assigned to one cluster如图4所示。

目标模型Robust and reliable如图5所示。

目标模型Usable system如图6所示。

目标模型Efficient system如图7所示。

采用的目标模式有：
（1） Generic Goal Pattern
（2） Usable system
（3） Cheap system
（4） Efficient syste

3.3 责任模型

Programmer的责任模型如图8所示。

3.4 对象模型之一

Clustering System的对象模型如图9所示。

3.5 操作过程模型

系统中主要对象的操作过程模型如图10所示。

3.6 目标实现的潜在障碍

3.7 系统的需求说明文档

3.7.1 Introduction

3.7.1.1 Document purpose

The purpose of this document is to present a detailed description of the Short Text Clustering System based on Deep Learning Technology. It will explain the purpose and features of the system, the interfaces of the system, what the system will do, the constraints under which it must operate and how the system will react to external stimuli.
This document is intended for both the stakeholders and the developers of the system.

3.7.1.2 System purpose

This software system will be a Short Text Clustering System based on Deep Learning Technology. This system will be designed to provide a more accurate clustering results which is not good enough based on present method, by using the similarity between the texts and clustering centroids which is calculated based on deep learning technology.
By providing a more accurate clustering results, the system can do preliminary work for other relevant tasks like Question Answering Track, Sentiment Analysis and so on.

3.7.1.3 Definitions, acronyms, and abbreviations

The Concept and its corresponding definition is listed in table 1.

Concept Name Concept Definition Centroids Texts representing some particular clusters Clustering Algorithm Put similar texts together and split different texts Clustering Module The module of the system to perform Clustering Algorithm Deep Learning Technology A kind of technology based on Artificial Neural Network, which has the ability to automatically learn some knowledge, then apply it to do some work Programmer Person developing this system Short Texts Texts containing not too many texts. (For example, less than 100 words.) Similarity Whether two texts talk about the same theme. If yes, the similarity is high. If not, the similarity is high Similarity Module The module of the system to perform Similarity calculating Software Requirements Specification A document that completely describes all of the functions of a proposed system and the constraints under which it must operate. For example, this document Stakeholder Any person with an interest in the project who is not a developer User Any person who interacts with this system

3.7.1.4 References

[1] IEEE. IEEE Std 830-1998 IEEE Recommended Practice for Software Requirements Specifications. IEEE Computer Society, 1998.
[2] Gregor v. Bochmann. Requirements Specification with the IEEE 830 Standard DRAFT. University of Ottawa, 2010.

3.7.1.5 Overview

The next chapter of this document is the Overall Description section, which gives an overview of the functionality of the product. It describes the informal requirements and is used to establish a context for the technical requirements specification in the next chapter.
The third chapter of this document is Requirements Specification section, which is written primarily for the developers and describes in technical terms the details of the functionality of the product.
Both sections of the document describe the same software product in its entirety, but are intended for different audiences and thus use different language.

3.7.2. Overall description

3.7.2.1 System perspective

This system is independent and self-contained except that it need interact with tensorflow. The user just need to input the texts which are going to be clustered, this system will output the clustering results. All the processes don’t need any other external system or software.

3.7.2.1.1 System interfaces

Since the similarity calculation model of this system is based on deep learning technology, the system need to interact with deep learning framework - Tensorflow.
• Tensorflow with GPU enabled should already be installed in the machine.

3.7.2.1.2 User interfaces

The system should provide interface for user to interact with, for example, upload texts. All user interfaces are operated on the machine which has installed this system.

3.7.2.1.3 Hardware interfaces

Since the system is based on deep learning technology, the system will require interfaces with the installed computer’s hardware. The System interfaces required on the system are the following:
• GPU which is compatible with CUDA to speed the computation process.

3.7.2.1.3 Memory Constraints

The system need at least 8GB memory to satisfy both the Tensorflow and system itself.

3.7.2.2 User requirements

Requirements are clearly distinguished in the text; each requirement is presented by giving its name, definition and unique identifying number. Requirements presented in this section are functional and nonfunctional.

3.7.2.2.1 Read data

The system should have the function to read variable types of data, like “cvs”, “excel” and so on.

3.7.2.2.2 Calculate similarity

After the data is read into the system, the system should have the function to calculate the similarity between texts and clustering centroids.
It read two texts as its input, and the similarity between them as output.

3.7.2.2.3 Clustering texts

The system should have the function to cluster given texts using different clustering algorithms.

3.7.2.2.4 Output clustering results

The system should have the function to display the clustering results.

3.7.2.2.5 Store clustering results

The system should have the function to store clustering results to different types of file, like “cvs”, “excel” file and so on.

3.7.2.2.6 Efficient

The system should be efficient enough to not let the user wait too long time.

3.7.2.2.7 Reliable and Robust

The system should be reliable and robust enough, so the system should give some message when the input is wrong or it cost too much time to cluster texts.

3.7.2.3 User characteristics

There is no authority needed or something sensitive should be safely protected. So there is only one type of User.
The User is expected to know basic clustering knowledge, for example, its concepts, input and output of clustering algorithm.
The Editor is expected to be Ubuntu literate and to be able to use button, pull-down menus, and similar tools.

3.7.2.4 Constraints

There are a number of constraints which the system must abide by during development. The system must be developed within their bounds. These constraints dictate a number of the functional and nonfunctional requirements specified by this document. All are important to be aware of during the implementation of the software system.
• GPU compatible with CUDA: Because one module of this system is based on Deep Learning technology, in order to speed this process of this module, the computation should be put onto GPU which need to be compatible with CUDA to enable GPU speeding.
• Operating system: Because one module of this system is based on Deep Learning technology, which needs to be compatible with Tensorflow, thus, the Operating System should be Ubuntu.
• Text length: If the text is too long, the performance will decrease, since the similarity calculation needs a lot computation and cost a lot of time.

3.7.2.5 Assumptions and dependencies

The Assumptions and dependencies are listed as bellow:
• System will be installed on a machine running Ubuntu operating system.
• The machine on which system is installed has already installed the compatible Tensorflow whose version is GPU enabled.
• The texts which are going to be clustered are formatted already before input into the system.
• When one task is executing, user can’t execute another clustering task.
• The clustering results is output as a standard “cvs” file.

3.7.2.6 Apportioning of requirements

The list of all the requirements presented in the User requirements section sorted by priority level is as bellow:

Requirements Name Priority Read data 1 Calculate similarity 2 Clustering texts 3 Output clustering results 4 Store clustering results 5 Efficient 6 Reliable and Robust 6

3.7.3. Specific requirements

This section specifies the detailed requirements which the system shall meet.

3.7.3.1 Functional Requirements

System functional requirements are specified by use cases and specific requirements. The use case helps understand system behavior, and the specific requirements extend the information from the use case.

3.7.3.1.1 Read data

3.7.3.1.2 Calculate similarity

3.7.3.1.3 Clustering texts

3.7.3.1.4 Output clustering results

3.7.3.1.5 Store clustering results

3.7.3.2 Non-Functional Requirements

3.7.3.2.1 Efficient

The system should be efficient enough to not let the user wait too long time.

3.7.3.2.2 Reliable and Robust

The system should be reliable and robust enough, so the system should give some message when the input is wrong or it cost too much time to cluster texts.

4 Discussion

4.1 Topics

Prioritization of requirements and consensus among stakeholders.

4.2 How to do it, advantages and potential limitations

需求的优先级可以从多种不同的、相关的，甚至是相互对立的维度进行定义。并且这些维度可以被不同的利益干系人进行评估。下面列出的方法都是从不同的角度对需求优先级进行的定义。

首先是需求优先级的确定。
1. Cost-value approach
  这个方法是从成本和产出方面考虑的，基本思想是对一个需求的成本消耗和所能产出的利益进行分析，确定优先级：消耗越少，利益越多的需求优先级越高。
  
  优点：能够保证在消费最少的情况下，最大化所带来的收益。
  缺点：没有考虑到利益干系人的特殊要求，比如利益干系人可能会要求首先完成某些指定的需求，如果采用这种方法，而此需求的成本消耗又比较高，此需求就可能被滞后。
2. Voting schemes
  这个方法是从利益干系人的角度考虑的，基本思想是让项目牵涉到的利益干系人对需求的优先级进行投票，并汇总得出不同需求的优先级。
  
  优点：虑到利益干系人的要求。
  缺点：不同利益干系人的需求可能会产生冲突。
3. Pairwise comparison
  基本思想是把需求两两组合进行，然后从各个方面对组合得到的这两个需求进行分析比较，从而得出每个需求的优先级。
  
  优点：能够比较全面的评估每个需求的优先级。
  缺点：为了确定需求的优先级，需要把需求两两组合进行比较，所以工作较大。
4. Business Case Analysis
  基本思想是对每个需求采取“如果进行了这个需求的开发，将会对项目接下来的发展产生什么样的影响？”，对于能够对项目接下来产生更好的影响的需求的优先级设为高，从而得出每个需求的优先级。
  
  优点：能够比较全面的评估每个需求的优先级。
  缺点：为了确定需求的优先级，需要对每个需求进行较为详细的分析，所以工作较大。
对于不同利益干系人的需求一致性认
可以采用类似需求优先级的方式，对利益干系人进行优先级定义。当不同利益干系人的需求优先级产生冲突的时候，首先满足优先级比较高的利益干系人的需求，然后按照优先级顺序满足接下来的利益干系人的需求。

0 0