About Text mining
来源:互联网 发布:淘宝返现是怎么回事 编辑:程序博客网 时间:2024/06/06 03:20
What Text Mining Can Do
Text mining offers a solution to thisproblem by replacing or supplementing the human reader with automaticsystems undeterred by the text explosion. It involves analysing a largecollection of documents to discover previously unknown information. The informationmight be relationships or patterns that are buried in the document collectionand which would otherwise be extremely difficult, if not impossible, todiscover. Text mining can be used to analyse natural language documents aboutany subject, although much of the interest at present is coming from thebiological sciences.
Take interactions betweenproteins, for example. This area of research is important for the developmentof drugs to modify protein interactions that are linked to disease. Text miningcan not only extract information on protein interactions from documents, but itcan also go one step further to discover patterns in the extractedinteractions. Information may be discovered that would have been extremelydifficult to find, even if it had been possible to read all the documents. Thisinformation could help to answer existing research questions or suggest newavenues to explore.
How TextMining Works
Text mining involves theapplication of techniques from areas such as information retrieval, naturallanguage processing, information extraction and data mining. These variousstages of a text-mining process can be combined into a single workflow. We willnow look in more detail at each of these areas and how, together, they form atext-mining pipeline.
Information retrieval (IR) systemsidentify the documents in a collection which match a user’s query. The most well known IR systems are search engines such asGoogle?, which identify those documents on the WWW that are relevant to a setof given words. IR systems are often used in libraries, where the documents aretypically not the books themselves but digital records containing informationabout the books. This is however changing with the advent of digital libraries,where the documents being retrieved are digital versions of books and journals.
IR systems allow us to narrow downthe set of documents that are relevant to a particular problem. As text mining involves applyingvery computationally intensive algorithms to large document collections, IR canspeed up the analysis considerably by reducing the number of documents foranalysis. For example, if we are interested in mining information onlyabout protein interactions, we might restrict our analysis to documents thatcontain the name of a protein, or some form of the verb ‘to interact’ or one ofits synonyms.
Natural language processing (NLP) is oneof the oldest and most difficult problems in the field of artificialintelligence. It is the analysis of human language so that computers canunderstand natural languages as humans do. Although this goal is still someway off, NLP can perform some types of analysis with a high degree of success.For example:
Part-of-speech tagging classifies words into categories such as noun, verb or adjective
Word sense disambiguation identifies the meaning of a word, given its usage, from themultiple meanings that the word may have
Parsing performs a grammatical analysis of a sentence. Shallow parsersidentify only the main grammatical elements in a sentence, such as noun phrasesand verb phrases, whereas deep parsers generate a complete representation ofthe grammatical structure of a sentence
The role of NLP in text mining isto provide the systems in the information extraction phase (see below) withlinguistic data that they need to perform their task. Often this is done byannotating documents with information such as sentence boundaries, part-of-speechtags and parsing results, which can then be read by the information extractiontools.
Information extraction (IE) is theprocess of automatically obtaining structured data from an unstructured naturallanguage document. Often this involves defining the general form of theinformation that we are interested in as one or more templates, which are thenused to guide the extraction process. IE systems rely heavily on the datagenerated by NLP systems. Tasks that IE systems can perform include:
Term analysis, which identifiesthe terms in a document, where a term may consist of one or more words. This isespecially useful for documents that contain many complex multi-word terms,such as scientific research papers
Named-entity recognition, whichidentifies the names in a document, such as the names of people or organizations.Some systems are also able to recognize dates and expressions of time,quantities and associated units, percentages, and so on
Fact extraction, which identifiesand extracts complex facts from documents. Such facts could be relationshipsbetween entities or events
A very simplified example of theform of a template and how it might be filled from a sentence is shown inFigure 1. Here, the IE system must be able to identify that ‘bind’ is a kind of interaction, and that‘myosin’ and ‘actin’ are the names of proteins. This kind of information might be stored ina dictionary or an ontology, which defines the terms in a particular field andtheir relationship to each other. The data generated during IE are normallystored in a database ready for analysis in the final stage, data mining.
Data mining (DM) (often also known as knowledge discovery) is the process of identifying patterns in large sets of data. Theaim is to uncover previously unknown, useful knowledge. When used in textmining, DM is applied to the facts generated by the information extractionphase. Continuing with our protein interaction example, we may have extracted alarge number of protein interactions from a document collection and storedthese interactions as facts in a database. By applying DM to this database, wemay be able to identify patterns in the facts. This may lead to new discoveriesabout the types of interactions that can or cannot occur, or the relationshipbetween types of interactions and particular diseases and so on.
We put the results of our DMprocess into another database that can be queried by the end-user via asuitable graphical interface. The data generated by such queries can also berepresented visually, for example, as a network of protein interactions.
- About Text mining
- Text Mining
- Text mining and web mining
- What Is Text Mining?
- Text Mining Blog
- text mining资料
- text mining 笔记
- Text Mining(文本挖掘)
- About Data Mining and Intelligence
- Practical Text Mining with Perl
- Lecture about Social Network and Data Mining
- something about sublime text
- Copying about text file
- 读书笔记(1) "the text mining handbook"
- 文本挖掘过程(Text Mining)
- 用R進行中文 text Mining
- text mining and analysis 学习笔记week1
- text mining and analytics学习笔记week2
- 第一次
- 为xp文件夹添加提示信息
- 鸟哥备份方案
- 控制窗口关闭和刷新事件
- 幸福的男人是.....
- About Text mining
- 快速排序
- 难懂的“人”
- 5个非常有用的导航菜单教程
- 可不可以不上班
- 使用ObjectOutputStream或ObjectInputStream传输文件
- 第一篇博客
- 检测程序是否退出,然后启动
- 解析Java对象的equals()和hashCode()的使用