About Text mining

来源：互联网发布：淘宝返现是怎么回事编辑：程序博客网时间：2024/06/06 03:20

　　What Text Mining Can Do

　　Text mining offers a solution to thisproblem by replacing or supplementing the human reader with automaticsystems undeterred by the text explosion. It involves analysing a largecollection of documents to discover previously unknown information. The informationmight be relationships or patterns that are buried in the document collectionand which would otherwise be extremely difficult, if not impossible, todiscover. Text mining can be used to analyse natural language documents aboutany subject, although much of the interest at present is coming from thebiological sciences.

　　Take interactions betweenproteins, for example. This area of research is important for the developmentof drugs to modify protein interactions that are linked to disease. Text miningcan not only extract information on protein interactions from documents, but itcan also go one step further to discover patterns in the extractedinteractions. Information may be discovered that would have been extremelydifficult to find, even if it had been possible to read all the documents. Thisinformation could help to answer existing research questions or suggest newavenues to explore.

　　How TextMining Works

　　Text mining involves theapplication of techniques from areas such as information retrieval, naturallanguage processing, information extraction and data mining. These variousstages of a text-mining process can be combined into a single workflow. We willnow look in more detail at each of these areas and how, together, they form atext-mining pipeline.

　　Information retrieval (IR) systemsidentify the documents in a collection which match a user’s query. The most well known IR systems are search engines such asGoogle?, which identify those documents on the WWW that are relevant to a setof given words. IR systems are often used in libraries, where the documents aretypically not the books themselves but digital records containing informationabout the books. This is however changing with the advent of digital libraries,where the documents being retrieved are digital versions of books and journals.

　　IR systems allow us to narrow downthe set of documents that are relevant to a particular problem. As text mining involves applyingvery computationally intensive algorithms to large document collections, IR canspeed up the analysis considerably by reducing the number of documents foranalysis. For example, if we are interested in mining information onlyabout protein interactions, we might restrict our analysis to documents thatcontain the name of a protein, or some form of the verb ‘to interact’ or one ofits synonyms.

　　Natural language processing (NLP) is oneof the oldest and most difficult problems in the field of artificialintelligence. It is the analysis of human language so that computers canunderstand natural languages as humans do. Although this goal is still someway off, NLP can perform some types of analysis with a high degree of success.For example:

　　Part-of-speech tagging classifies words into categories such as noun, verb or adjective

　　Word sense disambiguation identifies the meaning of a word, given its usage, from themultiple meanings that the word may have

　　Parsing performs a grammatical analysis of a sentence. Shallow parsersidentify only the main grammatical elements in a sentence, such as noun phrasesand verb phrases, whereas deep parsers generate a complete representation ofthe grammatical structure of a sentence

　　The role of NLP in text mining isto provide the systems in the information extraction phase (see below) withlinguistic data that they need to perform their task. Often this is done byannotating documents with information such as sentence boundaries, part-of-speechtags and parsing results, which can then be read by the information extractiontools.

　　Information extraction (IE) is theprocess of automatically obtaining structured data from an unstructured naturallanguage document. Often this involves defining the general form of theinformation that we are interested in as one or more templates, which are thenused to guide the extraction process. IE systems rely heavily on the datagenerated by NLP systems. Tasks that IE systems can perform include:

　　Term analysis, which identifiesthe terms in a document, where a term may consist of one or more words. This isespecially useful for documents that contain many complex multi-word terms,such as scientific research papers

　　Named-entity recognition, whichidentifies the names in a document, such as the names of people or organizations.Some systems are also able to recognize dates and expressions of time,quantities and associated units, percentages, and so on

　　Fact extraction, which identifiesand extracts complex facts from documents. Such facts could be relationshipsbetween entities or events

　　A very simplified example of theform of a template and how it might be filled from a sentence is shown inFigure 1. Here, the IE system must be able to identify that ‘bind’ is a kind of interaction, and that‘myosin’ and ‘actin’ are the names of proteins. This kind of information might be stored ina dictionary or an ontology, which defines the terms in a particular field andtheir relationship to each other. The data generated during IE are normallystored in a database ready for analysis in the final stage, data mining.

　　Data mining (DM) (often also known as knowledge discovery) is the process of identifying patterns in large sets of data. Theaim is to uncover previously unknown, useful knowledge. When used in textmining, DM is applied to the facts generated by the information extractionphase. Continuing with our protein interaction example, we may have extracted alarge number of protein interactions from a document collection and storedthese interactions as facts in a database. By applying DM to this database, wemay be able to identify patterns in the facts. This may lead to new discoveriesabout the types of interactions that can or cannot occur, or the relationshipbetween types of interactions and particular diseases and so on.

　　We put the results of our DMprocess into another database that can be queried by the end-user via asuitable graphical interface. The data generated by such queries can also berepresented visually, for example, as a network of protein interactions.