NLP keyword extraction tutorial with RAKE and Maui

来源:互联网 发布:美工切图是什么意思 编辑:程序博客网 时间:2024/05/16 06:11

NLP keyword extraction tutorial with RAKE and Maui

TABLE OF CONTENTS

  • 1 Introduction
    • 1.1 Why extract keywords?
  • 2 How does keyword extraction work?
  • 3 Keyword extraction with Python using RAKE
    • 3.1 Setting up RAKE
    • 3.2 Applying RAKE on a piece of text
    • 3.3 RAKE: Behind the scenes
  • 4 Keyword extraction with Java using Maui
    • 4.1 Setting up Maui
    • 4.2 Maui: Keyword extraction from text
    • 4.3 Maui: Keyword extraction with a controlled vocabulary
  • 5 What’s next?
Alyona Medelyan
Alyona runs New Zealand-based NLP consultancy Entopix, holds a Masters in Computational Linguistics and a PhD in Computer Science, and is the author of topic idexing tool Maui.


1 Introduction

In this tutorial you will learn how to extract keywords automatically using both Python and Java, and you will also understand its related tasks such as keyphrase extraction with a controlled vocabulary (or, in other words, text classification into a very large set of possible classes) and terminology extraction.

The tutorial is organized as follows: First, we discuss a little bit of background — what are keywords, and how does a keyword algorithm work? Then we demonstrate a simple, but in many cases effective, keyword extraction with a Python library called RAKE. And finally, we show how a Java tool called Maui extracts keywords using a machine-learning technique.

1.1 Why extract keywords?

Extracting keywords is one of the most important tasks when working with text. Readers benefit from keywords because they can judge more quickly whether the text is worth reading. Website creators benefit from keywords because they can group similar content by its topics. Algorithm programmers benefit from keywords because they reduce the dimensionality of text to the most important features. And these are just some examples&hellips;

By definition, keywords describe the main topics expressed in a document. Theterminology can get a little confusing, so the image below compares related tasks in terms of the source of terminology and number of topics selected per document.

In this tutorial we will focus on two specific tasks and their evaluation:

  • Extracting the most significant words and phrases that appear in given text
  • Identifying a set of topics from a predefined vocabulary that match a given text

If consistency of keywords across many documents is important, I always recommend that you use a vocabulary — or a lexicon or a thesaurus — unless it’s not possible for some reason.

A couple of words for those interested in text categorization (also called text classification), another popular task when working with text: if the number of categories is very large, you will struggle to collect enough training data for supervised classification. So, if you have 100 or more categories, and you can name these categories (they are not abstract), you are dealing with fine-grained categorization. We can treat this task as keyword extraction with a controlled vocabulary, or term assignment. So, read on, this tutorial is also for you!

2 How does keyword extraction work?

A typical keyword extraction algorithm has three main components:

  1. Candidate selection: Here, we extract all possible words, phrases, terms or concepts (depending on the task) that can potentially be keywords.
  2. Properties calculation: For each candidate, we need to calculate properties that indicate that it may be a keyword. For example, a candidate appearing in the title of a book is a likely keyword.
  3. Scoring and selecting keywords: All candidates can be scored by either combining the properties into a formula, or using a machine learning technique to determine probability of a candidate being a keyword. A score or probability threshold, or a limit on the number of keywords is then used to select the final set of keywords..

Finally, parameters such as the minimum frequency of a candidate, its minimum and maximum length in words, or the stemmer used to normalize the candidates help tweak the algorithm’s performance to a specific dataset.

3 Keyword extraction with Python using RAKE

For Python users, there is an easy-to-use keyword extraction library called RAKE, which stands for Rapid Automatic Keyword Extraction. The algorithm itself is described in the Text Mining Applications and Theory book by Michael W. Berry (free PDF). Here, we follow the existing Python implementation. There is also a modified version that uses the natural language processing toolkit NLTK for some of the calculations. For this tutorial, I have forked and extended the original RAKE repository into RAKE-tutorial in order to use additional parameters and evaluate its performance.

3.1 Setting up RAKE

First, you will need to get the RAKE-tutorial repo fromhttps://github.com/zelandiya/RAKE-tutorial.

Then, following the instructions in _raketutorial.py, import RAKE, and import the operator for the “Behind the scenes” part of this tutorial:

3.2 Applying RAKE on a piece of text

First, let us initialize RAKE with a path to a stop words list and set some parameters:

Now, we have a RAKE object that extracts keywords where:

  • Each word has at least 5 characters
  • Each phrase has at most 3 words
  • Each keyword appears in the text at least 4 times

These parameters depend on the text you have at hand, and it is essential to choose these parameters carefully (try running this example with the default parameters and you will understand). More about this in the next section.

Next, once we have a piece of text stored in a variable (in this example we read it in from a file), we can apply RAKE and print the keywords:

The output should look like this:

Here we have the title of each keyword and its score according to the algorithm.

3.3 RAKE: Behind the scenes

This time we will use a short piece of text, and we can use the default parameters here:

First, RAKE splits the text into sentences and generates the candidates:

Here, various punctuation signs will be treated as sentence boundaries. It works in most cases, but will not work for phrases in which these boundaries are parts of the actual phrase (e.g. .Net or Dr. Who).

All words listed in the stopwords file will be treated as phrase boundaries. This helps generate candidates that consist of one or more non-stopwords, such as 'compatibility,' 'systems,' 'linear constraints,' 'set,' 'natural numbers,' and 'criteria' in this text. Most of the candidates will be valid phrases; however ,it won’t work in cases where the stopword is part of the phrase. For example, ‘new’ is listed in RAKE’s stopword list. This means that neither ‘New York’ nor ‘New Zealand’ can be ever a keyword.

Second, RAKE computes the properties of each candidate, which is the sum of the scores for each of its words. The words are scored according to their frequency and the typical length of a candidate phrase in which they appear.

One issue here is that the candidates are not normalized in any way. As a result we may have keywords that look nearly identical: small scale production and small scale producers, or skim milk powder and skimmed milk powder. Ideally, a keyword extraction algorithm should apply stemming and other ways of normalizing keywords first.

Finally, we rank the keyword candidates based on RAKE’s scores. The keywords then can be either the top 5 scored candidates, or those above a chosen score threshold, or the top third, as in the example below:

Here is the output you should expect here:

There are two more scripts of interest. The first one evaluates RAKE’s accuracy using a directory with documents and their manually assigned keywords, as well as the number of top ranked keywords that should be evaluated. For example:

Precision tells us the percentage of correct keywords among those extracted, Recall tells us the percentage of correctly extracted keywords among all correct ones, andF-Measure is a combination of both.

In order to improve RAKE’s performance, we can run another script I have prepared for this tutorial. It cycles through runs with different sets of parameters and evaluates the quality of keywords for each run. It then returns the parameters that performed best on this dataset. For example:

These values indicate that on such long documents, RAKE is better off not including any candidates with more than 5 words and only taking into account candidates that appear fewer than 6 times.

To summarize, RAKE is a simple keyword extraction library which focuses on finding multi-word phrases containing frequent words. Its strengths are its simplicity and the ease of use, whereas its weaknesses are its limited accuracy, the parameter configuration requirement, and the fact that it throws away many valid phrases and doesn’t normalize candidates.

4 Keyword extraction with Java using Maui

Maui stands for Multi-purpose automatic topic indexing. It’s a GPL licensed library written in Java, and its core is the machine-learning toolkit Weka. It’s a reincarnation of the keyword extraction algorithm KEA, after years of research on this topic. Compared to RAKE, Maui allows one to:

  • Extract keywords not just from text, but also with a reference to a controlled vocabulary
  • Improve the accuracy by training Maui on manually chosen keywords

4.1 Setting up Maui

Maui’s source is available on GitHub and Maven Central, but the easiest way to get it is to download the maui-standalone jar from GitHub, then copy the jar into RAKE-tutorial working directory.

4.2 Maui: Keyword extraction from text

For comparison, let’s apply Maui to the same piece of text we used with RAKE. However, because Maui requires a training model, we first need to create such a training model. To train Maui, we execute the following command:

The parameters are train to indicate that we are training a model, -l for the path to the directory with documents and their manual keywords, -m for the path to the output model, -v none which stands for 'no vocabulary,' or perform keyword extraction, and -o 2 which discards any candidates that appear less than two times. Because the training directory is quite large, I increased the Java heap space accordingly.

Once this command has completed (it could take several minutes), we have a training model and we can apply it to the same document we used in the RAKE example, as follows:

The run command can be used on either a file path or a text string. The output should look like this:

We can evaluate the quality of keywords by running the test command, which triggers the evaluation built into Maui:

You can get significantly better performance if you train Maui on the entire set ofmanually annotated documents. But make sure to exclude from the training set the files you are using for testing.

If you are interested in extracting terminology with Maui, simply increase the probability threshold, and you will get many more important terms in this document. Then count the most frequent ones in the document collection.

This gives a good indication of the kind of terminology used in these documents.

4.3 Maui: Keyword extraction with a controlled vocabulary

Assume we are extracting keywords for each document in a document collection. If we extract keywords from document text, they are bound to be inconsistent. One author may be talking about “cultivated forests" and another about “artificial forests”. In order to consistently use the same keyword for both documents, it’s a good idea to use a controlled vocabulary: a terminology list, a thesaurus, or a taxonomy. Another advantage of using a vocabulary is the fact that it contains semantics that help the extraction process. For example, by knowing that “artificial forests” is related to “forest land” and other candidates in a document, algorithms can differentiate these topics from others that are less interconnected.

Maui can work with any vocabulary in the RDF SKOS format, and there are many such vocabularies available in various fields.

Using a vocabulary with Maui is easy. Use -v to specify the path to the vocabulary file and -f for specify its format, e.g. “skos”:

The two main advantages of these keywords are that a) they tie to a unique IDs associated with the actual meaning of these phrases, and b) they will be consistent for any other document analyzed using the same vocabulary.

Here, we have tested Maui using just the default features and parameters, after training it on just 50 documents. After training on more documents, as well as experimenting with tweaking the parameters, the performance can be improved to 35% or even further.

5 What’s next?

We have learned the key principles of a keyword extraction algorithm and have experimented with a Python and a Java library specifically designed for this task. You can now start applying these principles in your project disregarding of the library or a programming language you are using. But if you’d like to continue learning, here are some options:

  • To find out more about how Maui works, feel free to refer to my PhD thesis which explains various aspects in great detail. There is additional documentation on Maui's website.
  • To run more experiments, you could download and try out different keyword extraction datasets and vocabularies.
  • To evaluate your own keyword extraction algorithm against the state of the art, you could benchmark it on the data used in the SemEval’s keyword extraction competition.

And if you get stuck on any of the above, feel free to contact me through AirPair!

NEED 1 ON 1 EXPERT HELP?
PAIR UP with
experts like 
Alyona Medelyan
LOGIN

0 0
原创粉丝点击