数据挖掘最常见的10个问题

来源:互联网 发布:女生定型喷雾推荐 知乎 编辑:程序博客网 时间:2024/06/05 02:38

Abstract

While a myriad of different data mining techniques have been proposed, just a few simple questions can shed light on the key attributes and the power of each technique. In this paper Information Discovery, Inc. analyses approaches to data mining providing two sets of business and technical questions that dissect each technique. Information Discovery, Inc. is the leading provider of large scale data mining oriented decision support software and solutions, introducing pattern management with its breakthrough Pattern Warehouse? technology and offering two comprehensive product suites. The Data Mining Suite? of products directly access very large multi-table SQL repositories to find powerful multi-form patterns. The Knowledge Access Suite? incrementally stores these pre-mined patterns in a Pattern Warehouse? for access by business users. The company also offers a wide range of discovery and data mining solutions, strategic consulting and warehouse architecture design, as well as customized solutions for banking, financial services, retail, customer packaged goods, manufacturing and web-log analysis.

Introduction

The past year has seen a dramatic surge in the level of interest in data mining, with business users wanting to take advantage of the technology for a competitive edge. The IT departments in most Fortune 500 companies are suddenly tasked to respond to deployment questions relating to data mining. The growing interest in data mining has also resulted in the introduction of a myriad of commercial products, each described with a set of terms that sound similar, but in fact refer to very different functionality and based on distinct technical approaches.

The IT managers charged with the task of selecting a decision support system often face a challenge in responding to the needs of the business users because the underlying concepts of data mining are far more complex than traditional query and reporting, and to add to the pressure the needs of the business users are usually urgent, requiring decisions that need to be made quickly.

However, while various approaches to data mining seem to offer distinct features and benefits, in fact just a few fundamental techniques form the basis of most data mining systems and asking a few simple questions will help clarify the nature of each system. These questions need to be asked both from the view points of business and technical users.

These questions may be viewed in the context of two companion articles. A related business article on Measuring the Dollar Value of Mined Information illustrates how the benefits of a data mining system can be quantified as tangible corporate assets. A technical article on the Characterization of Data Mining Techniques separates the technologies used in most data mining systems as three classes of : equations, logic and cross-tabulation and how these techniques are used in some commercial products.

Here are two sets of "Top Ten Data Mining Questions" from business and technical perspectives. Each question has three parts which together highlight one specific aspect of a data mining system's power and capability. These questions aim to bring out the character of a data mining system and help business and technical users understand how to deploy such systems system.

The Top Ten Data Mining Business Questions

The top ten business question should be asked by business users about the benefits, quality and usability of the system. They are:


Question 1: Business Benefits
a)How will this system help us?
b)How well does this system work for our industry-specific applications?
c)What information can we get that we do not already have?

It is essential to ask this question again and again. You should, of course, get new refined information, but it is not enough just to know something -- you should have information that allows you to "act" within the context of your industry. And, you should measure the bottom-line dollar benefits delivered by a data mining system. See the paper "Measuring the Dollar Value f Mined Information" for a framework for this.


Question 2: Technical Know-how
a)How technically sophisticated do we need to be to use it?
b)Can business users operate it without calling the IS group all the time?
c)Is it as easy to use as an internet browser?

Business users should be empowered with direct, on-demand access to refined knowledge. They should not have to know statistics, yet should be given consistent and correct answers. The system interface should be as easy to use as a web-browser.


Question 3: Understandability and Explanations
a)Are the results intuitive or difficult to understand?
b)Do we get clear explanations for any information item presented?
c)Will the explanations be in technical statistical terms or in a form that we can understand?

Results should be presented to business users in plain English, accompanied with graphs. The system should be able to explain each piece of information it presents in clear, English-like terms that business users can easily comprehend and use.


Question 4: Follow-up Questions
a)What kinds of follow-up questions can we ask from the system?
b)Do we need to go to an analyst for further question answering?
c)How fast can we drill-down on the fly to see more patterns?

Response to follow-up questions must be immediate. Business users should not need to use intermediaries such as analysts to get more information after they have seen some results. If follow-up questions take time and involve intermediaries, the business users effectiveness will be impacted. Business users should get refined information, as they need it, when they need it.


Question 5: Business Users
a)How many business users can this system support?
b)Can the business users tailor their own questions for the system?
c)Can users utilize the knowledge for day-to-day decision making?

The system should be able to use the same fundamental knowledge to support a few hundred business users, each with a different group-perspective. Yet, all of these users must be given consistent answers as they ask their own questions. The information must be presented such that can be utilized for day-to-day actions.


Question 6: Accuracy, Completeness and Consistency
a)How accurate are the results the system delivers?
b)Can some patterns be missed by the system?
c)Are the results always consistent or can 100 users get 100 different answers?

The system must cover a wide range of patterns and should provide high quality, information. The knowledge provided to business users should be derived from the entire data set (and not samples) in order to increase accuracy. All business users should access the same knowledge so that they all receive consistent answers, increasing the quality of corporate information.


Question 7: Incremental Analysis
a)Can we automatically analyze weekly / monthly data as it becomes available?
b)Can the system compare the "month to month" results and patterns by itself?
c)Can we get automatic pattern detection over time, every week or month?

The system should analyze data as it becomes available every week or month and perform on-going trend analysis, highlighting the key items and influence factors that impact significant changes. The incremental analysis should be performed automatically in the background, informing the user of significant trends and the underlying causes.


Question 8: Data Handling
a)How much data can the system deal with?
b)Can it work directly on our database, or do we need to extract data?
c)If it works on extracts, how do we know that some patterns are not missed?

The system should handle moderate to large volumes of data on a powerful server -- of course, large data volumes should not be expected to be managed on small servers. The system should work directly on the SQL database, without extracts so that patterns are not missed and performance is improved.


Question 9: Integration
a)How will it integrate into our computing environment?
b)Will it just work on our existing SQL database?
c)How easily will the system work on our intranet?

The system should run smoothly on existing open server platforms (e.g. Unix) and popular DBMS engines (e.g. Oracle, Sybase Informix, etc.) on the server. The system should present results to users on the corporate intranet. The absence of data conditioning requirements and extract files will make integration much easier.


Question 10: Support Staff
a)What staff do I need to keep this system installed and running?
b)How do we get support and training to get started?
c)What happens after we install the system?

After the initial system design, the support personnel for the system should be kept minimal. One database administrator should be able to manage the DBMS, and one analyst should occasionally help in setting up discovery models, etc. Thereafter, business users should be able to use the system on their own. There should be no need for a large number of resident support analyst to act as intermediaries for the business users.