如何创建一个成功的数据仓库(data warehouse) (想了解数据仓库的人士快看)

来源:互联网 发布:淘宝助理怎么商品分类 编辑:程序博客网 时间:2024/04/29 06:24

       如何创建一个成功的数据仓库(data warehose),下面的故事将告诉你!

     

The company's first data warehouse project began with a casual conversation between several executives on their way to lunch. The people involved were the IT manager for decision support as well as several members of a department that had just decided to install a data warehouse. They had planned to install their data warehouse without any involvement from the IT department; nonetheless, the following conversation ensued:

"We urgently need a data warehouse to analyze our data!"
"In that case, why don't you take an OLAP tool with a multidimensional database? "
"Is it possible to make the sales figures available to our sales people?"
"Yes, of course, that's no problem because of its Web capacity."
"We need our answers very fast."
"Performance isn't an issue, the data requested can be made available on a local server."
"Great, when can we start our analyses?"
"Installing such a system shouldn't take more than a few weeks."

Encouraged by these casual tips from an expert, the department decided to build a data warehouse that corresponded exactly to its specific needs. Some months later, the data warehouse was installed according to the original specifications. After the first successes had become public knowledge within the company, other departments began to show interest in the data warehouse. Proudly, the system was displayed, and enthusiasm was spreading. Suddenly, each department wanted their own data warehouse, and requests began to pile up on the desks of the IT managers. However, apart from the casual conversation previously, the IT department had not been involved in the development of this first data warehouse. The project itself had been implemented by the department with the help of an external system integrator. Nobody had planned on integrating additional user groups. It had become imperative to further develop this successful data warehouse. At this point it became clear that the department had locked itself into a data mart with only limited scalability, instead of building a data warehouse with unlimited capacity for expansion. This difference between data marts and data warehouses is a basic issue that the whole company now had to face.

What to do Next?

Based on the situation mentioned, some questions arise, such as: Can a data warehouse originally conceived only for one department be used for the whole company, or should new data marts be built for each department? If the latter solution is preferred, how would one department access data from another department? Who will guarantee that all users will receive exactly the same information? The question about whether to start with data marts or a data warehouse has been widely discussed.1, 2, 3 It can only be answered by clearly defining the project抯 goals: does it have to cover the information needs for certain departments or is it seen as the first step toward a shared enterprise information pool. If only departmental needs are the issue, it will suffice to install some isolated data marts. However, if a company regards access to an integrated, company-wide database as critical for its future survival in the market, then an enterprise data warehouse is the solution to implement.

My thoughts, so far, may have created the impression that there are only two options: either quickly install a few data marts to cover a few departments' current needs for information or embark on the expensive adventure of installing an enterprise data warehouse. There is a third option that combines the best of both worlds and can be implemented quickly without sacrificing future growth options. This third option is to lay down the foundation of an enterprise data warehouse by starting with a scalable data warehouse framework in a pilot project.

How to Proceed

The procedures that lead to this scalable data warehouse pilot project are specifically designed to satisfy two seemingly contradictory requirements ?fast delivery and expandability. Assuming the project is well prepared, it should not take more than three or four months to implement a fully operational pilot for an enterprise data warehouse. After it is finished, a company-wide data warehouse platform will be available allowing users to execute concrete analyses and develop a better understanding of their real and shared needs.

What is the difference between a data warehouse pilot and a departmental data mart? In fact, these two approaches differ more in their strategic goals than in total expenditure required for conceptualization and implementation. To run a project within a department means that you do not have to negotiate with other departments and IT managers ?something that can prove time-consuming. By contrast, if you want to establish a company-wide project, you must coordinate this effort with other departments, IT management and top executives.

Figure 1: The process of creating a pilot for an enterprise.

Figure 1 shows a preparatory phase to start the data warehouse pilot project. Thorough preparations will ensure that the pilot project will not exceed a three to four month time frame. Among other things, the project team will have to clarify technical issues regarding the system and the contracts with the system partners selected, issues mostly arising out of the chosen data warehouse architecture. The selection criteria for the central database computer will depend upon the amount of data anticipated now (and in the possibly known future), the number and types of users and the complexity of the queries. From the user's point of view, the selection of the analysis tool is the most critical issue; however, thanks to standardized interfaces like ODBC, it is not mandatory to stay with a chosen tool forever. In the beginning, it is sufficient to have a suitable OLAP tool for multidimensional analysis and software programs for accessing the data warehouse database directly.

Project Design

During the design phase, all the information necessary for implementing the data warehouse must be gathered, such as:

  • Requirements of the departments regarding the potential information uses;
  • Description of source data used;
  • Definition of business terms, data definitions and transformation rules;
  • Data models for the central data warehouse and the local data marts.

Simultaneously, the necessary hardware and software must be installed. Basically, the design phase can be broken down into four steps:

Business questions from departments. In order to increase the pilot project's chance for success, the selected business questions need to be of the greatest potential usefulness. Business questions do not necessarily have to be stated as questions. Existing reports that contain key figures or concrete suggestions as to analyses not possible before can also be used.

Data sources available. After the users' requests have been roughly analyzed, the IT department must investigate the source systems and interfaces available within the company. Due to the time constraints the pilot project faces, only data can be considered that is available and meets certain quality standards, such as completeness and correct contents. For a successful pilot installation, it is important to focus on the most important requirements. Therefore, the business questions must be correlated to the available data.

Business data model. The business data model reflects the real objects customer, order, product, etc. and their relationships to each other. In order to represent them correctly in the business data model, business rules have to be applied, such as "each order relates to one customer," or "each customer can belong to various categories."4

Logical data model. The logical data model in its normalized form is based on the business data model, and all objects are represented with their attributes. Usually, not all attributes available from the source systems are needed for answering the business questions submitted. However, potentially useful data elements will be integrated so it will not later be necessary to repeat all the analyses performed for the pilot project.

Figure 2: All the data is 
available through the access layer.
Figure 2: All the data is available through the access layer.

For reasons of performance or because the query tool requires it, denormalized data models are needed alongside a normalized data model.5 One possibility is to complement the normalized data model with summary tables. In another approach, so-called "star schema" or "star models" are created in addition to the normalized data model (see Figure 2). Together with the normalized data, they are available to uses through views on the database. Each time, the data warehouse is accessed through the security layer, in which all the access authorizations are stores. The normalized approach provides a magnitude of much greater capability and scalability in allowing for any question to be asked of the data and also to easily add more data in the future to the data warehouse database.

Figure 2 represents the data marts as logical structures, i.e., the numbers are recalculated each time they are called up. When starting the pilot project, the first step should be to create the data marts logically. Only if performance is really lacking, will they be physically implemented by using fact tables, since optimizing performance is no the pilot project抯 top priority. If the performance is acceptable, it is sufficient to begin analyzing the data selected. More fine-tuning of both the database and the tools utilized should be accomplished AFTER the users/managers have begun to report their experiences and new value.

A note with regard to these two different data models: the normalized data model represents all the business relationships and, therefore, should not be changed without very good reasons for doing so. By contrast, data marts are for the most part based on star schema models and contain data for specific subject areas. If an organization utilized star models and there are changing business requirements, the data marts have to be redesigned and/or re- adjusted to meet any new business requirements. This can be very expensive and also limit the future scalability and growth of your data warehouse.

Checking the Results

The last step before adopting the logical data model is to check it by using selected business queries. A typical business query may be: "Give me all sales in the a specific month, broken down into industries (i.e.,hotels and restaurants only); number of transactions; types of customers; and mode of payment."

Using this query, system integrator and users check the model table by table to find possible interpretative errors of the data modeler. Experience shows that end users are very well able to understand a logical data model, even if they have never seen one before. Particularly for direct queries to the database, a profound understanding of the data warehouse data model is a necessity.

Implementation of the Design

After finishing the design phase with the system check outlined, the design will be implemented on the target system with the first users creating analysis and reports. The implementation phase consists of all steps necessary to transfer data from the operational systems into the data warehouse.

Core steps for success in using this method are:

  1. The transformation of the logical data model into a physical data structure on the target system;
  2. The creation of extraction and transformation programs;
  3. The implementation of the required control procedures to periodically update the data in the data warehouse;
  4. Users defining and testing their new analysis and reporting.

A pilot project for an enterprise data warehouse will usually contain only a few gigabytes of data, involve one or two departments and two to four source systems.

It may require more coordination between departments, IT managers and company executives than does a quickly installed, isolated data mart, but these efforts will really pay off once the pilot data warehouse is up and running. Your company will be using this platform both for its current informational needs and with an eye to the future as your business and your requirements expand.