《精通数据仓库设计》中英对照_第2章

来源:互联网 发布:淘宝网页版免费使用 编辑:程序博客网 时间:2024/04/29 07:34

《精通数据仓库设计》中英对照_2

2 关系的基本概念

每一种数据建模技术都有一套自己的术语、定义与技术。这些习语允许我们理解复杂与困难的概念,并用它们设计复杂的数据库。这本书提供关系数据建模技术,用于开发数据仓库数据模型。为了这个目的,这一章介绍关系数据建模的术语和专有名词。然后,接着介绍规范化技术和不同的规范化等级的规则(例如,第一范式、第二范式、第三范式)及每个范式的目标。我们会给出示例数据模型,演示规范化的过程。本章最后,讨论规范化数据模型和它的好处。

在我们进入各种数据仓库数据模型之前,我们必须理解数据模型为什么重要,及你在开发BI环境时会创建的各种模型。

Chapter 2 Fundamental Relational Concepts

Every data-modeling technique has its own set of terms, definitions, and techniques. This vernacular permits us to understand complex and difficult concepts and to use them to design complex databases. This book applies relational data-modeling techniques for developing the data warehouse data model. To that end, this chapter introduces the terms and terminology of relational data modeling. It then continues with an overview of normalization techniques and the rules for the different normalization levels (for example, first, second, and third normal form) and the purpose for each. Sample data models will be given, showing the progression of normalization. The chapter ends with a discussion of normalization of the data model and the associated benefits.

Before we get into the various types of data models we use in creating a data warehouse, it is necessary to first understand why a data model is important and the various types of data models you will create in developing your BI environment.

 

为什么需要数模型

模型是一个事物的抽象与表现,它的外观和行为与原物整体或部分相象,例如概念车、建筑模型等等。所有模型都有一个普通的客观性。设计模型用于帮助人们想象各部件如何连接起来,帮助人们理解怎么使用或应用最终的产品,减少开发的风险,保证建立的产品与需求有同样的期望。让我们更详细的介绍这些好处:

■■模型减少整体风险,保证最终产品的需求得到满足。检查最终产品的实体模型,有意的用户可以做出可靠的决定,即产品是否能真正的满足他们的需要与目标。

■■模型帮助开发者想象最终产品与其他系统或功能的接口。如果创建了一个详细的模型,可以可靠估计创建这个接口难度及可行性。(在数据仓库环境下,这个接口包括数据获取与数据交付程序,何时、何地执行清洗、审计、数据维护过程等等)。

■■模型帮助所有有关人员理解怎样与最终的产品关联起来,及如何与它们的工作联系起来。模型同时帮助开发者理解最终观众需要的技能,及需要进行哪些培训,以保证对产品正确使用。

■■最后,模型保证创建产品的人和需求产品的人对最终的成果有同样的期望。通过检查模型,错失机会的情况会大大减少,各部分人信任和信心增加,大大提高最终产品的满意度。模型是如此重要,尤其是在承担一个复杂的项目(如BI)的建设时,我们建议在模型没有完成之前,所有的项目都要搁置或者延迟。

Why Do You Need a Data Model?

A model is an abstraction or representation of a subject that looks or behaves like all or part of the original. Examples include a concept car and a model of a building. All models have a common set of objectives. They are designed to help people envision how the parts fit together, help people understand how to use or apply the final product, reduce the development risk, and ensure that the people building the product and those requesting it have the same expectations. Let’s look more closely at these benefits:

■■ A model reduces overall risk by ensuring that the requirements of the final product will be satisfactorily met. By examining a “mock-up” of the ultimate product, the intended users can make a reasonable determination of whether the product will indeed fulfill their needs and objectives.

■■ A model helps the developers envision how the final product will interface with other systems or functions. The level of effort needed to create the interfaces and their feasibility can be reasonably estimated if a detailed model is created. (In the case of a data warehouse, these interfaces include the data acquisition and the data delivery programs, where and when to perform data cleansing, audits, data maintenance processes, and so on.)

■■ A model helps all the people involved understand how to relate to the ultimate product and how it will pertain to their work function. The model also helps the developers understand the skills needed by the ultimate audience and what training needs to occur to ensure proper usage of the product.

■■ Finally a model ensures that the people building the product and those requesting it have the same expectations about the ultimate outcome of the effort. By examining the model, the potential for a missed opportunity is greatly reduced, and the belief and trust by all parties that the ultimate product will be satisfactory is greatly enhanced. We feel that a model is so important, especially when undertaking a set of projects as complex as building a business intelligence (BI) environment, that we recommend a project be halted or delayed until the justification for a solid set of models is made, signed off on, and funded.

 

关系数据建模目标

在理解了为何需要模型之后,让我们回到这个特殊的模型——数据模型。在描述各种层次的模型之前,我们先提出一套普遍的术语。

注意

本书无意替代很多有名的、权威的关于通用数据建模的书,这一节仅仅给新手提供一些我们贯通全书的术语。如果需要更多细节,有大量的数据建模书籍参考,也可以参考本书“推荐阅读”介绍的书籍。

 

Relational Data-Modeling Objects

Now that we understand the need for a model, let’s turn our attention to a specific type of model—the data model. Before describing the various levels of models, we need to come up with a common set of terms for use in describing these models.

 

CNOTE

This book is not intended to replace the many significant and authoritative books written on generic data modeling; rather this section should only serve as a refresher on some of the more significant terms we will use throughout the book. If more detail is needed, please refer to the wealth of data-modeling books at your disposal and listed in the “Recommended Reading” section in this book.

 

 

主题

第一个描述的术语是主题。你会看到我们使用面向主题的数据仓库和主题域模型。这两种情况下,术语“主题”指企业的一个数据对象,或者是一个主要的数据分类。“主题域”指企业的一个数据子集,包含相关的实体和关系。顾客、销售、产品等都是主体域。

 

Subject

The first term to describe is a subject. You will see us refer to a subject-oriented data warehouse and a subject area model. In both cases, the term subject refers to a data subject or a major category of data relevant to the business. A subject area is the subset of the enterprise’s data and consists of related entities and relationships. Customers, Sales, and Products are examples of subject areas.

实体

实体一般定义为一个人、地方、事情、概念、事件,企业对它感兴趣并且能够抓取及存贮信息。在一个模型内,实体是唯一的。对第三范式模型,有一个,也只有一个词目表示一个实体。在传统的Code Date的实体关系图(ERD)或逻辑数据建模中,有四种类型的实体,用于创建逻辑或业务数据模型与数据仓库模型。

■■首要的或基础实体定义为不依赖于任何其他实体存在的实体。一般来说,每一个主题域由一个同名的首要实体表示(除非主题域名是复数,而实体名是单数),例如顾客、销售、产品。这些实体是一些独立的数据的分组,这些数据是各异的。

■■子类实体是一个父实体(超实体)的逻辑部分或分类。例如,客户的子类实体有零售客户、批发客户。子类实体往往继承父实体的特征(属性)和关系,也就是说,零售客户会继承父实体的所有属性,客户(例如客户号、客户姓名),还有关系,例如“客户获得产品”。

■■表示属性或特征的实体,是一类依赖于其他实体的实体。对于它的父实体,会有一组发生多次的数据。顾客地址是一个表示属性的实体,因为每一个顾客可能有多个地址。

■■联合或交叉实体是一个依赖于两个或多个实体的实体。如,订单是一个联合实体,它的主键由两个父实体——客户与产品——加一个限定词,如日期,其他属性有产品数据、购买日期等。

使用这四类实体,我们有了所有创建业务或数据仓库数据模型的部件。我们会在本章的下一节描述这些模型及在第3 张与第4章描述创建的步骤。

Entity

An entity is generally defined as a person, place, thing, concept, or event in which the enterprise has both the interest and the capability to capture and store information. An entity is unique within the data model. For the third normal form data model, there is one and only one entry representing that entity. In entity-relationship diagrams (ERD) or logical data modeling in the classical Codd and Date sense, there are four types of entities from which to build logical or business data models and data warehouse models (see Figure 2.1).

■■ A Primary or Fundamental Entity is defined as an entity that does not depend on any other entity for its existence. Generally each subject area is represented by a primary entity that has the same name (except that the subject area name is pluralized and the entity name is singular), such as Customer, Sale, and Product. These entities are a grouping of dependent data occurring singularly.

■■ A Subtype Entity is a logical division or category of a parent (supertype) entity. Examples of subtypes for the Customer entity are Retail Customer and Wholesale Customer. The subtypes always inherit the characteristics, or attributes and relationships, of the parent entity; that is, the Retail Customer will inherit any attributes that describe the more generic parent entity, Customer (for example, Customer ID, Customer Name), as well as relationships such as “Customer acquires Product.”

■■ An Attributive or Characteristic Entity is an entity whose existence depends on another entity. It is created to handle a group of data that could occur multiple times for each instance of its parent entity. Customer Address is Fan attributive entity of Customer since each customer may have multiple addresses.

 

■■ An Associative or Intersection Entity is an entity that is dependent upon two or more entities for its existence, and that records data at the point of intersection. Order is an associative entity. Its key is composed of the keys of the two parent entities—Customer and Item—and a qualifier such as Date. Attributes that could be retained include the Quantity of the Item and Purchase Date.

With these four types of entities, we have all we will need in terms of components to create the business and data warehouse data models. We describe these models in the next section of this chapter and go through the steps to create them in Chapters 3 and 4.

 

元素或属性

元素或属性是关于实体最低层次的信息。它表示一个具体的信息片或一个具体实体的特性。元素或属性在一个实体内实现几个目标:

■■一个主键唯一标示一个实体,用于物理数据库在存贮和访问时定位一条记录,如客户号是客户实体的主键,产品号是产品实体的主键。

Element or Attribute

An element or attribute is the lowest level of information relating to any entity. It models a specific piece of information or a property of a specific entity. Elements or attributes serve several purposes within an entity.

■■ A primary key serves to uniquely identify the entity and is used in the physical database to locate a record for storage or access. Examples include Customer ID for the Customer entity and Item ID for the Item entity.

Chapter 2

(个人说明,图2.1 Primary Entity Item 应该画错了,主键应该是ItemId,其他属性也错了)。

 

注意

键可能是单个元素,也可能是多个元素的组合,这种情况,叫复合键。主键可能有,也可能没有意义(也即智能)。必须注意智能主键,例如一个表示地理区域或部门的帐目编码常常让人迷惑及不正确使用,关于一个好的键的规则讨论请参见补充材料。

■■外键存在于一对实体父子关系中,子实体的外键是父实体的主键,这样把两个实体连接起来。例如,客户实体的客户号,也存在订单实体内,连接这两个实体。

■■非键元素或属性是一个实体不用于唯一定义实体的部分,用于进一步描述实体的特征。例如,客户实体的非键属性包括客户姓名、客户类型等;产品实体的非键属性有产品颜色、产品数量等。

 

NOTE

The key may be a single element or it may consist of multiple elements that are combined, in which case it is called a concatenated key. Finally, primary keys may or may not have meaning or intelligence. Care must be taken with intelligent primary keys. For example, an Account Code that also depicts geographic area or department is both confusing and erroneous in this data model. See the sidebar for further rules for good keys.

■■ A foreign key is a key that exists because of a parent-child relationship between a pair of entities. The foreign key in the child entity is the primary key in the parent entity and links the two entities together. For example, the Customer ID of the Customer entity is also found in the Order entity, relating the two.

■■ A nonkey element or attribute is not needed to uniquely identify the entity but is used to further describe or characterize information about the entity. Examples of nonkey elements or attributes are Customer Name, Customer Type, Item Color, and Item Quantity.

 

一个好的键的特征

下面列出一个“行为良好”键的特征,在操作型系统里,这些键在整个生命周期是可维护的、可持续的,数据仓库也一样:

键在整个集成范围内不能为空。在任何情况下,不能有空键值。

键在整个集成范围内要唯一。在任何情况下,不能产生重复的键值。

键是由设计来决定唯一性,而不是由环境来决定唯一性。键产生器通过了仔细的思考和整个环境的测试。

键是持久稳定的。这在数据仓库内是强制的,因为数据有一个很长的生命期。

键使用可管理的格式,也就是说,在创建或维护键结构时没有不适当的花销,它由直接的整数或字符串组成,没有嵌入符号或特殊字符。

键不应该隐含智能,而只是一个普通的串。(它可能基于某些智能创建,但是,一旦创建,内置在键内的智能永不再使用)。

Characteristics of a Good Key

The following are characteristics of “well-behaved” keys—those keys that are maintainable and sustainable over the lifetime of the operational system and therefore, the data warehouse:

The key is not null over the scope of integration. It is imperative that there can never be a situation or event that could cause a null key.

The key is unique over the scope of integration. It is also imperative that there can never be a situation where duplicate keys could be generated.

The key is unique by design not by circumstance. Key generation has been carefully thought out and tested under all circumstances.

The key is persistent over time. This is mandatory in the data warehouse environment where data has a very long lifetime.

The key is in a manageable format, that is, there is no undue overhead produced in the creation or maintenance of the key structures. It consists of straightforward integers or character strings, no embedded symbols or odd characters.

The key should not contain embedded intelligence but rather is a generic string. (It may be created based on some intelligence but, once created, the intelligence embedded in the key is never used.)l

关系

关系表示两个实体间的业务规则。关系用于描述两个实体是如何自然地关联起来。图21中,客户下订单,订单包含产品,这些都是关系。在描述企业业务规则时,关系有不同的特征:

■■基数表示一个实体关联到另一个实体的最大对象数。一般地,表示为“1”或者“多。图2.1,一个用户有多个地址(帐单地址、送货地址等),而每一个地址属于一个用户。

■■可选性或必须性指示一个实体对象是否必须参与到另一个关系,这种特征告诉你关系中的最小对象数。

有几种不同的关系类型:

■■确定的关系是指父实体的主键,也作为子实体的主键。

■■非确定的关系是指父实体的主键,作为子实体的非主键。这种关系类型的一个示例是递归关系,即一个实体与自身关联。一个客户关联到其他客户(例如,公司的子公司,家庭或者家族)就是递归关系。这用于表示一个实体对象关联到同一个实体的另一个对象,如图2.2

在数据模型中,一个关系包含一个表示业务规则的动词(安排、拥有、包含等)、一个基数、可选性或必要性。

 

Relationships

A relationship documents the business rule associating two entities together. The relationship is used to describe how the two entities are naturally linked to each other. Customer places Order and Order is for Items are examples of relationships in Figure 2.1.

There are different characteristics of relationships used in documenting the business rules of the enterprise:

■■ Cardinality denotes the maximum number of occurrences of one entity that can be related to another entity. Usually these are expressed as “one” or “many.” In Figure 2.1, a Customer has many addresses (Bill-to, Ship-to) and every address belongs to one customer.

■■ Optionality or modality indicates whether an entity occurrence must participate in a relationship. This characteristic tells you the minimum number (zero or optional) of occurrences in the relationship. There are also different types of relationships:

■■ An identifying relationship is one in which the primary key of the parent entity becomes a part of the primary key of the child entity.

■■ A nonidentifying relationship is one in which the primary key of the parent entity becomes a nonkey attribute of the child entity. An example of this type of relationship is a recursive relationship, that is, a situation in which an entity is related to itself. Customers who are related to other customers

(for example, subsidiaries of corporations and families or households) are examples of recursive relationships. These are used to denote an entity occurrence that is related to another entity occurrence of the same entity. See Figure 2.2 for more on these types of relationships.

 The components of a relationship in a data model consist of a verb phrase denoting the business rule (places, has, contains), the cardinality, and the modality or optionality of the relationship.

 

数据模型的类型

数据模型是某个环境下数据的抽象与表示,它是数据需求的集合及验证,也是沟通的方法,帮助创建一个准确的、有效的的数据数据库。数据模型由实体、属性、与关系组成。在一个完整的数据模型内,要定义所有这些项目的元数据,如定义与物理特征等。

如我们早已申明的那样,数据模型对BI 的成功起关键作用,从一开始到长时间的维护与持续过程。既然数据模型如此重要,为什么没有总是被开发呢?有很多的原因:

■■不容易。创建数据模型需要IT技术组与企业团体付出很大的努力,企业必须有经过数据建模训练的员工或者外部资源。

■■需要规章和工具。一旦获得了数据建模技术,必须保持一致和依从,企业必须创建一系列详细的标准用于创建数据模型,给出指定的标准的示例,冲突解决过程、数据管家角色和责任(更详细的讨论见第3章)、元数据获取和维护过程。

■■需要相当的业务参与。一个公司的数据模型必须——重复——必须有业务团体参与。毕竟,我们是在设计重要的部件,会成为企业最终的竞争武器,我们是在业务人员创建大量的信息财富。

■■它延迟了看得见的工作。数据建模没有为企业产生切实的产品,模型提供给技术队伍关于业务与需求的信息。老笑话这样说“开始编写代码——我要继续发现它们需要什么”

■■它需要广阔的视野。数据仓库的数据建模必须围绕整个企业,它会用于创建最终的决策系统——数据集市——用于所有战略分析。因此,必须要有跨部门、跨流程的长远眼光。

■■数据建模的好处往往在在第一个项目被认识。真正的产出在于重用和企业远景。

 

Types of Data Models

A data model is an abstraction or representation of the data in a given environment. It is a collection and a subsequent verification and communication method to fully document the data requirements used in the creation of accurate, effective, and efficient physical databases. The data model consists of entities, attributes, and relationships. Within the complete data model, appropriate meta data, such as definitions and physical characteristics, is defined for each of these.

As we stated earlier, we feel that the data models you create for your BI environment are critical to the overall success of your initiative as well as the long term maintenance and sustainability of the environment. If the data model is so important, why isn’t it always developed? There are a number of reasons for this:

■■ It’s not easy. Creating the data model takes significant effort from the IT technical staff and business community. Data modelers must be either hired or internal resources trained in the disciplines of data modeling.

■■ It requires discipline and tools. Once the techniques for data modeling are learned, they must be applied with conformity and compliance. The enterprise must create a set of documents detailing the standards it will use in the creation of its data models. Examples of these are naming standards, conflict resolution procedures, data steward roles and responsibilities (see Chapter 3 for more on this topic), and meta data capture and maintenance procedures.

■■ It requires significant business involvement. A company’s data model must—repeat—must have business community involvement. We are, after all, designing the critical component of the business community’s ultimate competitive weapon. It is for them that we are creating this vast wealth of information.

■■ It postpones the visible work. Data modeling does not create tangible products that can be used by the business community. The models provide the technical staff creating the environment with information about the business environment and some requirements. The old joke goes something like this: “Start coding—I’ll go find out what they want.”

■■ It requires a broad view. The data model for the BI environment must encompass the entire enterprise. It will be used to create the ultimate decision-making components—the data marts—for all strategic analysis. Therefore, it must have a multi department and multi process perspective.

■■ The benefits of a data model are often not realized with the first project. The real productivity comes in its reuse and its enterprise perspective.

说了这些,没有开发数据模型的影响是什么?

■■抽取想要的数据非常困难。很容易会导致不能满足用户的期望,或者只能部分满足。

■■要花费相当的努力在接口上,而通常只能提供很小,甚至没有商业价值。

■■环境复杂性增长迅速。当缺乏数据模型作为地图指南,了解数据仓库里已经有什么,还需要什么,会变得非常困难,或者说不可能。

■■实际的代价缺乏数据集成,因为你无法直观的看到事情是如何结合在一起。数据仓库的开发会没有效率,甚至可能不可行。

■■最重要的障碍之一是:没有数据模型,作为资产的数据不能有效地管理。

Having said all this, what is the impact of not developing a data model?

■■ It becomes very difficult to extract desired data. It is easy to implement something that either misses the users’ expectations or only partially satisfies them.

■■ Significant effort is spent on interfaces that generally provide little or no business value.

■■ The environment’s complexity increases significantly. When there is no data model to serve as a roadmap, it becomes difficult, if not impossible, to know what you already have in your data warehouse and what needs to be added.

■■ It virtually guarantees lack of data integration because you cannot visualize how things fit together. Data warehouse development will not be effective and efficient, and may not even be feasible.

■■ One of the most significant drawbacks is that, without a data model, data will not be effectively managed as an asset.

 

现在,已经说明了需要数据模型,数据仓库的数据模型有哪些类型呢?图2.3 显示了我们推荐的数据模型类型,及模型之间的交互作用。以下章节描述一个完全的、成功的、可维护的BI环境下的不同的数据模型。重要的一点是要注意双向箭头,指向低级模型的箭头,暗示集成上一级模型的特征(基本实体、属性与关系),这保证我们使用同样的曲谱表示格式、定义与业务规则;向上指的箭头暗示我们在把模型变成现实时常常发生变化,这些变化必须反映并结合到后续的模型之中,使模型保持生命力。

Now, having explained the need for data models, what are the types of data models will you need for your data warehouse implementation? Figure 2.3 shows the types of data models we recommend and the interaction between the models. The following sections describe the different data models necessary for a complete, successful, and maintainable BI environment. It is important to note the two-way arrows. The arrows pointing to the next lower level of models indicate that the characteristics (basic entities, attributes, and relationships) are inherited from the upper model. This ensures that we are all singing from the same sheet of music in terms of format, definition, and business rules. The upward-pointing arrows indicate that changes constantly occur as we implement these models into reality and that the changes must be reflected or incorporated into the preceding models for them to remain viable.

 

主题域模型

主题域是企业感兴趣的主要事情的分组,这些感兴趣的事情最终会描述成实体。一般企业有15 20个主题域。主题域模型一个好处就是能够快速开发(一般在几天之内)。这个初始模型作为企业数据模型的蓝图,以此为基础进行提炼。主题域模型能够快速开发的原因之一是一些主题在很多公司都有,公司可以从这些现成的主题域开始。

Subject Area Model

Subject areas are major groupings of things1 of interest to the enterprise. These things of interest are eventually depicted in entities. The typical enterprise has between 15 and 20 subject areas. One of the beauties of a subject area model is that it can be developed very quickly (typically within a few days). The initial model serves as a blueprint for the business data model, and refinements in the subject area model should be expected. One of the reasons that the subject area model can be developed quickly is that there are some subjects that are common to many organizations, and a company embarking on the development of a subject area model can begin with these.

Fundamental Relational Concept

这些主题在主题域模型里保持一致的标准:

■■主题域名称使用复数名词。

■■主题域的定义指明过去、现在、未来。

■■主题域大致相当于同样层次的抽象。

■■结构化的定义,使主题域互相排斥。

These subject areas conform to standards governing the subject area model:

■■ Subject area names are plural nouns.

■■ Definitions apply implicitly to the past, present, and future.

■■ Subject areas are at approximately the same level of abstraction.

■■ Definitions are structured so that the subject areas are mutually exclusive.

主题域模型的好处

不管主题域模型创建如何快速,也只有在有好处的时候才值得努力。下面列出主题域模型的主要好处:

指导业务数据模型的开发。业务数据模型是更详细的模型,用于指导操作型系统与数据仓库模型的开发。这样做,有利于数据仓库实现它的目标之一——数据一致性。

通常有几种人影响业务数据模型。主题域模型的一个应用是拆分主题域的工作量。在这种方式下,每个人成为某个领域的专家,如客户、产品、销售。建模员有时忙于业务功能,这样每一个人的工作会涉及到多个主题域。确定每个主题域的首要人员,可以最小化重复的劳动,提高协调性。即使工作没有分拆到人,主题域模型对一致性和避免冗余提供保证。当建模员判断是否需要一个新的实体时,他会根据实体定义检查合适的主题域。在真正创建这个新实体前,建模员仅仅需要浏览那个主题域的模型(一般少于30个),而不要浏览所有上百个模型。有了这些信息,建模员或者创建新的实体,或者指出已存在的哪个实体能满足需要。

 

Subject Area Model Benefits

Regardless of how quickly the subject area model can be developed, the effort should only be undertaken if there are benefits to be gained. Following are some of the major benefits provided by the subject area model.

Guide the Business Data Model Development

The business data model is the detailed model used to guide the development of the operational systems and the data warehouse. By doing so, it helps the data warehouse accomplish one of its major generic objectives—data consistency.

Often, there are several people working on the business data model. One application of the subject area model is to divide the workload by subject area. In this manner, each person becomes an expert for a particular area such as Customers, Products, and Sales. The modelers sometimes address business functions, and hence each person’s work could involve multiple subject areas. By establishing a primary person for each subject area, duplication of effort is minimized and coordination is improved. Even if the workload is not divided by person, the subject area model helps ensure consistency and avoid redundancy. When a modeler identifies the need for a new entity, the modeler determines the appropriate subject area based on the definition. Before actually creating the new entity, the modeler need only review the entities in that subject area (typically less than 30) rather than reviewing the hundreds of entities that may exist in the full model. Armed with that information, the modeler can either create the new entity or ensure that the existing entity addresses the needs.

 

指导数据仓库项目选择

公司常常面临要推动多个数据仓库,考虑是需求组合成一个项目及建立优先级的问题。主题域模型提供一个高层次的方法,依据每个项目包含的数据把他们组合起来。这些信息可以和业务优先级、技术难度、人员可用性等一起考虑,用于建立最终的项目顺序。第3章会更详细的讨论这个问题。

Guide Data Warehouse Project Selection

Companies often contemplate multiple data warehouse initiatives and struggle with both grouping the requirements into projects and with establishing the priorities. The subject area model provides a high-level approach for grouping projects based on the data they encompass. This information should

be considered along with the business priority, technical difficulty, availability of people, and so on in establishing the final project sequence. Chapter 3 will cover this in more detail.

 

指导数据仓库开发项目。

主题问题专家常常基于已存在的数据,例如,金融主管是公司内的“金融”专家,人力资源部的某个人是“人力资源”专家,从事销售、市场与客户服务的人员会成为“客户”专家。精通所从事的领域帮助项目队伍定义需要包含的业务模块。同时,数据主文件(如客户主文件、产品主文件)趋于包含具体领域的数据。

Guide Data Warehouse Development Projects

Subject matter experts often exist based on the data that is being addressed. For example, someone in the chief financial officer’s organization would be the expert for “Financials”; someone in the Human Resources Department would be the expert for “Human Resources”; people from Sales, Marketing, and Customer Service would provide the expertise for “Customers.” Understanding the subject areas being addressed helps the project team identify the business representatives that need to be involved. Also, data master files (for example, Customer Master File, Product Master File) tend to contain data related to specific subjects.

 

业务数据模型

业务数据模型是另一种类型的模型,是数据在给定业务环境的抽象和表示,它对任何模型提供好处。它帮助人们直观地理解一个业务信息如何和另一个业务信息关联(“各部分如何互相连接起来”)。使用业务数据模型的产品包括操作型系统、数据仓库、数据集市数据库,模型提供这些数据库的元数据(或者叫数据的信息),帮助人们理解怎么使用或应用最终的产品。业务数据模型减少开发风险,因为它确保实现的系统正确反映业务环境。最后,它用于指导开发工作,它给开发者提供一个基础,用于解释业务信息关系,确保项目有关人员共享一个共同的远景。

Business Data Model

The business data model is another type of model. It is an abstraction or representation of the data in a given business environment, and it provides the benefits cited for any model. It helps people envision how the information in the business relates to other information in the business (“how the parts fit together”). Products that apply the business data model include operational systems, data warehouse, and data mart databases, and the model provides the meta data (or information about the data) for these databases to help people understand how to use or apply the final product. The business data model reduces the development risk by ensuring that all the systems implemented correctly reflect the business environment. Finally, when it is used to guide development efforts, it provides a basis to confirm the developers’ interpretation of the business information relationships to ensure that the key stakeholders share a common set of expectations.

业务数据模型的好处

业务数据模型提供一个关于业务信息及其关系一致的、稳定的视图。它用于认识、评估、反映业务变化。下面列出数据仓库业务数据模型的特殊益处:

Business Data Model Benefits

The business data model provides a consistent and stable view of the business information and business information relationships. It can be used as a basis for recognizing, evaluating, and responding to business changes. Specific benefits of the data model for data warehousing efforts follow.

 

范围定义

任何项目都应该包括范围定义作为第一步,数据仓库项目也不例外。若一个业务数据模型已经存在,它能用于传达随后的数据仓库的信息。范围定义文档应该有专门的章节用于列出数据仓库要包含的实体,另一章节专门列出那些有人期望包含在数据仓库但是没有包含在内的实体。显式的声明包含哪些实体,不包含哪些实体,确保对数据仓库的关注没有意外。这个实体列表对确定需要的主题问题专家与潜在的源系统非常重要。而且,这个清单可以用于帮助项目建立。很多活动(如数据仓库模型开发,数据转换逻辑)都依赖于数据元素的数量。使用数据实体(如果有的话,包括属性)作为项目的开始,给管理人员对项目工作量的估算提供基本依据。例如,开发数据仓库模型的公式,可能由实体与属性的数量乘以某一个数值,结果可能会根据复杂性、可用的文档等做出调整。虽然第一个数据仓库工作量的计算公式可能很粗糙,如果在实际工作中取得了数据,可以提炼这个公式,提供以后估算工作量的可靠性。

Scope Definition

Every project should include a scope definition as one of its first steps, and data warehouse projects are no exception. If a business data model already exists, it can be used to convey the information that will be addressed by the resultant data warehouse. A section of the scope document should be devoted to listing the entities that will be included within the data warehouse; another section should be devoted to listing the entities that someone could reasonably expect to be included in the data warehouse but which have been excluded. The explicit statement of the entities that are included and excluded ensures that there are no surprises with respect to the content of the data warehouse. The list of entities is useful for identifying the needed subject matter experts and for identifying the potential source systems that will be needed. Additionally, this list can be used to help in estimating the project. A number of activities (for example, data warehouse model development, data transformation logic) are dependent on the number of data elements. Using the data entities (and attributes if available) as a starting point provides the project manager with a basis for estimating the effort. For example, the formula for developing the data warehouse model may consist of the number of entities and attributes2

multiplied by the number of hours for each. The result can then be adjusted based on anticipated complexity, available documentation, an so on. While the formula for the first data warehouse effort may be very rough, if data is maintained on the actual effort, the formula can be refined, and the reliability of the estimates can be improved in future implementations.

 

集成的基础

在设计一个企业级的数据模型时,设计者很快会碰到同名异物(同样名称的实体或属性表示的不同事物)与同物异名(不同的实体或属性名称表示同一个事物)的问题。如图2.4,设计者会看到在“普通分类帐”和“订单明细”实体中都一个属性“帐目编号”,这两个属性是不是相同呢?也许不是,一个用于表示各种财务账目,另一个用于表示带有组织的客户账目。

类似的,在图2.5 中,订单明细和帐单实体都分别有叫账目编码和客户ID 的属性,是否是同一个意思呢?答案时也许是的。在创建数据模型的时候,设计者必须确定这些属性是同名异物的,并保证给他们不同的名字。(如果使用本章推荐的命名协定,在新的模型里不会产生同名异物的现象)。同样的,在模型里,一个属性必须由一个且只有一个名称来表示,所以设计者必须协调同物异名问题,给每个属性一个唯一的名字。这样,数据模型用于管理冗余的实体和属性,给每一个实例“全局”的名字,减少冗余。同时,数据模型对于同名异物情形下,清洗容易混淆及误用的实体、属性名称也非常有用。实体和属性有唯一的名称确保作为整体的企业不会做出错误的假设,这些假设将导致关于数据坏的决策。

Integration Foundation

In designing any enterprise’s data model, the designer will immediately run into situations where homonyms (entities or attributes that have the same name but mean very different things) and synonyms (entities or attributes that have different names but mean exactly the same thing) are encountered. In Figure 2.4, the designer may see that the General Ledger and the Order Entry systems both have an attribute called “Account Number.” Are these the same? Probably not! One is used to denote the field used for various financial accounts, and the other is used to denote the customer’s account with the organization. Similarly, in Figure 2.5, the Order Entry and Billing systems have attributes called Account Number and Customer ID, respectively. Are these the same? The answer is probably yes. In the data model being created, the designer must identify those attributes that are homonyms and ensure that they have distinctly different names. (If the naming convention for attributes recommended in this chapter is used, there will be no homonyms in the new models.) By the same token, an attribute must be represented once and only once in the model so the designer must reconcile the synonyms as well and represent each attribute by a single

 

2If the number of attributes is not known, an anticipated average number of attributes per entity can be used.

 

name. Thus, the data model is used to manage redundant entities and attributes rendering the “universal” name for each instance, reducing the redundancy in the environment. The data model is also very useful for clearing up confusing and misleading names for entities and attributes in the homonym situation as well. Ensuring that all entities and attributes have unique names guarantees that the enterprise as a whole will not make erroneous assumptions, which lead to bad decisions, about the data.

 

 

多项目协调

一个数据仓库程序由多个数据仓库项目组成,有时这些程序同时管理。当多个团队在数据仓库上工作时,主题域模型不能用于确定项目的交叠与裂缝部分。业务数据模型用于建立项目之间的交叠部分,用于每个项目要使用的数据。当一个实体被不只一个项目使用时,它的设计、定义与实现应该只分配给一个团队。由其他项目发现的这部分数据的变更由那个团队协调一致。数据模型还能帮助指出系统的裂缝,即实体和属性没有加入任何系统。是否所有的实体、属性、关系都在某地被创建?如果没有,你的系统真正有问题。在这个系统内他们是否被其他地方变更或使用?如果这样,在系统是否有正确的接口创建数据的流?最后,他们在系统内会被删除会丢弃吗?创建一个数据模型之间的交叉矩阵能帮助你回答这些问题。

Multiple Project Coordination

A data warehouse program consists of multiple data warehouse implementation projects, and sometimes several of these are managed simultaneously. When multiple teams are working on the data warehouse, the subject area model can be used to initially identify where the projects overlap and gaps that will remain following completion of the projects. The business data model is then used to establish where the projects overlap to fine-tune what data each project will use. Where the same entity is used by more than one project, its design, definition, and implementation should be assigned to only one team. Changes to that piece of data discovered by other projects can be coordinated by that team. The data model can also help to identify gaps in your systems where entities and attributes are not addressed at all. Are all entities, attributes, and relationships created somewhere? If not, you have a real problem in your systems. Are they updated or used somewhere else within the systems? If so, do you have the right interfaces between systems to handle the flow of created data? Finally, are they deleted or disposed of somewhere in your systems? The creation of a matrix based upon the crossing of your data model with your systems’ processes will give you a sound basis from which to answer these questions.

 

依赖识别

数据模型帮助识别不同的实体与属性之间的依赖关系。这种方式,用于帮助评估变更的影响。当你改变或创建一个过程,你必须能回答这个问题:它是否会影响到其他过程的数据。数据模型能帮助你确保依赖的实体和属性在设计和实现一个新系统或者系统变更时被考虑,

Dependency Identification

The data model helps to identify dependencies between various entities and attributes. In this fashion, it can be used to help assess the impact of change. When you change or create a process, you must be able to answer the question of whether it will have any impact on sets of data used by other processes. The data model can help ensure that dependent entities and attributes are considered in the design or implementation of new or changed systems.

 

冗余管理

数据模型尽力去处所有的冗余。每个实体、属性、关系在模型里只出现一次,除非他们在另一个实体里用作外键。在创建这个模型时,你能快速看到必须解决的叠交与冲突问题,同时还有在继续之前必须去除的冗余。在“关系建模指导方针”里提出的范式规则用于确保设计一个非冗余的模型。有非常多的理由在系统模型和技术模型里引入冗余,最常见的理由是查询的性能。明白何地及为和引入冗余非常重要,通过数据模型,冗余得到控制、提前关注、在整个设计阶段检验它的影响。

Redundancy Management

The business data model strives to remove all redundancies. Entities, attributes, and relationships appear only once in this model unless they are used as foreign keys into other entities. By creating this model, you can immediately see overlaps and conflicts that must be resolved, as well as redundancies that must be removed, before going forward. The normalization rules specified in the “Relational Modeling Guidelines” section are designed to ensure a nonredundant data model. There are many reasons to introduce redundancy back into system and technology data models; the most common one is to improve the performance of queries or requests for data. It is important to understand where and why any redundancy is introduced, and it is through the data model that redundancy can be controlled, thought out ahead of time, and examined for its impact on the overall design.

 

变更管理

数据模型同时作为记录实体、属性、关系变更的最好方式。系统建立之后,我们可能发现新的业务规则,需要增加实体和属性。当这些变更记录在技术模型和系统模型(如2.3 )时,必须强迫返回到数据模型链——到业务模型,甚至可能要到主题域模型本身。很明显,如果没有各个层次数据模型的变更控制,很快就会发生混乱,数据模型所有的好处都会丢失。

Change Management

Data models also serve as your best way to document changes to entities, attributes, and relationships. As systems are created, we may discover new business rules in effect and the need for additional entities and attributes. As these changes are documented in the technology and system data models (see Figure 2.3), these changes must be enforced all the way back up the data model chain—to the business data model and maybe even to the subject area diagram itself. Without solid change control over all levels of the data models, it should be clear that chaos will quickly take over and all the benefits of the data models will be lost.

 

系统模型

2.3中下一层次的数据模型由一系列系统模型组成。一个系统模型是一个信息的集合,这些信息与具体的系统或功能有关,如账单系统、数据仓库、或者数据集市,独立于任何具体的技术或者DBMS 环境。例如,账单系统和数据仓库系统模型很可能不会发现有每个企业感兴趣的数据片,因为系统模型是从业务模型开发而来的,它必须与业务模型一致。第4章更详细介绍数据仓库系统模型。我们构建的每个系统或数据库都有它自己唯一的系统模型,描述所支持的那个系统或功能模块具体的数据需求。一般每个系统只有一个系统模型,也就是说,数据仓库只有一个系统模型,账单系统只有一个系统模型,等等。我们可以选择物理上实现这个模型的很多版本(请见下一节关于技术模型的介绍),但是系统模型仍然只有一个,它用于实现真正的系统。

System Model

The next level of data models in Figure 2.3 consists of the set of system models. A system model is a collection of the information being addressed by a specific system or function such as a billing system, data warehouse, or data mart. The system model is an electronic representation of the information needed by that system. It is independent of any specific technology or DBMS environment. For example, the billing system and data warehouse system models will most likely not have every scrap of data of interest to the enterprise found in them. Because the system model is developed from the business data model, it must, by default, be consistent with that model. See Chapter 4 for more detail on the construction of the data warehouse system model. It is also important to note that there will be more than one system model. Each system or database that we construct will have its own unique system model denoting the specific data requirements for that system or the function it supports. Alternatively, there typically is only one system model per system. That is, there is only one system model for the data warehouse, one for the billing system, and so on. We may choose to physically implement many versions of the system model (see the next section on technology model) but still have only one system model from which to implement the actual system(s).

技术模型

最后要开发的模型是技术模型。这个模型是一个具体信息的集合,这些信息被存贮在一个特殊的系统里,用一个具体的平台实现。现在,我们必须考虑所有有关数据库的技术,包括:

硬件。你对平台的选择意味着依照你的平台技术,你必须考虑单个数据文件的尺寸,并且在技术模型里标记这些规格

数据库管理系统(DBMS。为数据仓库选择的数据库管理系统对你最终设计的数据库有很大影响,你必须做一些决定:

Technology Model

The last model to be developed is a technology model. This model is a collection of the specific information being addressed by a particular system and implemented on a specific platform. Now, we must consider all of the technology that is brought to bear on this database including:

Hardware. Your choice of platform means that you must consider the sizes of the individual data files according to your platform technology and notate these specifications in the technology model.

Database management system (DBMS). The DBMS chosen for your data warehouse will have a great impact upon the ultimate design of your database. You must make the following determinations:

■■反规范化的数量。一些DBMS环境在很少或没有反规范化时性能较好,另一些需要显著的反规范化以得到好的性能。

■■物化视图。依赖于你使用的DBMS技术,你可能创建物化视图或虚拟数据集市,用于加速查询性能。

■■分区策略。你应该使用分区加快数据装入数据仓库及交付到数据集市的速度,或者选择水平分区,或者选择垂直分区。第5章更详细的讨论这个话题。

■■索引策略。依赖于你的DBMS,索引策略有很多选择。位图、编码矢量、稀疏矩阵、哈西算法、簇、连接索引等,都是可选的。

■■参照完整性。绑定(DBMS绑定参照完整性——在父记录装入前不能装入子记录)与不绑定(装入数据到后备区,用程序检查完整性,然后再装入数据仓库)是两种选择,时间是一个限定因素。

■■数据交付技术。怎样把数据从数据仓库交付到各种数据集市对数据库的设计有影响,要考虑是通过门户交付还是管理的查询过程。

■■安全性。数据仓库常常包含高度敏感的数据。你可以选择在DBMS层,通过物理分离这部分数据来调用安全性,或者使用视图、存贮过程确保安全。如果数据是极端敏感的,你可能使用加密技术确保安全。

技术模型必须与控制系统模型一致,也就是说,从它的系统模型继承基本需求。同样地,技术模型里基础实体、属性、关系的任何变更,都必须返回影响模型链,如2.3所示(向上的箭头)。正象有很多系统模型一样——每个系统一个——每一个系统模型有很多技术模型。例如,你可以选择在分开的实例中实现企业数据仓库的子集;你可以选择按主题域实现数据——例如,分别为客户、产品、订单使用物理上不同的实例;或者你可以选择按地理区域分开数据子集——一个数据仓库用于北美,另一个用于欧洲,第三个用于亚洲。每一个物理实例有它自己的技术模型,这些技术模型都基于系统模型,而根据你选择的实现技术进行修改。

 

■■ Amount of denormalization. Some DBMS environments will perform better with minimal or no denormalization; others will require significant denormalization to achieve good performance.

■■ Materialized views. Depending on the DBMS technology you use, you may create materialized views or virtual data marts to speed up query performance.

■■ Partitioning strategy. You should use partitioning to speed up the loading of data into the data warehouse and delivery to the data marts. You have two choices—either horizontal or vertical partitioning. Chapter 5 discusses this topic in more detail.

■■ Indexing strategy. There are many choices, depending on the DBMS you use. Bitmap, encoded vector, sparse, hashing, clustered, and join indexes are some of the possibilities.

■■ Referential integrity. Bounded (the DBMS binds the referential integrity for you—you can’t load a child until the parent is loaded) and unbounded (you load the data in a staging area to programmatically check for integrity and then load it into the data warehouse) are two possibilities. You must make sure that time is one of the qualifiers.

■■ Data delivery technology. How you deliver the data from the data warehouse into the various data marts will have an impact on the design of the database. Considerations include whether the data is delivered via a portal or through a managed query process.

■■ Security. Many times the data warehouse contains highly sensitive data. You may choose to invoke security at the DBMS level by physically separating this data from the rest, or you can use views or stored procedures to ensure security. If the data is extremely sensitive, you may choose to use encryption techniques to secure the data. The technology model must be consistent with the governing system model. That is, it inherits its basic requirements from its system model. Likewise, any changes in the fundamental entities, attributes, and relationships discovered as the technology model is implemented must be reflected back up the chain of models as shown in Figure 2.3 (upward arrows). Just as there are many system models—one per system—there may be multiple technology models for a single system model. For example, you may choose to implement subsets of the enterprise data warehouse in physically separate instances. You may choose to implement data by subject area—for example, using a physically different instance for customer, product, and order. Or you may choose to separate subsets of data by geographic area—one warehouse for North America, another for Europe, and a third for Asia. Each of these physical instances will have its own technology model that is based upon the system model and modified according to the technology upon which you implement.

 

关系数据建模方针

数据建模是一个非常抽象的过程,不是所有的IT专家都拥有创建一个坚实的模型的资格。数据建模员需要有概念化的能力,把那些关于企业需要用于处理业务及包含在其中的规则的无形思想概念化。而且,数据建模是非确定性的——只有一个正确的方式建立一个数据模型,而错误的方式有很多。关于数据建模一个普遍关心的问题是发生变更的数量。当我们越来越了解企业,这些知识会反过来导致已有的数据模型的变更。数据建模员不要把这个看作威胁,而必须准备好变更,并欣然接受它——事实上,模型是更有洞察力地、更接近地象征一个企业整体。

数据建模员在创建各种数据模型时,必须坚持一系列原则和规则。我们已经建议,在你开始建模实践前,建立这些“基础规则”,以免后面的混淆和情绪上的争论。对这些规则的任何背离都必须记录文档及列出例外的原因,之后采取的用于减少或消除这些例外的任何行动也都要记录文档。最后,数据建模需要判断力,即使判断力的理由不清晰或不能文档化。当出现这种情况,数据建模员应该重新检查后面列出的三个指导方针。如从一个模型增加或者删除某些东西,能提供模型的沟通效用与能力,那么就应该那样做。

本书的目标是确保读者在开始数据仓库设计前有一个坚实的基础并一步一步指导读者处理这些问题。下面这些指导意见是我们从事多年建模经验中得来的,让我们开始吧。

 

Relational Data-Modeling Guidelines

Data modeling is a very abstract process, and not all IT professionals have the qualifications to create a solid model. Data modelers require the ability to conceptualize intangible notions about what the business requires to perform its business and what its rules are in doing business. Also, data modeling is nondeterministic—there is one right way to create a data model. There are many wrong ways.

A common concern in data modeling is the amount of change that occurs. As we learn more and more about the enterprise, this knowledge will be reflected in changes to the existing data models. Data modelers must not see this aspect as a threat but rather be prepared for change and embrace it as a good sign—a sign that the model is, in fact, more insightful and that it more closely resembles the enterprise as a whole.

Data modelers must adhere to a set of principles or rules in creating the various data models. It is recommended that you establish these “ground rules” before you start your modeling exercise to avoid confusion and emotional arguments later on. Any deviation from these rules should be documented and the reasons for the exception noted. Any mitigating or future actions that reduce or eliminate the exception later on should be documented as well. Finally, data modeling also requires judgment calls even when the reasons for the judgment are not clear or cannot be documented. When faced with this situation, the data modeler should revisit the three guidelines described in the next section. If adding or deleting something from the model improves its utility or ability to be communicated, then it should be done.

It is the goal of this book to ensure that you have the strong foundation and footing you need to deal with these issues before you begin your data warehouse design. Let’s start with a set of guidelines garnered from the many years of data modeling we have performed.

 

方针与最佳实践

数据模型的目标是完全地、准确地反应数据需求与业务规则,使业务高效处理。为了这个目标,我们相信在数据模型设计时应该遵守三个方针:

沟通工具。数据模型应该用作业务团体和IT成员之间的沟通工具。数据需求必须很好的形成文档,并且被所有有关人员理解,必须面向业务,必须包含适当的细节层次。数据模型应当用于传达业务团体的关于企业数据的观点,给实现系统的技术人员。当开发这些模型时,目标必须始终清晰准确。当往一个数据模型增加信息时,建模员必须问自己,是会提高清晰性还是降低清晰性。

粒度的层次。数据模型应该反应企业使用的信息的“最小公分母”。聚合、导出、汇总数据元素应该从这些基本部分分解出来,无必要的冗余和重复数据元素应该去除。当我们为了使用和性能要求,通过增加向后的聚合、导出项、汇总项等进行“反规范化”时,我们清楚地知道每个组件里加入了什么数据元素。换句话说,数据必须根据需要细化,以支持本性和最终的使用。而最终的技术模型可能有一些明显的聚合、汇总、导出项,这些都要数据建模文档指向最初的明细数据。

面向业务。模型本身就表示了企业视图,而不需要物理限制。我们总是努力按业务的需要建模,而不是根据现有的系统、技术、数据库等限制业务来建模。不以业务团体的要求为基础的项目往往必定要失败。我们常常没有和业务团体一致,因为我们简单地相信我们已经知道分析的结果是什么(类似于“如果我们建好了,他们就会来”的信念)。这些方针应该永远铭记在建模员心中,从他/她从事建模开始。只要一出现问题或意见,建模员应当回想这些方针,确定模型的能力是增加了还是减少了。

把这些记在心中,让我们看看数据建模的最佳实践有哪些:

 

Guidelines and Best Practices

The goal of any data model is to completely and accurately reflect the data requirements and business rules for handling that data so that the business can perform its functions effectively. To that end, we believe that there are three guidelines that should be followed when designing your data models:

Communication tool. The data models should be used as a communication tool between the business community and the IT staff and within the IT staff. Data requirements must be well documented and understood by all involved, must be business-oriented, and must consist of the appropriate level of detail. The data model should be used to communicate the business community’s view of the enterprise’s data to the technical people implementing their systems. When developing these models, the objectives must always be clarity and precision. When adding information to a data model, the modeler should ask whether the addition adds to clarity or subtracts from it.

Level of granularity. The data models should reflect the “lowest common denominator” of information that the enterprise uses. Aggregated, derived, or summarized data elements should be decomposed to their basic parts, and unnecessary redundancy or duplication of data elements should be removed. When we “denormalize” the model by adding back aggregations, derivations, or summarization according to usage and performance objectives, we know precisely what elements went into each of these components. In other words, the data should be as detailed as necessary to understand its nature and ultimate usage. While the ultimate technology model may have significant aggregations, summarizations, and derivations in it, these will be connected back to the ultimate details through the data modeling documentation.

Business orientation. It is paramount that the models represent the enterprise’s view of itself without physical constraints. We strive always to model what the business wants to be rather than model what the business is forced to be because of its existing systems, technologies, or databases. Projects that are not grounded in what the business community wants are usually doomed to fail. Generally, we miss the boat with our business community because we cut corners in the belief that we already know what the results of analysis will be (the “if we build it, they will come” belief). These guidelines should always be at the forefront of the modeler’s mind when he or she commences the modeling process. Whenever questions or judgment calls come into play, the modeler should fall back to these guidelines to determine whether the resolution adds or detracts to the overall usability of the models.

With these in mind, let’s look at some of the best practices in data modeling:

 

业务用户参与。首先,必须了解,业务团体必须花时间与资源,帮助建立各种模型,数据建模不只是IT人员的技术活。如果业务团体不能找到时间,拒绝参与,或者宣称IT人员应当对他们需要的数据“负责”,那么需要一个明智的项目经理把他们拉进来。缺席业务团体的数据建模是浪费时间、资源和精力,而且极有可能失败。而且,业务团体加入得越快越好。第一步,你必须决定业务团体的哪些人应该加入进来。这些人可能愿意也可能不愿意参加,如果他们公开抵制,你可能需要举办一些培训,消除他们的恐惧,或者寻找其他资源。典型的参与者包括发起人高管、具有领域专业的经理、业务分析师。

面谈与小型会议。在短时间内取得大量信息的一个最普通的方式是面谈及小型会议。面谈一般从一个多两个人那里得到信息,会议可能得到更深的信息。小型会议一般有5——10人参加,一般用于决定方向或取得一致意见,或者用于培训。这些会议的记录文档被确认,并添加到信息库,作为数据模型的依据。

验证。提交的模型然后通过面谈或小型会议直接反馈,或者通过正式的通道确认。请一些业务团体成员确认时,你可能更关注于业务规则与限制,而不是真正的数据模型本身;而对另一些人,你应当确认真正的数据模型结构和关系是否正确。

数据模型维护。在任何建模工作中,变更都是常有的,你应当准备应对这种情况。变更管理应当制定正式的规章,包含签入、签出过程,正式的变更请求及解决冲突的过程。

知道什么时候“适可而止”。也许数据建模员应该学会的最重要的实践是什么时候说模型已经足够好。因为我们正在设计一个抽象的、有争议的结构,数据建模员很容易发现自己处于“分析麻痹”状态。数据模型什么时候算完成了?永远没有!因此,建模员必须做一个艰难的决定:模型是否足够支持已经实现的功能需求。知道变更会发生,并且准备下次处理变更。

 

Business users’ involvement. It must be understood up front that the business community must set aside time and resources to help create the various data models; data modeling is not just a technical exercise for IT people. If the business community cannot find the time, refuses to participate, or basically declares that IT should “divine” what data they need, it is the wise project manager who pulls the plug on the project. Data modeling in a business community vacuum is a waste of time, resources, and effort, and is highly likely to fail. Furthermore, the sooner the business community gets involved, the better. As a first step, you must identify who within the business community should be involved. These people may or may not be willing to participate. If they are openly resistant, you may need to perform some education, carry out actions to mitigate their fears, or seek another resource. Typical participants are sponsoring executives, managers with subject matter expertise, and business analysts.

Interviews and facilitated sessions. One of the most common ways to get a lot of information in a short amount of time is to perform interviews and use facilitated sessions. The interviews typically obtain information from one or two people at a time. More depth information can be obtained from these sessions. The facilitated sessions are usually for 5 to 10 attendees and are used to get general direction and consensus, or even for educational purposes. The documentation from these sessions is verified and added to the bank of information that contributes to the data models.

Validation. The proposed data model is then verified by either immediate feedback from the interviews or facilitated sessions, or by formal walkthroughs. It may be that you focus on just the verification of the business rules and constraints rather than the actual data model itself with some of the business community members. With others though, you should verify that the actual data model structures and relationships are appropriate.

Data model maintenance. Because change becomes a common feature in any modeling effort, you should be prepared to handle these occurrences. Change management should be formalized by documented procedures that have check-in and check-out processes, formal requests for changes, and processes to resolve conflicts.

Know when “enough is enough.” Perhaps the most important practice any data modeler should learn is when to say the model is good enough. Because we are designing an abstract, debatable structure, it is very easy for the data modeler to find him- or herself in “analysis paralysis.” When is the data model finished? Never! Therefore it is mandatory that the modeler make the difficult determination that the model is sufficient to support the needs of the function being implemented, knowing that changes will happen and that he or she is prepared to handle them at a later date.

 

规范化

规范化是一个确保数据模型满足准确性、一致性、简单性、非冗余性、稳定性等目标的方法,它是一个物理数据库设计技术,在关系技术中应用数学规则,定义和减少插入、修改、删除异常。

我们常用的第三范式简单的定义是:所有属性必须依赖于键,所有键,不能依赖除了键之外的其他属性。规范化是一种从根本上保证属性在合适的实体里,使设计对关系型DBMS有用及有效。我们会在本章的下一节讨论数据模型设计中规范化的过程。规范化有如下特征:

■■验证数据模型结构的正确性与一致性。

■■独立于任意物理限制。

■■最小化存贮空间,通过减少数据在多个地方存贮而实现。

最后,规范化带来:

■■移除数据的不一致性,因为数据制存贮一次,因此消除了数据冲突的可能性。

■■减少插入、修改、删除异常,因为数据只存贮一次。

■■提高数据结构稳定性,因为属性放入实体位置是基于内部的特性而不是具体的应用需求。

 

Normalization

Normalization is a method for ensuring that the data model meets the objectives of accuracy, consistency, simplicity, nonredundancy, and stability. It is a physical database design technique that applies mathematical rules to the relational technology to identify and reduce insertion, update, or deletion anomalies.

The mantra we use to get to third normal form is that all attributes must depend on the key, the whole key, and nothing but the key—to put it simply. Fundamentally this means that normalization is a way of ensuring that the attributes are in the proper entity and that the design is efficient and effective for a relational DBMS. We will walk through the steps to get to this data model design in the next sections of this chapter. Normalization has these characteristics as well:

■■ Verification of the structural correctness and consistency of the data model

■■ Independence from any physical constraints

■■ Minimization of storage space requirement by eliminating the storage of data in multiple places

Finally, normalization:

■■ Removes data inconsistencies since data is stored only once, thus eliminating the possibility of conflicting data

■■ Diminishes insertion, updating, and deletion anomalies because data is stored only once

■■ Increases the data structure stability because attributes are positioned in entities based on their intrinsic properties rather than on specific application requirements

 

关系数据模型的规范化

在业务数据模型里规范化非常有用:

■■它不指示任何物理的处理方向,因此使业务模型有一个好的开始,它面向所有的应用和数据库。

■■它减少聚合、汇总或导出元素,保证在数据模型里没有隐含的过程。

■■它禁止所有的属性和实体的复制或冗余的发生。

系统模型与技术模型从业务数据模型继承了这些属性,这样开始一个完全规范化的数据模型。然而,因为各种原因,有时需要加入反规范化的属性,这会在第3章与第4章介绍,最要的是必须认识到何时、何地进行反规范化,并记录这样做的原因。失去控制的冗余和反规范化会导致混乱及不可执行的数据库设计。

规范化应当在业务数据模型设计阶段采用,然而,有一点非常重要,你不应该修改业务规则以适应规范化的严格限制,就是说,不要仅仅为了满足规范化而创建对象。

Normalization of the Relational Data Model

Normalization is very useful for the business data model because:

■■ It does not instruct any physical processing direction, thus making the business model a good starting place for all applications and databases.

■■ It reduces aggregated, summarized, or derived elements to their basic components, ensuring that no hidden processes are contained in the data model.

■■ It prevents all duplicated or redundant occurrences of attributes and entities.

The system and technology models inherit their characteristics from the business data model and so start out as a fully normalized data model. However, denormalized attributes will be designed into these data models for a variety of reasons, as described in Chapters 3 and 4, and it is important to recognize where and when the denormalization occurs and to document the reasons for that denormalization. Uncontrolled redundancy or denormalization will result in a chaotic and nonperforming database design.

Normalization should be undertaken during the business data model design. However, it is important to note that you should not alter the business rules just to follow strict normalization rules. That is, do not create objects just to satisfy normalization.

第一范式

第一范式(1NF)是建立数据模型的第一步——属性依赖于键。这需要两个条件:所有实体都有一个主键,唯一标示一个对象;实体不包含会反复出现的或者多值的组。每一个属性应该在它的最低层次的细节,并且有唯一的意义和名字。1NF是所有其他规范化技术的基础。图2.5显示模型如何转换为1NF

First Normal Form

First normal form (1NF) takes the data model to the first step described in our mantra—the attribute is dependent on the key. This requires two conditions— that every entity have a primary key that uniquely identifies it and that the entity contain no repeating or multivalued groups. Each attribute should be at its lowest level of detail and have a unique meaning and name. 1NF is the basis for all other normalization techniques. Figure 2.6 shows the conversion of our model to 1NF.

 

在图2.5中,我们发现在课程实体里包含一些课程安排的属性,而不是课程本身的属性(课程安排,阶段、教授标识、教授名称)。这些属性不依赖于课程实体的键而存在,因此,应该把他们放到自己的实体内(课程安排)。

In Figure 2.6, we see that the Course entity contains the attributes that deal with a specific offering of the course rather than the generic course itself (Course Offering, Period, Professor Identifier, and Professor Name). These attributes are not dependent on the Course entity key for their existence, and therefore should be put into their own entity (Course Offering).

 

第二范式

第二范式(2NF)对模型进行进一步提炼,根据我们的宗旨:属性必须依赖于所有键。为达到2NF,实体必须满足1NF,而且每一个非主键属性必须依赖整个主键而存在。2NF 进一步减少可能的冗余,通过移除依赖于部分主键的属性,把他们放到自己单独的实体里。注意学科名称仅仅依赖于学科标识,如果保留在模型里,学科标识和学科名称必须为每一个课程重复一次。把他们放在自己的实体里,他们只需要存贮一次。图2.7 显示了把模型转换为2NF

Second Normal Form

Second normal form (2NF) takes the model to the next level of refinement according to our mantra—the attributes must be dependent on the whole key. To attain 2NF, the entity must be in 1NF and every nonprimary attribute must be dependent on the entire primary key for its existence. 2NF further reduces possible redundancy in the data model by removing attributes that are dependent on part of the key and placing them in their own entity. Notice that Discipline Name was only dependent on the Discipline Identifier. If this remains in the model, then Discipline Identifier and Name must be repeated for every course. By placing these in their own entity, they are stored only once. Figure 2.7 shows the conversion of our model to 2NF.

 

第三范式

第三范式(3NF)使模型达到最后层次的提高,根据我们的宗旨——属性不能依赖于任何主键外的属性。为达到3NF ,实体必须满足2NF,而且非键属性必须仅仅依赖于主键,而不依赖于实体里的任何其他属性而存在。这样移除了传递依赖,也就是非主键属性不只依赖于主键,同时也依赖于其他非主键属性的情况。图2.8显示模型转换为3NF的情形,注意课程安排教授(标识)与课程安排教授名称是重复属性,教授名称和教授标识都不依赖于课程安排,因此,我们把这些属性从课程安排实体移出,把他们放入自己独立的实体,实体名为教授。在这一点,为任何技术实现做好了准备——帐务、订单、普通分类帐等操作型系统,数据仓库、数据集市等商业智能,或者操作型存贮等其它环境。

Third Normal Form

Third normal form (3NF) takes the data model to the last level of improvement referred to in our mantra—the attribute must be dependent on nothing but the key. To attain 3NF, the entity must be in 2NF, and the nonkey fields must be dependent on only the primary key, and not on any other attribute in the entity, for their existence. This removes any transitive dependencies in which the nonkey attributes depend on not only the primary key but also on other nonkey attributes. Figure 2.8 shows the conversion of our model to 3NF. In Figure 2.8, notice that Course Offering Professor and Course Offering Professor Name are recurring attributes. Neither the Professor Name or the Professor Identifier depend on the Course Offering. Therefore, we remove these attributes from the Course Offering entity and place them in their own entity, titled Professor. At this point, the data model is in 3NF in which all attributes are dependent on the key, the whole key, and nothing but the key. Your business data model should be presented in 3NF at a minimum. At this point, it is ready for use in any of your technological implementations— operational systems such as billing, order entry, or general ledger (G/L); business intelligence such as the data warehouse and data marts; or any other environment such as the operational data store.

 

 

其他范式

我们为一个组织设计业务模型时,常常到3NF为止。然而,你应该注意到,其它层次的规范化及他们好处决定了3NF是否能满足你的组织。有几本好的数据建模书讨论BC范式、第四范式和第五范式。我们不会在本书的后面章节讨论这些。

 

注意

我们再次警告你不要过度热心的使用这些规范化技术。换句话说,不要过度规范化你的模型。你应该使用结构化的一致性平衡考虑业务意义,你应该永远把模型基于业务概念放在首位,然后使用规范化技术去验证结构的完整性和一致性。

 

Other Normalization Levels

We usually stop with 3NF when we design the business model for organizations. However, you should be aware of other levels of normalization and their benefits to determine if 3NF will be sufficient for your organization. There are several good data-modeling books that discuss the merits of Boyce/Codd and fourth and fifth normal forms. We will not discuss these further in this book.

 

WARNING

We caution you against overzealous usage of these normalization techniques. In other words, don’t overnormalize your models. You should balance the consideration of business meanings with structural consistency. You should always base your model on business concepts first, then apply the normalization techniques to verify the structural integrity and consistency.

小结

我们在这一章讨论了数据模型作为公司资产是非常必要的。没有这一套数据模型,业务用户和创建系统的技术人员不能开发一个复杂的、准确的东西来表示信息结构、业务规则、数据之间的关系。只有在设计时记住重用、一致性、完整性等概念,并严格遵守本章提出的建模技术才能完成。我们创建各种数据模型来满足这个需要:从主题域图形开始,到业务数据模型、到系统模型、到技术模型——每一类模型定义一个不同的抽象和转换层次——最后引导致一个一致的、完全集成的数据库模式。我们看到主题域图形可以很快开发出来,一般在几天内就可以完成;全部业务模型需要稍长一些时间,包含整个企业的业务规则,能被数据仓库的所有应用使用。有一点非常重要,如果你的组织没有这些模型,我们建议你创建一个企业范围内的主题域图,而把焦点放在数据仓库需要的那些主题上,用来建立业务模型。当新的主题域加入到数据仓库时,你可以继续把他们加入数据模型,而不会牵制到整个数据模型。第4章会介绍更多这方面的信息。

Summary

We have discussed in this chapter the fact that data models are essential for managing data as a corporate asset. Without the set of data models described here, the business users and technical staff creating systems cannot develop a comprehensive and precise representation of information structures, business rules, and relationships among the data. This can only be accomplished when the databases are designed with the concept of reusability, consistency, and integration in mind and with rigorous of compliance to the modeling techniques contained in this chapter. We covered the need for various data models, starting with a subject area diagram, and migrating to a business data model, the system, and the technology models—each one defining a different level of abstraction and transformation— ultimately leading to a coordinated and fully integrated database schema. We see that the subject area diagram can be developed very quickly, usually within a few days. The full business data model will take a bit longer and, when fully developed, contains the business rules for the entire enterprise and can be used by all applications, including the data warehouse. It is important to note here that if your organization does not have either of these models, we recommend that you create an enterprise-wide subject area diagram but focus on only the subject area(s) needed for your data warehouse for the business data model. You will continue to fill out the business and data warehouse data models as new areas are needed in the data warehouse but should not get sidetracked into trying to create the entire business data model before you generate your first data warehouse subject area. See Chapter 4 for more information on this.

 

开发任何模型时,我们必须记住并遵循三个方针:数据模型是沟通的工具,包含最低明细级的数据,必须面向企业。当面对不同的决策时,应该考虑这三个方针。我们还在本章学到规范化及其给数据库设计带来的好处,我们建议你的业务模型满足3NF。在3NF,属性的依赖关系如下:

1NF。属性依赖于键,通过移出重复的组得到。

2NF。属性依赖于全部主键,通过移出依赖于部分主键的属性得到。

3NF。属性只依赖于键,通过移出依赖于非主键的属性而得到。

我们对过度规范化及分析麻痹提出警告。最后,键保持模型的完整性、一致性、可重用性。这些模型快速产生稳定的、可维护的数据库(或者数据库的子集),这样,才可以继续数据仓库的分析和交付工作。

There are three important guidelines to follow in developing any of the models we discuss. These are to remember that the data model is a communication tool, that it contains the lowest common denominator of detail, and that it reflects a solid business orientation. When confronted with difficult decisions, these three guidelines should rule the day. We also learned in this chapter about normalization and its benefits for database design. Our recommendation is to develop your business data model in 3NF. In 3NF, attributes are dependent:

1NF. On the key, accomplished by removing repeating groups

2NF. The whole key, accomplished by removing attributes dependent on part of the key

3NF. Nothing but the key, accomplished by removing attributes dependent on nonkey attributes

We warn against overzealous normalization and analysis paralysis. At the end of the day, the key is to get a set of models that is fully integrated, consistent, and reusable. These models produce stable and maintainable databases (or a subset of them) quickly so that work can proceed on providing timely business deliverables from the data warehouse.

 

 

原创粉丝点击