Normalization VS Denormalization

来源：互联网发布：python pdf下载编辑：程序博客网时间：2024/04/30 01:32

Denormalization is the process of attempting to optimize the read performance of a database by adding redundant data or by grouping data.^[1]^[2] In some cases, denormalization helps cover up the inefficiencies inherent in relational database software. A relational normalized database imposes a heavy access load over physical storage of data even if it is well tuned for high performance.

A normalized design will often store different but related pieces of information in separate logical tables (called relations). If these relations are stored physically as separate disk files, completing a database query that draws information from several relations (a join operation) can be slow. If many relations are joined, it may be prohibitively slow. There are two strategies for dealing with this. The preferred method is to keep the logical design normalized, but allow the database management system (DBMS) to store additional redundant information on disk to optimize query response. In this case it is the DBMS software's responsibility to ensure that any redundant copies are kept consistent. This method is often implemented in SQL as indexed views (Microsoft SQL Server) ormaterialized views (Oracle). A view represents information in a format convenient for querying, and the index ensures that queries against the view are optimized.

The more usual approach is to denormalize the logical data design. With care this can achieve a similar improvement in query response, but at a cost—it is now the database designer's responsibility to ensure that the denormalized database does not become inconsistent. This is done by creating rules in the database called constraints, that specify how the redundant copies of information must be kept synchronized. It is the increase in logical complexity of the database design and the added complexity of the additional constraints that make this approach hazardous. Moreover, constraints introduce a trade-off, speeding up reads (SELECT in SQL) while slowing down writes (INSERT, UPDATE, and DELETE). This means a denormalized database under heavy write load may actually offerworse performance than its functionally equivalent normalized counterpart.

A denormalized data model is not the same as a data model that has not been normalized, and denormalization should only take place after a satisfactory level of normalization has taken place and that any required constraints and/or rules have been created to deal with the inherent anomalies in the design. For example, all the relations are in third normal form and any relations with join and multi-valued dependencies are handled appropriately.

Examples of denormalization techniques include:

Materialized views, which may implement the following:
- Storing the count of the "many" objects in a one-to-many relationship as an attribute of the "one" relation
- Adding attributes to a relation from another relation with which it will be joined
Star schemas, which are also known as fact-dimension models and have been extended to snowflake schemas
Prebuilt summarization or OLAP cubes

Denormalization techniques are often used to improve the scalability of Web applications.^[3]

原文地址：http://en.wikipedia.org/wiki/Denormalization

Example: a shopping cart order

Suppose that we are designing a schema for a shopping cart application. Our application

stores orders in MongoDB, but what information should an order contain?

Normalized schema

A product:

{"_id" : productId,"name" : name,"price" : price,"desc" : description}An order:{"_id" : orderId,"user" : userInfo,"items" : [productId1,productId2,productId3]}

We store the _id of each item in the order document. Then, when we display the

contents of an order, we query the orders collection to get the correct order and

then query the products collection to get the products associated with our list of

_ids. There is no way to get a the full order in a single query with this schema.

If the information about a product is updated, all of the documents referencing

this product will “change,” as these documents merely point to the definitive

document.

Normalization gives us slower reads and a consistent view across all orders; multiple

documents can atomically change (as only the reference document is actually

changing).

Denormalized schema

A product (same as previous):

{"_id" : productId,"name" : name,"price" : price,"desc" : description}An order:{"_id" : orderId,"user" : userInfo,"items" : [{"_id" : productId1,"name" : name1,"price" : price1},{"_id" : productId2,"name" : name2,"price" : price2},{"_id" : productId3,"name" : name3,"price" : price3}]}

We store the product information as an embedded document in the order. Then,

when we display an order, we only need to do a single query.

If the information about a product is updated and we want the change to be propagated

to the orders, we must update every cart separately.

Denormalization gives us faster reads and a less consistent view across all orders;

product details cannot be changed atomically across multiple documents.

So, given these options, how do you decide whether to normalize or denormalize?

Decision factors

There are three major factors to consider:

• Are you paying a price on every read for the very rare occurrence of data changing?

You might read a product 10,000 times for every one time its details change. Do

you want to pay a penalty on each of 10,000 reads to make that one write a bit

quicker or guaranteed consistent? Most applications are much more read-heavy

than write-heavy: figure out what your proportion is.

How often does the data you’re thinking of referencing actually change? The less

it changes, the stronger the argument for denormalization. It is almost never worth

referencing seldom-changing data such as names, birth dates, stock symbols, and

addresses.

• How important is consistency? If consistency is important, you should go with normalization.

For example, suppose multiple documents need to atomically see a

change. If we were designing a trading application where certain securities could

only be traded at certain times, we’d want to instantly “lock” them all when they

were untradable. Then we could use a single lock document as a reference for the

relevant group of securities documents. This sort of thing might be better to do at

an application level, though, as the application will need to know the rules for when

to lock and unlock anyway.

Another time consistency is important is for applications where inconsistencies are

difficult to reconcile. In the orders example, we have a strict hierarchy: orders get

their information from products, products never get their information from orders.

If there were multiple “source” documents, it would be difficult to decide which

should win.

However, in this (somewhat contrived) order application, consistency could actually

be detrimental. Suppose we want to put a product on sale at 20% off. We

don’t want to change any information in the existing orders, we just want to update

the product description. So, in this case, we actually want a snapshot of what the

data looked like at a point in time (see “Tip #5: Embed “point-in-time”

data” on page 7).

• Do reads need to be fast? If reads need to be as fast as possible, you should denormalize.

In this application, they don’t, so this isn’t really a factor. Real-time

applications should usually denormalize as much as possible.

文章来源：50 tips & tricks for mongodb developers