SqlServer Physical Schema摘记（建，库，表，索引等等）

来源：互联网发布：数据库字段拆分提取编辑：程序博客网时间：2024/06/06 00:05

SQL Server 2008 supports several new data types, and an entirely new way of storing data in the data
pages —sparse columns.

Translating the logical database schema into a physical database schema may involve the following
changes:
Converting complex logical designs into simpler, more agile table structures
Converting logical many-to-many relationships to two physical one-to-many relationships with
an associative, or junction, table
Converting logical composite primary keys to surrogate (computer-generated) single-column
primary keys
Converting the business rules into constraints or triggers, or, better yet, into data-driven
designs

The key task in designing a physical database schema is brainstorming multiple possible designs, each
of which meets the user requirements and ensures data integrity. Each design is evaluated based on its
simplicity, performance of possible query paths, flexibility, and maintainability.

Designing for performance is greatly influenced by the simplicity or complexity of the design. Each
unnecessary complexity requires additional code, extra joins, and breeds even more complexity.

Denormalization:
A popular myth is that the primary task of translating a logical design into a physical schema is denormalization.
Denormalization, purposefully breaking the normal forms, is the technique of duplicating data within the data to make it easier to retrieve.

Some developers regularly denormalize portions of the database in an attempt to improve performance.
While it might seem that this would improve performance because it reduces the number of joins, I
have found that in practice the additional code (procedures, triggers, constraints, etc.) required to keep
the data consistent, or to renormalize data for use in set-based queries, actually costs performance. In
my consulting practice, I’ve tested a normalized design vs. a denormalized design several times. In every
case the normalized design was about 15% faster than the denormalized design.

Best Practice:
never denormalize apart from the two cases identified as responsible denormalization:

Denormalize aggregate data — such as account balances, or inventory on hand quantities
within OLTP databases — for performance even though such data could be calculated from
the inventory transaction table or the account transaction ledger table. These may be calculated
using a trigger or a persisted computed column.

If the data is not original and is primarily there for OLAP or reporting purposes, data
consistency is not the primary concern. For performance, denormalization is a wise move.

The architecture of the databases, and which databases or tables are being used for which purpose, are
the driving factors in any decision to denormalize a part of the database.

If the database requires both OLTP and OLAP, the best solution might just be to create a few tables that
duplicate data for their own distinct purposes. The OLTP side might need its own tables to maintain the
data, but the reporting side might need that same data in a single, wide, fast table from which it can
retrieve data without any joins or locking concerns. The trick is to correctly populate the denormalized
data in a timely manner.

Designing for extensibility:
The following techniques may reduce the cost of database maintenance:
Enforce a strong T-SQL based abstraction layer.
Always normalize the schema.
Data-driven designs are more flexible, and therefore more extensible, than rigid designs.
Use a consistent naming convention.
Avoid data structures that are overly complex, as well as unwieldy data structures, when
simpler data structures will suffice.
Develop with scripts instead of using Management Studio’s UI.
Enforce the data integrity constraints from the beginning. Polluted data is a bear to clean up
after even a short time of loose data-integrity rules.
Develop the core feature first, and once that’s working, then add the bells and whistles.
Document not only how the procedure works, but also why it works.

Database-file:
Never store the transaction log on the same disk subsystem as the data file. For the sake
of the transactional-integrity ACID properties and the recoverability of the database, it’s
critical that a failing disk subsystem not be able to take out both the data file and the transaction file.

Configuring file growth:
Growing by percent is the best option for smaller databases. With very large files,
this option may add too much space in one operation and hurt performance while the data
file is being resized.

Growing by a fixed size is a good option for larger data files.

Best Practice:
The default setting is to grow the data file by 1 MB. Autogrow events require database locks, which
severely impact performance. Imagine a database that grows by a couple of gigabytes. It will have to
endure 2,048 tiny autogrow events. On the other hand, a large autogrowth event will consume more time.
The best solution is to turn autogrow off and manually increase the file size during the database maintenance
window. However, if that’s not expedient, then I recommend setting Autogrow to a reasonable mid-size
growth.

Setting a maximum size can prevent the data file or transaction log file from filling the entire disk subsystem, which would cause trouble for the operating system.

The maximum size for a data file is 16 terabytes, and log files are limited to 2 terabytes. This does not
limit the size of the database because a database can include multiple files.

An easy way to determine the files and file sizes for all databases from code is to query the
sys.database_files catalog view.

To list the databases using code, query the sys.databases catalog view.

Best Practice:
Many DBAs detest the default autogrowth settings because the database is locked during an autogrowth.
Growing by 1MB per growth is probably far too small and will cause frequent interruptions. Another
common error is to set the autogrowth to a percentage. As the database grows, so will the growth size. For
serious production databases the best solution is to monitor the data size and manually grow the database,
but leave autogrowth on as a safety factor.

Using multiple files:
If the database uses multiple data files, then the first, or primary, file will contain the system tables.
SQL Server attempts to balance the I/O load by splitting the inserts
among the multiple files according to the free space available in each file. As SQL Server balances the
load, rows for a single table may be split among multiple locations. If the database is configured for
automatic growth, all of the files will fill up before SQL Server increases the size of the files.

删除数据库文件前要执行DBCC ShrinkFile然后再删除。

multiple filegroups：
By configuring a database with multiple filegroups, new objects (tables, indexes, and so on) can be created on a specified filegroup.

Using multiple filegroups can increase performance by separating heavily used tables or
indexes onto different disk subsystems.

Using multiple filegroups can organize the backup and recovery plan by containing static data
in one filegroup and more active data in another filegroup.

Best Practice:
Create a single secondary filegroup, I call mine ‘‘Data,’’ and set it as the default filegroup. This leaves the
primary filegroup dedicated for system objects.

Modifying filegroups:
a filegroup may be removed if it is empty
Emptying a filegroup is more difficult than shrinking a file. If there’s data in the
filegroup, shrinking a file will only move the data to another file in the filegroup. The tables and indexes
must be dropped from the filegroup before the filegroup can be deleted.
t-sql：ALTER DATABASE ADD FILEGROUP or ALTER DATABASE REMOVE FILEGROUP

Creating Tables:
A SQL Server database may include up to 2,147,483,647 objects, including tables

Designing tables using Management Studio:
两个途径设计表：The Table Designer tool和The Database Designer tool
使用这些设计器可以在设计好后，保存创建此设计的script。

To list the tables for the current database using code, query the sys.objects catalog view,
filtering for type_desc = ‘USER_TABLE’.

Best Practice:
Iconsider the schema to be code, and as such it should be handled as code and checked into a version
control system. I never develop using the graphic user interfaces in Management Studio. I strictly develop
using T-SQL scripts.

Schemas:
A schema is an object that exists purely to own database objects, most likely to segment a large database
into manageable modules, or to implement a segmented security strategy.

从2005开始user和schema分离，用户不再能拥有对像。
The schema name is the third part of the four-part name:
Server.database.schema.object;
using a two-part name improves performance

Best Practice:
Creating objects in a schema other than dbo can improve security. Getting the correct schema is one
more obstacle that helps prevent SQL injection.

To list the schema for the current database using code, query the sys.schemas catalog view.

Column names:
SQL Server is very liberal with table and column names, allowing up to 128 Unicode characters and
spaces, as well as both uppercase and lowercase letters.

Best Practice:
Consistency is the database developer’s holy grail. The purpose of naming conventions, constraints,
referential integrity, relational design, and even column data type is to bring order and consistency to
the data we use to model reality. When faced with a database decision, asking ‘‘Which choice is the most
consistent?’’ is a good step toward a solution.

Filegroups:
在创建表的时候，可以用filegroup来组织数据，比如把不经常改动的数据放在一个文件组，经常改动的放在另一个文件组，这样只经常备份常改动的文件组。

Creating Keys：
The design of these keys has a critical effect on the performance and usability of the physical database.
the structure of the primary and foreign keys is often the crux of the redesign. Keys are very
difficult to modify once the database is in production. Getting the primary keys right during the
development phase is a battle worth fighting.

Primary keys:
一般不要用composite主键，容易损害性能。一般主键分两种，一种是有意义的主键比如身份证号等，一种是计算机自动产生的：identity和 guid。
identity的主键很小，所以速度会快，但是使用identity int的主键在查询时会容易导致多个join。另外从2005开始可以使用newsequentialid()来产生有序的guid,所以guid和 identity类型的主键间的性能差距变小了，这也解决了在插入数据的时候数据页分隔的问题。但应该还是identity型的主键会更快些。

Database design layers:
Domain Integrity (lookup) layer: Use natural keys — short abbreviations work well. The
advantage is that the abbreviation, when used as a foreign key, can avoid a join.

Business Entity (visible) layer: For any table that stores operational data, use a surrogate
key, probably an identity. If there’s a potential natural key (also called a candidate key), it
should be given a unique constraint/index.

Supporting (associative tables) layer: If the associative table will never serve as the primary
table for another table, then it’s a good idea to use the multiple foreign keys as a composite
primary key. It will perform very well. But if the associative table is ever used as a primary
table for another table, then apply a surrogate primary key to avoid a composite foreign key.

CREATE TABLE dbo.Guide (
GuideID INT IDENTITY NOT NULL PRIMARY KEY,使用这种方式创建的主键名不确定，可以显示定义主键限制名。

To list the primary keys for the current database using code, query the sys.objects and
sys.key_constraints catalog views.

Identity column surrogate primary keys:
set insert_identity is set to true.后才能对identity列进行修改或insert

Using uniqueidentifier surrogate primary keys:
It’s a 16-byte hexadecimal number that is essentially unique among all tables, all databases, all servers, and all planets.

GUIDs offer several advantages:
can be replicated without a major overhaul.如果使用identity的话，当合并从其他数据库复制过来的表时，可能会产生重复主键而产生了错误。
GUIDs are more unique than integers. The scope of an integer’s uniqueness is limited to the
local table. A GUID is unique in the universe.
GUIDs are forever. The table based on a typical integer-based identity column will hold only
2,147,483,648 rows.
Because the GUID can be generated by either the column default, the SELECT statement
expression, or code prior to the SELECT statement, it’s significantly easier to program with
GUIDs than with identity columns. Using GUIDs circumvents the data-modification problems
of using identity columns.

The drawbacks of unique identifiers are largely performance based:
Unique identifiers are large compared to integers, so fewer of them fit on a page. As a result,
more page reads are required to read the same number of rows.
Unique identifiers generated by NewID(), like natural keys, are essentially random, so data
inserts will eventually cause page splits, hurting performance. However, natural keys will have
a natural distribution　so the page split　problem is worse with natural keys.

There are two primary methods of generating Uniqueidentifiers：
The versatile NewID() function may be used as a column default, passed to an insert
statement, or executed as a function within any expression.

NewsequentialID() is similar to NewID(), but it guarantees that every new
uniqueidentifier is greater than any other uniqueidentifier for that table.
The NewsequntialID() function can be used only as a column default. This makes sense
because the value generated is dependent on the greatest Uniqueidentifier in a specific
table.

Best Practice:The NewsequentialID() function, introduced in SQL Server 2005, solves the page-split clustered index
problem.

Creating foreign keys:
SQL Server tables may have up to 253 foreign key constraints.
The foreign key can reference primary keys, unique constraints, or unique indexes of any table except, of course, a temporary table.

to establish or modify a foreign key constraint in Management Studio in three ways:
Using the Database Designer, select the primary-key column and drag it to the foreign-key
column.
In the Object Explorer, right-click to open the context menu in the DatabaseName > Tables >
TableName > Keys node and select New Foreign Key.
Using the Table Designer, click on the Relationships toolbar button, or select Table Designer >
Relationships.

To list the foreign keys for the current database using code, query the sys.foreign_key_
columns catalog view.

Optional foreign keys:
If the foreign key is mandatory, the column should not allow nulls. An optional foreign key allows nulls.
Best Practice:
Although I’ve created databases with optional foreign keys, there are strong opinions that this is a worst
practice. My friend Louis Davison argues that it’s better to make the foreign key not null and add a row
to the lookup table to represent the Does-Not-Apply value. I see that as a surrogate lookup and would prefer
the null.

---Cascading deletes and updates
当删除或修改有外键的主键的时候，有以下四种选择：
No Action：这样将不能对主键进行操作。
Cascade:会随主键的操作删除或更改外键。当主键不存在的时候，外键也没有存在意义的时候使用。
Set Null:会把外键改成null，这个当想删除主键但保留外键的时候使用。
Set Default:将外键设置成一个默认值。

---Creating User-Data Columns
SQL Server tables may have up to 1,024 columns, but well-designed relational-database tables seldom
have more than 25, and most have only a handful.

To list the columns for the current database using code, query the sys.objects and
sys.columns catalog views.

---Column data types
It enforces the first level of data integrity.
It determines the amount of disk storage allocated to the column.

Character Data Types:
text和ntext已经被淘汰，用varchar(max)和nvarchar(max)替代了，它的长度都能达到2GB，带n的每字符长度是2字节，不带n的是1字节。char,varchar最大能有8000个字符，nchar和nvarchar最多能有4000个。
Sysname是微软自定义的一个数据类型，相当于nvarchar(128)。
Unicode data types are very useful for storing multilingual data.否则就使用char,varchar。varchar最大能有8000个字符

---Numeric data types
Best Practice：
When working with monetary values, be very careful with the data type. Using float or real data
types for money will cause rounding errors. The data types money and smallmoney are accurate
to one hundredth of a U.S. penny. For some monetary values, the client may request precision only to the
penny, in which case decimal is the more appropriate data type.

---Date/Time data types
Some programmers (non-DBAs) choose character data types for date columns. This can
cause a horrid conversion mess. Use the IsDate() function to sort through the bad data.

sqlserver2k8新增的一些时间类型：
DateTime2(),date,time(2),Datetimeoffset
具体的参考sql联机丛书：ms-help://MS.SQLCC.v10/MS.SQLSVR.v10.zh-CHS/s10de_6tsql /html/83e378a2-6e89-4c80-bc4f-644958d9e0a9.htm

Other Data Types:
Timestamp or Rowversion,Uniqueidentifier,Binary(n),VarBinary(max),VarBinary,Image(被淘汰),Sql_variant

---Calculated columns
By default, a calculated column doesn’t actually store any data; instead, the data is calculated when
queried. However, since SQL Server 2005, calculated columns may be optionally persisted, in which
case they are calculated when entered and then sorted as regular, but read-only, row data. They may
even be indexed.
这个使用容易而且速度快。
语法：
ColumnName as Expression

---Sparse columns
2008新加入的东东。当某列存储的数据大部分都为null的时候，使用它会大大的节省磁盘空间。要创建这种列，使用sparse关键字，这种列必须可空。
sparse columns use a completely different method for storing data within the page.
Normal columns have a predetermined designated location for the data. If there’s no data, then
some space is wasted. Even nullable columns use a bit to indicate the presence or absence of a null for
the column.

Sparse columns, however, store nothing on the page if no data is present for the column for that row.
To accomplish this, SQL Server essentially writes the list of sparse columns that have data into a list for
the row (5 bytes + 2–4 bytes for every sparse column with data). If the columns usually hold data,
then sparse columns actually require more space than normal columns. However, if the majority of
rows are null (I’ve heard a figure of 50%, but I’d rather go much higher), then the sparse column will
save space.

Because sparse columns are intended for columns that infrequently hold data, they can be used for very
wide tables — up to 30,000 columns.

To create a sparse column, add the SPARSE keyword to the column definition. The sparse column must
be nullable

Worst Practice:
Any table design that requires sparse columns is a horrible design. A different pattern, probably a
super-type subtype pattern, should be used instead. Please don’t ever implement a table with sparse
columns. Anyone who tells you they need to design a database with sparse columns should get a job flipping
burgers. Don’t let them design your database.

---Column constraints and defaults
有5种限制：
Primary key constraint，Foreign key constraint，Nullability，Check constraint，Unique constraint
还可以有默认值

---Column nullability
New columns in SQL Server default to not null
但是设置connection property ansi_null_dflt_on，可以让列默认为空，The ANSI standard is to default to null

Best Practice:
Because the default column nullability differs between ANSI SQL and SQL Server, it’s best to avoid relying
on the default behavior and explicitly declare null or not null when creating tables.

---Creating Indexes
SQL Server uses two types of indexes: clustered indexes, which reflect the logical sort order of the table,
and non-clustered indexes, which are additional b-trees typically used to perform rapid searches of
non-key columns. The columns by which the index is sorted are referred to as the key columns.

在索引属性页中：
The Filter page is for SQL Server 2008’s new WHERE clause option for indexes.
在已经建好的索引页的属性中：
The Fragmentation page displays detailed information about the health of the index.
Indexes are created in code with the CREATE INDEX command.

To retrieve fascinating index information from T-SQL code, use the following functions
and catalog views: sys.indexes, sys.index_columns, sys.stats, sys.stats_columns,
sys.dm_db_index_physical_stats, sys.dm_index_operational_stats, sys.indexkey_property,
and sys.index_col.

Composite indexes
A composite index is a clustered or non-clustered index that includes multiple key columns. Most nonclustered
indexes are composite indexes.
The order of the columns in a composite index is important. In order for a search to take advantage
of a composite index it must include the index columns from left to right.

Primary keys:
A primary key can be initially defined as a clustered or non-clustered index. However, in order for the
index type to be changed, the primary key constraint must be dropped and recreated — a painful task if
numerous foreign keys are present or the table is replicated.

Filegroup location:
If the database uses multiple named filegroups, the index may be created on a certain filegroup with the
ON filegroupname option:
CREATE NONCLUSTERED INDEX IndexName
ON Table (Columns)
ON filegroupname;

This option is useful for spreading the disk I/O throughput for very heavily used databases.
placing each table and index on its own disk subsystem will improve performance.
Remember that a clustered index must be in the same location as the table because the clustered index pages and the data pages are merged.

---Index options
Unique indexes:
A UNIQUE INDEX option is more than just an index with a unique constraint; index optimizations are
available to unique indexes. A primary key or a unique constraint automatically creates a unique index.

Filtered indexes:
New for SQL Server 2008, filtered indexes enable DBAs to create indexes that include less than the
entire table’s worth of data. Essentially, a WHERE clause is added to the index definition. These are
perfect for highly tuned covering indexes. Filtered indexes can be configured in the Filter page of the
Management Studio’s Index Properties dialog, or using T-SQL.

CREATE INDEX IxActiveProduction
ON Production.WorkOrders (WorkOrderID, ProductID)
WHERE Status = ‘Active’

Index fill factor and pad index:
An index needs a little free space in the tree so that new entries don’t require restructuring of the index.
When SQL Server needs to insert a new entry into a full page, it splits the page into two pages and
writes two half-full pages back to the disk. This causes three performance problems: the page split itself,
the new pages are no longer sequential, and less information is on each page so more pages must be
read to read the same amount of data.
Because the index is a balanced tree (b-tree), each page must hold at least two rows. The fill factor and
the pad index affect both the intermediate pages and the leaf node

Fill Factor     Intermediate Page(s)                     Leaf Node
0         One free entry                         100% full
1–99         One free entry or ≤ fill factor if pad index         ≤ Fill factor
100         One free entry                         100% full

The fill factor only applies to the detail, or leaf, node of the index, unless the PAD INDEX option is
applied to the fill factor. The PAD INDEX option directs SQL Server to apply the looseness of the fill
factor to the intermediate levels of the b-tree as well.

Best Practice:
The best fill factor depends on the purpose of the database and the type of clustered index. If the database
is primarily for data retrieval, or the primary key is sequential, a high fill factor will pack as much as
possible in an index page. If the clustered index is nonsequential (such as a natural primary key), then the
table is susceptible to page splits, so use a lower page fill factor and defragment the pages often.

Limiting index locks and parallelism
The locking behavior of queries using the index may be controlled using the ALLOW_ROW_LOCKS and
ALLOW_PAGE_LOCKS options. Normally these locks are allowed.

Index sort order
SQL Server can create the index as a descending index.
Any query using an ORDER BY clause will still be sorted ascending unless the query’s ORDER BY
specifically states DESC.
The ASC or DESC option follows the column name in the CREATE INDEX DDL command.

The Ignore Dup Key index option:
the IGNORE_DUPLICATE_KEY option directs INSERT transactions to succeed for all
rows accepted by the unique index, and to ignore any rows that violate the unique index.
This option does not break the unique index. Duplicates are still kept out of the table, so the consistency
of the database is intact, but the atomicity of the transaction is violated.

The Drop Existing index option:
The DROP EXISTING option directs SQL Server to drop the current index and rebuild the new index
from scratch. This may cause a slight performance improvement over rebuilding every index if the index
being rebuilt is a clustered index and the table also has nonclustered indexes, because rebuilding a clustered
index forces a rebuild of any non-clustered indexes.

The Statistics Norecompute index option
The SQL Server Query Optimizer depends on data-distribution statistics to determine which index is
most significant for the search criteria for a given table. Normally, SQL Server updates these statistics
automatically. However, some tables may receive large amounts of data just prior to being queried, and
the statistics may be out of date. For situations that require manually initiating the statistics update, the
STATISTICS NORECOMPUTE = ON index option disables automatic statistics, but for nearly all indexes
this option should be ignored.

Sort in tempdb
The SORT_IN_TEMPDB = ON option modifies the index-creation method by forcing it to use tempdb as
opposed to memory. If the index is routinely dropped and recreated, this option may shorten the indexcreation
time. For most indexes, this option is neither required nor important.

Disabling an index:
During some intensive data import operations, it’s faster to drop the index and recreate it than to
update the index with every newly inserted row. The benefit of disabling an index is that the metadata
for the index is maintained within the database, rather than depending on the code to recreate the
correct index.
Disabling a clustered index effectively disables the table.

To re-enable an index, use the ALTER INDEX... REBUILD WITH command。

Set up a separate filegroup for user tables and set it as the default filegroup, and keep user
tables off the primary filegroup.

Natural keys are good for lookup tables. Use surrogate keys for the other tables.

Constraints are faster and more reliable than application code. Always enforce the database
rules with constraints.