[MSSQL]Select count(*)与Select count(字段)的效率分析

来源:互联网 发布:淘宝淘口令怎么取消 编辑:程序博客网 时间:2024/06/05 16:18

原文:

     

Advice on using COUNT( )
In the SQL Server community, one thing that I sometimes encounter is the question on whether you should use COUNT(*) or COUNT(columnname), where columnname is a column in the table that you want to count the rows for. Often the advice given to people in forums and mailing lists is that COUNT(columnname) will perform better than COUNT(*). This is not always the correct advice though, in many cases it is even entirely wrong. Although there are situations where you can (or even should) use COUNT(columnname), you definitely shouldn't always use it. This incorrect piece of advice is probably based on a lack of understanding of how SQL Server handles data internally. 

Description of COUNT( )
The first thing that you need to know is that there is a difference between the alternative ways of using COUNT( ), and what this difference is. The complete syntax for COUNT( ) is this:

  COUNT ( { [ ALL | DISTINCT ] expression ] | * } )

The word expression means any expression except for uniqueidentifier, text, ntext or image data, and it may not use aggregated functions or subqueries. Most often though, expression is just a column in the table. ALL is the default, which means that writing COUNT(expression) is equal to writing COUNT(ALL expression).

COUNT(*) returns the total number of rows in the table, while COUNT(expression) returns the number of rows where the result of the expression is not NULL. Naturally, COUNT(DISTINCT expression) means that duplicates are only counted once. This means that COUNT( ) can return different results depending on how you write it.

Myths and facts
As I said earlier, many people believe COUNT(columnname) is faster than using COUNT(*), because COUNT(*) would have to read all columns of each row (just like executing a SELECT * FROM MYTABLE statement), while COUNT(columnname) only need to read the specified column. This is not true though, for several reasons.

First of all, SQL Server can't read just the contents of a single column without reading the entire row. SQL Server stores the rows with the data on 8 KB data pages on disk. These pages contain one or more rows (depending on the size of each individual row, which may be up to 8060 bytes, with some exceptions), and these pages are placed in the internal memory (RAM) when SQL Server needs to access them for any reason. To check the value of a single column (or several of course), an entire page has to be read from disk and placed in memory. The pages may of course already be cached in memory, in which case the read will be much faster, but SQL still needs to read an entire page from memory just to check a single column of a row.

Now, to avoid having to read these entire data pages when all you are really interested in is how many rows there are in a table, SQL Server will use an index instead, if one exists. Indexes are stored in the same way as data, on 8 KB index pages. Since an index is probably not as wide as a data row (the index only consists of some or even one of the columns in the row), an index page can usually fit a lot more rows per page than the data pages can. This means that SQL Server doesn't have to read as many pages to check the number of rows in the index as it does with the data pages, which is of course a good thing.

This does not only apply to COUNT(columnname_with_an_index_defined_on_it), COUNT(*) will of course also use the index to count the rows. In some cases there may not be an index that covers the specified column in a COUNT(columnname) query, but there is an index defined on another column of the table. In this case COUNT(*) would use this other index to count the number of rows, but COUNT(columnname_without_an_index) would have to read the data pages to check the column for NULL values and count the rows.

To try this for yourself, run the following script in SQL Query Analyzer (if it is not already set to show the results in text mode, use Ctrl-T to set it that way):

USE Northwind
GO

SET STATISTICS IO ON

SELECT COUNT(*) FROM Orders
SELECT COUNT(CustomerId) FROM Orders
SELECT * FROM Orders

SET STATISTICS IO OFF

The statement SET STATISTICS IO ON configures SQL Server to output statistics showing the amount of I/O that was required to execute the query, and you can use it to compare the amount of resources used by different queries to decide which one to use. You can find this output directly after the results of the statement executed. The statistics we are interested in here is the number of logical and/or physical page reads. Logical page reads is the amount of pages (data and/or index pages) that was read from memory, and physical page reads is the number of pages read from disk. On my computer the result of COUNT( ) shows 830 rows for both alternatives, which is probably also what you got if you haven't added or deleted any rows from the Orders table. Now note the number of logical page reads for these statements (run the script a couple of times if you're getting physical page reads to cache the data in memory). I have 3 logical page reads for the first alternative, and 21 logical page reads for the second one! Also note that the third statement that SELECTs all of the rows from the table also resulted in 21 logical page reads. This shows us that the second statement had to read all of the data pages just to count the number of rows in Orders because there is no index on the CustomerId column, but the first statement is able to use an index (on my computer the index ShippersOrders was used; I checked the execution plan for the query to find that out) to count the rows.

Which one to use?
As I have shown, using COUNT(*) does certainly not mean poor performance. On the contrary, in some cases you may instead get poor performance from using COUNT(expression). Normally you probably won't encounter the problem in the example above, as you will probably have an index on the column you specified. What is worse though is that you may receive a different result from what you were expecting! Let's say that you have a legacy application that uses COUNT(columnname) to count the number of rows of a table, where columnname represents a column that does not allow NULL values. Now, sometime later, the definition for the column is changed to allow NULL values. As soon as someone enters a NULL value in the column, your application will no longer show the number of rows in the table but instead the number of rows with non-NULL values in the specified column! That may not be what the designers of the application intended and expected, and could possibly cause major problems.

But...
So, normally there is no reason not to use COUNT(*). But as I mentioned in the beginning of the article there are situations where you want to (or rather should) use COUNT(expression). One obvious example is of course if you are really only interested in the number of rows where the column value is not NULL. A typical example of a situation like that is when you use COUNT( ) together with another aggregated function. Let's say we have a table with some sort of measure data, with NULL values in some rows. Now we're looking for an average of these values. Normally, we would use AVG( ) for this, but to see the point we'll say we're not allowed to use it. Compare these two statements and see if you spot the problem:

SELECT SUM(column) / COUNT(*) FROM table

SELECT SUM(column) / COUNT(column) FROM table


These statements will return different average results, since SUM( ) ignores NULL values (they are not counted as 0). If the sum is 1500, and the number of rows is 150, of which 50 have NULL in the specified column, the result of the first query will be 10 (1500/150) and the result of the second query will be 15 (1500/100). This is actually a problem that I encounter quite often in my work as a database consultant, and most often it exists due to the fact that the person who wrote the SQL statement where not aware of how NULL values are handled differently in different aggregated functions (SUM( ) and COUNT( ) in the example above).

翻译:

   建议使用COUNT()
在SQL Server社区,我有时会遇到的一件事是关于是否应该使用COUNT(*)或COUNT(的ColumnName),其中ColumnName是要算行的表中列的问题。经常在论坛和邮件列表的人的意见是,COUNT(的ColumnName)将执行比COUNT(*)。这并不总是正确的意见,虽然在许多情况下,它甚至是完全错误的。虽然也有,你可以(甚至应该)使用COUNT(的ColumnName),你绝对不应该总是使用它的情况。这不正确的意见,可能是基于缺乏了解SQL Server如何处理内部数据。

计数的说明()
你需要知道的第一件事是,有使用COUNT(),这种差异是什么替代方法之间的差异。COUNT()的完整语法是这样的:

  
COUNT({[ALL | DISTINCT,表达] | *})

字表达是指任何除uniqueidentifier的文本,ntext或image数据的表达,它可能无法使用聚合函数或子查询。最常虽然,表达仅仅是一个表中的列。ALL是默认的,这意味着写作COUNT(表达式)等于写COUNT(所有的表情)。

COUNT(*)返回表中的行的总数,而COUNT(表达式)返回的行数,不为NULL表达式的结果。当然,COUNT(DISTINCT表达式)表示,重复的只计算一次。这意味着,COUNT()可以返回不同的结果,取决于你如何写它。

神话与事实
正如我刚才所说,很多人认为COUNT(的ColumnName)的速度比使用COUNT(*),因为COUNT(*)会读取每一行的所有列(就像MYTABLE的语句执行一个SELECT *),而count(的ColumnName)只需要读取指定列。这是不正确的,有以下几个原因。

首先,SQL Server可以不读不读整行的单个列的内容。SQL Server存储8 KB数据页在磁盘上的数据行。这些网页包含一个或多个行(取决于每一个人行的大小,也有一些例外,这可能是8060字节),而这些页面放置在内部存储器(RAM),当SQL Server需要访问他们任何理由。要检查单个列的值(或几个当然),整个页面必须从磁盘读取并放置在内存中。当然,该页面可能已经缓存在内存中,在这种情况下,读取会快很多,但SQL仍然需要从内存读取整个页面,只是为了检查一个连续的单个列。

现在,以避免阅读这些整个数据页,当你真正感兴趣的是在一个表中有多少行,SQL Server将使用索引,而不是,如果存在的话。索引存储在8 KB索引页,数据同样的方式。由于指数可能不会作为一个数据行(该指数仅有的甚至在该行的列组成)宽,索引页通常可以容纳比数据页每页的行了。这意味着SQL Server不读尽可能多的网页检查索引中的行数,因为它的数据页,这当然是一件好事。

这并不只适用于计算),COUNT(*)(columnname_with_an_index_defined_on_it当然也可以使用索引的行计数。在某些情况下有可能是一个指数,涵盖了指定列的ColumnName一个Count()查询,但有另一个表中的列上定义的索引。在这种情况下,COUNT(*)将使用其他指数来计算的行数,但计数(columnname_without_an_index)将读取的数据页来检查NULL值的列数行。

尝试为自己的这一点,在SQL查询分析器运行以下脚本(如果它尚未设置显示在文本模式下的结果,使用Ctrl-T来设置这种方式):

USE NORTHWIND


的SET STATISTICS IO于

从订单的SELECT COUNT(*)
从订单的SELECT COUNT(客户)
SELECT *从订单

的SET STATISTICS IO关闭

配置的SQL Server上的语句的SET STATISTICS IO输出统计数字显示金额的I / O需要执行查询,你可以用它来比较不同的查询来决定使用哪一个资源量。你可以找到后直接执行的语句的结果输​​出。我们有兴趣在这里的统计是逻辑和/或物理页读取数。逻辑页读取是从内存中读取的网页(数据和/或索引页)的金额,和物理页读取是从磁盘读取的页数。我的电脑上计数()的结果显示两种选择,这也可能是你得到了什么,如果你还没有加入或删除从Orders表的任何行830列。现在,注意读这些语句的逻辑页码(运行脚本几次,如果你得到物理页读取缓存在内存中的数据)。我有3个逻辑页的第一选择读取,读取第二个和21个逻辑页!还要注意,选择从表中所有行的第三个语句也导致在21个逻辑页读取。这表明我们的第二个语句读取所有数据页仅仅依靠订单的行数,因为没有CustomerID列上的索引,但第一条语句是能够使用我的电脑上的索引(索引ShippersOrders;我检查查询执行计划,发现)数行。

要使用哪一个?
正如我已经表明,使用COUNT(*)不肯定并不意味着表现不佳。相反,在某些情况下,你可能反而得到使用COUNT(表达式)表现不佳。通常情况下,你可能不会在上面的例子中遇到的问题,因为你可能有你指定的列上的索引。什么是糟糕不过的是,您可能会收到不同的结果,从你期待什么!比方说,你有一个传统的应用程序,使用COUNT(的ColumnName)算表,其中的ColumnName列不允许NULL值的行数。现在,一段时间后,列的定义改变,以允许NULL值。只要有人进入列中的NULL值,您的申请将不再显示在表中的行数,而是与非NULL值的行中指定列的数字!这可能不是什么打算和预期,应用程序的设计者可能导致重大问题。

但是......
所以,通常是没有理由不使用COUNT(*)。不过,正如我在本文开头提到的情况下,您希望(或而应该)使用COUNT(表达式)。一个明显的例子是,当然,如果你真的只有在列值不为NULL的行的数目。当您使用与其他聚合函数COUNT(),一个典型的例子是这样的情况。比方说,我们有某种测量数据表的某些行中的NULL值。现在,我们正在寻找这些值的平均。通常情况下,我们将使用AVG(),但要明白这一点,我们会说,我们不允许使用。比较这两个语句,如果你发现问题:

的SELECT SUM(列)/表(*)

的SELECT SUM(列)/表(列)


这些语句将返回不同的平均结果,因为SUM()忽略空值(不计为0)。如果总和为1500,行数为150人,其中50个指定的列中有NULL,第一次查询的结果将是10(1500/150)和第二个查询的结果将是15(1500/ 100)。这实际上是一个问题,我遇到我在工作中常常作为数据库顾问,最经常存在由于这一事实,是谁写的SQL语句,不知道如何处理NULL值在不同的聚合函数不同的人(SUM()和COUNT(在上面的例子))。

select COUNT (*) AS [*] from TEST
select COUNT (Test) AS [Test]from TEST
select COUNT (1) as [1] from TEST
select COUNT (distinct Test) as [distinct] from TEST
select * from TEST

 

COUNT(*) 返回组中的项数。包括 NULL 值和重复项。

COUNT(ALL expression) 对组中的每一行都计算 expression 并返回非空值的数量。

COUNT(DISTINCT expression) 对组中的每一行都计算 expression 并返回唯一非空值的数量。

对于大于 2^31-1 的返回值,COUNT 生成一个错误。这时应使用 COUNT_BIG。

原创粉丝点击