Tuesday, March 27, 2012

Estimated row count vs. actual row count

Hello!
I was trying to troubleshoot a query. Very last step show estimated
rowcount=142386. When query is executed, it returns only 7234 rows. I
executed UPDATE STATISTICS ... WITH FULLSCAN against all tables involved in
this query but this didn't make a difference. I am not sure what else I can
do at this point.
Any advice is a greatly appreciated,
IgorIf you're referring to "estimated rowcount" in the show execution plan
screen of Query Analyzer, that just refers to how many rows were processed
by that specific step in the processing... The number of rows the query
eventually produces can be and most often is very different..
This is Nothing to be concerned about.
"Igor Marchenko" wrote:

> Hello!
> I was trying to troubleshoot a query. Very last step show estimated
> rowcount=142386. When query is executed, it returns only 7234 rows. I
> executed UPDATE STATISTICS ... WITH FULLSCAN against all tables involved i
n
> this query but this didn't make a difference. I am not sure what else I ca
n
> do at this point.
> Any advice is a greatly appreciated,
> Igor
>
>|||Yes, I am referring to execution plan in Query Analyzer. I got concerned
after reading following article:
http://www.windowsitpro.com/sqlserv...61.html

For other queries there is no discrepancy between estimated and actual
number of rows.
"CBretana" <cbretana@.areteIndNOSPAM.com> wrote in message
news:0DF221BF-921A-4D4A-9963-6FD335621588@.microsoft.com...
> If you're referring to "estimated rowcount" in the show execution plan
> screen of Query Analyzer, that just refers to how many rows were
> processed
> by that specific step in the processing... The number of rows the query
> eventually produces can be and most often is very different..
> This is Nothing to be concerned about.
>
> "Igor Marchenko" wrote:
>|||OK, read the article... The author's first comment, "Yes go ahead and
worry..." is a bit overkill - in that he is talking to a point extraneous to
the question he was asked...
He was asked about whether there's an issue simply because the numbers are
different. But he answered based on the fact the "estimated" Number is HIGH
.
Any time you see an "estimated rowCount" number much higher than the actual
number that were generated, then the Query Optimizer might not be choosing
the "fastest" most efficient option in processing the query... the accuracy
or completeness of the data results will not be affected, only the speed at
which it is generated.
"Igor Marchenko" wrote:

> Yes, I am referring to execution plan in Query Analyzer. I got concerned
> after reading following article:
> http://www.windowsitpro.com/sqlserv...61.ht
ml
> For other queries there is no discrepancy between estimated and actual
> number of rows.
> "CBretana" <cbretana@.areteIndNOSPAM.com> wrote in message
> news:0DF221BF-921A-4D4A-9963-6FD335621588@.microsoft.com...
>
>|||That's exactly why I am worried : 'Query Optimizer might not be choosing
the "fastest" most efficient option in processing the query'. I do know
accuracy won't be affected.
"CBretana" <cbretana@.areteIndNOSPAM.com> wrote in message
news:2BE6C2AB-8D39-4D79-ACC6-D0DBB427F651@.microsoft.com...
> OK, read the article... The author's first comment, "Yes go ahead and
> worry..." is a bit overkill - in that he is talking to a point extraneous
> to
> the question he was asked...
> He was asked about whether there's an issue simply because the numbers are
> different. But he answered based on the fact the "estimated" Number is
> HIGH.
> Any time you see an "estimated rowCount" number much higher than the
> actual
> number that were generated, then the Query Optimizer might not be choosing
> the "fastest" most efficient option in processing the query... the
> accuracy
> or completeness of the data results will not be affected, only the speed
> at
> which it is generated.
> "Igor Marchenko" wrote:
>|||Igor, I found this link:
http://msdn.microsoft.com/library/d...
etchapt14.asp
On it I extracted this bit if text...
Compare Actual vs. Estimated Rows and Executions
When you read the output from a SHOWPLAN statement, start from the
most-indented row that has the highest incremental change in the
TotalSubtreeCost column. Carefully evaluate both the index selection and the
optimizer's estimate by using the SET STATISTICS PROFILE ON command. This
command runs the statement, so only use it on SELECT statements or T-SQL cod
e
that does not modify data, or you can preface the command with a BEGIN TRAN
/ROLLBACK statement.
As an alternative, use the new profiler Performance:Showplan Statistics
event in SQL 2000. This event belongs to event class 98. This event reports
four columns that show estimated and actual rows and executions. You must
select the Binary Data column before the profiler event adds data to the
T-SQL or SP:stmtcompleted events.
Substantial differences in the estimated row count may indicate the
optimizer had out-of-date statistics or skewed statistics. For example, if
the estimated row count is 2 rows, and the actual row count is 50,000, the
optimizer may have had out-of-date statistics or skewed statistics. Try usin
g
the UPDATE STATISTICS WITH FULLSCAN command.
Hth,
Charly
"Igor Marchenko" wrote:

> That's exactly why I am worried : 'Query Optimizer might not be choosing
> the "fastest" most efficient option in processing the query'. I do know
> accuracy won't be affected.
> "CBretana" <cbretana@.areteIndNOSPAM.com> wrote in message
> news:2BE6C2AB-8D39-4D79-ACC6-D0DBB427F651@.microsoft.com...
>
>|||Charly,
I did execute UPDATE STATISTICS WITH FULLSCAN but it didn't correct
discrepancies between estimated and actual number of rows.
Regards,
Igor
"CBretana" <cbretana@.areteIndNOSPAM.com> wrote in message
news:A427092E-9686-4F08-B9F3-115B8C313DA7@.microsoft.com...
> Igor, I found this link:
> http://msdn.microsoft.com/library/d...enetchapt14.asp
> On it I extracted this bit if text...
> Compare Actual vs. Estimated Rows and Executions
> When you read the output from a SHOWPLAN statement, start from the
> most-indented row that has the highest incremental change in the
> TotalSubtreeCost column. Carefully evaluate both the index selection and
> the
> optimizer's estimate by using the SET STATISTICS PROFILE ON command. This
> command runs the statement, so only use it on SELECT statements or T-SQL
> code
> that does not modify data, or you can preface the command with a BEGIN
> TRAN
> /ROLLBACK statement.
> As an alternative, use the new profiler Performance:Showplan Statistics
> event in SQL 2000. This event belongs to event class 98. This event
> reports
> four columns that show estimated and actual rows and executions. You must
> select the Binary Data column before the profiler event adds data to the
> T-SQL or SP:stmtcompleted events.
> Substantial differences in the estimated row count may indicate the
> optimizer had out-of-date statistics or skewed statistics. For example, if
> the estimated row count is 2 rows, and the actual row count is 50,000, the
> optimizer may have had out-of-date statistics or skewed statistics. Try
> using
> the UPDATE STATISTICS WITH FULLSCAN command.
> Hth,
> Charly
> "Igor Marchenko" wrote:
>|||When the optimizer's rowcount estimates are incorrect, often
statistics cannot remedy the problem. For many query plan
operators, the actual rowcount depends on the distribution
of values in two separate columns (perhaps in two separate
tables), and as far as I know, SQL Server keeps no statistics
on how the values in multiple columns are correlated. In
addition, there is no feedback mechanism to do anything with
the information you're seeing - that the actual and estimated
counts are very different.
I have often seen suboptimal query plans be chosen as a
result of this situation, but more often than not, the best
query plan is still being chosen, even though its cost is not
estimated correctly. When there is a better plan the optimizer
is missing, it's my experience that the culprit is less often the
bad rowcounts than that the indexes available are not ideal.
For instance, many queries benefit from nonclustered indexes
on more than a single column, but many designers never define
indexes on more than one column at a time.
Without seeing the query and actual query plan, and assuming
that some basic tuning has already been done to make sure
no more filters can be made SARGable than already are, I can
suggest a couple of things to look for - beyond that, if you post the
query and plan, along with the create table and create index
statements, we can take a further look.
First, if the bad rowcounts appear in parallel operators, you
might try adding the hint OPTION (MAXDOP 1) and see if
performance improves. Bad rowcounts can cause parallelism
to be used in countereffective ways. Second, if you can
identify the particular WHERE conditions that the optimizer
is costing poorly (the optimizer expects many more rows to
satisfy the conditions that actually do), try adding one or more
non-SARGable dummy conditions at the same point in the
query and see if the estimates are more accurate. For example,
if the condition A.col1 = B.col1 is estimated poorly, where
these two columns are integers, try
(A.col1 = B.col1 AND -A.col1 = -B.col1), or try (A.col1 =
B.col1 AND A.col1%2 < 2 AND B.col1%2 < 2). If date
columns are involved, add something like ...
AND MONTH(A.datecolumn) > 0. Be sure to use parentheses
so you don't change the meaning of your query due to the way
AND and OR are associated. The query optimizer handles
non-SARGable predicates generically and this may help out.
A simpler situation where the rowcounts can be
wrong is for non-SARGable conditions
like MONTH(datecolumn) = 1 or like
DATEADD(d,30,datecolumn) > GETDATE(). Here
the optimizer will generally use generic estimates (perhaps
something like "gee, I think about a third of the rows will
satisfy a > condition, no matter what the condition is").
Sometimes the condition can be rewritten so the optimizer
can use statistics instead of a generic estimate.
The second of these examples can better be written as
datecolumn > DATEADD(d,-30,GETDATE()), which
is SARGable. The first can't be SARGed on its own, but
if it is there for something like MONTH(datecolumn) = 1
AND YEAR(datecolumn) = 2005, it can be rewritten as
(datecolumn >= '20050101' and datecolumn < '20050201').
Feel free to post more details.
Steve Kass
Drew University
Igor Marchenko wrote:

>Charly,
>I did execute UPDATE STATISTICS WITH FULLSCAN but it didn't correct
>discrepancies between estimated and actual number of rows.
>Regards,
>Igor
>
>"CBretana" <cbretana@.areteIndNOSPAM.com> wrote in message
>news:A427092E-9686-4F08-B9F3-115B8C313DA7@.microsoft.com...
>
>
>|||"Steve Kass" <skass@.drew.edu> wrote in message
news:ev8gWnZNFHA.3380@.TK2MSFTNGP15.phx.gbl...

> (A.col1 = B.col1 AND -A.col1 = -B.col1), or try (A.col1 =
> B.col1 AND A.col1%2 < 2 AND B.col1%2 < 2). If date
> columns are involved, add something like ...
> AND MONTH(A.datecolumn) > 0. Be sure to use parentheses
> so you don't change the meaning of your query due to the way
> AND and OR are associated.
Since AND has a higher precedence than OR, I think this caveat would only
apply if NOTs were involved. It never hurts to include extra parentheses, of
course, but in this case, I don't see that they're necessary.
Excellent post, as usual.|||Thank you Steve for your more than elaborate answer! I was able to
speed the query up by implementing clustered index on different set of
columns. Query plan has changed completely. Instead of using hash and merge
joins, SQL server is using inner loop joins.Overall query cost went down
from 137 to 56.1. I am currently verifying all queries accessing table in
question to make sure there is no side affect as a result of different
clustered index. I was wondering if you could provide a reference on where
to read more about query optimization techniques (statistics, better index
selection, etc.)
Thanks,
Igor

No comments:

Post a Comment