Thursday, March 29, 2012

eternal lock?

Hi,

does sb recognize (aspects of) the following problem? Or better, know a
solution or direction to search?

At work I've inherited a series of delphi applications that access a common
database using SQL Server 2000 (sp3, sp4 update in preparation due to this
problem). Applications run on one server, db on the second. Both are dual
xeon 2.8 or 3 GHz with 2 GB ram. The apps use about one GB (800MB) memory,
the db too (is configured to use more, 1.8GB, but doesn't.) The db is also
replicated to a third machine.

The problem is that sometimes, after a cascade of query timeouts (recorded
by the apps in the eventlog, cause is the commandtime set on all
components), the whole applications seems to stop responding. Restarting the
apps doesn't solve the problem, rebooting the application server does, which
leads me to believe the problem is in MDAC on the app server? The app server
has an own unused sql server instance (used in migrations) btw.

The problems occur during busier times, but nothing spectacular (up to
ten-thousand of queries per hour maybe).

The problem sometimes goes away after a few minutes in about half of the
cases, but if not, it seems perpetual till reboot (at least 13 hours).

Another notable point is that not all queries time out, most writes (which
append a row or change a row) seem to go ok, same with selects that get a
record for a primary key value, and pure read selects flagged with NOLOCK.
The queries that go wrong all get lists that touch central tables (either
directly or via joins).

The behaviour is consistent with an external row/page lock somewhere that
doesn't go away.

Database layout is fairly uninteresting. A db or 3 (one read-only), the
larger one having say 30 tables,

cardinality of the tables is not that much of a problem. Tens of thousands
of rows max, except a logging table with maybe 300000 tuples. (which is only
traversed for mgmnt info, and not during busy hours) No binary or other
disproportionally large fields, Most db access done based on primary/foreign
keys.

Other details:
- Replication overhead can be considered low (we are talking about
thousand(s) mutationsper day, nothing significant.
- commandtimeout on all db components is set (to 30s)
- all cursors are clientside, except the component used for getting lists,
that has
location=cluseserver; cursortype=ctopenforwardonly;
cachesize=250; locktype=readonly
- the apps are not threaded.
- D6 patched with all three patches

Thnks in advanceMarco van de Voort (marcov@.stack.nl) writes:
> The problem is that sometimes, after a cascade of query timeouts
> (recorded by the apps in the eventlog, cause is the commandtime set on
> all components), the whole applications seems to stop responding.
> Restarting the apps doesn't solve the problem, rebooting the application
> server does, which leads me to believe the problem is in MDAC on the app
> server? The app server has an own unused sql server instance (used in
> migrations) btw.

Have you examined blocking?

With this superficial information about the system it is difficult to
say for sure, but it does smell of a well-known gotcha (been there, done
that myself).

To wit, if a query times out, and there is a transaction in progress,
the transaction is not rolled back automatically. It is irrelevant
whether the transaction was started prior to the submission of the
query batch, or started within the query batch that timed out.

The application must handle this by submitting

IF @.@.trancount > 0 ROLLBACK TRANSACTION

in case of a query timeout.

If the application fails to observe this, the result is chaos.
Transactions never commits, which means that processes keeps on
acquiring more and more locks, and you get blocking galore. And
when you finally restart something, you lose all the updates...

--
Erland Sommarskog, SQL Server MVP, esquel@.sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pr...oads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodin...ions/books.mspx|||On 2006-05-22, Erland Sommarskog <esquel@.sommarskog.se> wrote:
> To wit, if a query times out, and there is a transaction in progress,
> the transaction is not rolled back automatically. It is irrelevant
> whether the transaction was started prior to the submission of the
> query batch, or started within the query batch that timed out.
> The application must handle this by submitting
> IF @.@.trancount > 0 ROLLBACK TRANSACTION

I call the rollback of the relevant ADO connection on the exception caused
by the timeout.|||Marco van de Voort (marcov@.stack.nl) writes:
>> To wit, if a query times out, and there is a transaction in progress,
>> the transaction is not rolled back automatically. It is irrelevant
>> whether the transaction was started prior to the submission of the
>> query batch, or started within the query batch that timed out.
>>
>> The application must handle this by submitting
>>
>> IF @.@.trancount > 0 ROLLBACK TRANSACTION
> I call the rollback of the relevant ADO connection on the exception caused
> by the timeout.

So, did you investiagate if you have any blocking?

Also, I have you examined the SQL Server error log?

--
Erland Sommarskog, SQL Server MVP, esquel@.sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pr...oads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodin...ions/books.mspx|||On 2006-05-23, Erland Sommarskog <esquel@.sommarskog.se> wrote:
> Marco van de Voort (marcov@.stack.nl) writes:
>>> To wit, if a query times out, and there is a transaction in progress,
>>> the transaction is not rolled back automatically. It is irrelevant
>>> whether the transaction was started prior to the submission of the
>>> query batch, or started within the query batch that timed out.
>>>
>>> The application must handle this by submitting
>>>
>>> IF @.@.trancount > 0 ROLLBACK TRANSACTION
>>
>> I call the rollback of the relevant ADO connection on the exception caused
>> by the timeout.

(for the record: I already did this all the time, it is not a change)

> So, did you investiagate if you have any blocking?

It occurs less than once a month (unfortunately on a painful moment). IOW, I
can't reproduce it at will. Which is why I asked if sb recognized the
problems.

> Also, I have you examined the SQL Server error log?

Yes, nothing. But maybe my logging settings are wrong.|||Marco van de Voort (marcov@.stack.nl) writes:
> It occurs less than once a month (unfortunately on a painful moment).
> IOW, I can't reproduce it at will. Which is why I asked if sb recognized
> the problems.

I'm afraid that without access to the real situation, it is difficult
to say that much intelligent. The fact that it occurs only rarely,
indicates that it is not a general problem with unhandled query timeouts.
But it still smells of transactions that are not committed/rolled back
properly.

The only thing I can suggest is that when it sets try to collect as
much data you can before the reboot panic sets in. I have a stored
procedure on my web site which is good for this purpose:
http://www.sommarskog.se/sqlutil/aba_lockinfo.html.

>> Also, I have you examined the SQL Server error log?
> Yes, nothing. But maybe my logging settings are wrong.

It's always good to have trace flags 1204 and 3605 enabled to get
deadlock information written to the error log, but that was not I
had in mind. I was thinking of error 17883, which indicates that
SQL Server is in bad shape at the moment. This diagnostic message
was added in SQL 2000 SP3, and was augmented in some hotfixes soon
thereafter. SP4 has an even wider set of these messages.

The fact that you don't have message 17883 in the log amplifies the
impression that the problem is related to the application.

--
Erland Sommarskog, SQL Server MVP, esquel@.sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pr...oads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodin...ions/books.mspxsql

No comments:

Post a Comment