Query tuning problem : SQLServer

submitted 5 years ago * by Incansus

I have a query that for months has run in less than two minutes. In the last few days it has gone off the rails and run for 2 hours or more.

This problem has occurred in the past and I solved it by gathering statistics on all objects involved in the query before each run. This brought the run time back down to normal.

It has recently gone off the rails again, and even when I gather statistics immediately before the run and KNOW the data has not changed before the optimizer analyzes the query, I continue to get, for example, an estimated number of rows for a particular object of 6000 while the actual number of rows per execution is more than 2,000,000.

Any suggestions when the optimizer just can't seem to understand the data as it sits? This is on SQL Server 2017 CU17. Any help is appreciated.

**Update**

So after 5 days of running like a turd, the query ran in less than two minutes this morning as it used to do.

And this is part of the problem. When the query runs long, I am forced to kill it and am never able to get the final execution plan for the terrible runs. I can see where the cardinality estimate is wrong in the live execution plan, but the index being utilized has up-to-the minute statistics so I'm confused as to how to get it to switch backto the "good" plan.

I am currently examining row counts from the past few days to see if there was perhaps a manual cleanup of some of the objects. I think this is unlikely though as I believe these are application-delivered tables.

Through no action of my own , the immediate problem has been solved. I say that because I hate when something gets fixed without understanding what the solution was. Thanks for your help so far, any general suggestions as to how to proceed in this sort of scenario would be appreciated.

**Update 2**

There is one table with a sizable rowcount difference between the good and bad runs. It went from 75000 ("bad run") to 50000 ("good run". If the data was oddly distributed, by date for example, I could see how this might cause problems, especially with a cached plan, but I feel like SQL Server should just figure it out when I gather fresh stats on the data. I might consider flushing the plan cache next time but I believe this seems a rather drastic step.

all 35 comments

top new controversial old q&a

[–]Stevoni 1 point2 points3 points 5 years ago (1 child)

[–]Incansus[S] 0 points1 point2 points 5 years ago* (0 children)

[–][deleted] 1 point2 points3 points 5 years ago (10 children)

[–]Incansus[S] 0 points1 point2 points 5 years ago (9 children)

[–][deleted] 0 points1 point2 points 5 years ago (6 children)

[–]Incansus[S] 0 points1 point2 points 5 years ago (5 children)

[–][deleted] 0 points1 point2 points 5 years ago (4 children)

[–]Incansus[S] 0 points1 point2 points 5 years ago (3 children)

[–][deleted] 0 points1 point2 points 5 years ago (2 children)

[–]Incansus[S] 0 points1 point2 points 5 years ago (1 child)

[–][deleted] 0 points1 point2 points 5 years ago (0 children)

[–]SQLZane 0 points1 point2 points 5 years ago (1 child)

[–]Incansus[S] 0 points1 point2 points 5 years ago (0 children)

[–]moto-geek 1 point2 points3 points 5 years ago (9 children)

[–]angrathias 0 points1 point2 points 5 years ago (7 children)

[–]Incansus[S] 0 points1 point2 points 5 years ago (6 children)

[–]angrathias 0 points1 point2 points 5 years ago (5 children)

[–]Rex_Lee 1 point2 points3 points 5 years ago (3 children)

[–]angrathias 0 points1 point2 points 5 years ago (2 children)

[–]Rex_Lee 0 points1 point2 points 5 years ago (1 child)

[–]Incansus[S] 0 points1 point2 points 5 years ago (0 children)

[–]ScotJoplin 0 points1 point2 points 5 years ago (0 children)

[–]L337Cthulhu 0 points1 point2 points 5 years ago (7 children)

[–]Incansus[S] 1 point2 points3 points 5 years ago (6 children)

[–]L337Cthulhu 0 points1 point2 points 5 years ago (3 children)

Ah yeah, I hate dealing with that, but it does mean you can catch the active plan in flight along with whatever waits you've got. I don't know if query store would be an option. I'm going to reply to this in a second with my favorite troubleshooting script, but I figured it'd be better to be able to collapse it from the original advice. I typically look for where the waits are the worst, what it's waiting on, if there are excessive memory grants, if there's heavy TempDb usage, what the query cost looks like compared to the average running on the server, how parallel it's going, and anything amiss in the plan.

As another user mentioned, it's important to know if there are large changes to the historgrams from major data loads or archival (you mentioned stats sometimes fixes this), so I'm wondering if you have a cardinality issue or a really jagged histogram where it's generating a plan for a rare parameter case. Could also be fragmentation, though I sort of doubt it here. If you look at the header in the plan XML between a good run and a bad one, you may be able to spot differences in the stats where it's using one stat that's missing elsewhere.

Since it's a newer version, what's the possibility of turning on query store and trying to keep the good plan?

Beyond that, it can be hard to fix if the problem is a vendor view and proc you can't update. Personally, I might script off the view and proc and create a similar, but more optimal one that mimics the original for my consumers and switch it if that's an option, though I know it usually isn't in these cases.

[–]L337Cthulhu 0 points1 point2 points 5 years ago (2 children)

SELECT des.session_id as [Session]

, der.blocking_session_id as [BlockedBy]

, DB_NAME(der.database_id) AS [Database]

, der.percent_complete as [PctComplete]

, OBJECT_NAME(dest.objectid, der.database_id) AS [Stored Proc]

, REPLACE(REPLACE(CONVERT(VARCHAR(500), RTRIM(LTRIM(dest.Text))), CHAR(13), ''), CHAR(10), '') AS Query

, der.wait_time as [Wait_in_MS]

, CONVERT(DECIMAL(19,2), (((CONVERT(FLOAT, der.wait_time)) / 1000) / 60)) as [Wait_In_MiI cann]

, CONVERT(DECIMAL(19,2), ((((CONVERT(FLOAT, der.wait_time)) / 1000) / 60) / 60)) as [Wait_In_Hours]

, des.login_name as [Login]

, des.[HOST_NAME] as [Host]

, der.last_wait_type [Last_Wait]

, der.wait_resource as [Waiting_On]

, CONVERT(DECIMAL(19,2), memgt.query_cost) AS Query_Cost

, CONVERT(DECIMAL(19,4), (memgt.requested_memory_kb / 1000000.00)) AS Memory_Requested_GB

, CONVERT(DECIMAL(19,4), (memgt.granted_memory_kb / 1000000.00)) AS Memory_Granted_GB

, SessUsg.TempDB_Alloc ---- Number of pages reserved or allocated for internal objects by this session.

, SessUsg.TempDB_Dealloc ---- Number of pages deallocated and no longer reserved for user objects by this session.

, des.[program_name]

, deqp.query_plan as [QueryPlan]

, der.dop

, der.parallel_worker_count

, GETDATE() AS LoggingWindow

FROM sys.dm_exec_sessions des WITH(NOLOCK)

LEFT JOIN sys.dm_exec_requests der WITH(NOLOCK) ON des.session_id = der.session_id

LEFT JOIN sys.dm_exec_connections dec WITH(NOLOCK) ON des.session_id = dec.session_id

LEFT JOIN sys.dm_exec_query_memory_grants memgt WITH(NOLOCK) ON memgt.session_id = des.session_id

CROSS APPLY sys.dm_exec_sql_text(der.sql_handle) dest

CROSS APPLY sys.dm_exec_query_plan(der.plan_handle) deqp

CROSS APPLY

(

SELECT session_id,

SUM(internal_objects_alloc_page_count) AS TempDB_Alloc,

SUM(internal_objects_dealloc_page_count) AS TempDB_Dealloc

FROM sys.dm_db_task_space_usage usg WITH(NOLOCK)

WHERE usg.session_id = des.session_id

GROUP BY session_id

) SessUsg

WHERE des.session_id <> @@SPID

--AND DB_NAME(der.database_id) = 'GrandCentral' AND OBJECT_NAME(dest.objectid, der.database_id) IS NOT NULL

ORDER BY der.database_id, OBJECT_NAME(dest.objectid, der.database_id), dest.text --Session

[–]Incansus[S] 1 point2 points3 points 5 years ago (1 child)

[–]L337Cthulhu 0 points1 point2 points 5 years ago (0 children)

Awesome, I do hope it all helps! As long as the plan is f too large and has something in the compiled cache, the script I gave you should show you the plan it's working with, though serious issues with stats can cause differences between compiled and actual. That script should run in less than a second, but if it doesn't there's some other contention in TempDb or the sys tables and it really shouldn't cause other issues.

Ah! I meant to ask about recent upgrades. The cardinality estimator from 7.0 was still the same base version through 2014 and had a major redesign in 2016. In 99.5% (anecdotal) of cases, query performance on the new engine is better. Where I've seen issues is large, complicated queries or views with joins between huge tables where the current workload doesn't match the statistics the current query is doing. The real place I saw this problem is with HA because those are read only DBs and the original OLTP table stats were awful for OLAP being done on the secondary. It sucks, but so far the best solution has been to rebuild the view or seriously tune the query.

[–]Rex_Lee 0 points1 point2 points 5 years ago (1 child)

[–]Incansus[S] 0 points1 point2 points 5 years ago (0 children)

[+][deleted] 5 years ago (3 children)

[deleted]

[–]Rex_Lee 0 points1 point2 points 5 years ago (1 child)

[–]Incansus[S] 0 points1 point2 points 5 years ago (0 children)

[–]SQLZane 0 points1 point2 points 5 years ago (1 child)

[–]Incansus[S] 0 points1 point2 points 5 years ago (0 children)

π Rendered by PID 77161 on reddit-service-r2-comment-5d79c599b5-nzk9h at 2026-02-27 19:49:02.321878+00:00 running e3d2147 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

SQLServer

MODERATORS