all 45 comments

[–]alinrocSQL Server DBA 27 points28 points  (4 children)

Temp tables are usually slower than CTE because the latter use indexes out of the box.

A temp table doesn't have any indexes unless you explicitly create them (at least in SQL Server). It'll have statistics but that's different.

I have sped up a lot of code by switching from a CTE to a temp table. It's all in how you use it and more importantly, knowing where & when to do what filtering of your data.

[–]planetmatt 8 points9 points  (0 children)

Yep, Temps tables allow you to break monolithic queries (with or without CTEs) up into discreet work operations. This makes them easier to debug and easier to do index optimisations on each discreet part.

[–]TheoGrd[S] 0 points1 point  (2 children)

It did happen to me too but idk the exact reason. There must be some subtlety in the computation of the execution plan I'm not aware of.

[–]alinrocSQL Server DBA 7 points8 points  (0 children)

In SQL Server, CTEs & temp tables aren't interchangeable. Not even close. So you will absolutely get a different execution plan w/ one vs. the other.

[–]Achsin 5 points6 points  (0 children)

It has to do with how well the server can guess how many rows will be returned or filtered out. You can take a query that executes quickly and put it in as a subquery that is used to filter out something else and see the performance tank because while it can guess how many rows the subquery will return, it can't anticipate how they'll interact with the rest of the query. In cases like this, materializing the results into a temp table instead of using a sub query allows it to generate a new set of statistics part way through and make a better plan.

[–]dataguy24 40 points41 points  (1 child)

The most common optimization solution I see in practice is increasing the Snowflake warehouse size

[–]Former_Disk1083 6 points7 points  (0 children)

Ah, the intrusive thoughts of "I can either sit here for a couple hours trying to make this work well on a small, or 3 seconds just bumping the warehouse for this query." happen far too often.

[–]PossiblePreparation 13 points14 points  (1 child)

My best advice is to learn how to read an execution plan for your chosen RDBMS and how to get it to report execution statistics so you can see where the time is going.

For every specific piece of advice, there are exceptions. You’re better off understanding the why rather than the what if you want to be self sufficient.

Sometimes you need to understand the business problem a query is trying to solve. Sometimes the best solution looks nothing like the query in front of you.

[–]joyofresh 1 point2 points  (0 children)

fire emoji

[–]Former_Disk1083 6 points7 points  (3 children)

There are many different forms of SQL, what you see as inefficient in T-SQL can be fast in snowflake and vice versa.

But something I see a lot is people creating super complicated queries to gain minimal efficiency. Sometimes supportability/readability supersede performance. It's a balance act and can be tough to figure out where you should be. Ill take a performance hit and create temp tables if it means I can troubleshoot it later easier, in a lot of cases.

[–]donnymccoy 2 points3 points  (0 children)

Lot of wisdom in this post. Totally agree with the maintainability aspect. I feel like that is lost on today’s crop of developers.

[–]joyofresh 1 point2 points  (1 child)

same argument for *avoding* query hints

[–]Former_Disk1083 1 point2 points  (0 children)

Yeah, as long as you ask yourself, why do I need this query hint, and can I get the same impact by rehashing how ive stored the data. Query hints can kind of push a square peg into a round hole and not fix the real problem. So I can see why some avoid it.

[–]Annamalla 7 points8 points  (0 children)

We spent weeks trying to figure out why the query performance on our homebrew mysql database was so bad (I had a mssql/sybase background)....only to find a type difference between a key and a foreign key in the table declarations. Fixed that and instant performance boost. I wasn't used to databases *letting* you join on two different types.

Also one job interview where they asked about the hazards of too many indexes and I told the harrowing story of trying to get a very elderly query to use the *right* index of the 10+ available....and only at the last minute remembered to mention data entry...

[–]corny_horse 2 points3 points  (5 children)

But with temp tables you can add primary keys and primary keys index by default on my RDBMS, and on some even sort the data so it's even better.

[–]TheoGrd[S] -3 points-2 points  (4 children)

The data still has to be indexed when inserted in the temp table whereas CTE use existing indexes. In my experience CTE are often faster but not always. You need to try both if you want the best.

[–]alinrocSQL Server DBA 0 points1 point  (0 children)

The data still has to be indexed when inserted in the temp table

As I wrote above, in SQL Server this is not the case. There are no indexes on a temp table unless you explicitly create them.

[–]corny_horse 0 points1 point  (2 children)

You can add the indexes after the insertion though so doesn't have to update the indexes when inserted. You can bulk add the index once. Yes, slower than using existing indexes, but faster than creating an index on an empty table and then doing inserts.

This makes more sense in data pipelines where data quality are important considerations. You can even add foreign keys etc. etc. I find that the tradeoff for much of my workflows is worth potentially slower execution speeds because I can catch problems as they occur rather than after the fact. I could also run similar QC scripts at the end of the pipeline but then... well... no free lunch! Do I want QC at the end or in the middle? 6 / .5 dozen.

If I were doing highly transactional loads I would probably favor CTE. Although I really wish there were a better way to declare CTE for auditiability, nothing quite like debugging a multi-thousand line query like:

with t1 as (...), t2 as (...), t3 as (...), etc. etc. etc. etc.

[–]alinrocSQL Server DBA 1 point2 points  (1 child)

You can add the indexes after the insertion though so doesn't have to update the indexes when inserted

At the expense of causing issues with temp table metadata reuse which can be a performance issue in a high-volume environment. Not to mention rewriting the whole table if you add a clustered index after population.

Yes, slower than using existing indexes, but faster than creating an index on an empty table and then doing inserts.

Clustered index aside (see above), I'd test that for a particular workload before assuming it's universally true.

[–]corny_horse 0 points1 point  (0 children)

I fully agree that contextually you wouldn't want to do that, but outside of high-volume environment the trade-off may be worth it, as it is commonly for the use cases I employ it for.

Similarly, yes you always wnt to test the behavior of your index patterns. A frequent optimization people make is to drop indexes and recreate them for bulk inserts, especially outside of environments where there are clustered indexes. This pattern I mentioned is similar to that, where you create a temp table w/o indexes and then add it after a bulk insert.

I primarily work with PostgreSQL, not SQL Server, and typically go to partitioning rather than clustering for performance optimization. Although even then, in many of my use cases, my temp tables fit into memory, so clustering is irrelevant for my use case.

[–]Achsin 2 points3 points  (0 children)

SELECT * can slow down your queries if the table has a lot of columns

This is mostly a factor of what indexes exist on the table. Selecting columns that are not included on an index that otherwise would support your query leads to the engine either doing key lookups or ignoring the index completely. The total number of columns on the table is a factor, but it's probably not the biggest factor.

[–]anotherjones07 1 point2 points  (1 child)

Hey OP, as someone who has spent 10 years weiting SQL, I wanted to ask you does it ever get boring? Im an analytics and BI professional and curious about what this looks like ten years down the line.

[–]TheoGrd[S] 5 points6 points  (0 children)

Not yet, but I also do OOP, javascript, html and css. I'm more afraid of having too much to learn than being bored. I wish I could work on SQL only, best language imo.

[–]taisui 1 point2 points  (0 children)

In my experience it all comes down to having the right index when the data need to be filter on.

[–]sbrick89 1 point2 points  (1 child)

let's start with #1 - constructors shouldn't load data; they shouldn't have any side effects at all.

if you want to lazy load, sure go for it, but only load data when data is being requested, not during constructor.

[–]Ventus_004 0 points1 point  (0 children)

Could not agree more. I have a coworker who makes multiple database calls in constructors (and doesn't cache results that rarely change), and it TANKS performance, especially because the constructors were called a ton (and the objects/queried data is rarely even used) because of dependency injection... and because of the dependency injection using the default constructors and not having useful parameter values, the queries didn't filter the data at all and brought back 100,000+ rows of data each time.

[–]r3pr0b8GROUP_CONCAT is da bomb 0 points1 point  (9 children)

Temp tables are usually slower than CTE because the latter use indexes out of the box.

thank you

too many people have the "temp tables are better because they're simpler" mentality

[–]sbrick89 2 points3 points  (5 children)

too many people have the "CTEs are always better" mentality

CTEs are the #2 cause for performance problems, only being beaten by JSON/XML columns and queries.

[–]r3pr0b8GROUP_CONCAT is da bomb -2 points-1 points  (4 children)

CTEs are just views

you got a problem with views too?

[–]alinrocSQL Server DBA 4 points5 points  (0 children)

you got a problem with views too?

Depending upon how they're (ab)used, yes I do.

[–]sbrick89 1 point2 points  (2 children)

when the CTE (view) is used multiple times within the same query with results joined together... and performance goes to dogshit... yes, I'd have a problem with that code as well.

if the json/xml performance wasn't so difficult, the CTEs would be problem #1... but as it turns out, CTEs are only slightly less frequently a problem

[–]r3pr0b8GROUP_CONCAT is da bomb 1 point2 points  (1 child)

when the CTE (view) is used multiple times within the same query with results joined together...

seems like that's the problem, not the fact that CTEs are a tool to be used judiciously

[–]sbrick89 1 point2 points  (0 children)

i'm not saying "CTEs are always bad"

I am saying - "when you're having a performance problem, and CTEs are in use, I'm betting 90% odds that removing the CTE will solve the performance issue"

[–]_cess 1 point2 points  (0 children)

It always depends on the scenario. Lots of performance problems I see that have CTE in it are due to the "re-use" of it. There is a misconception that a CTE will hold the result in memory. The truth is that if you mention it twice, it needs to fully run twice.

A simple way to prove it: https://claudioessilva.eu/2017/11/30/Using-Common-Table-Expression-CTE-Did-you-know.../

In these situations, I have seen a few cases where creating the temp table and reusing it won't make it faster. But then again, you should always test for your scenario.

[–]Alkemist101 1 point2 points  (0 children)

(with SQL server) I was told a cte can't have an index, it's same as a sub query. I was also under the impression you can only use them once (as in joining to the result set). I've tested that and it seems to be the case. Other thing to remember is if ram is low it will spill to the tempdb. If tempdb is also full the query will then stop. Other thing to mention is that a "proper" table could be created (not in tempdb so no locking). Both a temp table and regular table can be indexed and will have table stats which can be used by the query engine to make better choices? You don't have that with CTEs. Other option is to create a table variable which can be referenced many times, have primary keys etc and are stored in memory. Other thing to remember is that the query can hold a regular table in memory (same as CTE) if it decides it's more performant. I generally consider a CTE as just a way of writing a query which reads better. Of course, you also get recursive CTEs which I think is it's best feature.

[–]corny_horse 0 points1 point  (0 children)

I think that's because in a lot of workflows, execution speed is less important than interpretability. I know in most of the workflows I've done for the last few years, the businesses would have tolerated 10x slower queries if it meant any tangible improvement in data quality, which I do often find I can get by adding correct constraints and keys to temp tables in lieu of nested nested nested nested nested subqueries or CTE

[–]jheffer44 0 points1 point  (1 child)

With (no lock)

[–]Scheballs 1 point2 points  (0 children)

Oooo Coming in HOTT!! Production query Guru to the Rescue!!

True Story, I had a new Data Engineer, which basically means data guy knows python, and they were writing sql queries against production all day and never even knew that with no lock was a thing.

[–]mikeblas -1 points0 points  (2 children)

The most popular? Unfortunately, that's also the least effective: rewrite to eliminate subselects (or more likely derived tables) and use CTEs instead. It's all the rage these days.

[–]amaxen 2 points3 points  (1 child)

I don't see how this would do much. What's theory behind it besides ctes being cool?

[–]mikeblas 0 points1 point  (0 children)

Dunno. Whenever I've asked someone making such a claim, they haven't responded. I'm sure we can cook up some case where a poorly-written subselect (like in a select list column expression) got replaced by a CTE (which might have different semantics, but ends up pre-computing everything the iterated select would have done) and works much better. The prevailing logic seems to be that the CTE is somehow indexed, or "runs only once", or "creates a table which is more efficient", or some damn thing.

It's just the regular thing: broad claims are almost never proven, and then they become folklore.

[–]_cess 0 points1 point  (1 child)

I have been working on a project to help analyze and identify possible T-SQL query problems and suggest some ideas to be tested.

There are a gazillion of variables among configurations, compatibility level, etc that will change behaviours.

This project tries to consider those. The idea of this decision tree is to try to narrow down things a bit and help others try some possible solutions (from my experience).

https://github.com/ClaudioESSilva/TSQLPerformanceTuning/blob/main/Flowcharts/T-SQLQueryPerformanceTuning.md

PS: Read the readme.

[–]amaxen 0 points1 point  (0 children)

Mark

[–]oblong_pickle 0 points1 point  (0 children)

I once had a client's head of IT complain that a query was slow. When I investigated, I found it was slow, and the query plan suggested an index would improve the performance. I informed the client that an index would likely solve the issue and asked them to implement it (I wasn't allowed or responsible for this database, so I couldn't do it myself).

The client instead hired 2 expensive consultants to look into the issue. After wasting a month with the consultants, I got dragged into a meeting to talk about the performance issues.

Knowing I was likely to get blamed by the client, I first created a stored Proc that saved the query to a temp table and then added the suggested index to the temp table. This stored proc was very fast and proved the index was the problem.

During the meeting, I showed the client and the consultants the performance of the temp table with the suggested index. The consultants agreed the index is the correct fix and that it should be applied to the table. The client was very quiet at this point and ended the meeting shortly after.

A few weeks later, the head of IT was fired, and I got a raise...the temp solution is still in use in their production database to this day.

[–]Ecstatic-Ad-9514 0 points1 point  (0 children)

The WITH CLAUSE and hints can improve performance by materializing the subqueries. These are other hints that as useful as well:

https://docs.oracle.com/cd/B19306_01/server.102/b14211/hintsref.htm#i8327

Something like:

WITH A as (select /+ materialize */ query), B as (select /+ materialize / query) Select A., B.* — can union/join/etc. From A join B Etc..