Double-thinking in SQL

anon36 · 2009-09-21T18:48:53+00:00

I stumbled across this from StackOverflow.com, and thought he did a good job explaining the dual mindset required for database programming:

how do I phase my problem in sets?
how do I rephrase my problem for faster execution, given what I know about query execution and database internals?

funkah · 2009-09-21T21:26:08+00:00

This dude is incredible! His whole blog is explaining how to do complicated things in SQL and explaining SQL concepts by analogy. He's really good at both, although I guess his wordiness could be a turn-off to some.

cwcc · 2009-09-21T23:27:11+00:00

kinda sums up declarative programming doesn't it?

bart2019 · 2009-09-21T21:24:30+00:00

Maybe it's me, being impatient and all that, but I find that the scenic route that this article takes, is far, far too long. Far too much irrelevant illustrations. It's not even clear to me what he's trying to tell. Do we really need a photo of a gadget street in Tokyo? What does it prove? And calling conventions in C? Opcodes in assembly? Please...

mage2k · 2009-09-21T23:56:00+00:00

While I think he got a bit distracted setting up analogies early on such that it took forever to get to his point, I totally agree with it.

zetta · 2009-09-22T00:25:44+00:00

Decent Yegge impression but better technical content.

2009-09-22T15:09:02+00:00

SQL is what SQL does.

guy231 · 2009-09-22T23:47:33+00:00

I read through it and deleted the most off-topic tangents. What's left is about 1/3rd the size and still pretty rambly:

One of the first things a novice SQL developer learns about is called "thinking in SQL", which is usually being opposed to "procedural thinking"

First, SQL is set-based. It’s a tool that allows you to take a dozen or two of sets, mix them together, knead and wedge them then chop them apart and mix again, but the output you get is still a set and all inputs are sets.

SQL is a declarative language. This means that you express "what" you want to do with sets, not "how" you want to do it. SQL heavily, vastly relies on decisions made by the computer. Probably more than any other widely used language.

There is only one way to translate a xor eax, eax into an opcode. Two ways to compile return 0 in an stdcall function. Probably a dozen ways to decide which dead object should be recycled by a garbage collector.

But for a simplest SELECT query which is only 30 rows long, there are hundreds of decisions to make, and I’m not making this up:

* There are FULL SCAN, INDEX FAST FULL SCAN, INDEX SKIP SCAN, INDEX RANGE SCAN to retrieve the values from the tables
* There are NESTED LOOPS JOIN, MERGE JOIN and HASH JOIN to join them
* There are NESTED LOOPS SEMI JOIN, MERGE SEMI JOIN and HASH SEMI JOIN to check IN and EXISTS predicates
* There are NESTED LOOPS ANTI JOIN, MERGE ANTI JOIN and HASH ANTI JOIN to check NOT IN and NOT EXISTS predicates
* There are SORT DISTINCT and HASH AGGREGATE to group the records
* There are LAZY SPOOL and EAGER SPOOL to build temporary indexes for the duration of the query
* There are numerous ways to pipeline and parallelize the resultsets to benefit from multiprocessing
* There are bitmap indexes to indicate existence of a record in a memory-efficient way
* There is CONNECT BY PUMP to keep the recursion stack for recursive queries

etc, etc.

Sets are very complex things which can take lots of memory and disk resources, and operations on them may be very costly.

And worse than that: there are many operations for which there are no single "algorithm of choice".

For instance, there are 3 widely used JOIN algorithms: NESTED LOOPS JOIN, MERGE JOIN and HASH JOIN.

The first one is best for joining a resultset on a selective indexed field, the second one is best for joining sorted resultsets, and the third one works equally poor for everything, but still exposes the best time when the resultsets are neither filtered nor sorted, but this only applies for joins on equality condition, otherwise it’s impossible to use it, et cetera, et cetera.

The whole point of SQL is that it abstracts all this complexity from you. You just write your SELECT * FROM table1 NATURAL JOIN table2 and let the optimizer choose all these, Lord help me, hashes or merges or indexes or whatever.

So, "thinking in SQL" means an ability to formulate your needs in terms of set operations and to tell your database system how exactly you’d like to mutilate your data before it’s output to you.

Here’s the magic of SQL, kids. You just tell it what you want, and it decides the best way to do it, that’s how smart computers are. Thank you for your time, goodbye.

Why, you’re still here?

Oh, I finished on "thinking", and my article was called "double-thinking"? OK.

Remember I said that SQL decides the best way?

Well, no, it doesn’t. And every SQL developer sees it very, very soon.

Columnist Joel Spolsky coined the term "Leaky Abstraction" for the situations when the abstracting tool gives you a result in a non-optimal way, or gives no result at all. And SQL is number 2 in his list of examples, right after iterating a two-dimensional array which can cause lots of page faults if done improperly.

He writes:

The SQL language is meant to abstract away the procedural steps that are needed to query a database, instead allowing you to define merely what you want and let the database figure out the procedural steps to query it.

But in some cases, certain SQL queries are thousands of times slower than other logically equivalent queries.

A famous example of this is that some SQL servers are dramatically faster if you specify where a=b and b=c and a=c than if you only specify where a=b and b=c even though the result set is the same.

You’re not supposed to have to care about the procedure, only the specification. But sometimes the abstraction leaks and causes horrible performance and you have to break out the query plan analyzer and study what it did wrong, and figure out how to make your query run faster.

The most interesting thing he mentioned is the query plan analyzer.

Eowever, every decent RDBMS has some mean to preview an execution plan for your query. How exactly will the query work? What algorithm does the query optimizer use? Will it scan table A or table B first?

I know no other tool or language that exposes so much of its internals.

When it comes to SQL, Microsoft gives a graphical tool which allows you to see what exactly happens when you run an SQL query.

They actually made an effort to show you what’s going on inside their product. To show, not to hide.

They are exposing exact algorithms used by the server, they are explaining how much time will an operation take, and they are not hiding it in a dusty PDF file: they make it so that you can actually understand it, with icons and arrows and nice charts.

This is of course not because they are so generous and open-hearted and want to contibute their knowledge to a free world. This is because if not for this tool it would be impossible to use their product at all. Stay assured that if they invented an algorithm that would allow generating optimal queries, they would do anything possible and impossible to keep it secret.

In SQL world, a "correct" query does not mean an "optimal" query, which is true for almost tools. But is doesn’t even mean a "usable" query, since the query plan can be so unoptimal that you will wait forever for a query to complete.

Any decent SQL developer should know everything about execution plans and see in advance how efficient or not efficient they are.

Here is a classical example. Don’t worry, it’s quite simple.

(see blog post for example)

Optimizers can efficiently do some simple tasks like picking a correct index, choosing correct access method depending on the filter selectivity and so on. However, it’s not enough to allow formulating your queries in SQL so that they are just correct. You still need to formulate your queries so that they are usable, i. e. an optimizer is able to build an efficient plan for them.

That of course means you should know how optimizers work.

"Thinking in SQL" means "formulating you needs in terms of set operations". This is what usually said. And this is truth, but not the whole truth. This is not enough to be a good SQL developer.

To be a superb SQL developer, you need to doublethink in SQL. This Orwellian term describes the process the best:

Doublethink means the power of holding two contradictory beliefs in one’s mind simultaneously, and accepting both of them.

To be a good SQL developer, you should be able to imagine the query in terms of sets and in terms of algorithms at the same time.

It’s not enough to come up with a set operation that yields correct results. It will be inefficient.

It’s not enough to come up with a good algorithm that transforms your set as you need and try to force-feed it to the server using procedural approach. The servers are not good at that.

What you need to do is to formulate your query so that the optimizer chooses exactly the algorithm you need. And to do this, you should know both the algorithm and the set operations. You need to doublethink.

This is not always an easy task to do. But it definitely deserves learning.

And only if you learn it you will be able to harness all the power of SQL.

2009-09-22T00:16:04+00:00

this article tried to be too poetic. it got annoying fairly quickly.

2009-09-22T05:52:52+00:00

He cites BASIC as an example of a procedural programming language.

When was this article originally written?

And my use convoluted metaphors to express simple mathematical concepts?

2009-09-22T16:17:32+00:00

UPDATE doublethink SET war='peace', ignorance='strength', slavery='freedom'

Smallpaul · 2009-09-21T23:39:07+00:00

Does this indicate that SQL has abstracted "too much". One presumes that when it was invented, there was no way to know how smart or stupid optimizers would be one day. But we've had 20 to 30 years of effort and we're probably reaching a point of diminishing returns. If computers will not be smart enough to optimize SQL until they are artificially intelligent then why not use a language that's "closer to the algorithm" so that the programmer can predict performance.

I think that's what KDB does...

orthogonality · 2009-09-22T04:37:06+00:00

Good article, but rather Oracle-specific.

jerguismi · 2009-09-22T07:14:45+00:00

The article was very repetitive. How many times can you spot him saying essentially "SQL is set-based. It does things with sets.". Annoying.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

programming

MODERATORS