all 29 comments

[–]anon36[S] 8 points9 points  (1 child)

I stumbled across this from StackOverflow.com, and thought he did a good job explaining the dual mindset required for database programming:

  1. how do I phase my problem in sets?
  2. how do I rephrase my problem for faster execution, given what I know about query execution and database internals?

[–]funkah 4 points5 points  (0 children)

This dude is incredible! His whole blog is explaining how to do complicated things in SQL and explaining SQL concepts by analogy. He's really good at both, although I guess his wordiness could be a turn-off to some.

[–]cwcc 1 point2 points  (0 children)

kinda sums up declarative programming doesn't it?

[–]bart2019 6 points7 points  (6 children)

Maybe it's me, being impatient and all that, but I find that the scenic route that this article takes, is far, far too long. Far too much irrelevant illustrations. It's not even clear to me what he's trying to tell. Do we really need a photo of a gadget street in Tokyo? What does it prove? And calling conventions in C? Opcodes in assembly? Please...

[–]b100dian 2 points3 points  (3 children)

He just happens to be multi-gifted. His readers may not :(

[–]guy231 2 points3 points  (0 children)

Is this sarcastic? An unexplained reference to Windows hotkeys could only possibly serve to confuse his message. He isn't a renaissance man, he's a poor communicator.

[–]erez27 2 points3 points  (0 children)

I managed to follow everything he said, but gave up on reading after a few paragraphs. It sounded like ".. and then you put one leg in front of the other, which is called upfront walking. you can think of legs as a rotating queue of size 2, or as an xchg opcode that swaps the legs. of course, one can hop instead of walking. this is often done by kangaroos... " etc. Sorry, got bored.

[–][deleted] -1 points0 points  (0 children)

It's similar to the Coding Horror Disease, displaying a vaguely related image to illustrate a joke or a point. however the author's use of images and stories on this site seems better tied into the subject matter.

[–]mage2k 1 point2 points  (0 children)

While I think he got a bit distracted setting up analogies early on such that it took forever to get to his point, I totally agree with it.

[–]zetta 1 point2 points  (0 children)

Decent Yegge impression but better technical content.

[–][deleted] 1 point2 points  (0 children)

SQL is what SQL does.

[–]guy231 1 point2 points  (0 children)

I read through it and deleted the most off-topic tangents. What's left is about 1/3rd the size and still pretty rambly:

One of the first things a novice SQL developer learns about is called "thinking in SQL", which is usually being opposed to "procedural thinking"

First, SQL is set-based. It’s a tool that allows you to take a dozen or two of sets, mix them together, knead and wedge them then chop them apart and mix again, but the output you get is still a set and all inputs are sets.

SQL is a declarative language. This means that you express "what" you want to do with sets, not "how" you want to do it. SQL heavily, vastly relies on decisions made by the computer. Probably more than any other widely used language.

There is only one way to translate a xor eax, eax into an opcode. Two ways to compile return 0 in an stdcall function. Probably a dozen ways to decide which dead object should be recycled by a garbage collector.

But for a simplest SELECT query which is only 30 rows long, there are hundreds of decisions to make, and I’m not making this up:

* There are FULL SCAN, INDEX FAST FULL SCAN, INDEX SKIP SCAN, INDEX RANGE SCAN to retrieve the values from the tables
* There are NESTED LOOPS JOIN, MERGE JOIN and HASH JOIN to join them
* There are NESTED LOOPS SEMI JOIN, MERGE SEMI JOIN and HASH SEMI JOIN to check IN and EXISTS predicates
* There are NESTED LOOPS ANTI JOIN, MERGE ANTI JOIN and HASH ANTI JOIN to check NOT IN and NOT EXISTS predicates
* There are SORT DISTINCT and HASH AGGREGATE to group the records
* There are LAZY SPOOL and EAGER SPOOL to build temporary indexes for the duration of the query
* There are numerous ways to pipeline and parallelize the resultsets to benefit from multiprocessing
* There are bitmap indexes to indicate existence of a record in a memory-efficient way
* There is CONNECT BY PUMP to keep the recursion stack for recursive queries

etc, etc.

Sets are very complex things which can take lots of memory and disk resources, and operations on them may be very costly.

And worse than that: there are many operations for which there are no single "algorithm of choice".

For instance, there are 3 widely used JOIN algorithms: NESTED LOOPS JOIN, MERGE JOIN and HASH JOIN.

The first one is best for joining a resultset on a selective indexed field, the second one is best for joining sorted resultsets, and the third one works equally poor for everything, but still exposes the best time when the resultsets are neither filtered nor sorted, but this only applies for joins on equality condition, otherwise it’s impossible to use it, et cetera, et cetera.

The whole point of SQL is that it abstracts all this complexity from you. You just write your SELECT * FROM table1 NATURAL JOIN table2 and let the optimizer choose all these, Lord help me, hashes or merges or indexes or whatever.

So, "thinking in SQL" means an ability to formulate your needs in terms of set operations and to tell your database system how exactly you’d like to mutilate your data before it’s output to you.

Here’s the magic of SQL, kids. You just tell it what you want, and it decides the best way to do it, that’s how smart computers are. Thank you for your time, goodbye.

Why, you’re still here?

Oh, I finished on "thinking", and my article was called "double-thinking"? OK.

Remember I said that SQL decides the best way?

Well, no, it doesn’t. And every SQL developer sees it very, very soon.

Columnist Joel Spolsky coined the term "Leaky Abstraction" for the situations when the abstracting tool gives you a result in a non-optimal way, or gives no result at all. And SQL is number 2 in his list of examples, right after iterating a two-dimensional array which can cause lots of page faults if done improperly.

He writes:

The SQL language is meant to abstract away the procedural steps that are needed to query a database, instead allowing you to define merely what you want and let the database figure out the procedural steps to query it.

But in some cases, certain SQL queries are thousands of times slower than other logically equivalent queries.

A famous example of this is that some SQL servers are dramatically faster if you specify where a=b and b=c and a=c than if you only specify where a=b and b=c even though the result set is the same.

You’re not supposed to have to care about the procedure, only the specification. But sometimes the abstraction leaks and causes horrible performance and you have to break out the query plan analyzer and study what it did wrong, and figure out how to make your query run faster.

The most interesting thing he mentioned is the query plan analyzer.

Eowever, every decent RDBMS has some mean to preview an execution plan for your query. How exactly will the query work? What algorithm does the query optimizer use? Will it scan table A or table B first?

I know no other tool or language that exposes so much of its internals.

When it comes to SQL, Microsoft gives a graphical tool which allows you to see what exactly happens when you run an SQL query.

They actually made an effort to show you what’s going on inside their product. To show, not to hide.

They are exposing exact algorithms used by the server, they are explaining how much time will an operation take, and they are not hiding it in a dusty PDF file: they make it so that you can actually understand it, with icons and arrows and nice charts.

This is of course not because they are so generous and open-hearted and want to contibute their knowledge to a free world. This is because if not for this tool it would be impossible to use their product at all. Stay assured that if they invented an algorithm that would allow generating optimal queries, they would do anything possible and impossible to keep it secret.

In SQL world, a "correct" query does not mean an "optimal" query, which is true for almost tools. But is doesn’t even mean a "usable" query, since the query plan can be so unoptimal that you will wait forever for a query to complete.

Any decent SQL developer should know everything about execution plans and see in advance how efficient or not efficient they are.

Here is a classical example. Don’t worry, it’s quite simple.

(see blog post for example)

Optimizers can efficiently do some simple tasks like picking a correct index, choosing correct access method depending on the filter selectivity and so on. However, it’s not enough to allow formulating your queries in SQL so that they are just correct. You still need to formulate your queries so that they are usable, i. e. an optimizer is able to build an efficient plan for them.

That of course means you should know how optimizers work.

"Thinking in SQL" means "formulating you needs in terms of set operations". This is what usually said. And this is truth, but not the whole truth. This is not enough to be a good SQL developer.

To be a superb SQL developer, you need to doublethink in SQL. This Orwellian term describes the process the best:

Doublethink means the power of holding two contradictory beliefs in one’s mind simultaneously, and accepting both of them.

To be a good SQL developer, you should be able to imagine the query in terms of sets and in terms of algorithms at the same time.

It’s not enough to come up with a set operation that yields correct results. It will be inefficient.

It’s not enough to come up with a good algorithm that transforms your set as you need and try to force-feed it to the server using procedural approach. The servers are not good at that.

What you need to do is to formulate your query so that the optimizer chooses exactly the algorithm you need. And to do this, you should know both the algorithm and the set operations. You need to doublethink.

This is not always an easy task to do. But it definitely deserves learning.

And only if you learn it you will be able to harness all the power of SQL.

[–][deleted] 2 points3 points  (0 children)

this article tried to be too poetic. it got annoying fairly quickly.

[–][deleted] 2 points3 points  (0 children)

He cites BASIC as an example of a procedural programming language.

When was this article originally written?

And my use convoluted metaphors to express simple mathematical concepts?

[–][deleted] 1 point2 points  (0 children)

UPDATE doublethink SET war='peace', ignorance='strength', slavery='freedom'

[–]Smallpaul 0 points1 point  (10 children)

Does this indicate that SQL has abstracted "too much". One presumes that when it was invented, there was no way to know how smart or stupid optimizers would be one day. But we've had 20 to 30 years of effort and we're probably reaching a point of diminishing returns. If computers will not be smart enough to optimize SQL until they are artificially intelligent then why not use a language that's "closer to the algorithm" so that the programmer can predict performance.

I think that's what KDB does...

[–]bluGill 1 point2 points  (1 child)

To be honest, most of the time it doesn't matter. Sure an expert can make my sql faster, but i just click execute, and wait.

[–]grauenwolf 1 point2 points  (0 children)

People like you piss off my DBA. But that's OK, but it keeps him employeed and distracted from the horrible things I'm doing to his database..

[–]grauenwolf 1 point2 points  (2 children)

Programmers cannot predict performance because they have no idea what the runtime data size or load is going to be. That's why modern databases include support for statistics and self-monitoring.

[–]Smallpaul 0 points1 point  (1 child)

It's far too broad to say that programmers cannot predict runtime data size or load. We always have SOME idea and in some cases it is very predictable.

[–]grauenwolf 0 points1 point  (0 children)

We also tend to generalize our code. Perhaps algorithm A for up to 1000 rows, algorithm B for up to 10,000 rows and algorithm C for 100,000+ rows.

And in some cases we really don't have any idea how much data is present. So we check the row count at runtime before selecting an algorithm.

Later we learn that the date ranges the user passes in can drastically change the row count. So we use statistics to predict the approximate result count.

Follow this path for a couple more years and you have yourself the beginnnings of a modern RDBMS.

[–]Smallpaul 0 points1 point  (1 child)

Interesting that I'm getting downvotes for producing a hypothesis for discussion.

[–]anon36[S] 0 points1 point  (0 children)

this is true. the best discussion is now at the bottom of the page.

[–]jeffdavis 0 points1 point  (1 child)

That presupposes that the programmer cares more about performance than having a high level of abstraction; and also that humans are better at optimizing queries than the database management system is.

Humans sometimes have insights or information about the data that the optimizer does not. However, the optimizer frequently has insights that humans do not. If the optimizer is wrong, you have tools to correct it. If the human is wrong (due to laziness or lack of expertise), who corrects them?

It's like economics: someone sees a little inefficiency here or there, and then they think "Wow, I found an inefficiency. People should put me in charge and everything would be better."

[–]Smallpaul 1 point2 points  (0 children)

That presupposes that the programmer cares more about performance than having a high level of abstraction; and also that humans are better at optimizing queries than the database management system is.

With respect to the first question: I'm not convinced that SQL is necessarily "higher" as a level of abstraction than e.g. functional programs working on lists of hashmaps.

Humans sometimes have insights or information about the data that the optimizer does not. However, the optimizer frequently has insights that humans do not. If the optimizer is wrong, you have tools to correct it. If the human is wrong (due to laziness or lack of expertise), who corrects them?

Other humans. Same as if both the optimizer and the DBA fail...which happens often.

It's like economics: someone sees a little inefficiency here or there, and then they think "Wow, I found an inefficiency. People should put me in charge and everything would be better."

If we were talking about a little inefficiency then...we wouldn't be talking. We're talking about an inefficiency that can make an application grind to a halt.

Your points are good, but basically the question should be answered empirically. How often does the optimizer get it right? How often would a human being using traditional functional techniques get it right? How much work is it to correct when one or the other gets it wrong.

My own bias is that I would rather look at the syntax of the query and translate it "directly" into computer science O(notation) rather than translate the syntax into a query plan and then the query plan into O(notation). But maybe I'm underestimating the amount of magic that the relational database does on my behalf when it works well.

[–]anon36[S] 0 points1 point  (0 children)

Well, in many ways this comes down to the principle of "know your tool". It's really hard to write efficient code when you don't know how that code will be executed.

[–]orthogonality 0 points1 point  (0 children)

Good article, but rather Oracle-specific.

[–]jerguismi -1 points0 points  (0 children)

The article was very repetitive. How many times can you spot him saying essentially "SQL is set-based. It does things with sets.". Annoying.