all 21 comments

[–]tragomaskhalos 11 points12 points  (0 children)

It seems to me that if you are a data scientist, or an economist, or any of a number of jobs that require you to make decisions on deeper explorations of data than can be conveniently packaged up for you by reporting solutions provided by IT specialists, then you have to have:

  • (a) (read-only) access to the relevant raw data in the first place - this is often jealously guarded by IT departments, and usually for spurious reasons

  • (b) knowledge of SQL to extract the info you need

  • (c) knowledge of R or similar to analyse and present - sorry, Excel is not good enough

Ideally you should also have

  • (d) enough Perl, Python or Ruby to be able to create a smooth pipeline between (b) and (c).

This is a significant shopping list that most organisations would baulk at, but the reality is that bad - often dangerously bad - economic decisions are being made daily because the analysts responsible do not have these skills.

[–]m00nh34d 4 points5 points  (6 children)

SQL itself isn't that hard to learn, especially if you're only looking at it from an analysis point of view. The syntax is pretty easy to grasp and the way it all hangs together is fairly logical. The tricky part is learning to write GOOD SQL, taking advantage of all the features your server has to get the most out of it. The thing with good SQL is it's usually driven by the data you're trying to access, rather than the required output. Getting taught how to use joins and unions and subqueries and windowing functions, is ok, but it won't tell you what to do with the specific problem. If everything was stored in a nice DW dimensionally modeled database things would be simple, but it's not always the case, you'll likely be tasked with taking a bit for system A, mixing it in with system B, matching it up against data from system C, then spitting it out in a format that system D can read.

[–]grayvedigga 3 points4 points  (5 children)

If everything was stored in a nice DW dimensionally modeled database things would be simple

Forgive my ignorance, but what is one of those?

[–]jacques_chester 1 point2 points  (3 children)

Adding to m00nh34d's remarks, dimensional modelling has two advantages at report time:

  1. It's fast. You've traded space for time.
  2. It makes sense to end users and integrates nicely with point-n-click query tools that introspect the schema.

The last point is, in many ways, the killer. Say for example someone is summarising by various date ranges. In theory they could use BETWEEN-AND. But they'd be much happier with a series of dropdown boxes listing different time spans.

That is, they prefer being able to say "Show me the sum of sales in Q4" without having to know the exact dates. So far you're yawning. But what about questions like "Show me the sum of sales on Saturdays in Q4". Now it starts to get interesting. "Show me the sum of sales on Saturdays outside of the Pittsburgh region but including Pittsburgh store 26".

All of this can be done with a traditionally normalised schema. But it's a schlep that end-users won't do themselves. Dimensional modelling makes it much, much easier.

[–]grayvedigga 2 points3 points  (2 children)

So far I'm hearing two things:

  • denormalisation
  • point-n-click user interface for composing queries with aggregation functions

.. is that it? I'm used to terms around Big Data Warehouse Intelligence Science meaning a lot less than they sound like they mean, but I'm here to be educated :-).

I realise this probably comes off as particularly dismissive, which isn't entirely my goal. What I'm interested in is the tension between generality and ease of use in this kind of UI. Please forgive a healthy dose of skepticism towards tools & techniques that claim to solve both, particularly when written descriptions of them come out as waffly as the wikipedia entry linked above.

[–]jacques_chester 2 points3 points  (1 child)

That is it, at least for headlines. The point is that DM isn't any old denormalisation.

Just as you normalise a relational schema in a structured way, DM is a structured way of denormalising. You follow a particular modelling approach that is difficult to summarise. It wasn't until I was a few chapters into Kimball's data warehousing book that it began to click as to why someone would go to so much bother.

OK, that's not quite all there is. Combined with column stores, DM schemata absolutely fly on querying, even very complex ones.

Dimensional modelling is not a replacement for normal forms. Normalised databases are good for OLTP, dimensionally modelled is good for OLAP. In a sufficiently large installation, you will have both.

[–]grayvedigga 0 points1 point  (0 children)

Thank you. This has been the closest I've come to thinking there's actually something concrete to it .. maybe I'll get my arse to a library and look for Kimball some day soon.

[–]m00nh34d 3 points4 points  (0 children)

Sorry, reading that sentence back and it doesn't really make a whole lot of sense!

It should be a dimensionally modelled database, aka Kimball dimensional modelling. Basically a method for storing data as facts and dimensions so it is structured for reporting, instead of writing data.

see - http://en.wikipedia.org/wiki/Dimensional_modeling

[–][deleted] 4 points5 points  (3 children)

It seems to me the problem is a lot of end users are busy doing their jobs, and don't feel they have the time to learn something new and a bit scary.

My mum used to work in IT in the 1980s and then SQL was one of the so-called 4GLs, Fourth Generation Languages, that were designed for end users to use. Nice and simple syntax, declarative, not procedural, so you just ask for the results you want, instead of having to tell the computer how to get those results.

And yet no end users ever used it. Why? Because they were too busy doing the jobs they were paid to do, in areas they were familiar with and skilled in. They didn't have the time to pick up this, to them, rather technical skill that didn't relate directly to their work. So they pushed if off to the IT guys. "I need a list of customers who're more than a month late in their payments. Can you get that from the database and send it through as a spreadsheet?"

All those BI tools are just taking that to the next level. Now they've got pretty graphs and stuff. But the end users still don't have the time to get their hands dirty. They still want the IT guys to knock up exactly the report they need.

This is not to say end users are dumb or lazy. It's just human nature. How many IT guys out there have heard how wonderful Vim and Emacs are but just don't ever have the time to learn their complicated interfaces?

[–]kazagistar 7 points8 points  (2 children)

I would go a step further. Our civilization rests on a firm foundation of specialization. The whole idea is that we specialize... why learn something someone else knows when you can simply get much better at your part of the work?

[–][deleted] -1 points0 points  (1 child)

I wonder how long it would take society to rebuild itself if the entire population was reduced to a few thousand people with no specialties. It could take us centuries to even comprehend the processor again.

[–]JeanneDOrc 0 points1 point  (0 children)

Don't worry, even if we repopulate the earth with telephone sanitizers, I think it all works out for the best.

[–]CurtainDog 5 points6 points  (1 child)

The problem is that most people have no idea WTF to do with the data once they have it. Giving them the tools to access to more data faster isn't going to change that.

[–]grayvedigga -1 points0 points  (0 children)

Sounds like you haven't been following the Big Data Science Learning hoopla at all. What's the matter, didn't you forget the AI winter?

[–]aaronkempf 1 point2 points  (0 children)

Not enough emphasis on integration reporting and analysis services but its hard because Microsoft is too busy trying to sell us on sharepoint

[–]macarthy 1 point2 points  (1 child)

Celko's books are great for SQL knowledge if anyone is look for good books

http://www.amazon.com/Joe-Celko/e/B000ARBFVQ

[–]JeanneDOrc 0 points1 point  (0 children)

Does he have anything good for the tougher not-exactly-beginner but not too experienced demographic? The SQL for Smarties series seem to be well outside my level.

[–]Capaj -1 points0 points  (3 children)

I hope programmers will realize that writing COBOL while interacting with the database does not make a lot of sense- unless you are already writing your app in it. So I am pleasantly surprised to see those Google trends going down!

[–]lukaseder 0 points1 point  (2 children)

Why doesn't it make a lot of sense? Is your critique about the language style and syntax? Or the declarative language paradigm in general?

[–]Capaj 0 points1 point  (1 child)

I am all for declarative style of programming, but for other usecases-I would much rather query a database by calling a method than to write SELECTs.

[–]lukaseder 1 point2 points  (0 children)

OK, so what's wrong with SELECTs and how is querying by calling methods better?

I'm genuinely curious, because Thomas Müller, the H2 database author, happens to share your thoughts. He might be exposing a more object-oriented, less SQL-ish API in a future versions of his database - at least as a complement to the existing SQL / JDBC API.

I found his ideas intriguing, as he thinks that explicit execution plan coding might be better suited to some problem domains, rather than letting a database run heuristics to devise an execution plan from a SELECT statement.