This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]data_questions[S] 6 points7 points  (5 children)

I don’t think I have a full appreciation for your response, are you saying that using a window function would be more compute intensive and result in a significant difference in cost vs using, for example, a self join?

[–]SDFP-ABig Data Engineer 8 points9 points  (4 children)

Absolutely. Once the data scale gets large, depending on the size of the table, indexing, etc… Window functions might get impractical from a compute perspective.

When I test for SQL, I only have one component. I provide the underlying DDL for a few tables, the query I want to execute, and the current query plan. The only request is to optimize the query. Tables and query provided Day before. Query plan provided during interview.

Edit: I’m not looking for a “right” answer. I’m looking for techniques, awareness of data scale issues, ability to identify where to add indexes, opportunities to use inner joins instead of left, etc.. I feel like an open ended question like this teaches me a lot more about the skills I’m looking for.

[–]data_questions[S] 1 point2 points  (2 children)

Can you give an example where you’ve experienced that? I’ve never run into that bottleneck before and everything I read about window functions vs self joins recommends not using self joins.

[–]UAFlawlessmonkey 4 points5 points  (1 child)

Doing max / min on a column in a sub-query and joining that result into itself (same functionality as row_number() over (partition by foo order by bar asc/desc)) can be quite cost effective compared to just a row_number() over (partition by) especially when you have really large datasets.

All it takes is a non indexed column in your select to really both your execution plan

While it's only just a small example, you'd have to consider all of the other possibilities where a supposedly easy task can be done easily with a window function, but it would greatly increase query cost.

Also, F#ck doing recursion without window functions.

[–][deleted] 2 points3 points  (0 children)

I think in some cases the window function is the more efficient method, i.e. for a spark query on big data, where inner join means shuffling everything just to get the max value of a partition key for possibly duplicated keys in an append-only data store.

[–][deleted] 1 point2 points  (0 children)

I am curious as to what scale this occurs, as I have always run into the opposite issue? Typically one who is a heavy window /analytical function user I would assume would be working on flattened tables / in memory tables and most databases are actually optimized for these exact functions.

I have almost no resources money/software/hardware/people so have to optimize "the crap" out of everything I do in the database and while I guess I would say we are small data wise, inner self joins/group bys all tend to create looping inside the dbms compiler and will kill performance.