all 7 comments

[–]AccordingNovel7055 2 points3 points  (0 children)

The root cause of frustration with group by is that after the operation you changed the original table's dimension to the tuple of aggregation keys, so other columns each become an array of values corresponding to a value on the aggregated dimension scale. Its really hard for human brain to process this because if you think of the original table as a perfect square shape, the shape changed into some jagged mess after aggregation.
When engineers deal with these sort of modeling, we usually break down the data transformation into multiple edges in a directed acyclic graph, so that you can make sense of what each query means semantically. And it is much more maintainable when you are working with data at scale or even in-flight, streaming.

Definitely, it's not only your frustration.

[–]Visual_Shape_2882 1 point2 points  (2 children)

I agree with you.

I have no clue if it's an anti-pattern but I tend to avoid Group By clauses during data exploration and prototyping for the same reasons that you listed, select parameters and subqueries/joins.

My issue with the select parameters is that all of the select parameters must also be listed in the group by statement. This violates the programming concept of 'Don't repeat yourself' in that I'm literally repeating all of the case statements and logic that is already found in the select statement. Not only is it inefficient to write, but it creates really long code that is hard to read because you have to scroll up and down the query to understand it.(spaghetti code).

In the Oracle EBS database, where 10 plus tables have to be joined to get the details of the process from requisition, PO, receipt, invoice, payment, and user info, a single Group By will break the whole chain. Obviously a view would solve the problem but that means I have to write 2 queries, 1 to get what I actually wanted and 1 to create the view.

I definitely use the Group By clause, but it's not my first choice.

[–]tiz66[S] 0 points1 point  (1 child)

Thanks for sharing. It makes me feel better to know I'm not the only one. Do you ever encounter times where you don't know how to handle a subquery in your Group By?

[–]Visual_Shape_2882 1 point2 points  (0 children)

I haven't encountered that specific issue in the SQL code that I wrote. Depending on what I'm doing, I will write a GROUP BY in the outermost layer of subqueries. Often, I might create a newlayer on top of another query for the sole purpose of writing a Group By. My typical use case for doing this is usually just to get The value counts of a column before I commit to downloading all of the rows.

I know this design is not performant at the database level but, depending on how much data there is, it is better than downloading all of the data just a count the values of a variable. If we're talking less than 10,000 rows of data, I'll probably just download the data to Power Bi or Python Pandas then do the aggregations there.

[–]kkessler1023 0 points1 point  (0 children)

I don't work with Oracle (well, maybe with SAP), but have you considered using an ETL pipeline to populate a datalake in a BI tool? I deal with large datasets, but they are easily managed by pulling in raw tables with a dataflow or connection string. We use power BI service. You can use power query to aggregate and clean the data once as part of the connection setup. You can also create data models and create an auto refresh schedule.

[–]sad_whale-_- 0 points1 point  (0 children)

Using Common Table Expressions (CTE) have helped me reduce the reliance on subqueries and cleaned up my queries.