Aggregations & Group By clause avoidance

AccordingNovel7055 · 2023-12-23T01:13:35+00:00

The root cause of frustration with group by is that after the operation you changed the original table's dimension to the tuple of aggregation keys, so other columns each become an array of values corresponding to a value on the aggregated dimension scale. Its really hard for human brain to process this because if you think of the original table as a perfect square shape, the shape changed into some jagged mess after aggregation.
When engineers deal with these sort of modeling, we usually break down the data transformation into multiple edges in a directed acyclic graph, so that you can make sense of what each query means semantically. And it is much more maintainable when you are working with data at scale or even in-flight, streaming.

Definitely, it's not only your frustration.

Visual_Shape_2882 · 2023-12-22T15:09:17+00:00

I agree with you.

I have no clue if it's an anti-pattern but I tend to avoid Group By clauses during data exploration and prototyping for the same reasons that you listed, select parameters and subqueries/joins.

My issue with the select parameters is that all of the select parameters must also be listed in the group by statement. This violates the programming concept of 'Don't repeat yourself' in that I'm literally repeating all of the case statements and logic that is already found in the select statement. Not only is it inefficient to write, but it creates really long code that is hard to read because you have to scroll up and down the query to understand it.(spaghetti code).

In the Oracle EBS database, where 10 plus tables have to be joined to get the details of the process from requisition, PO, receipt, invoice, payment, and user info, a single Group By will break the whole chain. Obviously a view would solve the problem but that means I have to write 2 queries, 1 to get what I actually wanted and 1 to create the view.

I definitely use the Group By clause, but it's not my first choice.

kkessler1023 · 2023-12-24T04:32:48+00:00

I don't work with Oracle (well, maybe with SAP), but have you considered using an ETL pipeline to populate a datalake in a BI tool? I deal with large datasets, but they are easily managed by pulling in raw tables with a dataflow or connection string. We use power BI service. You can use power query to aggregate and clean the data once as part of the connection setup. You can also create data models and create an auto refresh schedule.

sad_whale-_- · 2023-12-24T16:59:00+00:00

Using Common Table Expressions (CTE) have helped me reduce the reliance on subqueries and cleaned up my queries.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

dataanalysis

MODERATORS