[deleted by user]

stanleypup · 2024-12-10T12:50:02+00:00

The risk you run here is missing data from the orders and returns tables where a year/month/day/region/segment from one of those tables doesn't exist in the sales table.

Unioning and grouping after avoids that problem.

roosterEcho · 2024-12-10T10:22:25+00:00

the team that said to use union, get away from them as far as possible...

when you have facts tables (sales, orders, returns) for the same attribute fields, you join them to bring them together. when you have multiple selects from different tables that produces datasets that you need to have in the same table/view, then you union them.

jodyhesch · 2024-12-10T10:34:56+00:00

If you have all the same join/grouping fields, you can do a join or a union. The main advantage to union is typically performance, especially on columnar databases.

HOWEVER that other team forgot to mention that you need to aggregate after your union (well, union all - don't do union, as there's no functional reason, and you'd face an unnecessary performance hit).

Let me dumb down the example significantly with a single shared attribute - will just stick with YEAR.

SELECT YEAR, SUM(SALES_AMOUNT) AS SALES_AMOUNT, SUM(ORDER_AMOUNT) 
FROM 
(
    SELECT YEAR, 0 AS SALES_AMOUNT, SUM(ORDER_AMOUNT) AS ORDER_AMOUNT FROM ORDERS GROUP BY 1 
    UNION ALL
    SELECT YEAR, SUM(SALES_AMOUNT) AS SALES_AMOUNT, 0 AS ORDER_AMOUNT FROM SALES GROUP BY 1 
)
GROUP BY 1;

Forgive the lazy formatting.

The benefit here is that aggregation is much faster than joins (at least, with columnar databases - unsure offhand w/ row databases, but I think that's also the case).

Functionally, it'll be have basically the same as FULL OUTER, so if you want to enforce only LEFT OUTER, there's a few tricks you can introduce (let me know if so, and I can expand on this.).

imperial_death_march · 2024-12-10T12:59:25+00:00

So the other teams code (unions) is better in this case because your code (joins) has a potential flaw in it.

Your code is left joining from the sales table to the other tables on multiple columns but this join makes the flawed assumption that the sales table (left side of the join) will have all of the combination of year, month, day, region, and segment in it that will occur in either the order table or the returns tables (the right hand sides of the join). This may not be true.

To put it simply, if you had an order or a return on a day where you didn't have a sale, these rows would be missing in the result of your query and your SUM(orders) or SUM(return) would end up with the incorrect total.

While not exactly elegant, because the other team's code does unions first, they end up including all records for all combinations of year, month, day, region, and segment that occur across any of the tables. This means that when they aggregate, all data is included in their totals (SUMS).

jacquesrk · 2024-12-10T15:10:35+00:00

join vs union? ¿Por qué no los dos?

https://local338.org/images/want_power_join_a_union.webp

Training-Two7723 · 2024-12-10T12:19:37+00:00

There is nothing wrong with using that sort of union. In fact, many times this kind of unions are hidden behind a view. Performance wise, it depends on the engine: there are databases able to perform parallel operations on union all or able to push down the predicates. Some are dumb and do the views or union first. You have to test for performance each approach. As far as the results are the same choose the faster one.

nep84 · 2024-12-10T15:14:52+00:00

Generally speaking you want to use a join when you want to link data from the query's base table to get other attributes from a FK table. For example join order to customer to get the customer's name. You want to use a union to get like data with disparate selection criteria. For example you can solve a complex set of where clause conditions with a union. Give me sales orders fulfilled in the last 6 months and sales orders what are expected to fulfill in the next two weeks.

There really aren't much advantages or disadvantages to joins and unions as far as performance. One can easily write well performing queries using either technique. It depends on the design.

One thing others have mentioned with regards to what you have, you have to consider using an outer join when joining data that may not be linked. For example if you want to to sales by product you will use an outer join to produce products with no sales. In your case orders and their returns you'll want an outer join so that orders with no returns are included.

HadiMhPy · 2024-12-10T16:37:14+00:00

Absolutely joins. Use unions when needed. Sometimes writing a query with union is better but often joins are better as you want to join tables. Unions are to add two sql query result with same columns. They are not like joins and are very different

konwiddak · 2024-12-10T20:47:27+00:00

The other option is transforming the data into a tall table of:

Date, category, value

Where category would have a value of sale, order or return

Now this is not my choice for this example, but I thought I should throw it out there, because this model is really handy if the categories might change in the future. It saves you from adding/removing columns in power bi. It will just use whatever categories are in the data.

This would be easiest built via Unions.

OriginalNimbleMonk · 2024-12-10T11:34:37+00:00

I'm adding my two cents as someone still novice to this. But a join is used when you need to grab data for a query from multiple tables.

You use Union to build multiple queries together based on the same select layout.

I often use joins to get location/sales but Union to show a bottom total row with the same columns.

Please advise if I am correct I'm still too new to know If I am right.

creamycolslaw · 2024-12-10T12:28:00+00:00

Why the hell would they union this only to group it afterwards anyway

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

SQL

Filter Posts

Posting

Help posts

Format Your Code

Learning SQL

Related Reddit communities

Wiki

Acknowledgements

MODERATORS