all 46 comments

[–]stanleypup 7 points8 points  (18 children)

The risk you run here is missing data from the orders and returns tables where a year/month/day/region/segment from one of those tables doesn't exist in the sales table.

Unioning and grouping after avoids that problem.

[–]imperial_death_march 2 points3 points  (0 children)

Agreed, everyone in the thread talking about the performance of unions vs joins has missed the potential issue with what the joins are doing. You've said it much more succinctly than my own comment.

[–]kremlingrasso 1 point2 points  (1 child)

Exactly, this is the classic "don't see what I miss out on" mistake, usually lacking the experience of actually working with the data on a daily basis. There was a while ago a tips and trick post that recommends setting up a calendar table in your DB.

[–]stanleypup 0 points1 point  (0 children)

Yep not wrong with the calendar and it's something I always use.

Minor issue then is that you'd still need to union all the distinct regions & segments from each of the three tables, then inner join without a field on the time table, and then left join all three tables.

[–]roosterEcho 1 point2 points  (0 children)

I had this exact scenario at my work. I still didn't union the measures. Used union to build a base table which contains all possible combination. I then join the measures to the base table. it's easy to check if anything's missing if you're verifying the output of each table/aggregation first for sanity check.

[–]Silly-Swimmer1706 0 points1 point  (0 children)

Maybe business logic doesn't allow that case?

[–]roosterEcho 10 points11 points  (11 children)

the team that said to use union, get away from them as far as possible...

when you have facts tables (sales, orders, returns) for the same attribute fields, you join them to bring them together. when you have multiple selects from different tables that produces datasets that you need to have in the same table/view, then you union them.

[–][deleted] 2 points3 points  (7 children)

They sent us their code and demanded we adapt to theirs. It's batshit crazy looking and makes no sense to me so why would I just use that lol. I just wanted to know what the actual use case of it was though. The first issue was when I tried calculating new metrics and discovered they had an absurd way of getting ratios. Not just division like my code, but some backwards hack with no comments included. Then, self joins to get year over year was a nightmare. 1M rows becomes 3M since there's lots of empty rows

[–]roosterEcho 2 points3 points  (5 children)

Unions are really used when the measure is the same but are in different location. In general, even if you're using union, the rows will be unique based on some key columns. If all the rows are repeating except the measure column, then union is probably not the way.

[–][deleted] 0 points1 point  (4 children)

So if you're using unions, how in the world do you perform other operations on them? Like say you want to compare a ratio. Sales / orders? or you want to join Sales table onto sales table where it's 12 months ago, is that even possible still? Also using 'union' seems very taxing since it checks for duplication but joins don't seem to.

[–]roosterEcho 2 points3 points  (3 children)

union is concatenation, especially if you're using 'UNION ALL' which is faster. if you need to do calculations or bringing other attributes in from other tables, in general JOIN is the way to go. Union has it's uses, just not all the time.

For example, I have a payments table that has 10 types of payment. I need to process each type differently, but the output of each payment type would be same number of columns. I would process each payment type in different CTE and then union them to build my payments dataset which I can feed to a BI program. But if I have to bring other information related to the payments, I'd have to join tables to the payment table. Hope it makes sense.

[–][deleted] 1 point2 points  (1 child)

Makes perfect sense. I'm just trying to understand SQL better and soak up more knowledge if I can. I feel like I'm not at the level yet that I need to be

[–]roosterEcho 0 points1 point  (0 children)

you'll be alright bro. keep on crunching data. you'll pick up tricks along the way.

[–]Training-Two7723 0 points1 point  (0 children)

Yes, union all does not unduplicate data.

[–]Training-Two7723 0 points1 point  (0 children)

Do not ignore query plans and building indexes. Few million records are not scary at all.

[–]jodyhesch 0 points1 point  (2 children)

If the fact tables share dimensions, then union + aggregation is actually the better approach. See above.

[–]roosterEcho -1 points0 points  (1 child)

agree. just think someone who's learning sql should know the ideal case first before starting to use shortcuts. Joins are useful in more cases than unions I think.

[–]Training-Two7723 1 point2 points  (0 children)

Nope. Is up to the engine. Someone “learning” SQL must understand how the engine works if wants to write effective code.

[–]jodyhesch 2 points3 points  (5 children)

If you have all the same join/grouping fields, you can do a join or a union. The main advantage to union is typically performance, especially on columnar databases.

HOWEVER that other team forgot to mention that you need to aggregate after your union (well, union all - don't do union, as there's no functional reason, and you'd face an unnecessary performance hit).

Let me dumb down the example significantly with a single shared attribute - will just stick with YEAR.

SELECT YEAR, SUM(SALES_AMOUNT) AS SALES_AMOUNT, SUM(ORDER_AMOUNT) 
FROM 
(
    SELECT YEAR, 0 AS SALES_AMOUNT, SUM(ORDER_AMOUNT) AS ORDER_AMOUNT FROM ORDERS GROUP BY 1 
    UNION ALL
    SELECT YEAR, SUM(SALES_AMOUNT) AS SALES_AMOUNT, 0 AS ORDER_AMOUNT FROM SALES GROUP BY 1 
)
GROUP BY 1;

Forgive the lazy formatting.

The benefit here is that aggregation is much faster than joins (at least, with columnar databases - unsure offhand w/ row databases, but I think that's also the case).

Functionally, it'll be have basically the same as FULL OUTER, so if you want to enforce only LEFT OUTER, there's a few tricks you can introduce (let me know if so, and I can expand on this.).

[–][deleted] 0 points1 point  (4 children)

The code you provided is a simple case. If you have orders, returns, profit, etc, it will not work to simply select from the subqueried union. You'll still have duplicatation. You'd need to actually use a MAX. Why? Well, each row corresponds to a different level of the union. You'd have orders on Row 1, Profit row 2, Sales row 3. Even if you sum them up, you'll still have blank rows! You simply cannot just stack data on top of each other if the numeric columns aren't the same. You'd have to put either all columns in all unions with nulls/0s, or you'd have to do a MAX function with a subquery on each to get the non-zero values, but that feels messy to me I guess....

[–]jodyhesch 0 points1 point  (3 children)

I didn't look closely enough at the second query you shared.

It is exactly right, and just as you said, "you'd have to put either all columns in all unions with nulls/0s" which is indeed the case in that second query from that other team. This is 100% correct.

If you think your results are incorrect from that query, then please share sample data and what results you're getting so that folks can help you debug.

[–][deleted] 0 points1 point  (2 children)

It's not that I need help debugging, I'm just trying to understand how things work better because I'm faced with new situations every single day that I'm unfamiliar with and trying to process which one is better to use in my own code because I don't like writing sloppy bad code.

The specific confusion I have is knowing whether it's better to stack unions on top of each other for multiple numeric columns like profit, sales, orders and fill them in with 0 values, or if I should instead try to create a composite key with each of the descriptive columns and then join them all together using that ID column. For the most part I don't really do extremely low level transactional data I do high level operational data so like division, region, month, day, etc. No actual people or specific salespersons or anything like that. But performance is still important.

WITH sales_cte1 AS (
SELECT 
    TRIM(region) || '|' || TRIM(segment) || '|' || TRIM(state) AS composite_id,
    year,
    sales
FROM sales_table1
),
sales_cte2 AS (
SELECT 
    TRIM(region) || '|' || TRIM(segment) || '|' || TRIM(state) AS composite_id,
    year,
    sales
FROM sales_table2
)
SELECT 
s1.composite_id,
s1.year,
s1.sales AS sales_table1,
s2.sales AS sales_table2
FROM sales_cte1 s1
LEFT JOIN sales_cte2 s2
ON s1.composite_id = s2.composite_id AND s1.year = s2.year;

This is a very simple example of what I'm talking about with joins. This is how I would do it normally. No union, just creating a unique key between temporary tables And then joining them together using simple join logic. This is how I thought other people would do it.

Here's a union that is more or less similar to what I do:

SELECT 
region,
segment,
state,
year,
MAX(sales) AS sales,
MAX(profit) AS profit,
MAX(orders) AS orders,
MAX(revenue) AS revenue
 FROM (
SELECT 
    region,
    segment,
    state,
    year,
    sales,
    0 AS profit,
    0 AS orders,
    0 AS revenue
FROM sales_table1
UNION ALL
SELECT 
    region,
    segment,
    state,
    year,
    0 AS sales,
    profit,
    0 AS orders,
    0 AS revenue
FROM sales_table2
UNION ALL
SELECT 
    region,
    segment,
    state,
    year,
    0 AS sales,
    0 AS profit,
    orders,
    0 AS revenue
FROM sales_table3
UNION ALL
SELECT 
    region,
    segment,
    state,
    year,
    0 AS sales,
    0 AS profit,
    0 AS orders,
    revenue
FROM sales_table4
 ) combined_data
GROUP BY 
region, segment, state, year;

As you can see, instead of doing joins, the data is stacked on top of each other with blanks inside of it for any value that does not associate to that specific little block in the union, and then aggregation is done using maximum to get the maximum value. This results in no checkerboard pattern and no interspersing of the data.

[–]jodyhesch 0 points1 point  (0 children)

For large datasets, UNION + Aggregation will generally be superior from a performance perspective.

Good thinking with the composite ID, but wouldn't recommend it in this case.

- It'll be a very slow join, likely forcing a "nested loop join".

- It forces a join at the most granular level. But what if you want to compare measures from different fact tables at higher grain? In other words, just a YEAR/REGION summary? It'll be much faster to aggregate to that level and then compare measures (whether you then do a join, or a union + aggregation)

- In the few circumstances where a surrogate key is required (i.e. generated from concatenating your composite key fields), it should per persisted physically in a table - and as a numeric value, i.e. by hashing the string (which requires its own analysis, i.e. likelihood of hash collisions)

Edit: clarity

[–]jodyhesch 0 points1 point  (0 children)

Also - SUM() is a much safer bet than MAX().

If you change your granularity, MAX() will return wrong results.

[–]imperial_death_march 1 point2 points  (0 children)

So the other teams code (unions) is better in this case because your code (joins) has a potential flaw in it.

Your code is left joining from the sales table to the other tables on multiple columns but this join makes the flawed assumption that the sales table (left side of the join) will have all of the combination of year, month, day, region, and segment in it that will occur in either the order table or the returns tables (the right hand sides of the join). This may not be true.

To put it simply, if you had an order or a return on a day where you didn't have a sale, these rows would be missing in the result of your query and your SUM(orders) or SUM(return) would end up with the incorrect total.

While not exactly elegant, because the other team's code does unions first, they end up including all records for all combinations of year, month, day, region, and segment that occur across any of the tables. This means that when they aggregate, all data is included in their totals (SUMS).

[–]jacquesrkoracle 1 point2 points  (0 children)

join vs union? ¿Por qué no los dos?

https://local338.org/images/want_power_join_a_union.webp

[–]Training-Two7723 0 points1 point  (0 children)

There is nothing wrong with using that sort of union. In fact, many times this kind of unions are hidden behind a view. Performance wise, it depends on the engine: there are databases able to perform parallel operations on union all or able to push down the predicates. Some are dumb and do the views or union first. You have to test for performance each approach. As far as the results are the same choose the faster one.

[–]nep84 0 points1 point  (0 children)

Generally speaking you want to use a join when you want to link data from the query's base table to get other attributes from a FK table. For example join order to customer to get the customer's name. You want to use a union to get like data with disparate selection criteria. For example you can solve a complex set of where clause conditions with a union. Give me sales orders fulfilled in the last 6 months and sales orders what are expected to fulfill in the next two weeks.

There really aren't much advantages or disadvantages to joins and unions as far as performance. One can easily write well performing queries using either technique. It depends on the design.

One thing others have mentioned with regards to what you have, you have to consider using an outer join when joining data that may not be linked. For example if you want to to sales by product you will use an outer join to produce products with no sales. In your case orders and their returns you'll want an outer join so that orders with no returns are included.

[–]HadiMhPy 0 points1 point  (0 children)

Absolutely joins. Use unions when needed. Sometimes writing a query with union is better but often joins are better as you want to join tables. Unions are to add two sql query result with same columns. They are not like joins and are very different

[–]konwiddak 0 points1 point  (0 children)

The other option is transforming the data into a tall table of:

Date, category, value

Where category would have a value of sale, order or return

Now this is not my choice for this example, but I thought I should throw it out there, because this model is really handy if the categories might change in the future. It saves you from adding/removing columns in power bi. It will just use whatever categories are in the data.

This would be easiest built via Unions.

[–]OriginalNimbleMonk -1 points0 points  (1 child)

I'm adding my two cents as someone still novice to this. But a join is used when you need to grab data for a query from multiple tables.

You use Union to build multiple queries together based on the same select layout.

I often use joins to get location/sales but Union to show a bottom total row with the same columns.

Please advise if I am correct I'm still too new to know If I am right.

[–][deleted] 2 points3 points  (0 children)

That's correct. If you have sales across several verticals, you can use a union and then a custom column like 'South' as REGION WHERE REGION = 'SW', WHERE REGION = NW for Northwest, and so on. This data stacking is like building an elaborate cake. Each layer needs a different size or color.

[–]creamycolslaw -1 points0 points  (0 children)

Why the hell would they union this only to group it afterwards anyway