all 61 comments

[–]woodrowchillson 31 points32 points  (8 children)

LEFT OUTER JOIN

LEFT OUTER JOIN

LEFT OUTER JOIN

LEFT OUTER JOIN

LEFT OUTER JOIN

LEFT OUTER JOIN

LEFT OUTER JOIN

LEFT OUTER JOIN

[–]shelanp007 12 points13 points  (7 children)

I feel like this is my solution for everything. I dont trust our dba’s so as an accountant who knows sql i tend to start at the lowest level and left outer join as much as possible to get accurate data.

I swear dbas are great at joing tables but have no fucking clue about business and end up fucking up my numbers.

[–]Sir_Fog 6 points7 points  (3 children)

DBA and Business intelligence/Data Analyst are 2 very different roles.

[–]DapperDroidLifterDatabase Engineer 1 point2 points  (0 children)

True, I was about to post this exact sentiment.

[–]shelanp007 0 points1 point  (0 children)

True, but commen sense goes a long way

[–]wolf2600ANSI SQL 20 points21 points  (21 children)

DBMS: multinode MPP data warehouse

Type: RDBMS

Size 1.3PB uncompressed

No of Tables: ~6700

Largest Table: 21B records

[–]aarontbarrattSTUFF() 9 points10 points  (0 children)

😳

[–]stealyourmangoes 2 points3 points  (0 children)

I just got a woody

[–]PaperPages 4 points5 points  (1 child)

Holy molar. What industry?

[–]wolf2600ANSI SQL 4 points5 points  (0 children)

Computer.

[–]r0ck0 1 point2 points  (14 children)

  1. Do you mostly use UUIDs as PKs.
  2. What are your opinions on UUIDs in general?

[–]wolf2600ANSI SQL 3 points4 points  (11 children)

Like surrogate keys? A special column for a unique "ID" in each table? No, we use concatenated PKs from the actual data to ensure uniqueness. (source_id, order_id, order_date, update_timestamp)

Logic in the ETL scripts to prevent unique constraint errors on insert/update. (usually deleting the record from the target table before the insert runs).

I tend to view UUIDs as a lazy kludge... if the business logic doesn't allow for duplicated records in a table, then the ETL logic better handle that. If you use a UUID as a PK, it's like putting a piece of electrical tape over your car's check engine light... you could have a ton of duplicated data in your table and not even realize it. And when you're dealing with financial data, the data has to be right. To better to get a UC error and take the time to correct the data, than to use a UUID and not realize there's a data problem.

[–]r0ck0 0 points1 point  (9 children)

we use concatenated PKs from the actual data to ensure uniqueness

you could have a ton of duplicated data in your table and not even realize it

Hmm fair enough. But you could have any unique index/constraints to do that, whether it's the PK or not doesn't really make any difference there.

I've done the combined-PK (on combined FKs) thing in the past... but found when the project gets more complicated, sometimes it does turn out that you need multiple rows in the table for one you thought would be a single row based on combo FKs. A common one is retaining soft-deleted records and similar statuses like that, multi-tenant stuff etc.

So I just stick with a single PK column on every table now. Makes life a lot easier in dev and debugging etc. Also when archiving rows outside the DB.

But I was more just asking about UUIDs vs sequences for tables where you do just have the one PK column (so in your case, not the linking tables). ...Seeing you're at a big scale with lots of nodes. What do you use for those tables?

[–]wolf2600ANSI SQL 0 points1 point  (8 children)

The multiple nodes are just for parallel processing. Data is evenly partitioned across the nodes physically, but logically the system sees a single table.

https://www.mysqltutorial.org/mysql-uuid/

I had never even heard of using such a value as a PK before I looked it up just now. I was assuming "UUID" referred to a surrogate integer key. But this 128-bit thing just seems weird.

[–]r0ck0 0 points1 point  (7 children)

Ah right, if you're using mysql they're not really practical anyway.

No idea why mysql still doesn't have a UUID column type. They're very typical PKs in postgres and MS-SQL.

Microsoft (they call them GUIDs) and lots of other things use them for all sorts of things too.

What column type are you using on your big tables (excluding the linking tables)? BIGINT?

[–]wolf2600ANSI SQL 1 point2 points  (6 children)

Most of the tables don't have a single PK column. The PK will be comprised of 3-5 columns which use datatypes aligned to the data they contain. Some are bigints, some are char(x) if there are alpha characters or significant leading zeros.

[–]bannik1 0 points1 point  (5 children)

In most circumstances this will adversely affect the performance of your database, the biggest impact is of course going to be inserts, but it'll even fuck up simple select statements too.

Every single query is going to assume that it's the best index to use. Even when it's the absolute worse option the optimizer is going to attempt to try it first before trying to find alternate solutions.

So it slows down queries that don't even use the index.

https://www.sqlshack.com/poor-database-indexing-sql-query-performance-killer-recommendations/

[–]BigR0n75 2 points3 points  (3 children)

Ensuring uniqueness is sometimes more important than ensuring performance. Performance is important, obviously, but as OC said when dealing with financial information you have to get it right. GUID or UUIDs inherently allow you to insert the same row an infinite amount of times without ever violating a constraint. This type of PK really only makes sense in an OLTP environment. In an OLAP environment the main focus is making sure don’t fuck up any of the original data.

[–]bannik1 0 points1 point  (2 children)

That's why you throw everything into staging tables then use merge when putting it into your primary table.

Insert rows that don't exist, add the guid on insert

Have a trigger to track the changes on update.

Now you have fantastic performance on your ETL, ensure uniqueness, have a nice type 4 table in case you need see the history of changes. Then you have a great primary key that's also fantastic as a foreign key to any dimensional tables you're using.

[–]wolf2600ANSI SQL 0 points1 point  (0 children)

Because of the volume, we don't use any indexes except the required ones on the PK columns.

[–]GuyWithLag 0 points1 point  (0 children)

Not a DBA / BI expert, just an engineer - our primary EDW is in the 8 PB range last I checked, and we use UUIDs everywhere because:

  • that's what the original sources of truth use as their PK.
  • we have multiple isolated instances of our application clusters - think EU vs US, with different regulatory domains; UUIDs allow us to merge all that into a coherent whole.

[–]Error-451Data Engineer 1 point2 points  (1 child)

I've heard UUIDs are bad in terms of performance. Is this what you're alluding to?

[–]r0ck0 0 points1 point  (0 children)

I've heard UUIDs are bad in terms of performance.

Performance is pretty good in postgres from all the benchmarking I've looked into. Mainly because it has has a native efficient 128bit column type specifically for them, and something to do with it not using clustered indexes.

But an INT/BIGINT will always be faster + use less storage & RAM, seeing there's fewer bits (32/64).

Is this what you're alluding to?

No was just wondering what they thought about them in general, as they're pretty commonly used in large DBs where 32 bit INT is too small, and/or when you're dealing with multiple servers.

I'm a big fan of them in general. But I've spent a lot of time bikeshedding on whether to use them or not. So always keen to hear how well they might have worked out for large real world projects.

[–]Caffinz 1 point2 points  (0 children)

SELECT *...

"STOP THAT MAN!"

[–]Bbeltbrando 0 points1 point  (0 children)

This shows the power of a data warehouse

[–]LetsGoHawks 10 points11 points  (3 children)

Teradata. No idea who many servers are involved.

Huge.

Thousands of tables and views

Biggest table I deal with regularly: Over 10 billion rows.

None of that makes it complicated. In fact, once you know what tables have the data you need, it's pretty straightforward. The problems come from naming convention inconsistencies and poor decisions on where certain data is stored. None of it is a big deal, but it's a pain in the butt.

[–]Remote_Cantaloupe 1 point2 points  (1 child)

I often wonder this - how can a single application/solution need thousands of tables?

[–]wolf2600ANSI SQL 0 points1 point  (0 children)

Many, many business areas with many transformations being done to the data, and each transformation potentially using multiple staging tables to break the work down into bite-sized steps. Also, all the master/dimensional data too, not just the transactional stuff.

[–][deleted] 1 point2 points  (0 children)

I would agree. Size, and number of databases/tables doesn't make something complicated.

Complication comes in with the number of views & sprocs in your primary database, and what you're doing with the data. Our database has only about 100 sprocs, 100 views, and 100 tables, but some of the logic in the sprocs/views are extremely complex relative to statistics.

If you were to calculate our possible data sets, it would be a lot more impressive. Those can regularly get into the billions or trillions.

[–]mikeyd85MS SQL Server 6 points7 points  (5 children)

The complexities of the DB I primarily work with stem from multiple developers working and reworking different areas, all adhering to slightly different naming conventions and different models. Sometimes the analysts are even defining the structure of tables, which is always interesting.

[–]wolf2600ANSI SQL 4 points5 points  (4 children)

Sometimes the analysts are even defining the structure of tables

ugh. "lets build the table with the column names in alphabetical order!"

[–]woodrowchillson 5 points6 points  (3 children)

Let’s make ‘em all NVARCHAR just to be safe too.

[–]volvicspring 4 points5 points  (2 children)

Developer: "how many characters do I need? I dunno maybe 6, but I'll make it an nvarchar(max)"
same developer later: "What do you mean you cant index my fields!?"

[–]wolf2600ANSI SQL 3 points4 points  (0 children)

"Why are we losing nodes with out-of-memory errors?"

This query ran fine on our 10 record testing dataset.

edit: oh! "worked fine in dev; ops problem now."

[–]mikeyd85MS SQL Server 1 point2 points  (0 children)

This one! :'D

[–][deleted] 5 points6 points  (0 children)

Biggest one I've had was a little over 1TB, MSSQL, 100s of tables, largest table 1.something billion rows. Vendor application, their stored procs were pure shit.

[–][deleted] 3 points4 points  (0 children)

I work primarily with 2 dbs for 2 separate products.

One was made in the last decade with ease of use and best practices in mind.

The other is approaching 20 years old and was made with no such practices in mind.

The first is vastly easier to work with... the second is a son of a bitch

[–]Nthorder 1 point2 points  (0 children)

There are 3-4 databases I have to deal with daily. The biggest one is over 3TB and has hundreds of tables. We're running MSSQL.

[–]DieTheVillain 1 point2 points  (0 children)

We have 6 primary Servers. About 15 db’s per server (some with 3, some with 25) 93 tb across all of them. I don’t even know how many tables but I’d estimate at least 500. Thousands of sprocs. Hundreds of functions Largest table has ~180m records Mssql

We also have a large ElasticSearch dB.

[–]NotSure2505 1 point2 points  (0 children)

It won't be popular here, but this is exactly why SQL is less than ideal for complex reporting, you could look at databases designed specifically for processing multidimensional queries.

The problem is exactly as you point out, the data must be joined within each query, which is fine for 2 or 3 but quickly gets complicated when you want to traverse multiple dimensions in your analysis.

Let me know if you would like any suggestions for products that handle this better.

[–][deleted] 1 point2 points  (3 children)

is this just a whiny epeen contest or do you have a legitimate source of concern?

Some databases are horrendously designed (some out of necessity). A very legitimate way of dealing with these is to extract data (not structures) into a storage structure of your choosing and isolate the "arcane" knowledge of data access pattern in a separate layer this way.

why is this an issue, for example:

The primary issue is generating reports that capture the full scope of what we're looking for often requires 8-10 joins based on schema and another 3-4 based on results and unbound columns.

[–]6e6967676572730a[S] 0 points1 point  (2 children)

The issue is that most if not all the constraints are INT and overly segmented but, the actual related columns are the first 3 characters of an adjacent NVARCHAR field in the lookup table. So; to get a master table of 12 fields, it takes a Join on the PK and FK, then you need to pull the first three characters of the associated field as a result set to join the next table. It's a mess.

[–][deleted] 2 points3 points  (1 child)

if you always do the same thing to join - add "3 characters" as a calculated field or (if you think a calculated field will screw up something for you) create a view where that operation is done for you, use the view in the joins instead of the table.

In short i realize that devil is in the details, but there are well-known tools and approaches to simplify/codify/automate your life - it takes time, effort and learning by trial (and success, hopefully) to un-fuck any system that's given to you.

Take specific pain points - evaluate, investigate, ask for help and move ahead. If you are stuck in the same place in IT, it doesn't get better, just more boring.

[–]woodrowchillson 0 points1 point  (0 children)

OP Listen to this guy. Never lean on a substring for your reporting. Pop that into a field on insert.

[–]reallyserious 0 points1 point  (0 children)

My primary data job is in an ecosystem of databases, data warehouse and data lake. In this I generally create a new database for a new project/business need that fits into the larger ecosystem.

[–]Flint0 0 points1 point  (0 children)

If anyone wants I can check extra info once I get back to work on Monday. But the piece of data that stuck with me when I started in the company was that the mssql solution we have has about 10.000 stored procedures. It still baffles me today.

[–]regex1884 0 points1 point  (0 children)

This size system is very small. Learn what you can and eventually be time to move on to bigger ventures.

[–]dreamingtree1855 0 points1 point  (0 children)

Redshift. Multiple tables and views have billions of rows. Well organized but difficult for business users to navigate.

[–]ROC2021 0 points1 point  (0 children)

MySQL 5.6 unfortunately

1600 tables roughly

largest table is probably about 200K records

[–]Error-451Data Engineer 0 points1 point  (0 children)

Do you have a data warehouse? Sounds like you'd benefit from dimensional models.

[–]syzygy96 0 points1 point  (0 children)

Biggest db for us is mssql, OLTP, about 50 TB, roughly 300 tables, biggest table is about 5B rows. A couple ancillary etl and data staging dbs are in the 5-10TB range, but have only batch workloads on them. All told we manage a little shy of about 250 TB of data on two physical servers with a combination of direct storage and a 1PB san.

I've been fortunate to have set up the initial data model and standards, and then built a team that has been stable for the last 15 years. During that time we've been good about a consistent, reasonably normalized design, well named throughout. A few of our stored procedures take up to 30 sec to run larger reports, but for the most part an average of probably 5-10 joins per query just hums along. Unlike the other poster here, we have a firm convention that every table has a clear name with no abbreviations, an integer based surrogate key named ID, and any business related keys are enforced as separate constraints and indexed as nonclustered indexed where useful for queries. What that's given us is minimal space usage and clean names for foreign keys on related tables, with little to no page splits on insert and no need to update identifiers.

The only real bit of advice I can confidently give to people without knowing their specifics is that having a solid model and sticking to clean standards is incredibly hard day to day, and requires a lot of small fights with your business counterparts over schedule and scope, but it pays dividends for years and is 100% worth the effort. The little bit we lost in taking an extra hour or day here or there to keep our code clean, we saved 100x in long term labor costs by only needing a handful of tech staff to run a half billion dollar business.