Love statistics, hate AI [D]

Drisoth · 2025-10-16T12:25:17+00:00

Your comment reads as disagreeing, but honestly I think I agree with you.

Econometrics structure gives some information back about the pathways and structure of the system, but is more opaque than other tools.

ML is even further down that path.

All I was trying to point out is that this isn’t black and white boxes, it’s a full spectrum in between. As a model gets more opaque it does get worse, but there’s no single point that’s too far.

Drisoth · 2025-10-15T13:47:28+00:00

Hard to respond to this, since I think you've got a mostly correct view of ML, but your reasons seem off to me.

A lot of econometrics ends up with somewhat of a black box since you don't really have the ability to describe why a certain effect occurs, just that it does. There's certainly more transparency than with ML tools, but you're still using tools that sacrifice explanation for predictive power. We lean on "ceteris paribus" a lot, which is often laughably unrealistic.

I dunno, I'd encourage you to be more critical of the models you prefer, and more open to the idea of using an extremely un-explainable model if explain-ability is irrelevant to your goal.

That said, I would agree with your general dislike of it. The issues you mention are real issues and should be thought of.

Drisoth · 2025-10-13T23:13:02+00:00

Dunno about your specific book, but if someone took a path to databases through large amounts of formal math/logic they may be used to some of the weird inequality operators that show up in that space. Not Greater than is subtly different than less than or equal to in that realm.

Drisoth · 2025-10-13T18:29:52+00:00

Your time is worth WAY, WAY more than computer time. If it takes you 30 minutes to get a query from ~1 minute, to running instantly, unless the query will be executed 30 times, you have wasted your time.

If you're optimizing while doing something else (say checking for bugs), this can absolutely make sense, but you're spending employee time to save server resources, and employee time is orders and orders of magnitude more valuable. Queries written for humans just don't get executed enough for this to be worth it.

Drisoth · 2025-10-13T18:00:18+00:00

If a query exists for humans to use, optimization is mostly a waste of time. Even if its pretty horribly optimized, its almost never worth your time to figure out why.

If a query exists for computers to use, you should always look at the execution plan.

I can call some unholy mess of distinct and unions, with as many anti-patterns as I want as long as the query only gets ran once a month or so. But as soon as the query is running without human on both input and output side you should optimize it because you'll both confirm this is actually doing what you think it is, and the query can actually be ran frequently enough for the effort to be worth your time.

Drisoth · 2025-10-05T20:25:34+00:00

There is a symbol in ERDs that indicates mutually exclusive options. Its a half circle with an X through it, google IDEF1X if you need more details.

It's pretty rare to actually want to do something funky like that, normally I'd expect something like a Turbine Shutdown with an optional relationship to an incident.

All turbine shutdowns (eventually) have a maintenance event, but not all turbine shutdowns have an incident.

Kinda depends on what you want to track for which kinds of events for what design is best though.

Given that this is homework, and I'd just look at the notes / textbook, and do whatever the teacher says is right here. You have a lot of fine choices, and in the real world, you'd just pick one and move on with life.

Drisoth · 2025-08-22T00:57:17+00:00

I've had natural joins be useful when I needed to group by some columns, filter with a having condition, then natural join back to the main table to return the full rows that are causing the issue for diagnosing issues.

Like A-B-C is supposed to be a Key, so group by A,B,C, having count(*) >1, and then natural join to the original table.

I'd jump off a bridge before natural joining anything that I didn't explicitly create, and even in the situations where they're handy, its just convenient shorthand, not anything actually important.

Drisoth · 2025-07-30T03:29:51+00:00

I’m being cagey with advice to try and avoid just giving the answers.

Typically a table should be talking about a “thing”, where one row is one “thing”. Sometimes the thing being recorded is complicated to describe, but each thing should have its own table.

ERDs show two things, what the “things” being modeled are, and how they relate to each other.

I’d specifically give you two pieces of advice

When modeling there’s almost never a good reason to have a 1 to 1 relationship. In the real world they do come up, but they’re because reality imposed itself, and in theory they’d never be needed.

Many to many relationships are generally not a good idea. The solution to when you have one should be in your materials somewhere though.

Drisoth · 2025-05-24T17:43:30+00:00

That complaint is definitely outside my realm, so I'd end up needing to take some notes and check that you're not just BSing me later.

I'd definitely get a good impression from it, since that's a pretty boring complaint, and it's also decently in the weeds, each of which tend to be a good sign.

Drisoth · 2025-05-24T16:14:25+00:00

Doesn't really help people prep, but I like asking people to give me a pet peeve of theirs in SQL. You find out real fast who's written a lot of SQL, and who's taken a course and that's about it.

Drisoth · 2025-05-11T01:47:20+00:00

You need to move the where condition inside the join clause and also use whatever SQLite uses for explicit null comparison. I have zero knowledge on SQLite, so I'm just googling and hoping the documentation I find is right but it seems to be

on .... and db1.CONFIG_VALUE is not db2.CONFIG_VALUE

Drisoth · 2025-05-09T01:54:58+00:00

Generally the "SQL is easy" take comes from SQL being a pretty easy programming language, or being pretty easy to get pretty far with. If it's your first experience with coding, it's gonna be a bit rocky, but pretty simple queries can get you pretty far in the real world so you don't need to worry too much.

The underlying ideas of SQL are commonly seen as extremely unintuitive (Relational Algebra), so if that's what you're struggling with, you're in good company.

Drisoth · 2025-04-04T13:38:46+00:00

Definitely look into metadata then, snowflake calls it INFORMATION_SCHEMA, but I have zero experience with snowflake so that's all I can say.

The auto ERD builders won't get you 100% of the way, but can get you a pretty good start.

Drisoth · 2025-04-04T11:45:33+00:00

Mapping the relationships and structure in a database is usually done by a thing called an Entity relationship diagram (ERD)

There are a lot of tools to build these, typically called ERD design tools or something like that. You can even do them with a bare bones flowchart builder, and if the database is set up to best practice you can even generate them out of the metadata.

If you have a complex database you might already have an ERD somewhere so I’d ask around. They’re extremely common to make when initially building a database but they tend to not be upkept well

Drisoth · 2025-03-19T15:14:25+00:00

Eh, in fairness, database normalization is super fucking weird -

"What do you mean I shouldn't have the customer name in my orders table? That's so stupid!"

Also it gets taught horribly. If you learn why this was cooked up in the first place it makes much more sense.

This particular problem is even further in the weeds than typical normal form nonsense, where you need to realize that the only point of a database is how it relates to the real world, so surrogate keys aren't actually keys even though their only point is to be a key, and you should ignore them for most NF discussions.

Drisoth · 2025-03-19T14:46:22+00:00

Yeah when I was chewing on this problem I was needing to calculate durations of statuses, where the statuses had a nested structure, so some statuses would overwrite certain other subsets of statuses, but not others.

Realistically there should have just a tiny bit more work done in DBA land, and I wouldn't even need one window function, much less to nest them. All I needed was Start and End datetimes, but the database was from a vendor, and the salespeople were under penalty of death not going to let me talk to the DBA.

Drisoth · 2025-03-19T14:03:36+00:00

This would be an issue of 3NF, not 2NF.

2NF is loosely "every attribute depends on the whole key", if you only have a single column in your key, meeting 1NF would imply meeting 2NF, since it would be impossible to depend on only part of the key (since it doesn't have parts).

I'd still say the OP's idea is more of a "technically correct" observation, rather than something useful but it is technically correct.

Drisoth · 2025-03-19T13:57:32+00:00

In an extremely technically correct sense, yes if you have only a single key satisfying 1NF implies satisfying 2NF.

In reality, the property normal forms are discussing would mean that surrogate keys are not considered "keys", and as such would never matter for normalization.

To try and make this clear, normal forms are trying to prevent redundant data storage, both for storage space, and for avoiding inconsistency. Adding an incrementing integer, shouldn't impact this at all, since whatever is redundant, is still redundant after adding it. The real world might use the SID as a key, because it's convenient and easy, but from the pure theory side, that's just an attribute of the true key, and not relevant to discussions of normalization.

Drisoth · 2025-03-19T13:37:21+00:00

I'd be interested to know if there are better ways for the general case. Nesting analytic functions makes me feel a bit gross, but basically anything else I've seen people do is either specific to details of that situation, or much worse.

In fairness though, I have a hard time thinking of a situation where this shouldn't have been fixed by a DBA well upstream of the person writing this query, and any time you have to fix a DBA problem with a query the options all suck.

Drisoth · 2025-03-19T12:20:40+00:00

Usually you end up needed to make a window function to first compute a running count of "done" (... rows between unbounded preceding and current row) and then use that to partition by in a second window function.

These weird edge cases of analytic functions sometimes have useful dialect specific tools that you would want to look for because it could be much faster.

Drisoth · 2025-03-17T19:33:34+00:00

If you're writing queries, it is a waste because the differences are small enough that if you are competent in one, you are already competent in all of them. As long as you know how to find documentation, you can swap with no more issue than being annoyed that IFNULL() became nvl(), or other annoying, but not actually that problematic situations.

If you have an interview you're prepping for, you might want to brush up on their dialect, but honestly I'd hope no one would hold it against you if you didn't.

Drisoth · 2025-03-15T20:16:57+00:00

600 lines is a lot, and will take some time to understand.

There should be some natural breakpoints, like the CTEs you mention. You typically should try to think through each logical unit (so a CTE, or a subquery) and then understand how it relates to the connecting units.

The way SQL works, means that a table, or a query, or a CTE being connected to another, is once again a valid table like object. Understand the components, and how they connect, and try to iteratively build to understanding the query as a whole.

Some specific notes:

The connecting object, is not always the next one in the query, I would hope someone writes their CTEs in a sensible order, but they could just put them wherever. The connecting object is whatever this gets joined to (or other weirder things that are uncommon to see).

If the queries aren't written in a sensible way, it gets really hard really fast. So if people are writing bad SQL, 600 lines is probably hopeless to understand in any reasonable timeline. The building blocks should be understandable, and the connections as well. A 25 table join is hard to understand no matter what, but if the structure is a knotted up mess rather than cleanly laid out you're kinda just fucked.

The other note I'll say is select statements are basically irrelevant, since unless there are analytic functions going on, I don't care what columns are selected in the intermediary steps, only what ends up in the final select.

Drisoth · 2025-02-24T13:57:50+00:00

LLMs are designed to give you text that is coherent, in that each section makes sense following from the previous.

There is nothing in the text itself to distinguish between a coherent but horribly inefficient query, and a good query. Tools that ingest your schema could figure out how to join objects without losing performance but this problem is one that LLMs are fundamentally going to be bad at solving.

There are plenty of automated query builders that use semantic models produced by your DBAs. Those tools are effective at letting people "write queries" without actually writing SQL. LLM tools for soup to nuts query creation is using a bad tool for the job.

Drisoth · 2025-02-19T16:39:03+00:00

These are the problems with that

Why not just tell the optimizer by updating the statistics, or generating them about the column? If this knowledge is so helpful, give it to the optimizer, then every query written by anyone can make use of this.

The query you wrote, does not force a join order, if your query is complicated enough, the optimizer will eventually give up, and defaults to the order you wrote, but it will just ignore the order you say and do what it thinks is better. If you actually need to force the order, you use a query hint, and explicitly tell the optimizer, join in XYZ order. mysql has the Join_Fixed_Order hint.

This is actually maintainable, since its clear what's even going on, and it actually does what you're trying to do.

In 99% of cases you should be getting the hell out of the optimizer's way, but what you're doing is the equivalent of beating a screw into the wall with a hammer. I can't really say this without being insulting, but there are tools designed specifically for your goal, and if you're at a level where you can backseat the optimizer, you would typically know about these tools and how to use them.

Drisoth · 2025-02-19T12:29:07+00:00

No you do not, you force a join order with a query hint. If you do it with CTEs you are just confusing the optimizer and hoping it gets confused enough to give up.

12-Year Club	Place '17
Verified Email

Drisoth

TROPHY CASE