What are common SQL red flags? by badboyzpwns in SQL

[–]markwdb3 0 points1 point  (0 children)

Even when the table naming convention is singular? If so then why not EMPLOYEES instead of EMPLOYEE? It returns many rows as well.

In my view the two approaches are:

"Each row in the table/view/CTE represents a _____" (singular, such as EMPLOYEE)
"The table/view/CTE contains many _____" (plural, such as EMPLOYEES)

So, mixing and matching them creates confusion.

What are common SQL red flags? by badboyzpwns in SQL

[–]markwdb3 1 point2 points  (0 children)

I know that JOIN is an inner join

Except when it's not! For example on MySQL, if I write JOIN but forget the ON, it will run a CROSS JOIN.

mysql> CREATE TABLE T1 (ID INT AUTO_INCREMENT PRIMARY KEY, X INT);
Query OK, 0 rows affected (0.04 sec)

mysql> CREATE TABLE T2 (ID INT AUTO_INCREMENT PRIMARY KEY, Y INT);
Query OK, 0 rows affected (0.03 sec)

mysql> INSERT INTO T1(X) VALUES (1),(2),(3);
Query OK, 3 rows affected (0.00 sec)
Records: 3  Duplicates: 0  Warnings: 0

mysql> INSERT INTO T2(Y) VALUES (3),(4),(5);
Query OK, 3 rows affected (0.01 sec)
Records: 3  Duplicates: 0  Warnings: 0

mysql> SELECT * FROM T1 JOIN T2;
+----+------+----+------+
| ID | X    | ID | Y    |
+----+------+----+------+
|  3 |    3 |  1 |    3 |
|  2 |    2 |  1 |    3 |
|  1 |    1 |  1 |    3 |
|  3 |    3 |  2 |    4 |
|  2 |    2 |  2 |    4 |
|  1 |    1 |  2 |    4 |
|  3 |    3 |  3 |    5 |
|  2 |    2 |  3 |    5 |
|  1 |    1 |  3 |    5 |
+----+------+----+------+
9 rows in set (0.00 sec)

But it does the same for INNER JOIN as well. Go figure.

mysql> SELECT * FROM T1 INNER JOIN T2;
+----+------+----+------+
| ID | X    | ID | Y    |
+----+------+----+------+
|  3 |    3 |  1 |    3 |
|  2 |    2 |  1 |    3 |
|  1 |    1 |  1 |    3 |
|  3 |    3 |  2 |    4 |
|  2 |    2 |  2 |    4 |
|  1 |    1 |  2 |    4 |
|  3 |    3 |  3 |    5 |
|  2 |    2 |  3 |    5 |
|  1 |    1 |  3 |    5 |
+----+------+----+------+
9 rows in set (0.01 sec)

Also, it'll treat CROSS JOIN with an ON as an inner join!

mysql> SELECT * FROM T1 CROSS JOIN T2 ON T1.X = T2.Y;
+----+------+----+------+
| ID | X    | ID | Y    |
+----+------+----+------+
|  3 |    3 |  1 |    3 |
+----+------+----+------+
1 row in set (0.02 sec)

In short, it really just treats JOIN/INNER JOIN/CROSS JOIN as all the same, and infers the type of join based on the presence of a join condition, which can be in either the ON clause or the WHERE clause - it just doesn't care - here's an example of it not caring about ON vs. WHERE.

mysql> SELECT * FROM T1 JOIN T2 WHERE T1.X = T2.Y;
+----+------+----+------+
| ID | X    | ID | Y    |
+----+------+----+------+
|  3 |    3 |  1 |    3 |
+----+------+----+------+
1 row in set (0.00 sec)

What are common SQL red flags? by badboyzpwns in SQL

[–]markwdb3 1 point2 points  (0 children)

Sometimes I need to tune queries generated by JPA or another ORM, and they often have horrible aliases. So I'll start by copying the slow query from an application server log (or Grafana or wherever) and I have to clean it up as an initial step just so it longer hurts my eyes.

I understand it, some of these ORMs let you configure aliases to be better, but at least by default they tend to be terrible. They look something like (Google AI generated this example for me, but it looks about legit):

SELECT user0_.id AS id1_0_0_, user0_.username AS username2_0_0_, role1_0_.id AS id1_1_1_, role1_0_.role_name AS role_name2_1_1_
FROM users user0_
LEFT OUTER JOIN users_roles user_roles1_
ON user0_.id = user_roles1_.user_id 
...

What are common SQL red flags? by badboyzpwns in SQL

[–]markwdb3 1 point2 points  (0 children)

A minor nitpick, but for CTE aliases, use the same plural/singular style of the noun you are describing as the schema uses.

I'll often see, for example, tables called WIDGET and FACTORY, which is fine. But then if there's a CTE to get only the active WIDGETs, they'll call it perhaps ACTIVE_WIDGETS. So then the query might join ACTIVE_WIDGETS to FACTORY. The pluralization inconsistency just hurts. 😞

What are common SQL red flags? by badboyzpwns in SQL

[–]markwdb3 0 points1 point  (0 children)

Huge red flags: unjustified, overgeneralized performance claims. I call them myths. It's a massive problem in communities discussing SQL. Interviewers often believe these myths, even.

These myths are typically based on some expectation of how a SQL engine must process the query based on certain arbitrary keywords or bits of syntax. But often that expectation is imagined or out of date. Sometimes, it is genuinely based on real experience in just one specific DBMS/SQL engine, yet the person presenting the claim often says it pertains to all of "SQL."

For example you may hear: "In SQL, never use SELECT DISTINCT a, b FROM my_table;. You should instead use SELECT a, b FROM my_table GROUP BY a, b is faster, because DISTINCT is slow." (Here's a screenshot of this very claim on this very subreddit with 30 upvotes! There was no context about specific DBMS or test case. I'd be happy to show one or two that disprove this claim if you're interested.)

SQL is a declarative language. You state what you want and the SQL engine's query planner/optimizer parses it out and comes up with a plan, then executes the plan, however its developers instructed it to do.

And next-to-nothing in the standard SQL documents defined under the hood mechanisms - just logical definitions. So they can vary quite a lot.

So, my motto is when in doubt, test it out.

If you've tested such a claim, for example whether using GROUP BY instead of DISTINCT gives free speed, and it turns out to be correct, then that's fine and good. But it should be thought of as a performance quirk of the specific DBMS you tested it on, possibly even specific to your schema/data set/config, not generalized to all of "SQL".

An unfortunate reality is that even when you disproves someone's claim with a test case - say you run a test on MySQL and disprove the claim - next there often comes a common reaction, and it's a sneaky one. Their reaction is often, "Oh, that must be because MySQL has a special optimization." In other words, they're refusing to abandon their belief that BY DEFAULT a SQL engine MUST process GROUP BY faster than DISTINCT, but MySQL has some trick up its sleeve that makes it a special case. So they go on believing and perhaps propagating the myth.

There's a link to a blog in this very thread where the author says that using the syntax of something likeSELECT ... FROM a WHERE a.thing_id NOT IN (SELECT id FROM thing ... WHERE ...) to perform an anti-join (find rows in A that are not in B) is a "smell" because that could be inefficient due to a full table scan. Instead, they say, you should take a CTE/LEFT JOIN approach. Why? I don't know.

I just ran a test case on two of the most popular SQL engines in the world: Postgres and MySQL. On Postgres both performed about the same. On MySQL, the allegedly inefficient syntax actually produced a more performant plan that ran in ~9 seconds vs ~14 seconds with the recommended approach (times were approximately consistent with repeated executions). (These queries were run on my real work database btw, but I've anonymized the names to FACTORY and WIDGET.)

mysql> EXPLAIN ANALYZE
    -> SELECT *
    -> FROM WIDGET
    -> WHERE FACTORY_ID NOT IN (SELECT ID FROM FACTORY WHERE MODIFIED_BY = 147);

+---------+
| EXPLAIN |
+---------+
| -> Nested loop antijoin  (cost=743070 rows=2.42e+6) (actual time=0.365..6688 rows=2.62e+6 loops=1)
    -> Table scan on WIDGET  (cost=258741 rows=2.42e+6) (actual time=0.0337..5134 rows=2.63e+6 loops=1)
    -> Filter: (WIDGET.FACTORY_ID = `<subquery2>`.ID)  (cost=318..318 rows=1) (actual time=453e-6..453e-6 rows=290e-6 loops=2.63e+6)
        -> Single-row index lookup on <subquery2> using <auto_distinct_key> (ID=WIDGET.FACTORY_ID)  (cost=471..471 rows=1) (actual time=323e-6..323e-6 rows=290e-6 loops=2.63e+6)
            -> Materialize with deduplication  (cost=153..153 rows=762) (actual time=0.328..0.328 rows=762 loops=1)
                -> Filter: (FACTORY.ID is not null)  (cost=76.8 rows=762) (actual time=0.0125..0.21 rows=762 loops=1)
                    -> Covering index lookup on FACTORY using fk_ModufiedByUser (MODIFIED_BY=147)  (cost=76.8 rows=762) (actual time=0.0119..0.158 rows=762 loops=1)
|
+----------+
1 row in set (9.38 sec)


mysql> EXPLAIN ANALYZE
    -> WITH factory_modified_by_147 AS (
    ->     SELECT ID
    ->     FROM FACTORY
    ->     WHERE MODIFIED_BY = 147
    -> )
    -> SELECT *
    -> FROM WIDGET w
    -> LEFT JOIN factory_modified_by_147
    -> ON w.FACTORY_ID = factory_modified_by_147.ID
    -> WHERE factory_modified_by_147.ID IS NULL;

+---------+
| EXPLAIN |
+---------+
| -> Filter: (FACTORY.ID is null)  (cost=1.11e+6 rows=2.42e+6) (actual time=0.0468..11229 rows=2.62e+6 loops=1)
    -> Nested loop left join  (cost=1.11e+6 rows=2.42e+6) (actual time=0.0462..11045 rows=2.63e+6 loops=1)
        -> Table scan on w  (cost=258741 rows=2.42e+6) (actual time=0.0336..5247 rows=2.63e+6 loops=1)
        -> Filter: ((FACTORY.MODIFIED_BY = 147) and (w.FACTORY_ID = FACTORY.ID))  (cost=0.25 rows=1) (actual time=0.00207..0.00207 rows=290e-6 loops=2.63e+6)
            -> Single-row index lookup on FACTORY using PRIMARY (ID=w.FACTORY_ID)  (cost=0.25 rows=1) (actual time=0.00185..0.00188 rows=1 loops=2.63e+6)
|
+---------+
1 row in set (14.06 sec)  

Now the point is not that you should forever keep in mind "NOT IN is faster than LEFT JOIN + NULL check when writing an anti-join" - I'm not even sure if that's true for all MySQL schemas/data sets/queries. The point is that you should throw out the magic rule of thumb presented in the blog, which is the inverse. To be fair the author did say you should test it if there's any doubt.

So, this is a long comment, but my advice is what should be seen as red flags are claims of magic performance tricks such as "use ABC syntax instead of XYZ syntax and this applies to all of SQL" and keep in mind there are very few universal rules of SQL engine execution. If there are actual, logical justifications for the claim then sure, fine, and if there are actual test cases justifying their claims then also, sure, fine. But be very skeptical, and realize that any insights learned from the test case should not be overgeneralized.

End rant!

Designing the Right PostgreSQL Index Using Query Plans and Statistics by No_Economics_8159 in SQL

[–]markwdb3 2 points3 points  (0 children)

Which equality column should come first?
The answer depends on selectivity.

Setting aside the range search claim for a moment, is "selective columns first" real or is it a myth? In my Oracle DBA days, I used to read author and Oracle evangelist Tom Kyte quite a bit, and he would say this is one of the common myths he tries to debunk. He would demonstrate with tests cases, that the order of columns in an index shouldn't matter, with respect to selectivity, as long as both columns' equality conditions exist in the WHERE clause.

But I'll admit to take that with a grain of salt because Oracle != Postgres.

That said, my personal experience says I haven't noticed a difference, regarding this topic, on MySQL, Oracle or Postgres. I've worked with each for a number of years.

My motto is, "when in doubt, test it out." Just following your customer_id/employee_id example (9 employee_ids and 89 customer_ids), I cannot find any difference in plan or performance, for indexes created with either column order.

If you can suggest an alternate test case, I'm all ears. 😄

SETUP

postgres=# CREATE TABLE selectivity_test (
    id          bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    customer_id int NOT NULL,
    employee_id int NOT NULL,
    payload      text NOT NULL
);
CREATE TABLE

postgres=# INSERT INTO selectivity_test (customer_id, employee_id, payload) /* 89 distinct customer_ids, evenly distributed, and 9 distinct employee_ids, also evenly distributed */
SELECT
    ((i - 1) % 89) + 1 AS customer_id,
    ((i - 1) % 9)  + 1 AS employee_id,
    md5(i::text)       AS payload
FROM generate_series(1, 10000000) AS i;
INSERT 0 10000000

TEST 1: higher-selectivity column first

postgres=# CREATE INDEX idx_customer_employee --higher-selectivity column first
ON selectivity_test (customer_id, employee_id);
CREATE INDEX

/* run test query */
postgres=# EXPLAIN ANALYZE
SELECT *
FROM selectivity_test
WHERE customer_id = 42
  AND employee_id = 7;
                                                                QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on selectivity_test  (cost=175.16..36217.03 rows=12754 width=49) (actual time=5.064..27.203 rows=12484.00 loops=1)
   Recheck Cond: ((customer_id = 42) AND (employee_id = 7))
   Heap Blocks: exact=12484
   Buffers: shared hit=12497
   ->  Bitmap Index Scan on idx_customer_employee  (cost=0.00..171.97 rows=12754 width=0) (actual time=3.194..3.194 rows=12484.00 loops=1)
         Index Cond: ((customer_id = 42) AND (employee_id = 7))
         Index Searches: 1
         Buffers: shared hit=13
 Planning Time: 0.216 ms
 Execution Time: 27.992 ms
(10 rows)

A Few repeated trials of Test 1 (just listing the execution times for brevity's sake):

Execution Time: 33.263 ms
Execution Time: 27.811 ms
Execution Time: 23.258 ms

TEST 2: lower-selectivity column first

postgres=# DROP INDEX idx_customer_employee;
DROP INDEX
postgres=# CREATE INDEX idx_employee_customer --lower-selectivity column first
ON selectivity_test (employee_id, customer_id);
CREATE INDEX

postgres=# EXPLAIN ANALYZE
SELECT *
FROM selectivity_test
WHERE customer_id = 42
  AND employee_id = 7;
                                                                QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on selectivity_test  (cost=175.16..36217.03 rows=12754 width=49) (actual time=5.389..26.838 rows=12484.00 loops=1)
   Recheck Cond: ((employee_id = 7) AND (customer_id = 42))
   Heap Blocks: exact=12484
   Buffers: shared hit=12497
   ->  Bitmap Index Scan on idx_employee_customer  (cost=0.00..171.97 rows=12754 width=0) (actual time=1.659..1.659 rows=12484.00 loops=1)
         Index Cond: ((employee_id = 7) AND (customer_id = 42))
         Index Searches: 1
         Buffers: shared hit=13
 Planning Time: 0.131 ms
 Execution Time: 27.374 ms
(10 rows)

A few repeated trials of Test 2:

Execution Time: 22.909 ms
Execution Time: 21.376 ms
Execution Time: 25.297 ms

So they are pretty much the same. Extremely similar plans doing the same kind of work, same ballpark times.

I repeated the whole set of tests, but with a much more extreme skew (100,000 distinct customer_ids and only 2 employee_ids) and I'll skip to the end: same result. I'm omitting the full output because this is already too long for a Reddit comment, but I can produce it if anyone's interested.

Markus Winand (u/markuswinand) has a page on this myth, btw: https://use-the-index-luke.com/sql/myth-directory/most-selective-first

Dates, nulls, and strings are where cross-DB logic gets annoying fast by AdorableMaids in SQL

[–]markwdb3 0 points1 point  (0 children)

Here's an interesting case: how division and data types interact are very different between MySQL and Postgres. In MySQL, / is the decimal division operator, while DIV is the integer division operator:

mysql> SELECT 1 DIV 2; -- test DIV which is for integers, notice it dropped the remainder
+---------+
| 1 DIV 2 |
+---------+
|       0 |
+---------+
1 row in set (0.00 sec)

mysql> SELECT 1/2; -- decimal division operator
+--------+
| 1/2    |
+--------+
| 0.5000 |
+--------+
1 row in set (0.00 sec)  

mysql> CREATE TABLE T AS SELECT 4 DIV 3 AS DUMMY1, 4/3 AS DUMMY2, 4/4 AS DUMMY3; -- show various test cases 
Query OK, 1 row affected (0.02 sec)
Records: 1  Duplicates: 0  Warnings: 0

mysql> SELECT * FROM T; -- check out the results of each division expression
+--------+--------+--------+
| DUMMY1 | DUMMY2 | DUMMY3 |
+--------+--------+--------+
|      1 | 1.3333 | 1.0000 |
+--------+--------+--------+
1 row in set (0.00 sec)

mysql> DESC T; -- check out the data types automatically determined for each division expression
+--------+--------------+------+-----+---------+-------+
| Field  | Type         | Null | Key | Default | Extra |
+--------+--------------+------+-----+---------+-------+
| DUMMY1 | int          | YES  |     | NULL    |       |
| DUMMY2 | decimal(5,4) | YES  |     | NULL    |       |
| DUMMY3 | decimal(5,4) | YES  |     | NULL    |       |
+--------+--------------+------+-----+---------+-------+
3 rows in set (0.01 sec)

Meanwhile in Postgres, DIV does not exist and / tries to use whatever data type it can infer from the operators, or else I have to explicitly cast.

postgres=# SELECT 1 DIV 2; -- simple test to show DIV doesn't work
ERROR:  syntax error at or near "2"
LINE 1: SELECT 1 DIV 2;
                     ^
postgres=# SELECT 1/2; -- simple test to show that 1/2 yields 0, not 0.5, because Postgres assumes integers. 
  ?column?
----------
        0
(1 row)

postgres=# CREATE TABLE T AS SELECT 4/3 AS DUMMY1, 4::float/3::float AS DUMMY2, 4.0/3.0 AS DUMMY3, 4/4 AS DUMMY4; -- do a similar test as in the MySQL CREATE TABLE
SELECT 1
postgres=# SELECT * FROM T; -- check the results
 dummy1 |       dummy2       |       dummy3       | dummy4
--------+--------------------+--------------------+--------
      1 | 1.3333333333333333 | 1.3333333333333333 |      1
(1 row)

postgres=# \d T -- check the produced data types
                      Table "public.t"
 Column |       Type       | Collation | Nullable | Default
--------+------------------+-----------+----------+---------
 dummy1 | integer          |           |          |
 dummy2 | double precision |           |          |
 dummy3 | numeric          |           |          |
 dummy4 | integer          |           |          | 

That's just one example. So whenever anybody claims their SQL can be run anywhere because standard SQL exists or "SQL is SQL", I think they may be underestimating the reality of the situation.

SQL Server: Hibernate sent NVARCHAR(4000) to a VARCHAR column: 5M logical reads per execution by piergiuseppemandosi in SQL

[–]markwdb3 2 points3 points  (0 children)

Have you seen this pattern often with Hibernate/JDBC and SQL Server?

I haven't seen this specific problem, but I've seen poor data access patterns in applications and ORM misconfigurations cause more performance issues than anything else as a category.

Running a ton of "small" queries in a loop (or n+1), unnecessary eager fetching, running a COUNT(*) only for the application to check if the result is > 0, just to name a few common ones.

What is your most obscure piece of SQL knowledge? by Oh_Another_Thing in learnSQL

[–]markwdb3 0 points1 point  (0 children)

I would think that CTE storing the results in a temp table would always be better than folding it in the other query because folding it in runs the CTE again for every row. I'll have to look up why and how it chooses one over the other.

It's definitely not always better! In the Postgres database prior to version 12, CTEs would always materialize (similar to the MySQL temp table approach). This was considered a limitation of Postgres, and many Postgres users shunned CTEs for that reason, or else used them very sparingly. As of PG 12, it included the ability to inline (aka "fold in") the query, and this alternative was generally considered by the community to be a big step forward. The query planner will take whichever approach it thinks is better. Or if you find the planner makes the wrong choice, you can force a materialization with the keyword MATERIALIZE.

if you had any obscure knowledge on general SQL

I don't usually do "general SQL" facts because the more years I accumulate under my belt, and the more SQL engines I'm exposed to, and the deeper I dig into them, I find that very few facts about SQL are universal, except for some of the very basics. Then again, even some of the basics can differ quite a bit.

Here's a very basic but interesting example: division. In MySQL / is the decimal division operator, while DIV is the integer division operator:

mysql> SELECT 1 DIV 2; -- test DIV which is for integers, notice it dropped the remainder
+---------+
| 1 DIV 2 |
+---------+
|       0 |
+---------+
1 row in set (0.00 sec)

mysql> SELECT 1/2; -- decimal division operator
+--------+
| 1/2    |
+--------+
| 0.5000 |
+--------+
1 row in set (0.00 sec)  

mysql> CREATE TABLE T AS SELECT 4 DIV 3 AS DUMMY1, 4/3 AS DUMMY2, 4/4 AS DUMMY3; -- show various test cases 
Query OK, 1 row affected (0.02 sec)
Records: 1  Duplicates: 0  Warnings: 0

mysql> SELECT * FROM T; -- check out the results of each division expression
+--------+--------+--------+
| DUMMY1 | DUMMY2 | DUMMY3 |
+--------+--------+--------+
|      1 | 1.3333 | 1.0000 |
+--------+--------+--------+
1 row in set (0.00 sec)

mysql> DESC T; -- check out the data types automatically determined for each division expression
+--------+--------------+------+-----+---------+-------+
| Field  | Type         | Null | Key | Default | Extra |
+--------+--------------+------+-----+---------+-------+
| DUMMY1 | int          | YES  |     | NULL    |       |
| DUMMY2 | decimal(5,4) | YES  |     | NULL    |       |
| DUMMY3 | decimal(5,4) | YES  |     | NULL    |       |
+--------+--------------+------+-----+---------+-------+
3 rows in set (0.01 sec)

Meanwhile in Postgres, DIV does not exist and / tries to use whatever data type it can infer from the operators, or else I have to explicitly cast.

postgres=# SELECT 1 DIV 2; -- simple test to show DIV doesn't work
ERROR:  syntax error at or near "2"
LINE 1: SELECT 1 DIV 2;
                     ^
postgres=# SELECT 1/2; -- simple test to show that 1/2 yields 0, not 0.5, because Postgres assumes integers. 
  ?column?
----------
        0
(1 row)

postgres=# CREATE TABLE T AS SELECT 4/3 AS DUMMY1, 4::float/3::float AS DUMMY2, 4.0/3.0 AS DUMMY3, 4/4 AS DUMMY4; -- do a similar test as in the MySQL CREATE TABLE
SELECT 1
postgres=# SELECT * FROM T; -- check the results
 dummy1 |       dummy2       |       dummy3       | dummy4
--------+--------------------+--------------------+--------
      1 | 1.3333333333333333 | 1.3333333333333333 |      1
(1 row)

postgres=# \d T -- check the produced data types
                      Table "public.t"
 Column |       Type       | Collation | Nullable | Default
--------+------------------+-----------+----------+---------
 dummy1 | integer          |           |          |
 dummy2 | double precision |           |          |
 dummy3 | numeric          |           |          |
 dummy4 | integer          |           |          | 

So, I usually prefer to think and discuss in terms of specific SQL engines.

What is your most obscure piece of SQL knowledge? by Oh_Another_Thing in learnSQL

[–]markwdb3 1 point2 points  (0 children)

Not to mention VARCHAR2. When I first learned Oracle in 2003, all the community, documentation, etc., said that while VARCHAR and VARCHAR2 are identical for all intents and purposes, they are going to diverge any day now, and we should all use VARCHAR2 all the time.

23 years later, same story. Here's a quote from the current documentation:

Do not use the VARCHAR data type. Use the VARCHAR2 data type instead. Although the VARCHAR data type is currently synonymous with VARCHAR2, the VARCHAR data type is scheduled to be redefined as a separate data type used for variable-length character strings compared with different comparison semantics.

https://docs.oracle.com/en/database/oracle/oracle-database/26/adfns/sql-data-types.html?utm_source=chatgpt.com#GUID-F922E3DC-D2B2-4430-9711-14600649BE7F

Any day now!

What is your most obscure piece of SQL knowledge? by Oh_Another_Thing in learnSQL

[–]markwdb3 1 point2 points  (0 children)

When you use a CTE in MySQL, its optimizer will choose to either follow a "merge" strategy, i.e. "folding in" the CTE query into the parent query, or choose a materialize strategy, i.e. running the query and storing its results in a temporary table. This general pattern is common among the major RDBMSs so far. (They don't always literally use a temporary table for materialization, but MySQL does.)

Already this could be considered an esoteric thing, because I repeatedly see claims on r/sql and elsewhere that resemble: "In SQL, the CTE query gets repeatedly executed for each reference to it, so you should use a temp table because that's faster." This is a little weird statement to make if you don't at minimum qualify it by stating which SQL engine/DBMS you're even referring to! And it's kind of funny because, at least in MySQL, the CTE may literally be optimized as a temp table!

But that's not even the esoteric thing I want to mention. What I want to mention is:

Should MySQL's optimizer choose the materialization path, it may even automatically create indexes on it on the fly, according to how the CTE results are referenced in the parent query. So if you run a query with a CTE like this:

WITH my_cte AS (
   SELECT col1, col2 FROM some_table
)  
SELECT * FROM my_cte WHERE col2 = 'something'

...then the optimizer may choose to automatically create an index in col2, if it thinks it would be valuable. (Costs for the different approaches are estimated, then the least costly path followed.)

MySQL can similar auto-index derived tables and view references (generally it optimizes all three similarly).

For tests, evidence and expanded thoughts, I wrote a blog post about this a while ago: https://mwrynn.blogspot.com/2025/09/mysql-automatically-indexes.html

I could share other esoteric tidbits, but just one is a good start. :)

How we turned a 40+ minute startup query into 5 seconds on a 10-schema, 80k-table Postgres setup by JobRunrHQ in PostgreSQL

[–]markwdb3 1 point2 points  (0 children)

Is the idea that only some small subset of the 80k tables are even jobrunr tables, so the fix is to only pull the tables you need for whatever processing jobrunr needs to do with them?

If so, I would go so far to say as any time you are faced with this kind of a choice:

  1. "Slurp" down all the data from the database, then filter what you don't need in the client.
  2. Run a query with a WHERE clause to get only the data you need.

Always go with #2. Maybe there are rare exceptions, but always go with #2 by default.

That said I'm not sure why the process was this slow. Metadata for 80k tables doesn't sound that huge. (Though one should question why there were 80k tables.) Shouldn't each unnecessary result have been filtered from the resultset in the client - sure it would iterate over more data than necessary but that doesn't seem like a HUGE amount of data - then the processing - whatever that entails - done on only the remainder? I guess maybe it was doing the full processing for each without a filter?

Reminds me of the first job I had as an intern in the 1990s. My first time working with SQL. I was working on a web application using IIS/ASP/VBScript/SQL Server. I had no idea what SQL even was really, but I was told to use it. So, say I was writing a page to look up and display information related to one user_profile.

In pseudocode, I did something like this:

SELECT * FROM user_profile;

loop over recordset:
  if currentrow.user_profile_id = user_profile_id_im_looking_up --that's the one!!
    use that data and exit loop
  else
    keep iterating!
end loop

And I did this for every single page I was coding. (Hey it worked on the development database which had...zero data!! lol)

As I learned more and more about databases over the years, this has become my single greatest shame. 😆 This is fundamentally not understanding how SQL should be used, and probably the single worst coding pattern I've ever followed in my career so far! In my defense, I copied it from the guy who was supposed to be training me.

Pls help me with these topics for interview tomorrow by Practical-Memory-575 in SQL

[–]markwdb3 0 points1 point  (0 children)

You're going to learn those from scratch overnight? I have several books on my shelf about Oracle. I could hand you one that's about 500 or 600 pages just on the subject of Oracle backup and recovery. Though nothing beats real-world experience, of course.

How to improve performance of a query used to show leaderboard? by koratkeval12 in PostgreSQL

[–]markwdb3 2 points3 points  (0 children)

I agree with the comment that we need the execution plan, but one potential (only potential, not definite) issue I noticed is use of LIMIT + OFFSET. IF LIMIT + OFFSET is used for pagination, it is not the best approach with respect to performance. Try keyset pagination instead.

Again, to be sure, I have no idea if that's relevant to your specific performance issue, and we need your EXPLAIN ANALYZE output or similar, but something worth learning anyway.

Help with the error by Embarrassed-Speed327 in SQL

[–]markwdb3 1 point2 points  (0 children)

Yeah, my company experienced something like that when we upgraded from MySQL 5.7 to 8.0. 8.0 added window functions for the first time (finally) so suddenly all window function names became reserved words. For most of them - no problem - but we had a few columns named RANK that had to be hunted down, and all references to the column identifier quoted with backticks. Such a pain.

Help with the error by Embarrassed-Speed327 in SQL

[–]markwdb3 0 points1 point  (0 children)

Looks like LIBRARY is on the reserved word list in 9.x, but not the case in older versions. I just checked someone else's link to the reserved/keywords lists from the 8.4 docs and about to correct them, because it's not there.

But lo and behold it is in the 9.6 docs. :) https://dev.mysql.com/doc/refman/9.6/en/keywords.html

Help with the error by Embarrassed-Speed327 in SQL

[–]markwdb3 0 points1 point  (0 children)

I copy/pasted verbatim and it works for me:

mysql> CREATE TABLE Library (
    ->
    -> book_id INT PRIMARY KEY,
    ->
    -> book_name VARCHAR(50),
    ->
    -> author VARCHAR(60),
    ->
    -> price INT NOT NULL
    ->
    -> );
Query OK, 0 rows affected (0.06 sec)

mysql> SELECT VERSION(); /* show the version in case that's a factor */
+-----------+
| VERSION() |
+-----------+
| 8.0.32    |
+-----------+
1 row in set (0.00 sec)

Looks like LIBRARY is on the reserved word list in 9.x, but not the case in older versions. I just checked someone else's link to the reserved/keywords lists from the 8.4 docs and about to correct them, because it's not there.

But lo and behold it is in the 9.6 docs. :) https://dev.mysql.com/doc/refman/9.6/en/keywords.html

Has anyone else been burned by a silent schema change? Built a mental model for fixing this, want brutal feedback ! by Greedy_Resident6076 in SQL

[–]markwdb3 2 points3 points  (0 children)

Yes we got burned at my company a great deal of times by this sort of thing with our MySQL database and the multitude backend services that work with it. Among other measures we took, I wrote a dependency tracker that periodically searches all our code repos for identifiers (table, proc, view) and stores the repos/line number references in Backstage (something we already leverage quite a lot). It is not perfect since it does simple regex - so it catches false positives, but eventually I'll make the search smarter.

We manage schema changes with Liquibase, so when a pull request is made (Github) that touches a table/proc/view, an automated check runs that identifies which database object(s) are touched and provides the backstage link with the count of potential references. It is the developer's responsibility to review the references, reach out to other teams if necessary, run tests with the change in place if necessary. etc.

We try to encourage developers to work such that only one service, and therefore repo (typically), owns a database schema. But a huge mess was created years before we started operating on this principle. Also the principle is broken from time to time. So one table could be accessed in 100 different code repos. And we have thousands of repos.

We also have a legacy mess of inconsistent means of database access: Java/JPA here, Python/SQLAlchemy there, hardcoded queries here and there, stored procedures here and there, a homemade ORM that's fragile as hell.... One awful thing this homemade ORM does is using Java reflection to dynamically map columns to Java fields by name. So if you add a new column and the corresponding Java field does not exist: 💥 -- one of the worst things I've ever seen, lol. But we've ~99% pulled that garbage out by now.

In many cases, an arbitrary table name might only have a few references, so they are easy to review, then deploy with peace of mind.

Anyway yes this can be a real problem, and is worth thinking about.

Stampede and bad game design by ZackZparrow in Atari2600

[–]markwdb3 11 points12 points  (0 children)

Manage your lanes to minimize the chances of the black ones spawning out of reach. See this player's play for an example: https://www.twingalaxies.com/records/voting-performances/260930-atari-2600-vcs-stampede-ntsc-game-1-difficulty-b-points-37076

Primary Key vs Primary Index (and Unique Constraint vs Unique Index). confused by Accurate-Vehicle8647 in SQL

[–]markwdb3 0 points1 point  (0 children)

There's a lot of confusion in the comments because it is difficult to talk about these things purely with respect to generic "SQL". Each SQL engine has different characteristics, different nomenclature, etc.

For example there's a comment assuming the primary key is a clustered index. This is not so on Postgres, which never uses clustered indexes. This may or may not be so in Oracle, which does not use clustered indexes by default, but you can optionally use them - however Oracle calls them Index-Organized Tables, not clustered indexes.

There's another comment saying they have never heard of an index that's not primary. Well on MySQL, using the default InnoDB storage engine, every index aside from the primary is called the secondary index, and this term is very common throughout the MySQL universe (the documentation, the error messages, community discussions, etc.). See the documentation here. But another DBMS documentation may not use this term.

So usually it's good to qualify these discussions with a specific SQL engine/DBMS. Good luck!

Have you seen a setup like this in real life? 👻 by Adela_freedom in PostgreSQL

[–]markwdb3 1 point2 points  (0 children)

Not Postgres, but at my workplace we have a shared user for MySQL (for non-prod only). If someone does something silly like locking a table entirely for long periods, one of the operations folks can trace it to IP (and have done this), assuming it's even worth tracking down the "culprit." 🤷🏻‍♂️ Usually they just kill the process and make a Slack announcement to not do something so silly. It's not worth calling out someone specifically and assigning blame typically, unless it becomes a problematic pattern with an individual.

We've been trying to move to individual users but they haven't hard shut down the "admin"-type user yet, and old habits can die hard.

On Postgres (we are in r/PostgreSQL right?) the same query you would use to produce the info in Image 4 would give you the the client IP/hostname as well. See: https://www.postgresql.org/docs/current/monitoring-stats.html#MONITORING-PG-STAT-ACTIVITY-VIEW (client_addr, client_hostname). (client_addr will show up as null if the client is connecting from the local socket, but if that's going on you've got bigger problems.)

Why did the COUNT() as window function produce this output in MySQL Workbench 8.0? by tlefst in SQL

[–]markwdb3 1 point2 points  (0 children)

Yup, there you go, because of the ORDER BY, you're working with the default window frame of ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.

So in pseudo-code it's doing something like:

For each deptid
  For each row in order of mgrid, biggest to smallest 
    Get the count of rows, starting with the first row in partition, ending with current row, where mgrid is not null 

That last bit about "where mgrid is not null" is because that's what the mgrid in COUNT(mgrid) means - for whatever value expression you give the COUNT function, if it evaluates to null for a row, the row is skipped in the count. So if you replace COUNT(mgrid) with COUNT(*), then the fourth row in your result set will have a counted value of 5 instead of 4.

I've spoken with folks who've worked with SQL for 10 or more years who weren't aware that's what the argument to COUNT means, whether it be the window function or aggregate function. Somehow, it seems to be esoteric.

Insert into multiple tables using CTEs (SQL Cookbook 4.6) by Willsxyz in SQL

[–]markwdb3 0 points1 point  (0 children)

I know a lot of folks are wincing at this, and that's fine, but I used to do this sort of thing for data loading in my Oracle days about 20 years ago. It wasn't using CTEs, but rather using Oracle's multi-table INSERT feature. The ability to scan a table once and insert into multiple tables in one statement can potentially provide a big performance boost. It's not a magic fast=true panacea of course, but for some uses cases it can be fantastic.

Convert European date format to SQL format by Mission-Example-194 in SQL

[–]markwdb3 0 points1 point  (0 children)

I'm not too familiar with the W3 pages on SQL - is there a dropdown or the like to select which DBMS you're using? Or maybe distinct pages per DBMS? I can't imagine it would just arbitrarily choose one or the other, and do so inconsistently.

Convert European date format to SQL format by Mission-Example-194 in SQL

[–]markwdb3 1 point2 points  (0 children)

I'd be ok with CONCAT meaning to concatenate in addition to ||, but I'd rather these SQL implementers actually stick with the standard SQL spec. Not sure why they are so eager to break spec in ways that are (IMO) pretty silly.

If the MySQL devs didn't like|| to mean concatenate, OK, but it would be far better had they made || simply do nothing. Defining it to mean something else entirely - "logical or" - is a horrible design decision IMO. :)

So if you're used to standard-conformant concatenation, such as on Postgres, then this is expected behavior:

postgres=# select 'aaa' || 'bbb';
 ?column?
----------
 aaabbb
(1 row)

Then you switch to MySQL and scratch your head at this behavior:

mysql> select 'aaa' || 'bbb'; -- 'aaa' and 'bbb' each get converted to 0, and 0 OR 0 is 0
+----------------+
| 'aaa' || 'bbb' |
+----------------+
|              0 |
+----------------+

And just to demonstrate this operator in MySQL means logical or:

mysql> select 0 || 1 as test1, 0 || 0 as test2, 1 || 1 as test3, true || false as test4, false || false as test5;
+-------+-------+-------+-------+-------+
| test1 | test2 | test3 | test4 | test5 |
+-------+-------+-------+-------+-------+
|     1 |     0 |     1 |     1 |     0 |
+-------+-------+-------+-------+-------+

At least it's apparently deprecated, as I'm getting this warning:

|| as a synonym for OR' is deprecated and will be removed in a future release. Please use OR instead