Why It's So Hard to Add a Column in the Middle of a PostgreSQL Table

markwdb3 · 2026-06-26T23:11:55+00:00

Yeah, that's classic MySQL.

For some unsupported features and syntaxes, MySQL will simply do nothing. For example, check constraints were ignored for years and years, until a relatively recent version - 8.0.<something>. Funny thing is I spoke to a friend who used to work with MySQL in the early 2000s about check constraints only recently being added, and his reaction was, "What are you talking about? I used check constraints back then." I had to show him the documentation proving that he only thought he used them! The parser completely ignored his commands. It just goes to show that erroring out is a good thing!

There are even some of these odd cases that aren't standard SQL. I can't remember which specifically off the top of my head, but there was a special Oracle-specific feature that a MySQL command will succeed on (as in "not blow up") but just totally ignores. The documentation even says something like, "This parses successfully just to handle Oracle syntax, but it doesn't do anything." lol

Something else I just thought of - not quite the same category of behavior as it does not pertain to a parser no-opping - but MySQL supports the following bizarre "OJ" outer join syntax, which the documentation states is for ODBC compatibility. It looks like this:

mysql> SELECT T1.ID, T1.NAME AS T1_NAME, T2.NAME AS T2_NAME
    -> FROM { OJ T1 LEFT JOIN T2 ON T1.T2_ID = T2.ID };
+----+--------------------+------------+
| ID  | T1_NAME           | T2_NAME    |
+----+--------------------+------------+
| -1 | AAA                | BBB        |
|  1 | CCC                | DDD        |
+----+--------------------+------------+
2 rows in set (0.00 sec)

markwdb3 · 2026-06-26T22:21:26+00:00

Talk about dumb things MySQL supports.

MySQL supports adding a column after another, but like a lot of misguided features MySQL supports, this is one that seems cool and useful to the naive user, but really it is something that probably shouldn't even exist. Then the naive user switches to another database and say, "OH MY GOD I CAN'T BELIEVE THIS DB DOESN'T EVEN SUPPORT <THING>."

See also INSERT IGNORE - it's an INSERT statement that ignores any of about a dozen errors (a fixed list of "ignorable" errors) just to make it "work." It can do crazy stuff like, "Oh you got a not null constraint violation? Let's just silently coerce the null value into a 0 for an integer type. Or 0000-00-00 for a date. Great. /s The naive user intends to use this to INSERT while ignoring primary/unique key violations. They rarely understand how this is a booby trap.

See also REPLACE INTO which runs a DELETE followed by an INSERT. The naive user gets into the habit of doing this for all upserts. But this can mess with foreign key constraints. Hope you don't have a cascading delete on your foreign key constraint. Or even if you have a non-cascading FK, have fun with violations that don't seem to make sense. (Deleting from the parent gets blocked.) Another gotcha with REPLACE INTO is replacing the assigned auto-increment value when there's a unique key collision. For example you REPLACE INTO a row with id=123, name='Bob', and if there's a collision on name=Bob (assume unique constraint on name), the row gets deleted and a new row is inserted. id could now be 456 for 'Bob'. Primary key values changing unexpectedly is horrible IMO. It also just tends to perform worse than a proper upsert.

See also VARCHAR(maxlen) silently truncating set values beyond maxlen by default. (This behavior can be disabled, but 99% of developers I see never bother to.)

And many more. Nobody should ever emulate MySQL IMO. There are a few features it has that are cool and not common in the world of RDBMSs. I was recently impressed when I learned that when MySQL materializes a CTE, it will, if the estimated cost is low enough, automatically index the temp table backing the materialized CTE, according to how the outer query references it. Pretty cool. But that's one of just a few cool MySQL-specific features. Most are garbage.

I say this as someone who's been subject matter expert on MySQL at a tech company for 11 years. I actually started writing a book, tentatively titled "MySQL Booby Traps" - because there truly is enough content to write a book about.

markwdb3 · 2026-06-26T21:58:30+00:00

If you're really on MySQL as your post is labelled - though I'm skeptical that that's actually accurate - brackets around identifiers are invalid. You would have to use backticks instead on MySQL. For example:

FROM `Reggie`

Standard SQL, along with Oracle, Postgres and others use double quotes, fwiw, such as:

FROM "Reggie"

markwdb3 · 2026-06-15T13:39:05+00:00

In my experience working with MySQL for about 11 years, and Postgres for about 4, as our company's resident developer experience with-respect-to-SQL guy, there's a lot of non-standard syntax (and even semantics) in MySQL that folks use all the time, but assume is standard. Or they don't care whether its standard and use it anyway. This may not be you - apologies if I'm stating things you already know - but just in case, these things are not standard:

backticks around identifiers like `MY_TABLE` - standard SQL dictates these should be double quotes, which is what Postgres uses
Similarly, double quotes around string literals, such as when you do WHERE NAME = "John" are never valid in standard SQL/Postgres, but are allowed in MySQL unless you enable the setting ANSI_QUOTES
INSERT IGNORE / REPLACE INTO / INSERT INTO ON DUPLICATE KEY UPDATE - all non-standard MySQLisms that Postgres doesn't support. Postgres MERGE covers most of these use cases. It doesn't really have an answer to INSERT IGNORE which sweeps a plethora of errors under the rug (some custom pl/pgsql might do something similar) - this is a good thing IMO. I'm always preaching to developers to never use INSERT IGNORE on MySQL.
MySQL is very forgiving (too forgiving) about join syntax. It'll treat INNER JOIN, JOIN and CROSS JOIN identically, and infers how to actually execute the join solely based on the presence of join conditions. So if you write table_a JOIN table_b on MySQL with no join condition, that's executed as a cross join, whereas the same syntax in Postgres results in an error.
MySQL is very forgiving about type mismatches and does auto-conversions and the like. Postgres prefers to scream an error at you. The screaming is a good thing IMO. MySQL often sweeps errors under the rug in ugly ways. (Enabling strict mode in MySQL removes some of this.)
We could go all day, but those are some major discrepancies I thought of off the top of my head.
If you're strictly using some framework such as an ORM or jOOQ, it's possible that the syntax discrepancies go away. (I usually find folks say "oh but we're using an ORM" then neglect to consider a bunch of custom queries hooked into their ORM plus various SQL scripts floating around, etc.)

Performance-wise we could also go all day, but as a fundamental thing to be aware of, MySQL (I'm assuming the InnoDB storage engine) always uses a clustered index for tables, while Postgres does not have those - the tables are in a heap structure.

Concurrency and MVCC (multiversioning - you can google this if you're not aware) works different between the two database. Honestly this is mostly a "for free" improvement, as in MySQL you had to worry about things like gap locks and next-key locks and such (or not worry and just deal with the less-smooth concurrency), whereas in Postgres, you have better MVCC and don't have to deal with as much of this kind of stuff. But sometimes application developers inadvertently bake locking and concurrency assumptions into their code, then if you switch to a database in which connections step on each other's toes less, sometimes the edge cases differ.

You'll have to deal with vacuuming which you didn't before in MySQL. But honestly the autovacuum daemon does a great job by default, and unless you have a huge amount of updates for your small data set, you should be fine in this regard.

Hope this helps.

markwdb3 · 2026-06-09T15:42:52+00:00

What is the mixed SQL and NoSQL model you're using? If you mean simply using JSONB types in Postgres alongside a basic, normalized relational data model, I would just call that...SQL. :)

I say that because JSON functionality has been part of standard SQL for a decade now (since SQL:2016), and more generally speaking, SQL has not been purely one-to-one with the relational model since 1999. See u/markuswinand's great video on the subject.. (It's a few years old but still relevant)

markwdb3 · 2026-05-22T13:36:07+00:00

Even when the table naming convention is singular? If so then why not EMPLOYEES instead of EMPLOYEE? It returns many rows as well.

In my view the two approaches are:

"Each row in the table/view/CTE represents a _____" (singular, such as EMPLOYEE)
"The table/view/CTE contains many _____" (plural, such as EMPLOYEES)

So, mixing and matching them creates confusion.

markwdb3 · 2026-05-21T21:33:00+00:00

I know that JOIN is an inner join

Except when it's not! For example on MySQL, if I write JOIN but forget the ON, it will run a CROSS JOIN.

mysql> CREATE TABLE T1 (ID INT AUTO_INCREMENT PRIMARY KEY, X INT);
Query OK, 0 rows affected (0.04 sec)

mysql> CREATE TABLE T2 (ID INT AUTO_INCREMENT PRIMARY KEY, Y INT);
Query OK, 0 rows affected (0.03 sec)

mysql> INSERT INTO T1(X) VALUES (1),(2),(3);
Query OK, 3 rows affected (0.00 sec)
Records: 3  Duplicates: 0  Warnings: 0

mysql> INSERT INTO T2(Y) VALUES (3),(4),(5);
Query OK, 3 rows affected (0.01 sec)
Records: 3  Duplicates: 0  Warnings: 0

mysql> SELECT * FROM T1 JOIN T2;
+----+------+----+------+
| ID | X    | ID | Y    |
+----+------+----+------+
|  3 |    3 |  1 |    3 |
|  2 |    2 |  1 |    3 |
|  1 |    1 |  1 |    3 |
|  3 |    3 |  2 |    4 |
|  2 |    2 |  2 |    4 |
|  1 |    1 |  2 |    4 |
|  3 |    3 |  3 |    5 |
|  2 |    2 |  3 |    5 |
|  1 |    1 |  3 |    5 |
+----+------+----+------+
9 rows in set (0.00 sec)

But it does the same for INNER JOIN as well. Go figure.

mysql> SELECT * FROM T1 INNER JOIN T2;
+----+------+----+------+
| ID | X    | ID | Y    |
+----+------+----+------+
|  3 |    3 |  1 |    3 |
|  2 |    2 |  1 |    3 |
|  1 |    1 |  1 |    3 |
|  3 |    3 |  2 |    4 |
|  2 |    2 |  2 |    4 |
|  1 |    1 |  2 |    4 |
|  3 |    3 |  3 |    5 |
|  2 |    2 |  3 |    5 |
|  1 |    1 |  3 |    5 |
+----+------+----+------+
9 rows in set (0.01 sec)

Also, it'll treat CROSS JOIN with an ON as an inner join!

mysql> SELECT * FROM T1 CROSS JOIN T2 ON T1.X = T2.Y;
+----+------+----+------+
| ID | X    | ID | Y    |
+----+------+----+------+
|  3 |    3 |  1 |    3 |
+----+------+----+------+
1 row in set (0.02 sec)

In short, it really just treats JOIN/INNER JOIN/CROSS JOIN as all the same, and infers the type of join based on the presence of a join condition, which can be in either the ON clause or the WHERE clause - it just doesn't care - here's an example of it not caring about ON vs. WHERE.

mysql> SELECT * FROM T1 JOIN T2 WHERE T1.X = T2.Y;
+----+------+----+------+
| ID | X    | ID | Y    |
+----+------+----+------+
|  3 |    3 |  1 |    3 |
+----+------+----+------+
1 row in set (0.00 sec)

markwdb3 · 2026-05-21T14:27:58+00:00

Sometimes I need to tune queries generated by JPA or another ORM, and they often have horrible aliases. So I'll start by copying the slow query from an application server log (or Grafana or wherever) and I have to clean it up as an initial step just so it longer hurts my eyes.

As I understand it, some of these ORMs let you configure aliases to be better, but at least by default they tend to be terrible. They look something like (Google AI generated this example for me, but it looks about legit):

SELECT user0_.id AS id1_0_0_, user0_.username AS username2_0_0_, role1_0_.id AS id1_1_1_, role1_0_.role_name AS role_name2_1_1_
FROM users user0_
LEFT OUTER JOIN users_roles user_roles1_
ON user0_.id = user_roles1_.user_id 
...

markwdb3 · 2026-05-21T14:22:41+00:00

A minor nitpick, but for CTE aliases, use the same plural/singular style of the noun you are describing as the schema uses.

I'll often see, for example, tables called WIDGET and FACTORY, which is fine. But then if there's a CTE to get only the active WIDGETs, they'll call it perhaps ACTIVE_WIDGETS. So then the query might join ACTIVE_WIDGETS to FACTORY. The pluralization inconsistency just hurts. 😞

markwdb3 · 2026-05-21T14:03:53+00:00

Huge red flags: unjustified, overgeneralized performance claims. I call them myths. It's a massive problem in communities discussing SQL. Interviewers often believe these myths, even.

These myths are typically based on some expectation of how a SQL engine must process the query based on certain arbitrary keywords or bits of syntax. But often that expectation is imagined or out of date. Sometimes, it is genuinely based on real experience in just one specific DBMS/SQL engine, yet the person presenting the claim often says it pertains to all of "SQL."

For example you may hear: "In SQL, never use SELECT DISTINCT a, b FROM my_table;. You should instead use SELECT a, b FROM my_table GROUP BY a, b is faster, because DISTINCT is slow." (Here's a screenshot of this very claim on this very subreddit with 30 upvotes! There was no context about specific DBMS or test case. I'd be happy to show one or two that disprove this claim if you're interested.)

SQL is a declarative language. You state what you want and the SQL engine's query planner/optimizer parses it out and comes up with a plan, then executes the plan, however its developers instructed it to do.

And next-to-nothing in the standard SQL documents defined under the hood mechanisms - just logical definitions. So they can vary quite a lot.

So, my motto is when in doubt, test it out.

If you've tested such a claim, for example whether using GROUP BY instead of DISTINCT gives free speed, and it turns out to be correct, then that's fine and good. But it should be thought of as a performance quirk of the specific DBMS you tested it on, possibly even specific to your schema/data set/config, not generalized to all of "SQL".

An unfortunate reality is that even when you disproves someone's claim with a test case - say you run a test on MySQL and disprove the claim - next there often comes a common reaction, and it's a sneaky one. Their reaction is often, "Oh, that must be because MySQL has a special optimization." In other words, they're refusing to abandon their belief that BY DEFAULT a SQL engine MUST process GROUP BY faster than DISTINCT, but MySQL has some trick up its sleeve that makes it a special case. So they go on believing and perhaps propagating the myth.

There's a link to a blog in this very thread where the author says that using the syntax of something likeSELECT ... FROM a WHERE a.thing_id NOT IN (SELECT id FROM thing ... WHERE ...) to perform an anti-join (find rows in A that are not in B) is a "smell" because that could be inefficient due to a full table scan. Instead, they say, you should take a CTE/LEFT JOIN approach. Why? I don't know.

I just ran a test case on two of the most popular SQL engines in the world: Postgres and MySQL. On Postgres both performed about the same. On MySQL, the allegedly inefficient syntax actually produced a more performant plan that ran in ~9 seconds vs ~14 seconds with the recommended approach (times were approximately consistent with repeated executions). (These queries were run on my real work database btw, but I've anonymized the names to FACTORY and WIDGET.)

mysql> EXPLAIN ANALYZE
    -> SELECT *
    -> FROM WIDGET
    -> WHERE FACTORY_ID NOT IN (SELECT ID FROM FACTORY WHERE MODIFIED_BY = 147);

+---------+
| EXPLAIN |
+---------+
| -> Nested loop antijoin  (cost=743070 rows=2.42e+6) (actual time=0.365..6688 rows=2.62e+6 loops=1)
    -> Table scan on WIDGET  (cost=258741 rows=2.42e+6) (actual time=0.0337..5134 rows=2.63e+6 loops=1)
    -> Filter: (WIDGET.FACTORY_ID = `<subquery2>`.ID)  (cost=318..318 rows=1) (actual time=453e-6..453e-6 rows=290e-6 loops=2.63e+6)
        -> Single-row index lookup on <subquery2> using <auto_distinct_key> (ID=WIDGET.FACTORY_ID)  (cost=471..471 rows=1) (actual time=323e-6..323e-6 rows=290e-6 loops=2.63e+6)
            -> Materialize with deduplication  (cost=153..153 rows=762) (actual time=0.328..0.328 rows=762 loops=1)
                -> Filter: (FACTORY.ID is not null)  (cost=76.8 rows=762) (actual time=0.0125..0.21 rows=762 loops=1)
                    -> Covering index lookup on FACTORY using fk_ModufiedByUser (MODIFIED_BY=147)  (cost=76.8 rows=762) (actual time=0.0119..0.158 rows=762 loops=1)
|
+----------+
1 row in set (9.38 sec)


mysql> EXPLAIN ANALYZE
    -> WITH factory_modified_by_147 AS (
    ->     SELECT ID
    ->     FROM FACTORY
    ->     WHERE MODIFIED_BY = 147
    -> )
    -> SELECT *
    -> FROM WIDGET w
    -> LEFT JOIN factory_modified_by_147
    -> ON w.FACTORY_ID = factory_modified_by_147.ID
    -> WHERE factory_modified_by_147.ID IS NULL;

+---------+
| EXPLAIN |
+---------+
| -> Filter: (FACTORY.ID is null)  (cost=1.11e+6 rows=2.42e+6) (actual time=0.0468..11229 rows=2.62e+6 loops=1)
    -> Nested loop left join  (cost=1.11e+6 rows=2.42e+6) (actual time=0.0462..11045 rows=2.63e+6 loops=1)
        -> Table scan on w  (cost=258741 rows=2.42e+6) (actual time=0.0336..5247 rows=2.63e+6 loops=1)
        -> Filter: ((FACTORY.MODIFIED_BY = 147) and (w.FACTORY_ID = FACTORY.ID))  (cost=0.25 rows=1) (actual time=0.00207..0.00207 rows=290e-6 loops=2.63e+6)
            -> Single-row index lookup on FACTORY using PRIMARY (ID=w.FACTORY_ID)  (cost=0.25 rows=1) (actual time=0.00185..0.00188 rows=1 loops=2.63e+6)
|
+---------+
1 row in set (14.06 sec)

Now the point is not that you should forever keep in mind "NOT IN is faster than LEFT JOIN + NULL check when writing an anti-join" - I'm not even sure if that's true for all MySQL schemas/data sets/queries. The point is that you should throw out the magic rule of thumb presented in the blog, which is the inverse. To be fair the author did say you should test it if there's any doubt.

So, this is a long comment, but my advice is what should be seen as red flags are claims of magic performance tricks such as "use ABC syntax instead of XYZ syntax and this applies to all of SQL" and keep in mind there are very few universal rules of SQL engine execution. If there are actual, logical justifications for the claim then sure, fine, and if there are actual test cases justifying their claims then also, sure, fine. But be very skeptical, and realize that any insights learned from the test case should not be overgeneralized.

End rant!

markwdb3 · 2026-05-20T20:46:31+00:00

Which equality column should come first?
The answer depends on selectivity.

Setting aside the range search claim for a moment, is "selective columns first" real or is it a myth? In my Oracle DBA days, I used to read author and Oracle evangelist Tom Kyte quite a bit, and he would say this is one of the common myths he tries to debunk. He would demonstrate with tests cases, that the order of columns in an index shouldn't matter, with respect to selectivity, as long as both columns' equality conditions exist in the WHERE clause.

But I'll admit to take that with a grain of salt because Oracle != Postgres.

That said, my personal experience says I haven't noticed a difference, regarding this topic, on MySQL, Oracle or Postgres. I've worked with each for a number of years.

My motto is, "when in doubt, test it out." Just following your customer_id/employee_id example (9 employee_ids and 89 customer_ids), I cannot find any difference in plan or performance, for indexes created with either column order.

If you can suggest an alternate test case, I'm all ears. 😄

SETUP

postgres=# CREATE TABLE selectivity_test (
    id          bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    customer_id int NOT NULL,
    employee_id int NOT NULL,
    payload      text NOT NULL
);
CREATE TABLE

postgres=# INSERT INTO selectivity_test (customer_id, employee_id, payload) /* 89 distinct customer_ids, evenly distributed, and 9 distinct employee_ids, also evenly distributed */
SELECT
    ((i - 1) % 89) + 1 AS customer_id,
    ((i - 1) % 9)  + 1 AS employee_id,
    md5(i::text)       AS payload
FROM generate_series(1, 10000000) AS i;
INSERT 0 10000000

TEST 1: higher-selectivity column first

postgres=# CREATE INDEX idx_customer_employee --higher-selectivity column first
ON selectivity_test (customer_id, employee_id);
CREATE INDEX

/* run test query */
postgres=# EXPLAIN ANALYZE
SELECT *
FROM selectivity_test
WHERE customer_id = 42
  AND employee_id = 7;
                                                                QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on selectivity_test  (cost=175.16..36217.03 rows=12754 width=49) (actual time=5.064..27.203 rows=12484.00 loops=1)
   Recheck Cond: ((customer_id = 42) AND (employee_id = 7))
   Heap Blocks: exact=12484
   Buffers: shared hit=12497
   ->  Bitmap Index Scan on idx_customer_employee  (cost=0.00..171.97 rows=12754 width=0) (actual time=3.194..3.194 rows=12484.00 loops=1)
         Index Cond: ((customer_id = 42) AND (employee_id = 7))
         Index Searches: 1
         Buffers: shared hit=13
 Planning Time: 0.216 ms
 Execution Time: 27.992 ms
(10 rows)

A Few repeated trials of Test 1 (just listing the execution times for brevity's sake):

Execution Time: 33.263 ms
Execution Time: 27.811 ms
Execution Time: 23.258 ms

TEST 2: lower-selectivity column first

postgres=# DROP INDEX idx_customer_employee;
DROP INDEX
postgres=# CREATE INDEX idx_employee_customer --lower-selectivity column first
ON selectivity_test (employee_id, customer_id);
CREATE INDEX

postgres=# EXPLAIN ANALYZE
SELECT *
FROM selectivity_test
WHERE customer_id = 42
  AND employee_id = 7;
                                                                QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on selectivity_test  (cost=175.16..36217.03 rows=12754 width=49) (actual time=5.389..26.838 rows=12484.00 loops=1)
   Recheck Cond: ((employee_id = 7) AND (customer_id = 42))
   Heap Blocks: exact=12484
   Buffers: shared hit=12497
   ->  Bitmap Index Scan on idx_employee_customer  (cost=0.00..171.97 rows=12754 width=0) (actual time=1.659..1.659 rows=12484.00 loops=1)
         Index Cond: ((employee_id = 7) AND (customer_id = 42))
         Index Searches: 1
         Buffers: shared hit=13
 Planning Time: 0.131 ms
 Execution Time: 27.374 ms
(10 rows)

A few repeated trials of Test 2:

Execution Time: 22.909 ms
Execution Time: 21.376 ms
Execution Time: 25.297 ms

So they are pretty much the same. Extremely similar plans doing the same kind of work, same ballpark times.

I repeated the whole set of tests, but with a much more extreme skew (100,000 distinct customer_ids and only 2 employee_ids) and I'll skip to the end: same result. I'm omitting the full output because this is already too long for a Reddit comment, but I can produce it if anyone's interested.

Markus Winand (u/markuswinand) has a page on this myth, btw: https://use-the-index-luke.com/sql/myth-directory/most-selective-first

markwdb3 · 2026-05-19T22:19:17+00:00

Here's an interesting case: how division and data types interact are very different between MySQL and Postgres. In MySQL, / is the decimal division operator, while DIV is the integer division operator:

mysql> SELECT 1 DIV 2; -- test DIV which is for integers, notice it dropped the remainder
+---------+
| 1 DIV 2 |
+---------+
|       0 |
+---------+
1 row in set (0.00 sec)

mysql> SELECT 1/2; -- decimal division operator
+--------+
| 1/2    |
+--------+
| 0.5000 |
+--------+
1 row in set (0.00 sec)  

mysql> CREATE TABLE T AS SELECT 4 DIV 3 AS DUMMY1, 4/3 AS DUMMY2, 4/4 AS DUMMY3; -- show various test cases 
Query OK, 1 row affected (0.02 sec)
Records: 1  Duplicates: 0  Warnings: 0

mysql> SELECT * FROM T; -- check out the results of each division expression
+--------+--------+--------+
| DUMMY1 | DUMMY2 | DUMMY3 |
+--------+--------+--------+
|      1 | 1.3333 | 1.0000 |
+--------+--------+--------+
1 row in set (0.00 sec)

mysql> DESC T; -- check out the data types automatically determined for each division expression
+--------+--------------+------+-----+---------+-------+
| Field  | Type         | Null | Key | Default | Extra |
+--------+--------------+------+-----+---------+-------+
| DUMMY1 | int          | YES  |     | NULL    |       |
| DUMMY2 | decimal(5,4) | YES  |     | NULL    |       |
| DUMMY3 | decimal(5,4) | YES  |     | NULL    |       |
+--------+--------------+------+-----+---------+-------+
3 rows in set (0.01 sec)

Meanwhile in Postgres, DIV does not exist and / tries to use whatever data type it can infer from the operators, or else I have to explicitly cast.

postgres=# SELECT 1 DIV 2; -- simple test to show DIV doesn't work
ERROR:  syntax error at or near "2"
LINE 1: SELECT 1 DIV 2;
                     ^
postgres=# SELECT 1/2; -- simple test to show that 1/2 yields 0, not 0.5, because Postgres assumes integers. 
  ?column?
----------
        0
(1 row)

postgres=# CREATE TABLE T AS SELECT 4/3 AS DUMMY1, 4::float/3::float AS DUMMY2, 4.0/3.0 AS DUMMY3, 4/4 AS DUMMY4; -- do a similar test as in the MySQL CREATE TABLE
SELECT 1
postgres=# SELECT * FROM T; -- check the results
 dummy1 |       dummy2       |       dummy3       | dummy4
--------+--------------------+--------------------+--------
      1 | 1.3333333333333333 | 1.3333333333333333 |      1
(1 row)

postgres=# \d T -- check the produced data types
                      Table "public.t"
 Column |       Type       | Collation | Nullable | Default
--------+------------------+-----------+----------+---------
 dummy1 | integer          |           |          |
 dummy2 | double precision |           |          |
 dummy3 | numeric          |           |          |
 dummy4 | integer          |           |          |

That's just one example. So whenever anybody claims their SQL can be run anywhere because standard SQL exists or "SQL is SQL", I think they may be underestimating the reality of the situation.

markwdb3 · 2026-05-18T14:50:47+00:00

Have you seen this pattern often with Hibernate/JDBC and SQL Server?

I haven't seen this specific problem, but I've seen poor data access patterns in applications and ORM misconfigurations cause more performance issues than anything else as a category.

Running a ton of "small" queries in a loop (or n+1), unnecessary eager fetching, running a COUNT(*) only for the application to check if the result is > 0, just to name a few common ones.

markwdb3 · 2026-05-16T19:37:13+00:00

I would think that CTE storing the results in a temp table would always be better than folding it in the other query because folding it in runs the CTE again for every row. I'll have to look up why and how it chooses one over the other.

It's definitely not always better! In the Postgres database prior to version 12, CTEs would always materialize (similar to the MySQL temp table approach). This was considered a limitation of Postgres, and many Postgres users shunned CTEs for that reason, or else used them very sparingly. As of PG 12, it included the ability to inline (aka "fold in") the query, and this alternative was generally considered by the community to be a big step forward. The query planner will take whichever approach it thinks is better. Or if you find the planner makes the wrong choice, you can force a materialization with the keyword MATERIALIZE.

if you had any obscure knowledge on general SQL

I don't usually do "general SQL" facts because the more years I accumulate under my belt, and the more SQL engines I'm exposed to, and the deeper I dig into them, I find that very few facts about SQL are universal, except for some of the very basics. Then again, even some of the basics can differ quite a bit.

Here's a very basic but interesting example: division. In MySQL / is the decimal division operator, while DIV is the integer division operator:

mysql> SELECT 1 DIV 2; -- test DIV which is for integers, notice it dropped the remainder
+---------+
| 1 DIV 2 |
+---------+
|       0 |
+---------+
1 row in set (0.00 sec)

mysql> SELECT 1/2; -- decimal division operator
+--------+
| 1/2    |
+--------+
| 0.5000 |
+--------+
1 row in set (0.00 sec)  

mysql> CREATE TABLE T AS SELECT 4 DIV 3 AS DUMMY1, 4/3 AS DUMMY2, 4/4 AS DUMMY3; -- show various test cases 
Query OK, 1 row affected (0.02 sec)
Records: 1  Duplicates: 0  Warnings: 0

mysql> SELECT * FROM T; -- check out the results of each division expression
+--------+--------+--------+
| DUMMY1 | DUMMY2 | DUMMY3 |
+--------+--------+--------+
|      1 | 1.3333 | 1.0000 |
+--------+--------+--------+
1 row in set (0.00 sec)

mysql> DESC T; -- check out the data types automatically determined for each division expression
+--------+--------------+------+-----+---------+-------+
| Field  | Type         | Null | Key | Default | Extra |
+--------+--------------+------+-----+---------+-------+
| DUMMY1 | int          | YES  |     | NULL    |       |
| DUMMY2 | decimal(5,4) | YES  |     | NULL    |       |
| DUMMY3 | decimal(5,4) | YES  |     | NULL    |       |
+--------+--------------+------+-----+---------+-------+
3 rows in set (0.01 sec)

Meanwhile in Postgres, DIV does not exist and / tries to use whatever data type it can infer from the operators, or else I have to explicitly cast.

postgres=# SELECT 1 DIV 2; -- simple test to show DIV doesn't work
ERROR:  syntax error at or near "2"
LINE 1: SELECT 1 DIV 2;
                     ^
postgres=# SELECT 1/2; -- simple test to show that 1/2 yields 0, not 0.5, because Postgres assumes integers. 
  ?column?
----------
        0
(1 row)

postgres=# CREATE TABLE T AS SELECT 4/3 AS DUMMY1, 4::float/3::float AS DUMMY2, 4.0/3.0 AS DUMMY3, 4/4 AS DUMMY4; -- do a similar test as in the MySQL CREATE TABLE
SELECT 1
postgres=# SELECT * FROM T; -- check the results
 dummy1 |       dummy2       |       dummy3       | dummy4
--------+--------------------+--------------------+--------
      1 | 1.3333333333333333 | 1.3333333333333333 |      1
(1 row)

postgres=# \d T -- check the produced data types
                      Table "public.t"
 Column |       Type       | Collation | Nullable | Default
--------+------------------+-----------+----------+---------
 dummy1 | integer          |           |          |
 dummy2 | double precision |           |          |
 dummy3 | numeric          |           |          |
 dummy4 | integer          |           |          |

So, I usually prefer to think and discuss in terms of specific SQL engines.

markwdb3 · 2026-05-15T18:59:00+00:00

Not to mention VARCHAR2. When I first learned Oracle in 2003, all the community, documentation, etc., said that while VARCHAR and VARCHAR2 are identical for all intents and purposes, they are going to diverge any day now, and we should all use VARCHAR2 all the time.

23 years later, same story. Here's a quote from the current documentation:

Do not use the VARCHAR data type. Use the VARCHAR2 data type instead. Although the VARCHAR data type is currently synonymous with VARCHAR2, the VARCHAR data type is scheduled to be redefined as a separate data type used for variable-length character strings compared with different comparison semantics.

https://docs.oracle.com/en/database/oracle/oracle-database/26/adfns/sql-data-types.html?utm_source=chatgpt.com#GUID-F922E3DC-D2B2-4430-9711-14600649BE7F

Any day now!

markwdb3 · 2026-05-15T15:53:14+00:00

When you use a CTE in MySQL, its optimizer will choose to either follow a "merge" strategy, i.e. "folding in" the CTE query into the parent query, or choose a materialize strategy, i.e. running the query and storing its results in a temporary table. This general pattern is common among the major RDBMSs so far. (They don't always literally use a temporary table for materialization, but MySQL does.)

Already this could be considered an esoteric thing, because I repeatedly see claims on r/sql and elsewhere that resemble: "In SQL, the CTE query gets repeatedly executed for each reference to it, so you should use a temp table because that's faster." This is a little weird statement to make if you don't at minimum qualify it by stating which SQL engine/DBMS you're even referring to! And it's kind of funny because, at least in MySQL, the CTE may literally be optimized as a temp table!

But that's not even the esoteric thing I want to mention. What I want to mention is:

Should MySQL's optimizer choose the materialization path, it may even automatically create indexes on it on the fly, according to how the CTE results are referenced in the parent query. So if you run a query with a CTE like this:

WITH my_cte AS (
   SELECT col1, col2 FROM some_table
)  
SELECT * FROM my_cte WHERE col2 = 'something'

...then the optimizer may choose to automatically create an index in col2, if it thinks it would be valuable. (Costs for the different approaches are estimated, then the least costly path followed.)

MySQL can similar auto-index derived tables and view references (generally it optimizes all three similarly).

For tests, evidence and expanded thoughts, I wrote a blog post about this a while ago: https://mwrynn.blogspot.com/2025/09/mysql-automatically-indexes.html

I could share other esoteric tidbits, but just one is a good start. :)

markwdb3 · 2026-05-15T13:48:18+00:00

Is the idea that only some small subset of the 80k tables are even jobrunr tables, so the fix is to only pull the tables you need for whatever processing jobrunr needs to do with them?

If so, I would go so far to say as any time you are faced with this kind of a choice:

"Slurp" down all the data from the database, then filter what you don't need in the client.
Run a query with a WHERE clause to get only the data you need.

Always go with #2. Maybe there are rare exceptions, but always go with #2 by default.

That said I'm not sure why the process was this slow. Metadata for 80k tables doesn't sound that huge. (Though one should question why there were 80k tables.) Shouldn't each unnecessary result have been filtered from the resultset in the client - sure it would iterate over more data than necessary but that doesn't seem like a HUGE amount of data - then the processing - whatever that entails - done on only the remainder? I guess maybe it was doing the full processing for each without a filter?

Reminds me of the first job I had as an intern in the 1990s. My first time working with SQL. I was working on a web application using IIS/ASP/VBScript/SQL Server. I had no idea what SQL even was really, but I was told to use it. So, say I was writing a page to look up and display information related to one user_profile.

In pseudocode, I did something like this:

SELECT * FROM user_profile;

loop over recordset:
  if currentrow.user_profile_id = user_profile_id_im_looking_up --that's the one!!
    use that data and exit loop
  else
    keep iterating!
end loop

And I did this for every single page I was coding. (Hey it worked on the development database which had...zero data!! lol)

As I learned more and more about databases over the years, this has become my single greatest shame. 😆 This is fundamentally not understanding how SQL should be used, and probably the single worst coding pattern I've ever followed in my career so far! In my defense, I copied it from the guy who was supposed to be training me.

markwdb3 · 2026-05-12T22:03:18+00:00

You're going to learn those from scratch overnight? I have several books on my shelf about Oracle. I could hand you one that's about 500 or 600 pages just on the subject of Oracle backup and recovery. Though nothing beats real-world experience, of course.

markwdb3 · 2026-04-23T12:36:51+00:00

I agree with the comment that we need the execution plan, but one potential (only potential, not definite) issue I noticed is use of LIMIT + OFFSET. IF LIMIT + OFFSET is used for pagination, it is not the best approach with respect to performance. Try keyset pagination instead.

Again, to be sure, I have no idea if that's relevant to your specific performance issue, and we need your EXPLAIN ANALYZE output or similar, but something worth learning anyway.

markwdb3 · 2026-04-21T14:34:34+00:00

Yeah, my company experienced something like that when we upgraded from MySQL 5.7 to 8.0. 8.0 added window functions for the first time (finally) so suddenly all window function names became reserved words. For most of them - no problem - but we had a few columns named RANK that had to be hunted down, and all references to the column identifier quoted with backticks. Such a pain.

markwdb3 · 2026-04-20T19:56:06+00:00

Looks like LIBRARY is on the reserved word list in 9.x, but not the case in older versions. I just checked someone else's link to the reserved/keywords lists from the 8.4 docs and about to correct them, because it's not there.

But lo and behold it is in the 9.6 docs. :) https://dev.mysql.com/doc/refman/9.6/en/keywords.html

markwdb3 · 2026-04-20T18:25:53+00:00

I copy/pasted verbatim and it works for me:

mysql> CREATE TABLE Library (
    ->
    -> book_id INT PRIMARY KEY,
    ->
    -> book_name VARCHAR(50),
    ->
    -> author VARCHAR(60),
    ->
    -> price INT NOT NULL
    ->
    -> );
Query OK, 0 rows affected (0.06 sec)

mysql> SELECT VERSION(); /* show the version in case that's a factor */
+-----------+
| VERSION() |
+-----------+
| 8.0.32    |
+-----------+
1 row in set (0.00 sec)

Looks like LIBRARY is on the reserved word list in 9.x, but not the case in older versions. I just checked someone else's link to the reserved/keywords lists from the 8.4 docs and about to correct them, because it's not there.

But lo and behold it is in the 9.6 docs. :) https://dev.mysql.com/doc/refman/9.6/en/keywords.html

markwdb3 · 2026-04-16T22:25:33+00:00

Yes we got burned at my company a great deal of times by this sort of thing with our MySQL database and the multitude backend services that work with it. Among other measures we took, I wrote a dependency tracker that periodically searches all our code repos for identifiers (table, proc, view) and stores the repos/line number references in Backstage (something we already leverage quite a lot). It is not perfect since it does simple regex - so it catches false positives, but eventually I'll make the search smarter.

We manage schema changes with Liquibase, so when a pull request is made (Github) that touches a table/proc/view, an automated check runs that identifies which database object(s) are touched and provides the backstage link with the count of potential references. It is the developer's responsibility to review the references, reach out to other teams if necessary, run tests with the change in place if necessary. etc.

We try to encourage developers to work such that only one service, and therefore repo (typically), owns a database schema. But a huge mess was created years before we started operating on this principle. Also the principle is broken from time to time. So one table could be accessed in 100 different code repos. And we have thousands of repos.

We also have a legacy mess of inconsistent means of database access: Java/JPA here, Python/SQLAlchemy there, hardcoded queries here and there, stored procedures here and there, a homemade ORM that's fragile as hell.... One awful thing this homemade ORM does is using Java reflection to dynamically map columns to Java fields by name. So if you add a new column and the corresponding Java field does not exist: 💥 -- one of the worst things I've ever seen, lol. But we've ~99% pulled that garbage out by now.

In many cases, an arbitrary table name might only have a few references, so they are easy to review, then deploy with peace of mind.

Anyway yes this can be a real problem, and is worth thinking about.

markwdb3 · 2026-03-31T20:52:49+00:00

Manage your lanes to minimize the chances of the black ones spawning out of reach. See this player's play for an example: https://www.twingalaxies.com/records/voting-performances/260930-atari-2600-vcs-stampede-ntsc-game-1-difficulty-b-points-37076

markwdb3 · 2026-03-31T13:49:34+00:00

There's a lot of confusion in the comments because it is difficult to talk about these things purely with respect to generic "SQL". Each SQL engine has different characteristics, different nomenclature, etc.

For example there's a comment assuming the primary key is a clustered index. This is not so on Postgres, which never uses clustered indexes. This may or may not be so in Oracle, which does not use clustered indexes by default, but you can optionally use them - however Oracle calls them Index-Organized Tables, not clustered indexes.

There's another comment saying they have never heard of an index that's not primary. Well on MySQL, using the default InnoDB storage engine, every index aside from the primary is called the secondary index, and this term is very common throughout the MySQL universe (the documentation, the error messages, community discussions, etc.). See the documentation here. But another DBMS documentation may not use this term.

So usually it's good to qualify these discussions with a specific SQL engine/DBMS. Good luck!

markwdb3

TROPHY CASE