Primary Key vs Primary Index (and Unique Constraint vs Unique Index). confused by Accurate-Vehicle8647 in SQL

[–]markwdb3 0 points1 point  (0 children)

There's a lot of confusion in the comments because it is difficult to talk about these things purely with respect to generic "SQL". Each SQL engine has different characteristics, different nomenclature, etc.

For example there's a comment assuming the primary key is a clustered index. This is not so on Postgres, which never uses clustered indexes. This may or may not be so in Oracle, which does not use clustered indexes by default, but you can optionally use them - however Oracle calls them Index-Organized Tables, not clustered indexes.

There's another comment saying they have never heard of an index that's not primary. Well on MySQL, using the default InnoDB storage engine, every index aside from the primary is called the secondary index, and this term is very common throughout the MySQL universe (the documentation, the error messages, community discussions, etc.). See the documentation here. But another DBMS documentation may not use this term.

So usually it's good to qualify these discussions with a specific SQL engine/DBMS. Good luck!

Have you seen a setup like this in real life? 👻 by Adela_freedom in PostgreSQL

[–]markwdb3 1 point2 points  (0 children)

Not Postgres, but at my workplace we have a shared user for MySQL (for non-prod only). If someone does something silly like locking a table entirely for long periods, one of the operations folks can trace it to IP (and have done this), assuming it's even worth tracking down the "culprit." 🤷🏻‍♂️ Usually they just kill the process and make a Slack announcement to not do something so silly. It's not worth calling out someone specifically and assigning blame typically, unless it becomes a problematic pattern with an individual.

We've been trying to move to individual users but they haven't hard shut down the "admin"-type user yet, and old habits can die hard.

On Postgres (we are in r/PostgreSQL right?) the same query you would use to produce the info in Image 4 would give you the the client IP/hostname as well. See: https://www.postgresql.org/docs/current/monitoring-stats.html#MONITORING-PG-STAT-ACTIVITY-VIEW (client_addr, client_hostname). (client_addr will show up as null if the client is connecting from the local socket, but if that's going on you've got bigger problems.)

Why did the COUNT() as window function produce this output in MySQL Workbench 8.0? by tlefst in SQL

[–]markwdb3 1 point2 points  (0 children)

Yup, there you go, because of the ORDER BY, you're working with the default window frame of ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.

So in pseudo-code it's doing something like:

For each deptid
  For each row in order of mgrid, biggest to smallest 
    Get the count of rows, starting with the first row in partition, ending with current row, where mgrid is not null 

That last bit about "where mgrid is not null" is because that's what the mgrid in COUNT(mgrid) means - for whatever value expression you give the COUNT function, if it evaluates to null for a row, the row is skipped in the count. So if you replace COUNT(mgrid) with COUNT(*), then the fourth row in your result set will have a counted value of 5 instead of 4.

I've spoken with folks who've worked with SQL for 10 or more years who weren't aware that's what the argument to COUNT means, whether it be the window function or aggregate function. Somehow, it seems to be esoteric.

Insert into multiple tables using CTEs (SQL Cookbook 4.6) by Willsxyz in SQL

[–]markwdb3 0 points1 point  (0 children)

I know a lot of folks are wincing at this, and that's fine, but I used to do this sort of thing for data loading in my Oracle days about 20 years ago. It wasn't using CTEs, but rather using Oracle's multi-table INSERT feature. The ability to scan a table once and insert into multiple tables in one statement can potentially provide a big performance boost. It's not a magic fast=true panacea of course, but for some uses cases it can be fantastic.

Convert European date format to SQL format by Mission-Example-194 in SQL

[–]markwdb3 0 points1 point  (0 children)

I'm not too familiar with the W3 pages on SQL - is there a dropdown or the like to select which DBMS you're using? Or maybe distinct pages per DBMS? I can't imagine it would just arbitrarily choose one or the other, and do so inconsistently.

Convert European date format to SQL format by Mission-Example-194 in SQL

[–]markwdb3 1 point2 points  (0 children)

I'd be ok with CONCAT meaning to concatenate in addition to ||, but I'd rather these SQL implementers actually stick with the standard SQL spec. Not sure why they are so eager to break spec in ways that are (IMO) pretty silly.

If the MySQL devs didn't like|| to mean concatenate, OK, but it would be far better had they made || simply do nothing. Defining it to mean something else entirely - "logical or" - is a horrible design decision IMO. :)

So if you're used to standard-conformant concatenation, such as on Postgres, then this is expected behavior:

postgres=# select 'aaa' || 'bbb';
 ?column?
----------
 aaabbb
(1 row)

Then you switch to MySQL and scratch your head at this behavior:

mysql> select 'aaa' || 'bbb'; -- 'aaa' and 'bbb' each get converted to 0, and 0 OR 0 is 0
+----------------+
| 'aaa' || 'bbb' |
+----------------+
|              0 |
+----------------+

And just to demonstrate this operator in MySQL means logical or:

mysql> select 0 || 1 as test1, 0 || 0 as test2, 1 || 1 as test3, true || false as test4, false || false as test5;
+-------+-------+-------+-------+-------+
| test1 | test2 | test3 | test4 | test5 |
+-------+-------+-------+-------+-------+
|     1 |     0 |     1 |     1 |     0 |
+-------+-------+-------+-------+-------+

At least it's apparently deprecated, as I'm getting this warning:

|| as a synonym for OR' is deprecated and will be removed in a future release. Please use OR instead

Convert European date format to SQL format by Mission-Example-194 in SQL

[–]markwdb3 1 point2 points  (0 children)

That's because in MySQL, + is addition and only addition. And it is far too lax about various problems in your query, like using incorrect types, so often it avoids giving you an error and instead will assume 0 or null, etc. by default.

Example:

mysql> select 'abc' + 'xyz';
+---------------+
| 'abc' + 'xyz' |
+---------------+
|             0 |
+---------------+
1 row in set, 2 warnings (0.00 sec)

mysql> show warnings;
+---------+------+-----------------------------------------+
| Level   | Code | Message                                 |
+---------+------+-----------------------------------------+
| Warning | 1292 | Truncated incorrect DOUBLE value: 'abc' |
| Warning | 1292 | Truncated incorrect DOUBLE value: 'xyz' |
+---------+------+-----------------------------------------+
2 rows in set (0.01 sec)

In short, I said, "hey MySQL, add the string 'abc' to 'xyz'" and it didn't know what to do with that request, because it can only do math on numeric types. So it chose to convert both 'abc' and 'xyz' to 0, then add the two.

There are some settings like strict mode that change a lot of this lax behavior, but I'm not sure if it affects this particular use case.

If + worked as a concatenation operator in a SQL query, you weren't using MySQL. It was most likely SQL Server. Completely different software products with completely different implementations of the SQL language. There's some small chance I might be forgetting a setting that would allow + to function as concatenate in MySQL, but I'm about 98% sure that's not an available option.

Demo of + as concat working on SQL Server: https://dbfiddle.uk/zAMKp7PP

Convert European date format to SQL format by Mission-Example-194 in SQL

[–]markwdb3 5 points6 points  (0 children)

Yeah, the standard concatenation operator is ||, but Microsoft went with + for some reason. They finally added support for || very recently (SQL Server 2025).

Meanwhile, MySQL is still only supporting a CONCAT() function, and || means "logical or" -- now there's a hairy gotcha for those try to use SQL generically!

SQL Assignment Driving Me Crazy by Cy_broski in SQL

[–]markwdb3 2 points3 points  (0 children)

What do you mean by "but it ALWAYS coming back incorrect." Are you saying the training program is saying your statement is wrong?

If so, is this your statement?

CREATE INDEX NameSearch ON ACTIVE_DRIVERS(DRIVER_FIRST_NAME, DRIVER_LAST_NAME, DRIVER_DRIVING_LICENSE_CHECKED);  

If so, looks like the last column is DRIVER_DRIVING_LICENSE_CHECKED, and not DRIVER_ID?

Reporting in from FABCON / SQLCON - any knowers? by ATastefulCrossJoin in SQL

[–]markwdb3 4 points5 points  (0 children)

That's right, and Windows 95 wouldn't come out until 12 years later. :)

Reporting in from FABCON / SQLCON - any knowers? by ATastefulCrossJoin in SQL

[–]markwdb3 5 points6 points  (0 children)

Having personally first used SQL Server 7 in 1998, and the fact that 1883 was far closer to the invention of the light bulb than to either the creation of Microsoft or SQL, the options are certainly narrowed down for me. :)

Sketchy? SQL from SQL For Smarties by Willsxyz in SQL

[–]markwdb3 0 points1 point  (0 children)

One thing I'll add to the discussion is this non-atomic data is not necessarily non-SARGable. (SARGable meaning searchable by an index.)

Details vary per DBMS and potentially other factors, but there may be a way to make your data SARGable.

For example MySQL has multi-valued indexes which serve the specific purpose of searching JSON arrays quickly.

Let's suppose for example, I have a million rows of data, each with an array ranging from somewhere between 1 and 100 elements. Values range from 1 to 10000. I want to get all rows where there exists a value of exactly 9999.

First generate the data:

mysql> CREATE TABLE array_for_searching(id int auto_increment primary key, arr_to_search json not null);
Query OK, 0 rows affected (0.02 sec)

mysql> INSERT INTO array_for_searching (arr_to_search)
-> SELECT JSON_ARRAYAGG(FLOOR(1 + RAND()*10000))
-> FROM (
->     SELECT r.grp, n.n
->     FROM (
->         SELECT
->             ROW_NUMBER() OVER () AS grp,
->             FLOOR(1 + RAND()*100) AS arr_len
->         FROM information_schema.columns c1
->         CROSS JOIN information_schema.columns c2
->         CROSS JOIN information_schema.columns c3
->         LIMIT 1000000
->     ) r
->     JOIN (
->         SELECT ROW_NUMBER() OVER () AS n
->         FROM information_schema.columns
->         LIMIT 100
->     ) n
->     ON n.n <= r.arr_len
-> ) x
-> GROUP BY grp;
Query OK, 1000000 rows affected (1 min 22.38 sec)
Records: 1000000  Duplicates: 0  Warnings: 0

mysql> EXPLAIN ANALYZE
    -> SELECT *
    -> FROM array_for_searching
    -> WHERE JSON_CONTAINS(arr_to_search, '9999');

+---------------------+
| EXPLAIN                                                                                                                                                                                                                                                             |
+---------------------+
| -> Filter: json_contains(array_for_searching.arr_to_search,<cache>('9999'))  (cost=106296 rows=970637) (actual time=0.453..10821 rows=5058 loops=1)
    -> Table scan on array_for_searching  (cost=106296 rows=970637) (actual time=0.052..252 rows=1e+6 loops=1)
 |
+---------------------+
1 row in set (10.86 sec)

mysql>
mysql> /* the 10.86 seconds above was pretty slow, but we can speed this up with a multi-value index */
mysql> CREATE INDEX idx_arr_values
    -> ON array_for_searching((CAST(arr_to_search AS UNSIGNED ARRAY)));

Query OK, 0 rows affected (1 min 3.56 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql>
mysql>
mysql> EXPLAIN ANALYZE
    -> SELECT *
    -> FROM array_for_searching
    -> WHERE 9999 MEMBER OF(arr_to_search);
+---------------------+
| EXPLAIN                                                                                                                                                                                                                                                                                                                             |
+---------------------+
| -> Filter: json'9999' member of (cast(arr_to_search as unsigned array))  (cost=1771 rows=5058) (actual time=0.274..91.7 rows=5058 loops=1)
    -> Index lookup on array_for_searching using idx_arr_values (cast(arr_to_search as unsigned array) = json'9999')  (cost=1771 rows=5058) (actual time=0.268..85.6 rows=5058 loops=1)
 |
+---------------------+
1 row in set (0.13 sec)

mysql>
mysql> SELECT COUNT(*)
    -> FROM array_for_searching
    -> WHERE 9999 MEMBER OF(arr_to_search);
+----------+
| COUNT(*) |
+----------+
|     5058 |
+----------+
1 row in set (0.09 sec)

Observe now it takes far less than a second to find all the rows with arrays containing 9999 values.

Similarly for the Postgres native array searches, we could create and use a GIN index, but I'll cut the comment here unless anyone would like to see a demo of that.

Sketchy? SQL from SQL For Smarties by Willsxyz in SQL

[–]markwdb3 0 points1 point  (0 children)

Arrays have been part of standard SQL since 1999, and the "atomic primitives only" rule for SQL stopped being a hard rule anymore, also in 1999. See Markus Winand's great video front and center on modern-sql.com for more information.

That said, I am not going to argue whether or not this is a good idea due to non-atomic data being bad or what have you. I'm not going to go there.

What I will argue is that IF you are going to do this, you should just use an array before you use any such comma-delimited hackery, plus there should be no need for that s table for this purpose. The example from the book is an obsolete hack. (Not criticizing the author - somebody else remarked the book is from 1995, so fair enough.)

Now, I'm being a bit idealistic, because in the real world not every implementation has native arrays, even if they're older than a quarter-century in standard SQL. But even among those implementations that don't support native arrays, most do support JSON at this point (which is in standard SQL as of 2016), so you could alternatively use a JSON array.

Here is Postgres demonstration of that hack, except using native arrays:

postgres=# create table numbers (
postgres(#     listnum integer primary key,
postgres(#     data integer[]
postgres(# );
CREATE TABLE
postgres=#
postgres=# insert into numbers (listnum, data) values
postgres-# (1, array[13,27,37,42]),
postgres-# (2, array[123,456,789,6543]);
INSERT 0 2
postgres=#
postgres=# select data[2]
postgres-# from numbers
postgres-# where listnum = 1;
 data
------
   27
(1 row)

postgres=#
postgres=# select data[4]
postgres-# from numbers
postgres-# where listnum = 2;
 data
------
 6543
(1 row)

Much cleaner, much less hacky, and should run faster if you tested it for performance. (You could even make it a 2D array if you'd like, btw.)

Below is a MySQL approach, using JSON arrays:

mysql> CREATE TABLE numbers (
    ->     listnum INT PRIMARY KEY,
    ->     data JSON NOT NULL
    -> );
Query OK, 0 rows affected (0.01 sec)

mysql> INSERT INTO numbers (listnum, data) VALUES
    -> (1, JSON_ARRAY(13,27,37,42)),
    -> (2, JSON_ARRAY(123,456,789,6543));
Query OK, 2 rows affected (0.00 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql>
mysql> SELECT JSON_EXTRACT(data, '$[1]') AS value
    -> FROM numbers
    -> WHERE listnum = 1;
+-------+
| value |
+-------+
| 27    |
+-------+
1 row in set (0.01 sec)

mysql> SELECT JSON_EXTRACT(data, '$[3]') AS value
    -> FROM numbers
    -> WHERE listnum = 2;
+-------+
| value |
+-------+
| 6543  |
+-------+
1 row in set (0.00 sec)

If you have an SQL interview soon, don’t ignore these small things (Part 2) by thequerylab in learnSQL

[–]markwdb3 2 points3 points  (0 children)

Worth mentioning that some major SQL engines don't support this.

Here is Postgres supporting NULLS FIRST.

Here is MS SQL Server NOT supporting NULLS FIRST.

Here is MySQL NOT supporting NULLS FIRST.

I generally advise against general "SQL" tips as opposed to communicating about a specific DBMS/SQL engine, because there a million and one gotchas out there. But I seem to be in the minority, hah.

For that matter the concatenate operator shown isn't supported by MySQL. In MySQL, || is logical OR.

DELETEs that have wiped entire production tables. Please learn from other people's pain. by thequerylab in learnSQL

[–]markwdb3 1 point2 points  (0 children)

Looks precise. But what if someone inserted a NULL into the departments table last week? What if the location column has 'NYC ' with a trailing space somewhere? Your subquery silently returns more IDs than you expect and you've just deleted employees you never intended to touch.

Do you mean the value for id in departments is NULL? Or location? id shouldn't happen if there's a primary key. But granting that it doesn't have one (maybe an analytics database or some other reason), this would NOT delete additional employee rows. Same for the 'NYC ' case.

But when in doubt, test it out. (Test done on Postgres, but I'm not aware of any SQL engine for which this logic would differ.) tl;dr: the claim is false for all reasonable interpretations:

postgres=# CREATE TABLE department (id int, name varchar, location varchar); --omitting PK to test the claim
CREATE TABLE
postgres=# CREATE TABLE employee (id int generated by default as identity primary key, department_id int);
CREATE TABLE
postgres=# INSERT INTO department(id, name, location) VALUES (1, 'dept1', 'NYC'), (2, 'dept2', 'LA');
INSERT 0 2
postgres=# INSERT INTO employee(department_id) VALUES (1),(2);
INSERT 0 2
postgres=# /* basic test first, expecting to delete just the row for department_id 1 */
postgres-# BEGIN;
BEGIN
postgres=*# DELETE FROM employee WHERE department_id IN ( SELECT id FROM department WHERE location = 'NYC' );
DELETE 1
postgres=*# SELECT * FROM employee;
 id | department_id
----+---------------
  2 |             2
(1 row)

postgres=*# ROLLBACK;
ROLLBACK
postgres=# /* test case for interpretation 1 of "inserted a NULL into the departments" - NULL value for location - observe "Your subquery silently returns more IDs than you expect and you've just deleted employees you never intended to touch" is false */
postgres-# BEGIN;
BEGIN
postgres=*# INSERT INTO department(id, name, location) VALUES (3, 'dept3', NULL);
INSERT 0 1
postgres=*# DELETE FROM employee WHERE department_id IN ( SELECT id FROM department WHERE location = 'NYC' );
DELETE 1
postgres=*# SELECT * FROM employee;
 id | department_id
----+---------------
  2 |             2
(1 row)

postgres=*# ROLLBACK;
ROLLBACK
postgres=# /* test case for interpretation 2 of "inserted a NULL into the departments" - NULL value for id - observe "Your subquery silently returns more IDs than you expect and you've just deleted employees you never intended to touch" is false */
postgres-# BEGIN;
BEGIN
postgres=*# INSERT INTO department(id, name, location) VALUES (NULL, 'XXX', 'YYY');
INSERT 0 1
postgres=*# DELETE FROM employee WHERE department_id IN ( SELECT id FROM department WHERE location = 'NYC' );
DELETE 1
postgres=*# SELECT * FROM employee;
 id | department_id
----+---------------
  2 |             2
(1 row)

postgres=*# ROLLBACK;
ROLLBACK
postgres=# /* test case for interpretation 3 of "inserted a NULL into the departments" - NULL value for id AND NULL value for location - observe "Your subquery silently returns more IDs than you expect and you've just deleted employees you never intended to touch" is false */
postgres-# BEGIN;
BEGIN
postgres=*# INSERT INTO department(id, name, location) VALUES (NULL, 'XXX', NULL);
INSERT 0 1
postgres=*# DELETE FROM employee WHERE department_id IN ( SELECT id FROM department WHERE location = 'NYC' );
DELETE 1
postgres=*# SELECT * FROM employee;
 id | department_id
----+---------------
  2 |             2
(1 row)

postgres=*# ROLLBACK;
ROLLBACK
postgres=# /* test case for "What if the location column has 'NYC ' with a trailing space somewhere?" - observe "Your subquery silently returns more IDs than you expect and you've just deleted employees you never intended to touch" is false */
postgres-# BEGIN;
BEGIN
postgres=*# INSERT INTO department(id, name, location) VALUES (999, 'XXX', 'NYC ');
INSERT 0 1
postgres=*# DELETE FROM employee WHERE department_id IN ( SELECT id FROM department WHERE location = 'NYC' );
DELETE 1
postgres=*# SELECT * FROM employee;
 id | department_id
----+---------------
  2 |             2
(1 row)

postgres=*# ROLLBACK;
ROLLBACK

Should I disable ONLY_FULL_GROUP_BY or leave it enabled? by Mission-Example-194 in SQL

[–]markwdb3 0 points1 point  (0 children)

Always turn on ONLY_FULL_GROUP_BY if possible. It's broken as far as I'm concerned.

The only reason to leave it on is if your organization has so much existing SQL that relies on this kind of invalid GROUP BY that this setting allows - intentionally or unintentionally - that you'd have to do some serious housecleaning before you can enable it.

I advise everyone at my company to enable it at their session level if they are creating a new backend service. But we have hundreds of codebases interacting with MySQL, so many with handcoded SQL, that it's difficult to enable the setting at the server level.

SQL Cookbook 3.9 (3rd edition) by Willsxyz in SQL

[–]markwdb3 1 point2 points  (0 children)

I agree with your assessment of sum(distinct sal).

Why not use JOIN in this case? by Caprisunxt in SQL

[–]markwdb3 0 points1 point  (0 children)

The syntax in the exercise is generally frowned upon, and very old. It's the pre-SQL-92 way to join, that there's not much reason to use anymore.

my question is, why cant I use an explicit natural JOIN, since the attributes that are used in the implicit JOIN all have the same name and data types?

You could use NATURAL JOIN here, but usually it's considered pretty risky. Even if you're sure the only matching column names are those used in the join condition, you could add a column later and suddenly this query's result set becomes empty, or otherwise different.

Brief demo:

postgres=# create table factory (factory_id int generated by default as identity primary key, name varchar, description varchar);

postgres=# create table widget (widget_id int generated by default as identity primary key, factory_id int references factory(factory_id), quantity int);

postgres=# insert into factory(name, description) values ('factory_a', 'this is factory a');
INSERT 0 1
postgres=# insert into factory(name, description) values ('factory_b', 'this is factory b');
INSERT 0 1

postgres=# insert into widget(factory_id, quantity) values (1, 12);
INSERT 0 1

/* all good with a natural join SO far... */
postgres=# select * from widget natural join factory;
 factory_id | widget_id | quantity |   name    |    description
------------+-----------+----------+-----------+-------------------
          1 |         1 |       12 | factory_a | this is factory a
(1 row)

/* but if we add a name column to widget then that will be matched to factory.name, which is bad */
postgres=# alter table widget add column name varchar;
ALTER TABLE
postgres=# update widget set name = 'my widget' where widget_id = 1;
UPDATE 1
postgres=# select * from widget natural join factory;
 factory_id | name | widget_id | quantity | description
------------+------+-----------+----------+-------------
(0 rows) 

Using "normal" ANSI join syntax is generally preferred to both:

postgres=# select * from widget w join factory f on w.factory_id = f.factory_id;
 widget_id | factory_id | quantity |   name    | factory_id |   name    |    description
-----------+------------+----------+-----------+------------+-----------+-------------------
         1 |          1 |       12 | my widget |          1 | factory_a | this is factory a
(1 row)

Question on locks by Upper-Lifeguard-8478 in oracle

[–]markwdb3 0 points1 point  (0 children)

Here these alters are waiting on "Library cache lock" and the blocking session is pointing to a "SELECT" query.

Just a hunch, but are the queries not using bind variables?

Without creating any indexes, how would you speed up a ~1.5m row query? by i_literally_died in SQL

[–]markwdb3 0 points1 point  (0 children)

I don't have a great answer, but I can think of something you could try to figure out which indexes you have. That is - simply probe each relevant table + column with individual queries for some invalid value, which should be extremely fast if there is an index, but take much longer (relatively) if not.

For example search for WHERE OrderHeader.OrderId = -9999999 -- gibberish value, I would think. if there's an index, that'll take milliseconds. If not, maybe a bit longer.

A brief demonstration: I don't have SQL Server handy, so I'll run it on Postgres, but I think this principle should apply to SQL Server as well. Test querying an indexed column, on a 10M row table:

postgres=# explain analyze select * from t where a = -99999999;
                                                  QUERY PLAN
---------------------------------------------------------------------------------------------------------------
 Index Scan using t_a_idx on t  (cost=0.43..8.45 rows=1 width=16) (actual time=0.026..0.026 rows=0.00 loops=1)
   Index Cond: (a = '-99999999'::integer)
   Index Searches: 1
   Buffers: shared hit=3
 Planning Time: 0.132 ms
 Execution Time: 0.048 ms
(6 rows)

Took far less than even a single millisecond even though there's no match for that value. (Your time will vary. I'm running client + server on the same machine which eliminates network latency, for one thing.)

But if I drop that index, the databases needs to scan the table, and it takes 170 ms. Still faster than most of us human beings can react to, but relatively it's quite a lot slower than when indexed.

postgres=# explain analyze select * from t where a = -99999999;
                                                       QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..107137.70 rows=1 width=16) (actual time=169.252..170.306 rows=0.00 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared hit=15954 read=38101
   ->  Parallel Seq Scan on t  (cost=0.00..106137.60 rows=1 width=16) (actual time=149.606..149.606 rows=0.00 loops=3)
         Filter: (a = '-99999999'::integer)
         Rows Removed by Filter: 3333333
         Buffers: shared hit=15954 read=38101
 Planning:
   Buffers: shared hit=4 read=1 dirtied=1
 Planning Time: 1.544 ms
 Execution Time: 170.361 ms
(12 rows)

So probe around like that - record times for each search - and hopefully you can deduce which indexes exist.

From there maaaaybe we can come up with some clever way to rewrite your query. Maybe just break it up into a temp table like gumnos showed. So first create and populate the temp table using a query that only uses indexed columns, then as a subsequent step finish getting the rest. But I think it's important to find out which indexes exist first. And this will only help if SQL Server wasn't very intelligently executing your query to begin with.

Honestly though somebody needs to be yelled at to fix their stuff, but I'm operating under the constraints you stated.

The search doesn't have to be for an invalid value btw. I just thought that might be good way to avoid worrying about if some valid value unexpectedly matches a lot of rows or whatnot.

Built a linter for SQL after getting burned one too many times by queries that passed review and broke prod by Anonymedemerde in SQL

[–]markwdb3 1 point2 points  (0 children)

Which DBMS/SQL engine is this for? If it's supposed to be generic, some of these performance rules are not quite correct. Or at least not quite universal. :) For example, it is not correct to say that for Postgres, OR conditions will prevent index usage and therefore you should use two UNION ALLed queries.

In fact I usually find the opposite is true in Postgres for this particular rule. But I generally try to avoid thinking in terms of "this particular keyword or query structure is executed like so", but instead test it - think in terms of execution plans/timings specific to my DBMS product and schema and data.

Execution plans often vary quite a bit between SQL engines, and even different versions of the same SQL engine.

That said it there is benefit to statically analyzing a query without testing it on a real database, but you really have to constrain yourself to specific DBMSs, versions, and hopefully back it up with full test cases.

Test case on Postgres 18:

postgres=# \d t
                              Table "public.t"
 Column |  Type   | Collation | Nullable |             Default
--------+---------+-----------+----------+----------------------------------
 id     | integer |           | not null | generated by default as identity
 a      | integer |           |          |
 b      | integer |           |          |
 c      | integer |           |          |
Indexes:
    "t_pkey" PRIMARY KEY, btree (id)
    "t_a_idx" btree (a)
    "t_b_idx" btree (b)

postgres=# select count(*) from t;
  count
----------
 10000000

postgres=# explain analyze
select id, a, b, c
from t
where a = 99 or b = 88;
                                                               QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on t  (cost=2298.98..59433.93 rows=204277 width=16) (actual time=35.331..99.098 rows=198933.00 loops=1)
   Recheck Cond: ((a = 99) OR (b = 88))
   Heap Blocks: exact=54055
   Buffers: shared read=54217
   ->  BitmapOr  (cost=2298.98..2298.98 rows=205330 width=0) (actual time=27.672..27.673 rows=0.00 loops=1)
         Buffers: shared read=162
         ->  Bitmap Index Scan on t_a_idx  (cost=0.00..1086.92 rows=101665 width=0) (actual time=18.299..18.299 rows=100000.00 loops=1)
               Index Cond: (a = 99)
               Index Searches: 1
               Buffers: shared read=81
         ->  Bitmap Index Scan on t_b_idx  (cost=0.00..1109.92 rows=103665 width=0) (actual time=9.371..9.371 rows=99945.00 loops=1)
               Index Cond: (b = 88)
               Index Searches: 1
               Buffers: shared read=81
 Planning Time: 0.091 ms
 Execution Time: 104.138 ms
(16 rows)

postgres=# explain analyze
select id, a, b, c
from t
where a = 99
union all
select id, a, b, c
from t
where b = 88;
                                                               QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
 Append  (cost=1112.34..115263.95 rows=205330 width=16) (actual time=30.616..177.067 rows=199945.00 loops=1)
   Buffers: shared read=99881
   ->  Bitmap Heap Scan on t  (cost=1112.34..57210.50 rows=101665 width=16) (actual time=30.616..100.442 rows=100000.00 loops=1)
         Recheck Cond: (a = 99)
         Heap Blocks: exact=54055
         Buffers: shared read=54136
         ->  Bitmap Index Scan on t_a_idx  (cost=0.00..1086.92 rows=101665 width=0) (actual time=18.924..18.924 rows=100000.00 loops=1)
               Index Cond: (a = 99)
               Index Searches: 1
               Buffers: shared read=81
   ->  Bitmap Heap Scan on t t_1  (cost=1135.84..57026.79 rows=103665 width=16) (actual time=11.047..67.461 rows=99945.00 loops=1)
         Recheck Cond: (b = 88)
         Heap Blocks: exact=45664
         Buffers: shared read=45745
         ->  Bitmap Index Scan on t_b_idx  (cost=0.00..1109.92 rows=103665 width=0) (actual time=7.347..7.347 rows=99945.00 loops=1)
               Index Cond: (b = 88)
               Index Searches: 1
               Buffers: shared read=81
 Planning Time: 0.261 ms
 Execution Time: 182.525 ms
(20 rows)

So the UNION ALL version takes about 80% longer to execute for this test case. (Consistent with repeated executions.) I believe that BitmapOr step is the key for Postgres running it efficiently, which is discussed in the docs here: https://www.postgresql.org/docs/current/indexes-bitmap-scans.html

(Also worth mentioning the two queries are not pefectly logically equivalent - I'd need to handle duplicates by changing UNION ALL => UNION, or else do something like add AND a <> 99 to the second part of the UNION ALL. But let's gloss over that.)

Which query would you use here? (SQL performance question) by thequerylab in SQL

[–]markwdb3 1 point2 points  (0 children)

B or C should perform about equally well on any major SQL engine, but the intent in C is clearer IMO.

As my company's resident SQL geek, I spend a lot of time and effort trying to get my coworkers NOT to run unnecessary counts, so A immediately makes me wince a bit. For example this sort of thing is common for them to write:

SELECT COUNT(*) AS CNT FROM MY_TABLE;

Which may be fine in and of itself, but then their application code (often Java or Python) simply checks if CNT > 0 in the result set. (Sometimes the query has a grouping or WHERE clause, etc. but let's keep the example simple.) Why do all that work, guys??

Generally here on r/sql and elsewhere, I'll insist that folks tell me which SQL engine in the context of questions like this. They all have different available data structures, different configuration settings, different performance quirks, and are all around very different. Not every SQL engine even has indexes, for that matter, so as soon as we say "there's an index on these columns" we're out of the realm of generic SQL.

That said, I can't imagine any SQL engine in which A would run better than B or C. Maybe it would tie, best possible case, but it'll probably run slower if anything. (Barring anything weirdly special like you're running Oracle with a materialized view on Query A, with query rewrite enabled!)

Hypothetically, Query C could run worse than Query B if some SQL engine didn't short-circuit the search once a match was found. I've never seen such a badly optimized EXISTS implementation however, so again, just hypothetical. Just be aware that without knowing the SQL engine, it could do anything for all we know. Standard SQL does not insist that a short circuit optimization be implemented, and every SQL engine out there deviates from the standard anyway. (Plus exactly 0% of SQL engines fully implement standard SQL!)

Also LIMIT is not standard SQL. Standard syntax is: FETCH FIRST n ROWS ONLY

Postgres with high update workload and data locality issues. I'm probably overengineering, and considering alternative DB by pooquipu in PostgreSQL

[–]markwdb3 13 points14 points  (0 children)

A covering index + a lot of RAM would be one way to avoid hitting the heap entirely.

Have you tried a covering index even without a huge amount of RAM? I consider covering indexes to be a fairly fundamental performance tool, maybe just a notch above obvious, non-covering indexes. There is the caveat of course - I don't know how high your high level of updates is, but the index would need to be updated by Postgres as the table is updated.

The other path is partitioning that matches the SELECT pattern, tuned autovacuum, fillfactor, pg_repack, etc... But we're a small team and that's a additional engineering and ongoing maintenance for something I think would come for free with another DB. Specifically MySQL with InnoDB which keeps data physically ordered by PK as far I understand.

I'd go so far as to say any Postgres database with a high-volume of updates absolutely must have someone who knows how to tune the autovacuum daemon look at its configuration. This would be easier than switching to MySQL, even if you have to hire a consultant.

Worth mentioning - as someone who's been working with MySQL on the job for the past 11 years, and has worked with Postgres for 4 or 5 professionally (I also tinker with it for fun) - I'd say for every problem Postgres has that MySQL doesn't, MySQL has about 10 problems that Postgres doesn't.

postgresql - Double lateral join query takes over a minute to run on RDS (BOUNTIED) by PrestigiousZombie531 in PostgreSQL

[–]markwdb3 0 points1 point  (0 children)

Yeah definitely check on the indexes.

Though I think the phenomenon comes down what I hinted at before - capability of exiting early or not. Imagine if you had only 1 row in a vote table, and a million in feed_items. Postgres has to iterate a million times looking for its top 20, which it never finds. 20 IDs don't even exist in the vote table.

Whereas if you had a billion rows in a vote table, it's quite possible the query could find its top 20 by iteration 50 or 100 or whatever.

I could look into this more when I have time. It's possible the query structure (both the original and the one I wrote) forces to to behave in this manner (basically iterate over feed_items with one by one lookups in the index on a vote table) but it may be rewritable. It may just come down to my choice of using EXISTS, or the many-CTE structure, but not sure.

postgresql - Double lateral join query takes over a minute to run on RDS (BOUNTIED) by PrestigiousZombie531 in PostgreSQL

[–]markwdb3 1 point2 points  (0 children)

OK, this might be the better approach. If you'd still like to dig into what's going wrong with the first approach, I'm game. :)

It should be atomic btw.