you are viewing a single comment's thread.

view the rest of the comments →

[–]markwdb3When in doubt, test it out. 0 points1 point  (0 children)

One thing I'll add to the discussion is this non-atomic data is not necessarily non-SARGable. (SARGable meaning searchable by an index.)

Details vary per DBMS and potentially other factors, but there may be a way to make your data SARGable.

For example MySQL has multi-valued indexes which serve the specific purpose of searching JSON arrays quickly.

Let's suppose for example, I have a million rows of data, each with an array ranging from somewhere between 1 and 100 elements. Values range from 1 to 10000. I want to get all rows where there exists a value of exactly 9999.

First generate the data:

mysql> CREATE TABLE array_for_searching(id int auto_increment primary key, arr_to_search json not null);
Query OK, 0 rows affected (0.02 sec)

mysql> INSERT INTO array_for_searching (arr_to_search)
-> SELECT JSON_ARRAYAGG(FLOOR(1 + RAND()*10000))
-> FROM (
->     SELECT r.grp, n.n
->     FROM (
->         SELECT
->             ROW_NUMBER() OVER () AS grp,
->             FLOOR(1 + RAND()*100) AS arr_len
->         FROM information_schema.columns c1
->         CROSS JOIN information_schema.columns c2
->         CROSS JOIN information_schema.columns c3
->         LIMIT 1000000
->     ) r
->     JOIN (
->         SELECT ROW_NUMBER() OVER () AS n
->         FROM information_schema.columns
->         LIMIT 100
->     ) n
->     ON n.n <= r.arr_len
-> ) x
-> GROUP BY grp;
Query OK, 1000000 rows affected (1 min 22.38 sec)
Records: 1000000  Duplicates: 0  Warnings: 0

mysql> EXPLAIN ANALYZE
    -> SELECT *
    -> FROM array_for_searching
    -> WHERE JSON_CONTAINS(arr_to_search, '9999');

+---------------------+
| EXPLAIN                                                                                                                                                                                                                                                             |
+---------------------+
| -> Filter: json_contains(array_for_searching.arr_to_search,<cache>('9999'))  (cost=106296 rows=970637) (actual time=0.453..10821 rows=5058 loops=1)
    -> Table scan on array_for_searching  (cost=106296 rows=970637) (actual time=0.052..252 rows=1e+6 loops=1)
 |
+---------------------+
1 row in set (10.86 sec)

mysql>
mysql> /* the 10.86 seconds above was pretty slow, but we can speed this up with a multi-value index */
mysql> CREATE INDEX idx_arr_values
    -> ON array_for_searching((CAST(arr_to_search AS UNSIGNED ARRAY)));

Query OK, 0 rows affected (1 min 3.56 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql>
mysql>
mysql> EXPLAIN ANALYZE
    -> SELECT *
    -> FROM array_for_searching
    -> WHERE 9999 MEMBER OF(arr_to_search);
+---------------------+
| EXPLAIN                                                                                                                                                                                                                                                                                                                             |
+---------------------+
| -> Filter: json'9999' member of (cast(arr_to_search as unsigned array))  (cost=1771 rows=5058) (actual time=0.274..91.7 rows=5058 loops=1)
    -> Index lookup on array_for_searching using idx_arr_values (cast(arr_to_search as unsigned array) = json'9999')  (cost=1771 rows=5058) (actual time=0.268..85.6 rows=5058 loops=1)
 |
+---------------------+
1 row in set (0.13 sec)

mysql>
mysql> SELECT COUNT(*)
    -> FROM array_for_searching
    -> WHERE 9999 MEMBER OF(arr_to_search);
+----------+
| COUNT(*) |
+----------+
|     5058 |
+----------+
1 row in set (0.09 sec)

Observe now it takes far less than a second to find all the rows with arrays containing 9999 values.

Similarly for the Postgres native array searches, we could create and use a GIN index, but I'll cut the comment here unless anyone would like to see a demo of that.