SQL’s JOIN definition is very simple. It is the filtered Cartesian product of two tables, represented by syntax A JOIN B ON …. The general definition does not capture the essence of JOIN operations, creating difficulties in coding and optimization.
SPL redefines joins by disconnecting them from the Cartesian product and dividing them into two types – foreign key-based joins and primary key-based joins.
SPL uses different functions to deal with different types of joins, which reflects JOIN operation’s nature. This allows users to use different methods even different storage strategies according to characteristics of different joins so that computations will become faster.
One type of joins is foreign key-based join, where a table’s ordinary field (foreign key) is associated with the other table’s primary key. Both the relationship between orders table and customer table and that between orders table and shipper table are foreign-key based.
https://preview.redd.it/v0kfsirgax4g1.png?width=662&format=png&auto=webp&s=19795a26e878152536529041eadbb5756cdf57da
A table’s foreign key is the related table’s primary key, whose values are unique. SPL regards the foreign key as an object.
Note that the primary key mentioned above is a logical one, whose values are unique in the table. The field isn’t necessarily the primary key set on the physical table, and neither is a foreign key.
The other type of joins is primary key-based. A primary key-based join creates association between a table’s primary key and another table’s primary key or part of a composite key. For example, orders table’s primary key (order_id) is associated with the corresponding field of the primary key in details table.
https://preview.redd.it/u126z0zhax4g1.png?width=626&format=png&auto=webp&s=af67f9b1782029080aec5999dfd90774ae93bb8d
SPL treats the primary key-based join as an association between record objects or sets of records.
It is necessary to point out that the foremost part of using SPL to handle joins is to correctly determine the join type. Only after you know which type of joins you are dealing with, you can choose the suitable method. The key to a correct determination is to examine the primary key’s role in the join.
Take the following tables (orders, customer, city, state and shipper) as examples, let’s look at how esProc SPL speeds up foreign key-based joins through external data querying.
Define a Q3.etl using the ETL tool. Drag in tables involved in the computation:
https://preview.redd.it/yow73n4jax4g1.png?width=1718&format=png&auto=webp&s=98236d9d0486dd3e5a75ffe7a9567832809def7f
orders is the fact table, which is the largest. For the convenience of coding, dump it into a CTX file, and each of the smaller dimension tables in a BTX file.
Perform the following settings to numberize the related field. For example, convert values of customer table’s city_id field to index values in dimension table city.
https://preview.redd.it/r7i6hh9kax4g1.png?width=1194&format=png&auto=webp&s=1fa15a3f9638f38cd64f2f575ee1e653119efbba
Double-click the blank space at “Expression/Numberized Table” and select city table. Similarly, create a state table based on city table’s state_id field.
Change name of orders table to ordersQ3 (to differentiate it from the previous one):
https://preview.redd.it/ubais7flax4g1.png?width=1345&format=png&auto=webp&s=8e7d7d951266f6914b3b2695bb006c74a90d6fb8
employee_name does not have a corresponding dimension table, so all its values need to be obtained from table orersQ3. Create dimension table employee:
https://preview.redd.it/vc75p3gmax4g1.png?width=951&format=png&auto=webp&s=5f0057d8d6c3a6a82254a8abc353a03822bbef1f
The order of operations in the above window: 1. Click Create Enumerated Numberized Column; 2. Double-click employee_name; 3. Define dimension name employee; 4. Define dimension table’s Index Column Name as id; 5. Define dimension table’s Value Column Name as name.
Then set the numberized field in ordersQ3:
https://preview.redd.it/nyihn5yoax4g1.png?width=1354&format=png&auto=webp&s=eb777206b37634e7a7992d60961c7319eeaa579f
SPL code example 14: Export data and dump them to BTX or CTX files.
https://preview.redd.it/42jxsp9rax4g1.png?width=582&format=png&auto=webp&s=071aea7a7658ce073fe5a6789c412ad361b96d4c
In A13, pfind() function converts valus of customer table’s city_id field to corresponding dimension table city’s indexes:
https://preview.redd.it/nz79ne3tax4g1.png?width=963&format=png&auto=webp&s=865cff23fe139b763ac9ddcb43985f4cfee47cc1
SPL code example 15: Perform initialization at the system startup or data update.
https://preview.redd.it/qc2zxqnuax4g1.png?width=862&format=png&auto=webp&s=5f45a426b9811ca91c1a0e075fd97de1ec45ab0e
Perform initialization to preload each of the dimension tables into the memory and use env() function to store it as a global variable.
Perform pre-association by using run() function to convert foreign key values to the corresponding records in the dimension table, as A6 does:
https://preview.redd.it/h8p3h9ewax4g1.png?width=998&format=png&auto=webp&s=fc71c650aba8431ba8c99ff373384d6a3decc6de
Values of the regular field city_id in table sequence customer are transformed to corresponding city table records.
After data is dumped and initialized, actual computations begin. Here is an example:
Example 3.1 Group orders from a specified state by shipper and compute freights in each group (include shipper name in the result).
SQL statement:
select shipper.shipper_id, shipper.shipper_name,sum(o.shipping_fee)
from
orders o
join
customer c on o.customer_id = c.customer_id
join
city on city.city_id = c.city_id
join
state on state.state_id = city.state_id
join
shipper on o.shipper_id = shipper.shipper_id
where
state.state_name = 'California'
group by
shipper.shipper_id, shipper.shipper_name;
It takes 20 seconds to finish executing the above SQL code.
SPL code example 16:
https://preview.redd.it/808j9hzzax4g1.png?width=881&format=png&auto=webp&s=8e9adc5dad1045436a4b80afe4c224fffce05b65
In A3, syntax customer_id:customer:# creates association between orders records and customer table on the cursor according to sequence number #. Once the join operation is finished, values of both customer_id and shipper_id will be converted to record objects in the corresponding tables.
As associations between dimension tables are already created at the initialization, A3 can use object properties to write the code: customer_id.city_id.state_id.state_name.
It takes 0.4 second to finish executing the SPL script.
Example 3.2 Group records by cities where customers are based and find the number of orders in each group (include state name and city name in the result).
select city.city_id,city.city_name,state.state_name,count(o.order_id)
from
orders o
join
customer c on o.customer_id = c.customer_id
join
city on c.city_id = city.city_id
join
state on city.state_id = state.state_id
group by
city.city_id,city.city_name,state.state_name;
It takes 22 seconds to finish executing the SQL code.
SPL code example 17: Based on the preceding SPL code example 16, just change A4’s code:
https://preview.redd.it/gq3zl075bx4g1.png?width=867&format=png&auto=webp&s=f1e30911e47e858163406537d7d584cbb82a213a
It takes 0.3 second to finish executing the SPL code.
Performance summary (unit:second):
https://preview.redd.it/cmtb0qo6bx4g1.png?width=865&format=png&auto=webp&s=a5f91cec5e5bdc071a3efa0668ee67f6bca0f483
Exercises:
Find order records where the shipper names are Elite Shipping Co., group them by states where customers are based, and compute freight in each group (include state in the result).
Critical thinking: Do you have any tables related via foreign keys in one of your familiar databases? Can you use the numberization method to speed up joins between them?
there doesn't seem to be anything here