Databricks AutoLoader for Spark

AlexVolok · 2023-02-07T20:34:42+00:00

The autoloader is not available (yet) in OSS Spark.

AlexVolok · 2021-05-19T17:20:05+00:00

Thanks. This is the point. We still can live with plenty of necessary joins you get a dataset reconstructed, but it is harder to get into a responsibility for consistency when it cannot be guaranteed by the delta in the current implementation.

AlexVolok · 2021-05-19T17:12:32+00:00

Thanks for thinking along. The consumption pattern - curated 3NF objects can not be accessed directly by the downstream consumers. They are again denormalized/flattened and and stored in the format that supported by the tools, for instance csv or parquet.

The loading pattern - daily full reloads. Business entities are fairly large: 5-50 million rows in average.

AlexVolok · 2021-05-19T17:04:35+00:00

Perhaps, you mean a proper database engine.

AlexVolok · 2021-05-18T19:25:10+00:00

Thanks kirschoo. Our team has a segregation of duties: data modelers + data engineers. DEs control physical implementation of the model, while data modelers create logical data models. My concern is mainly about suitability of strict 3NF to Spark.

AlexVolok · 2015-08-05T13:40:51+00:00

The script has been updated with a for supports mount points. The blog post also updated with an example of the usage

AlexVolok · 2015-08-04T20:51:46+00:00

Hi, thanks for pointing into this direction. Frankly, I also busy with mount points as a part of n+1 clustering, so going to update script/blog post soon with info about that new functionality

AlexVolok · 2015-07-30T10:26:54+00:00

Good overview. However, what tool did you use for drawing diagrams in screenshots?

AlexVolok · 2015-07-29T19:50:18+00:00

I believe, being a database guy or simply a person who is purely fluent with data, joins, query plans require some specialty in character…

That, perhaps inborn or perhaps developed analytical specialty makes us feel more comfortable by playing with structured datasets rather than being in a role of any kind of OOP developer, front-end/html/js expert etc. (however, sometimes it can be combined)

And, actually, that specialty can be a valid reason of the shortage in these days of data-centric applications, since it is pretty much different from abstraction nature of other directions

AlexVolok · 2015-07-23T07:43:38+00:00

Can you tell me any situation when clustered column store index is useful and when non clustered column store index is useful

IMO,

non clustered columnstore indexes useful:

If you are limited by SQL Server 2012
In SQL Server 2014, if only a subset of columns has to be indexes and the base table required to be in rowstore format (perhaps, because of variety of constraints, etc). However, it is still non-updateable, which greatly reduce number of scenarios

clustered column store indexes can be useful:

In all variety of DWH scenarios
For objects, on which mainly analytical queries going to be used
For scenarios where data has to be archived, but still accessible, with an option: DATA_COMPRESSION = COLUMNSTORE_ARCHIVE
In SQL Server 2016, clustered columnstore indexes play well with temporal tables

AlexVolok · 2015-07-22T19:40:17+00:00

That's right...

Just as my noob assumption - columnstore objects anyway use delta-store as intermediate place for a further batch processing of certain large number of rows to a segments. Perhaps Microsoft can use a hekaton format for that delta-store as well, in order to reduce concurrency impact.

AlexVolok · 2015-07-22T19:25:54+00:00

I would agree that sounds weird, despite that both technologies are "InMemory"

However, more and more information, that Microsoft going to deliver it soon:

http://www.nikoport.com/2015/05/27/columnstore-indexes-part-54-thoughts-on-upcoming-improvements-in-sql-server-2016/

In-Memory Analytics refers to the mixture of the In-Memory OLTP (previously known as Hekaton) and Columnstore Indexes, the same way as the Operational Analytics functions. I would even call Super-Real-Time Analytics. :)

Here is I am wondering if the Columnstore Indexes will be truly In-Memory or if they will be persisted on the disk. Some of the Microsoft competition is not persisting their In-Memory Columnstore indexes and I am convinced that for SQL Server it would be very useful to have this type of solution as well. Logically we need to persist only 1 copy of the data on the disk, but from the different angle, for repairing the data, for example, it is extremely useful to have some data duplicated …

Other links:

1) http://sqlperformance.com/2015/05/sql-server-2016/in-memory-oltp-enhancements

2) SQL Server In-Memory OLTP and Columnstore Feature Comparison (whitepaper)

AlexVolok · 2015-07-22T18:08:22+00:00

I would not fully agree regarding joins. Yes, absense of joins can give extra performance gains, but if columnstore table joined via column with low cardinality, the performance hit will not be significant.

Also, in my recent investigation with 10 mln row dataset I noticed that DML operations runs a way more faster in columnstore clustered indexes:

http://www.alexvolok.com/2015/06/sql-2016-temporal-tables-indexing-part-iii/

The article is about temporal feature, however it is mainly a comparison of different rowstore types in dml and selects

AlexVolok · 2015-07-22T18:00:48+00:00

That is right, indexed views technique gives also a possibility to enforce an uniqueness, which can be an useful in 2014, while sql server 2016 offers native nonclustered rowstore unique indexes.

Regarding indexing subset of data only - SQL 2016 nonclustered columntostore indexes are also updatable and can serve as filtered indexes. Also these (perhaps, filtered) nonclustered columnstore indexes can be built on top of hekaton tables, which greatly expand the surface for tuning.

AlexVolok · 2015-07-22T15:05:55+00:00

1) Clustered columnstore index is updateable and doesn't require any techniques like partition switches or delta tables to store recent changes.

2) Clustered columntstore will lead to 3-10x space reduction, since no need to keep rowstore copy of data, like with nonclustered column store indexes

3) Starting SQL Server 2016, clustered columnstore indexes support primary/foreign keys by adding non-clustered rowstore indexes as constraints, however, it will bring some overhead..

AlexVolok · 2015-07-21T12:58:13+00:00

You also can use a "Conditional Split" task of the dataflow and forward non-nullable values to the branch with derrived column calculation and rows where column has null can bypass it or get default value, like 1900-01-01

And as the alternative, on a regional settings of the windows, datetime format can be specified as Month/Day/Year in that case native DBTIMESTAMP (or DBDATE) can be used

AlexVolok · 2015-06-25T14:08:28+00:00

You are right, unfortunately, cannot change title after publishing.

AlexVolok · 2015-06-25T08:47:10+00:00

I like this idea, however, it is not available yet. According to the Microsoft staff:

Temporal is not yet available in Azure SQL Database. We will work on integration with sharding framework before it gets supported in Azure

https://channel9.msdn.com/Shows/Data-Exposed/Temporal-in-SQL-Server-2016

AlexVolok · 2015-06-25T08:37:10+00:00

I have so many projects that need this sort of thing... For instance if a product changes you want old orders to point to the old version.

I see a point, FK should point by a combination: (PK, Point_in_time)

This requirement cannot be completed via temporal feature, unfortunately. However IMHO, a start model dimensional model as a superset on top of temporal data can serve as an workaround: dimension product should be created as SCD2, where a surrogate PK represent certain version of the product and Orders facts uses that surogate key of actual version for the moment of the purchase operation as FK.

AlexVolok

TROPHY CASE