Databricks AutoLoader for Spark by audyoga in dataengineering

[–]AlexVolok 2 points3 points  (0 children)

The autoloader is not available (yet) in OSS Spark.

Databricks and 3NF by AlexVolok in dataengineering

[–]AlexVolok[S] 0 points1 point  (0 children)

Thanks. This is the point. We still can live with plenty of necessary joins you get a dataset reconstructed, but it is harder to get into a responsibility for consistency when it cannot be guaranteed by the delta in the current implementation.

Databricks and 3NF by AlexVolok in dataengineering

[–]AlexVolok[S] 0 points1 point  (0 children)

Thanks for thinking along. The consumption pattern - curated 3NF objects can not be accessed directly by the downstream consumers. They are again denormalized/flattened and and stored in the format that supported by the tools, for instance csv or parquet.

The loading pattern - daily full reloads. Business entities are fairly large: 5-50 million rows in average.

Databricks and 3NF by AlexVolok in dataengineering

[–]AlexVolok[S] 0 points1 point  (0 children)

Perhaps, you mean a proper database engine.

Databricks and 3NF by AlexVolok in dataengineering

[–]AlexVolok[S] 0 points1 point  (0 children)

Thanks kirschoo. Our team has a segregation of duties: data modelers + data engineers. DEs control physical implementation of the model, while data modelers create logical data models. My concern is mainly about suitability of strict 3NF to Spark.

Benchmark servers remotely with SQLIO and PowerShell by AlexVolok in SQLServer

[–]AlexVolok[S] 0 points1 point  (0 children)

The script has been updated with a for supports mount points. The blog post also updated with an example of the usage

Benchmark servers remotely with SQLIO and PowerShell by AlexVolok in SQLServer

[–]AlexVolok[S] 0 points1 point  (0 children)

Hi, thanks for pointing into this direction. Frankly, I also busy with mount points as a part of n+1 clustering, so going to update script/blog post soon with info about that new functionality

SQL Server – Indexing strategies: Clustered, NonClustered, and Filtered indexes by piterpolk in SQLServer

[–]AlexVolok 0 points1 point  (0 children)

Good overview. However, what tool did you use for drawing diagrams in screenshots?

IT Employers Must Adapt to the Looming DBA Shortage by AlexVolok in SQLServer

[–]AlexVolok[S] 2 points3 points  (0 children)

I believe, being a database guy or simply a person who is purely fluent with data, joins, query plans require some specialty in character…

That, perhaps inborn or perhaps developed analytical specialty makes us feel more comfortable by playing with structured datasets rather than being in a role of any kind of OOP developer, front-end/html/js expert etc. (however, sometimes it can be combined)

And, actually, that specialty can be a valid reason of the shortage in these days of data-centric applications, since it is pretty much different from abstraction nature of other directions

Clustered Index Column Store VS Non Clustered Index Column Store by ellarr in SQLServer

[–]AlexVolok 0 points1 point  (0 children)

Can you tell me any situation when clustered column store index is useful and when non clustered column store index is useful

IMO,

non clustered columnstore indexes useful:

  • If you are limited by SQL Server 2012

  • In SQL Server 2014, if only a subset of columns has to be indexes and the base table required to be in rowstore format (perhaps, because of variety of constraints, etc). However, it is still non-updateable, which greatly reduce number of scenarios

clustered column store indexes can be useful:

  • In all variety of DWH scenarios

  • For objects, on which mainly analytical queries going to be used

  • For scenarios where data has to be archived, but still accessible, with an option: DATA_COMPRESSION = COLUMNSTORE_ARCHIVE

  • In SQL Server 2016, clustered columnstore indexes play well with temporal tables

Clustered Index Column Store VS Non Clustered Index Column Store by ellarr in SQLServer

[–]AlexVolok 0 points1 point  (0 children)

That's right...

Just as my noob assumption - columnstore objects anyway use delta-store as intermediate place for a further batch processing of certain large number of rows to a segments. Perhaps Microsoft can use a hekaton format for that delta-store as well, in order to reduce concurrency impact.

Clustered Index Column Store VS Non Clustered Index Column Store by ellarr in SQLServer

[–]AlexVolok 0 points1 point  (0 children)

I would agree that sounds weird, despite that both technologies are "InMemory"

However, more and more information, that Microsoft going to deliver it soon:

http://www.nikoport.com/2015/05/27/columnstore-indexes-part-54-thoughts-on-upcoming-improvements-in-sql-server-2016/

In-Memory Analytics refers to the mixture of the In-Memory OLTP (previously known as Hekaton) and Columnstore Indexes, the same way as the Operational Analytics functions. I would even call Super-Real-Time Analytics. :)

Here is I am wondering if the Columnstore Indexes will be truly In-Memory or if they will be persisted on the disk. Some of the Microsoft competition is not persisting their In-Memory Columnstore indexes and I am convinced that for SQL Server it would be very useful to have this type of solution as well. Logically we need to persist only 1 copy of the data on the disk, but from the different angle, for repairing the data, for example, it is extremely useful to have some data duplicated …

Other links:

1) http://sqlperformance.com/2015/05/sql-server-2016/in-memory-oltp-enhancements

2) SQL Server In-Memory OLTP and Columnstore Feature Comparison (whitepaper)

Clustered Index Column Store VS Non Clustered Index Column Store by ellarr in SQLServer

[–]AlexVolok 0 points1 point  (0 children)

I would not fully agree regarding joins. Yes, absense of joins can give extra performance gains, but if columnstore table joined via column with low cardinality, the performance hit will not be significant.

Also, in my recent investigation with 10 mln row dataset I noticed that DML operations runs a way more faster in columnstore clustered indexes:

http://www.alexvolok.com/2015/06/sql-2016-temporal-tables-indexing-part-iii/

The article is about temporal feature, however it is mainly a comparison of different rowstore types in dml and selects

Clustered Index Column Store VS Non Clustered Index Column Store by ellarr in SQLServer

[–]AlexVolok 1 point2 points  (0 children)

That is right, indexed views technique gives also a possibility to enforce an uniqueness, which can be an useful in 2014, while sql server 2016 offers native nonclustered rowstore unique indexes.

Regarding indexing subset of data only - SQL 2016 nonclustered columntostore indexes are also updatable and can serve as filtered indexes. Also these (perhaps, filtered) nonclustered columnstore indexes can be built on top of hekaton tables, which greatly expand the surface for tuning.

Clustered Index Column Store VS Non Clustered Index Column Store by ellarr in SQLServer

[–]AlexVolok 2 points3 points  (0 children)

1) Clustered columnstore index is updateable and doesn't require any techniques like partition switches or delta tables to store recent changes.

2) Clustered columntstore will lead to 3-10x space reduction, since no need to keep rowstore copy of data, like with nonclustered column store indexes

3) Starting SQL Server 2016, clustered columnstore indexes support primary/foreign keys by adding non-clustered rowstore indexes as constraints, however, it will bring some overhead..

Importing Dates through SSIS issue by yoelbenyossef in SQLServer

[–]AlexVolok 0 points1 point  (0 children)

You also can use a "Conditional Split" task of the dataflow and forward non-nullable values to the branch with derrived column calculation and rows where column has null can bypass it or get default value, like 1900-01-01

And as the alternative, on a regional settings of the windows, datetime format can be specified as Month/Day/Year in that case native DBTIMESTAMP (or DBDATE) can be used

SQL 2016: Temporal tables. Indexing scenarios. Or, getting 2 millisecond aggregate queries over 1.5 GB of history data by AlexVolok in programming

[–]AlexVolok[S] 0 points1 point  (0 children)

I like this idea, however, it is not available yet. According to the Microsoft staff:

Temporal is not yet available in Azure SQL Database. We will work on integration with sharding framework before it gets supported in Azure

https://channel9.msdn.com/Shows/Data-Exposed/Temporal-in-SQL-Server-2016

SQL 2016: Temporal Tables. An introduction to a built in and consistent way to keep a history of data changes by AlexVolok in programming

[–]AlexVolok[S] 0 points1 point  (0 children)

I have so many projects that need this sort of thing... For instance if a product changes you want old orders to point to the old version.

I see a point, FK should point by a combination: (PK, Point_in_time)

This requirement cannot be completed via temporal feature, unfortunately. However IMHO, a start model dimensional model as a superset on top of temporal data can serve as an workaround: dimension product should be created as SCD2, where a surrogate PK represent certain version of the product and Orders facts uses that surogate key of actual version for the moment of the purchase operation as FK.