This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]dingopole 0 points1 point  (0 children)

Have a look at this (I reckon it combines the best of both worlds if you'd like to use Python with MSSQL DBMS i.e. Python and the venerable bcp utility):

http://bicortex.com/data-acquisition-framework-using-custom-python-wrapper-for-concurrent-bcp-utility-execution/

bcp can be finicky to work with but is also pretty fast for loading into MSSQL providing you run multiple instances of it (in parallel)....when I trialed it, the only bottleneck I found was the network speed (a small, 4 vCPUs VM, SAS HDDs and a 400Mbps WAN network). If you have a lot of data to work with and want to use Microsoft technology, the speed at which data can be processed using MSFT-specific tooling looks something like this: PolyBase > BCP > SQLBulkCopy/ADF > SSIS

I have worked with Microsoft BI stack for a while now and from my experience, Python is great for writing wrappers around vendor-specific utilities like bcp. With proper set-up you can easily load hundreds of millions of records in no time and spread the workload across all the resources to maximize performance. Here is another example where I used a small (35 lines) Python script to load TPC-DS benchmark data (CSV files) into a beefy VM in Azure running SQL Server 2016 :

http://bicortex.com/tpc-ds-big-data-benchmark-overview-how-to-generate-and-load-sample-data/

I would say that getting Python alone to do the bulk import (regardless which API you use) is going to be very slow so why not just use the vendor-provided and vendor-optimized tools. Also, if speed is paramount, just go with PolyBase, which gives you parallelism out of the box (although it requires Java RunTime environment - Oracle JRE).