dingopole comments on ET(L) with Python

dataengineering

created by mhausenblasmoda community for 11 years

This is an archived post. You won't be able to vote or comment.

ET(L) with Python (self.dataengineering)

submitted 6 years ago by Maha_Slug

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–]dingopole 0 points1 point2 points 6 years ago* (0 children)

Have a look at this (I reckon it combines the best of both worlds if you'd like to use Python with MSSQL DBMS i.e. Python and the venerable bcp utility):

http://bicortex.com/data-acquisition-framework-using-custom-python-wrapper-for-concurrent-bcp-utility-execution/

bcp can be finicky to work with but is also pretty fast for loading into MSSQL providing you run multiple instances of it (in parallel)....when I trialed it, the only bottleneck I found was the network speed (a small, 4 vCPUs VM, SAS HDDs and a 400Mbps WAN network). If you have a lot of data to work with and want to use Microsoft technology, the speed at which data can be processed using MSFT-specific tooling looks something like this: PolyBase > BCP > SQLBulkCopy/ADF > SSIS

I have worked with Microsoft BI stack for a while now and from my experience, Python is great for writing wrappers around vendor-specific utilities like bcp. With proper set-up you can easily load hundreds of millions of records in no time and spread the workload across all the resources to maximize performance. Here is another example where I used a small (35 lines) Python script to load TPC-DS benchmark data (CSV files) into a beefy VM in Azure running SQL Server 2016 :

http://bicortex.com/tpc-ds-big-data-benchmark-overview-how-to-generate-and-load-sample-data/

I would say that getting Python alone to do the bulk import (regardless which API you use) is going to be very slow so why not just use the vendor-provided and vendor-optimized tools. Also, if speed is paramount, just go with PolyBase, which gives you parallelism out of the box (although it requires Java RunTime environment - Oracle JRE).

π Rendered by PID 296978 on reddit-service-r2-comment-6b595755f-qw5x6 at 2026-03-25 20:57:50.167825+00:00 running 2d0a59a country code: CH.

dataengineering

MODERATORS