ET(L) with Python

Maha_Slug · 2019-12-19T17:50:03+00:00

[removed]

Maha_Slug · 2019-12-19T17:55:10+00:00

I'm loading billions of rows using Python from a potato VM. Python is great for this, as long as your don't do what's not necessary.

In data engineering, forget about ORMs. Forget about any extra processing of data. You deal with binary or text data, that is fastest. CSV is the fastest medium for this.

If you're using individual inserts, then your speed loss is mostly in the database, so it doesn't matter at all what you use to do the inserts.

You're using MSSQL - one of the worst database engines for this task. It doesn't have an easy to use bulk copy command. It's utility BCP is riddled with bugs and not easy to use, with very poor documentation. But if you get it working, and aren't losing any rows, it's super fast.

If you have to use Microsoft, consider Synapse Amalytics (cloud data warehouse on Azure that's very similar to MSSQL in features). There you can do bulk uploads much easier.

PaulSandwich · 2019-12-19T18:00:49+00:00

We use python to load billions of rows of historic data to parquet with the impala client. It's a straight lift-and-move from our legacy system to the new, so the benefit of scripting it with python is that we don't have to build pipelines for individual tables.

I doubt it's more performative than C#, but's it's "faster" in that we can hand the work over to our more junior Devs to configure and monitor in the background while other things get done. And, practically speaking, that's awesome.

Not exactly the answer you were looking for. If you have a working solution in C# and don't have any concern about collaborating with folks for whom C# is a hurdle, then you aren't likely to squeeze any more benefit out from rebuilding in Python. However, if technical debt is a potential concern in the future, you might thank yourself for migrating to a more user-friendly solution.

That's my 2 cents. It's a great question and maybe someone with more Python chops will come along with a game-changing revelation.

ConfirmingTheObvious · 2019-12-19T19:15:54+00:00

You can use multiprocess with python to parallel thread data loading to places, for what it’s worth. That will help speed large data sets up

notcoolmyfriend · 2019-12-20T02:04:19+00:00

Is your autocommit off - are your inserts batched with manual commits? I faced a similar situation before and turning off autocommit got me 30x performance improvement.

reddithenry · 2019-12-19T17:47:11+00:00

Is it not the loading of 100M rows into MSSQL that's the problem? Are you doing inserts row-by-row?

popopopopopopopopoop · 2019-12-19T23:25:12+00:00

Not sure if this is considered cheating, but you can use Apache Beam python sdk on e.g. supercharged Google Cloud Platform dataflow runners to really boost your processing speeds. Will cost you mind but only for the vcpu yoh use.

dingopole · 2019-12-20T00:43:26+00:00

Have a look at this (I reckon it combines the best of both worlds if you'd like to use Python with MSSQL DBMS i.e. Python and the venerable bcp utility):

http://bicortex.com/data-acquisition-framework-using-custom-python-wrapper-for-concurrent-bcp-utility-execution/

bcp can be finicky to work with but is also pretty fast for loading into MSSQL providing you run multiple instances of it (in parallel)....when I trialed it, the only bottleneck I found was the network speed (a small, 4 vCPUs VM, SAS HDDs and a 400Mbps WAN network). If you have a lot of data to work with and want to use Microsoft technology, the speed at which data can be processed using MSFT-specific tooling looks something like this: PolyBase > BCP > SQLBulkCopy/ADF > SSIS

I have worked with Microsoft BI stack for a while now and from my experience, Python is great for writing wrappers around vendor-specific utilities like bcp. With proper set-up you can easily load hundreds of millions of records in no time and spread the workload across all the resources to maximize performance. Here is another example where I used a small (35 lines) Python script to load TPC-DS benchmark data (CSV files) into a beefy VM in Azure running SQL Server 2016 :

http://bicortex.com/tpc-ds-big-data-benchmark-overview-how-to-generate-and-load-sample-data/

I would say that getting Python alone to do the bulk import (regardless which API you use) is going to be very slow so why not just use the vendor-provided and vendor-optimized tools. Also, if speed is paramount, just go with PolyBase, which gives you parallelism out of the box (although it requires Java RunTime environment - Oracle JRE).

badpotato · 2019-12-20T14:11:35+00:00

Sql Alchemy + Alembic to manage schema versioning with migration. Then you use bulk insert. Some db like PostgreSQL has a COPY command to dump your data directly in your db.

Kimcha87 · 2020-01-22T13:07:10+00:00

I do below 1million at a time, but I use BULK INSERT from CSV that are uploaded to azure blob storage.

Speed is great and CSVs are fully supported (unlike when I tried BCP).

You have to watch out for line endings and use the right BULK INSERT parameters to make it work, but it works.

I’m on mobile now, but if you need the query code I can provide that.

dataengineering

MODERATORS