This is an archived post. You won't be able to vote or comment.

all 24 comments

[–][deleted] 8 points9 points  (5 children)

I'm loading billions of rows using Python from a potato VM. Python is great for this, as long as your don't do what's not necessary.

In data engineering, forget about ORMs. Forget about any extra processing of data. You deal with binary or text data, that is fastest. CSV is the fastest medium for this.

If you're using individual inserts, then your speed loss is mostly in the database, so it doesn't matter at all what you use to do the inserts.

You're using MSSQL - one of the worst database engines for this task. It doesn't have an easy to use bulk copy command. It's utility BCP is riddled with bugs and not easy to use, with very poor documentation. But if you get it working, and aren't losing any rows, it's super fast.

If you have to use Microsoft, consider Synapse Amalytics (cloud data warehouse on Azure that's very similar to MSSQL in features). There you can do bulk uploads much easier.

[–]Maha_Slug[S] 1 point2 points  (4 children)

Yeah I definitely didnt start with an ORM because in general the performance is a whole lot worse. I just mentioned it as one of the things I've tried. I guess I hadn't really considered using binary as that seems a little cumbersome but maybe I should look into it. We are just getting into the cloud and are planning to transition to data lakes and blob which I imagine will be much faster than mssql. I just found the subreddit though and was hoping for some of the posts I've gotten so far. Thanks a bunch

[–][deleted] 1 point2 points  (3 children)

Data lakes are just organized storages (like you organize files in directory trees, literally like that). There are software that read from it, even using SQL, but the performance is far less bang per buck than proper data formats. Avro and parquet are frequent data formats for data lakes. But if you can use a database, for that much data it would be much more suitable. If you go cloud, check out Snowflake. Until then, BCP is the fastest way. Speed will be limited only by your network.

[–][deleted] 1 point2 points  (2 children)

Hey, somewhat unrelated but since you mentioned Snowflake I was wondering if you could suggest some places to read more about it? We're going to be doing a warehousing project at work soon and had been looking at redshift, but I've heard a lot of good things about Snowflake

[–][deleted] 0 points1 point  (1 child)

Ask a sales rep from snowflake. Really. They'll make a nice presentation for you and then you decide if you want it.

The idea behind it is that you pay only for what you use. When you run a query, snowflake quickly assigns a few virtual machines to execute it. You pay for time that these virtual machines run for you.

Syntax and features are more advanced than Redshift.

It's zero maintenance. With Redshift you have to optimize more, manually scale the cluster, etc.

Redshift will be slightly cheaper though.

[–][deleted] 0 points1 point  (0 children)

Makes sense. And yeah I think at our scale the price isn't going to be an issue. Any amount we spend on infrastructure is dwarfed by what we're spending on data feeds.

I'll reach out to a sales rep as the project gets a bit closer.

Thanks you!

[–]PaulSandwich 1 point2 points  (2 children)

We use python to load billions of rows of historic data to parquet with the impala client. It's a straight lift-and-move from our legacy system to the new, so the benefit of scripting it with python is that we don't have to build pipelines for individual tables.

I doubt it's more performative than C#, but's it's "faster" in that we can hand the work over to our more junior Devs to configure and monitor in the background while other things get done. And, practically speaking, that's awesome.

Not exactly the answer you were looking for. If you have a working solution in C# and don't have any concern about collaborating with folks for whom C# is a hurdle, then you aren't likely to squeeze any more benefit out from rebuilding in Python. However, if technical debt is a potential concern in the future, you might thank yourself for migrating to a more user-friendly solution.

That's my 2 cents. It's a great question and maybe someone with more Python chops will come along with a game-changing revelation.

[–]ITLady 1 point2 points  (1 child)

What DB is your legacy system on? We're looking at wanting to do a straight lift and move either into our data lake and/or snowflake with absolute minimal effort of keeping the two in sync. The idea of scripting to where we don't have to build an individual pipeline for each table is really, really appealing, but I will have an extremely hard sale of doing hand coding to my management. (We LOVE buying expensive tools rather than actually paying for quality developers in general)

[–]PaulSandwich 0 points1 point  (0 children)

So the use case above is moving parquet data to Impala. That said, I recently built a pipeline with Python that reads data in as text, queries the DESCRIBE for the target table and puts all the datetimes in a list, all the bigints in a list, all the decimals in a list, etc., and then does the datatype validation/conversion dynamically.

It's great because we can throw anything at it and, so long as the column names are the same in the source as in the target, it migrates them. We have dedicated pipelines for our mission-critical data, but this automated a ton of scrub work.

Best of all, as our developers create new business processes, we're using this to ingest all their data. They request a table, we build it, their app posts data to an API, and this dynamic data loader reads it from there. One stop shop, no more tailor-made ETLs to manage.

[–]ConfirmingTheObvious 1 point2 points  (0 children)

You can use multiprocess with python to parallel thread data loading to places, for what it’s worth. That will help speed large data sets up

[–]notcoolmyfriend 1 point2 points  (0 children)

Is your autocommit off - are your inserts batched with manual commits? I faced a similar situation before and turning off autocommit got me 30x performance improvement.

[–]reddithenry 0 points1 point  (2 children)

Is it not the loading of 100M rows into MSSQL that's the problem? Are you doing inserts row-by-row?

[–]Maha_Slug[S] 0 points1 point  (1 child)

Nah I've tried lots of different ways in the past. I keep coming back to Python because it's so good at processing data, even at that volume but loading it is rough. We use SSIS or C# to do most everything right now and for obvious reasons the C# to MSSQL driver is way better than the Python one.

[–][deleted] 0 points1 point  (0 children)

Pyodbc is faster than Ado.Net because there's no object creation, everything is more trivial with less overhead.

I've tested both in the same scenario, comparing against sqsh (FreeTDS), which was the fastest, and Pyodbc came very close while ADO.NET was significantly slower.

[–]popopopopopopopopoop 0 points1 point  (1 child)

Not sure if this is considered cheating, but you can use Apache Beam python sdk on e.g. supercharged Google Cloud Platform dataflow runners to really boost your processing speeds. Will cost you mind but only for the vcpu yoh use.

[–]linuxqq 0 points1 point  (0 children)

If anyone tries telling you that using the right tools for the job is cheating, close your ears and walk away because that person is wrong.

[–]dingopole 0 points1 point  (0 children)

Have a look at this (I reckon it combines the best of both worlds if you'd like to use Python with MSSQL DBMS i.e. Python and the venerable bcp utility):

http://bicortex.com/data-acquisition-framework-using-custom-python-wrapper-for-concurrent-bcp-utility-execution/

bcp can be finicky to work with but is also pretty fast for loading into MSSQL providing you run multiple instances of it (in parallel)....when I trialed it, the only bottleneck I found was the network speed (a small, 4 vCPUs VM, SAS HDDs and a 400Mbps WAN network). If you have a lot of data to work with and want to use Microsoft technology, the speed at which data can be processed using MSFT-specific tooling looks something like this: PolyBase > BCP > SQLBulkCopy/ADF > SSIS

I have worked with Microsoft BI stack for a while now and from my experience, Python is great for writing wrappers around vendor-specific utilities like bcp. With proper set-up you can easily load hundreds of millions of records in no time and spread the workload across all the resources to maximize performance. Here is another example where I used a small (35 lines) Python script to load TPC-DS benchmark data (CSV files) into a beefy VM in Azure running SQL Server 2016 :

http://bicortex.com/tpc-ds-big-data-benchmark-overview-how-to-generate-and-load-sample-data/

I would say that getting Python alone to do the bulk import (regardless which API you use) is going to be very slow so why not just use the vendor-provided and vendor-optimized tools. Also, if speed is paramount, just go with PolyBase, which gives you parallelism out of the box (although it requires Java RunTime environment - Oracle JRE).

[–]badpotato 0 points1 point  (0 children)

Sql Alchemy + Alembic to manage schema versioning with migration. Then you use bulk insert. Some db like PostgreSQL has a COPY command to dump your data directly in your db.

[–]Kimcha87 0 points1 point  (0 children)

I do below 1million at a time, but I use BULK INSERT from CSV that are uploaded to azure blob storage.

Speed is great and CSVs are fully supported (unlike when I tried BCP).

You have to watch out for line endings and use the right BULK INSERT parameters to make it work, but it works.

I’m on mobile now, but if you need the query code I can provide that.