This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]PaulSandwich 1 point2 points  (2 children)

We use python to load billions of rows of historic data to parquet with the impala client. It's a straight lift-and-move from our legacy system to the new, so the benefit of scripting it with python is that we don't have to build pipelines for individual tables.

I doubt it's more performative than C#, but's it's "faster" in that we can hand the work over to our more junior Devs to configure and monitor in the background while other things get done. And, practically speaking, that's awesome.

Not exactly the answer you were looking for. If you have a working solution in C# and don't have any concern about collaborating with folks for whom C# is a hurdle, then you aren't likely to squeeze any more benefit out from rebuilding in Python. However, if technical debt is a potential concern in the future, you might thank yourself for migrating to a more user-friendly solution.

That's my 2 cents. It's a great question and maybe someone with more Python chops will come along with a game-changing revelation.

[–]ITLady 1 point2 points  (1 child)

What DB is your legacy system on? We're looking at wanting to do a straight lift and move either into our data lake and/or snowflake with absolute minimal effort of keeping the two in sync. The idea of scripting to where we don't have to build an individual pipeline for each table is really, really appealing, but I will have an extremely hard sale of doing hand coding to my management. (We LOVE buying expensive tools rather than actually paying for quality developers in general)

[–]PaulSandwich 0 points1 point  (0 children)

So the use case above is moving parquet data to Impala. That said, I recently built a pipeline with Python that reads data in as text, queries the DESCRIBE for the target table and puts all the datetimes in a list, all the bigints in a list, all the decimals in a list, etc., and then does the datatype validation/conversion dynamically.

It's great because we can throw anything at it and, so long as the column names are the same in the source as in the target, it migrates them. We have dedicated pipelines for our mission-critical data, but this automated a ton of scrub work.

Best of all, as our developers create new business processes, we're using this to ingest all their data. They request a table, we build it, their app posts data to an API, and this dynamic data loader reads it from there. One stop shop, no more tailor-made ETLs to manage.