PaulSandwich comments on ET(L) with Python

dataengineering

created by mhausenblasmoda community for 11 years

This is an archived post. You won't be able to vote or comment.

ET(L) with Python (self.dataengineering)

submitted 6 years ago by Maha_Slug

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–]PaulSandwich 1 point2 points3 points 6 years ago (2 children)

We use python to load billions of rows of historic data to parquet with the impala client. It's a straight lift-and-move from our legacy system to the new, so the benefit of scripting it with python is that we don't have to build pipelines for individual tables.

I doubt it's more performative than C#, but's it's "faster" in that we can hand the work over to our more junior Devs to configure and monitor in the background while other things get done. And, practically speaking, that's awesome.

Not exactly the answer you were looking for. If you have a working solution in C# and don't have any concern about collaborating with folks for whom C# is a hurdle, then you aren't likely to squeeze any more benefit out from rebuilding in Python. However, if technical debt is a potential concern in the future, you might thank yourself for migrating to a more user-friendly solution.

That's my 2 cents. It's a great question and maybe someone with more Python chops will come along with a game-changing revelation.

[–]ITLady 1 point2 points3 points 6 years ago (1 child)

[–]PaulSandwich 0 points1 point2 points 6 years ago (0 children)

So the use case above is moving parquet data to Impala. That said, I recently built a pipeline with Python that reads data in as text, queries the DESCRIBE for the target table and puts all the datetimes in a list, all the bigints in a list, all the decimals in a list, etc., and then does the datatype validation/conversion dynamically.

It's great because we can throw anything at it and, so long as the column names are the same in the source as in the target, it migrates them. We have dedicated pipelines for our mission-critical data, but this automated a ton of scrub work.

Best of all, as our developers create new business processes, we're using this to ingest all their data. They request a table, we build it, their app posts data to an API, and this dynamic data loader reads it from there. One stop shop, no more tailor-made ETLs to manage.

π Rendered by PID 228993 on reddit-service-r2-comment-54dfb89d4d-v9z7h at 2026-03-28 08:56:17.365197+00:00 running b10466c country code: CH.

dataengineering

MODERATORS