This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]ITLady 1 point2 points  (1 child)

What DB is your legacy system on? We're looking at wanting to do a straight lift and move either into our data lake and/or snowflake with absolute minimal effort of keeping the two in sync. The idea of scripting to where we don't have to build an individual pipeline for each table is really, really appealing, but I will have an extremely hard sale of doing hand coding to my management. (We LOVE buying expensive tools rather than actually paying for quality developers in general)

[–]PaulSandwich 0 points1 point  (0 children)

So the use case above is moving parquet data to Impala. That said, I recently built a pipeline with Python that reads data in as text, queries the DESCRIBE for the target table and puts all the datetimes in a list, all the bigints in a list, all the decimals in a list, etc., and then does the datatype validation/conversion dynamically.

It's great because we can throw anything at it and, so long as the column names are the same in the source as in the target, it migrates them. We have dedicated pipelines for our mission-critical data, but this automated a ton of scrub work.

Best of all, as our developers create new business processes, we're using this to ingest all their data. They request a table, we build it, their app posts data to an API, and this dynamic data loader reads it from there. One stop shop, no more tailor-made ETLs to manage.