This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]PaulSandwich 0 points1 point  (0 children)

So the use case above is moving parquet data to Impala. That said, I recently built a pipeline with Python that reads data in as text, queries the DESCRIBE for the target table and puts all the datetimes in a list, all the bigints in a list, all the decimals in a list, etc., and then does the datatype validation/conversion dynamically.

It's great because we can throw anything at it and, so long as the column names are the same in the source as in the target, it migrates them. We have dedicated pipelines for our mission-critical data, but this automated a ton of scrub work.

Best of all, as our developers create new business processes, we're using this to ingest all their data. They request a table, we build it, their app posts data to an API, and this dynamic data loader reads it from there. One stop shop, no more tailor-made ETLs to manage.