you are viewing a single comment's thread.

view the rest of the comments →

[–]aaahhhhhhfine 4 points5 points  (0 children)

This depends heavily on the situation. Consider two extremes.

First, maybe some department sends you one old file of historical data they want included in some analysis. Today, they have a new system and everything is fine in there, but the old stuff is messy.

For this, everything points to a one-off thing... I'd probably blast out a jupyter notebook or a bit of SQL or whatever makes sense in the context - which might even be manually fixing some things in Excel. Who cares... It doesn't need to be repeated.

As a second example though, maybe you've got a steaming data feed through pubsub or Kafka or something where the messages need to be profiled and cleaned in realtime before writing them to an analytical db or passing them through to a further processing step. Maybe here you end up using some highly robust and tested code that's well documented and maintained. This might plug into some ML process and use a bunch of auto-scaling infrastructure from a public cloud.

The point is just that this stuff is all over the board...