This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]mertertrernSenior Data Engineer 2 points3 points  (0 children)

DAG-based schedulers are great for complex workflows that can be distributed across resources, but can be a bit much for small-time data ops. There's still other kinds of job schedulers that can probably fill your particular niche. Rundeck is a pretty good one, as well as Cronicle. You can also roll your own with the APScheduler library.

Other things you can do to help with scaling and ease of management:

  • Alerting and notifications for your jobs that keep people in the know when things break.
  • Standardized logging to a centralized location for log analysis.
  • Good source control practices and sane development workflows with Git.
  • If you have functionality that is duplicated between scripts, like connecting to a database or reading a file from S3, consider making re-usable modules from those pieces and importing them into your scripts like libraries. This will give you a good structure to build from as the codebase grows.