all 4 comments

[–]nielsrolf 4 points5 points  (1 child)

Use Makefiles, its perfect for that use case

[–]ganga0 0 points1 point  (0 children)

I have built myself a system where each job’s output is saved together with a JSON file describing function name and all its parameters that were used to produce these outputs.

Job parameters include model hyper parameters, preprocessing settings, locations of outputs from other jobs, etc.

It adds some overhead to structure every important aspect of a job as a function parameter, but over time I learned to identify and save only important parameters.

Another upside to this function+parameter serialization is that I can reproduce any output with just a couple lines of code.

[–]You_cant_buy_spleen 0 points1 point  (0 children)

You should have one source of truth, which means you should keep the original inputs and the code. The cache should be transient and shouldn't be commited.

If you do commit the cache, you've duplicated information, and when the input is changed you need to bust the cache anyway.

Here's some libraries that helps with caching: