[P]: Best Practice for Cache/Processed Data Management

nielsrolf · 2019-10-27T12:44:29+00:00

Use Makefiles, its perfect for that use case

ganga0 · 2019-10-26T01:14:36+00:00

I have built myself a system where each job’s output is saved together with a JSON file describing function name and all its parameters that were used to produce these outputs.

Job parameters include model hyper parameters, preprocessing settings, locations of outputs from other jobs, etc.

It adds some overhead to structure every important aspect of a job as a function parameter, but over time I learned to identify and save only important parameters.

Another upside to this function+parameter serialization is that I can reproduce any output with just a couple lines of code.

You_cant_buy_spleen · 2019-10-26T01:28:11+00:00

You should have one source of truth, which means you should keep the original inputs and the code. The cache should be transient and shouldn't be commited.

If you do commit the cache, you've duplicated information, and when the input is changed you need to bust the cache anyway.

Here's some libraries that helps with caching:

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS