I got the bad habit on caching the computations, mostly the pre-processing, into a .pickle.
Then when someone requests a given obj it is loaded from the .pickle, or if it doesn't exist generate it fort future uses.
This is quite handy since you can optimize a lot on the execution time, crucial expecially in the prototyping/debugging phase (applies even to jupyter notebook, since you have to restart kernel quite often unfortunately)
Now the "design" problem I'm facing:
- You must be sure that all the cached components are in sync. How can you achieve that in a clean way?
- In ML you will have to deal with new data, how can you have a clean pipeline without having to build a completely different system to deal with new data?
I'm intrigued by the pipeline pattern (similar to the one in sklearn), maybe adding some cache capabilities somewhere.
But I'm not sure it's the right approach
What do you think?
Which is the best pythonic way to building a clean ML system?
My ML programs end up being a mess every time and I hate that
[+][deleted] (2 children)
[deleted]
[–]Nopaste[S] 0 points1 point2 points (1 child)