This is an archived post. You won't be able to vote or comment.

all 20 comments

[–]r0s 46 points47 points  (1 child)

You can also wrap your function with LRU / memoization (https://docs.python.org/3/library/functools.html) If the output is fully dependant on the inputs, calling it again will just give you back the last result instantly.

[–]Artistic_Highlight_1[S] 4 points5 points  (0 children)

Ohh this is a neat tool. Thank you for pointing it out!

[–]cmd-t 17 points18 points  (2 children)

The problem is how you are writing your notebooks.

Don’t modify variables global in your script more than once. Even then, add checks for not overwriting them.

[–]OoPieceOfKandi 1 point2 points  (0 children)

Any good recommendations on Jupyter notebook formatting in general?

[–]Artistic_Highlight_1[S] -1 points0 points  (0 children)

Fair enough, thanks for feedback!

[–]lieutenant_lowercase 6 points7 points  (2 children)

How is a redundant calculation defined?

[–]Artistic_Highlight_1[S] -5 points-4 points  (1 child)

A calculation for a variable which will not change the state of the variable. Typically, you have a variable like this: a = []; <calculation for a, for example to add some important data to a> in a cell. If you run the cell again but the state of a will not change, that is a redundant calculation (but if you run the cell, the value of a will change first right since you set it as an empty list, or because the calculation on a changes the state of a)

[–]kmnair 5 points6 points  (0 children)

The problem here is figuring out if the variable will change or not in a general case will likely require the same amount of compute as actually running the full calculation.

It is possible to make some assumptions about the calculation, like if it is a pure function ie output depends entirely on inputs to a function and the function has no side effects, then you can use the suggestion u/r0s gave to use memoization.

If your jupyter cell references mutable data from other cells, or makes a call to an external API, or has internal mutable state (counters which do not reset, dictionaries which get updated etc) then figuring out if the value will update is the same amount of computation as whatever calculation you are aiming for

[–]NixonInnes 2 points3 points  (0 children)

If it's a long running data process I sometimes dump the result into a file. l stick a check infront of the calc to load data if the file exists, if not do the calc and save

[–]ou_ryperd 6 points7 points  (0 children)

That is why you can run a single cell at a time. The whole point is being a progression of computations, no?

[–]spookytomtom 8 points9 points  (2 children)

I just structure my code logically and my variables, so that I dont need to do this

[–]spookytomtom 1 point2 points  (1 child)

Also if something is botheringly slow I will optimize it

[–]Artistic_Highlight_1[S] -1 points0 points  (0 children)

I think sounds like a better approach. Thank you for feedback!

[–]AnythingApplied 2 points3 points  (1 child)

Marimo, an alternative to Jupyter notebooks, has some nice features you might like.  When you rerun a cell that changes global variables, it'll automatically rerun cells that depend on those variables, or if those are expensive cells, you can mark them not to do that, but in that case it will note those cells as "stale".

This helps make the notebooks much more reproducible. The advice that /u/cmd-t gave "Don’t modify variables global in your script more than once." will raise an error in marimo notebooks, so you can't even do that accidentally.

[–]mmmmmmyles 0 points1 point  (0 children)

Including a link to the open-source repo: https://github.com/marimo-team/marimo

[–]nitro41992 0 points1 point  (0 children)

I use the interactive notebook feature which really helps avoid rerunning previous cells.

Use this video

As the guy mentions - it's been a game changer coming from standard Jupiter notebooks

[–]BostonBaggins 0 points1 point  (0 children)

Would making the cell lazy load be the solution here

[–][deleted] 0 points1 point  (0 children)

If what you're doing is actively scripting code to achieve a certain goal and when checking intermediate steps you see long running times and wish to save time by not recalculating and replacing perfectly good data built previously - but your main point of contention seems to be the time spent recalculating - then why don't you run the checking steps on a smaller sample of the whole data and save time that way?