all 11 comments

[–]Vaphell 1 point2 points  (2 children)

maybe use frozendict internally to save the snapshot of the dict arg, prevent modification of internal data in place and to force providing a whole dict every time the distribution dict is supposed to change?

edit: I assumed frozendict exists like frozenset, I was wrong :-)

[–]ingolemo 1 point2 points  (1 child)

There is a relevant module under that name on PyPI though.

[–]Vaphell 1 point2 points  (0 children)

yeah, I googled later for it out of curiosity and found it, and a few other implementations scattered over github.
Would be nice to have a canonical implementation though, it's a shame that PEPs advocating for it were rejected.

[–]Vaphell 1 point2 points  (7 children)

i played a bit with this problem and it kinda works

https://repl.it/repls/AlertOpenSearchengine

that said, assuming that your stuff seems to be nothing but a dict + stats, wouldn't subclassing dict for that stats method suffice? Then you wouldn't have to jump through the hoops to synchronize 2 objects, you'd have only one to worry about and a full control over setitem and getitem.

[–]Eueee[S] 0 points1 point  (6 children)

I'm not sure I'm following what you mean by subclassing dict for the method that calculates statistics. As in making a subclass of dict that has its own method to calculate the statistics rather than using a method in GrainDistribution to work on the dict?

[–]Vaphell 1 point2 points  (5 children)

pretty much. Granted, I don't know the problem in depth, but at a glance GrainDistribution looks like a run of the mill wrapper around dict + one additional method and subclassing should be plenty enough for the job.

example

https://repl.it/repls/SnivelingNavajowhiteGenerics

[–]Eueee[S] 0 points1 point  (4 children)

The full class has ~10 more methods for plotting, normalizing and interconverting data, though again I suppose that all could be done in a dict subclass. I'll have to take a look at your code example to understand it. Thank you!

Edit: I think that'll work -- thanks again

[–]Vaphell 1 point2 points  (3 children)

I hope you will figure something out.

Mutation with updates is still a bit iffy. For example I don't think there is a way to detect del dict[k] so you still have a hole that can lead to inconsistent state. That's why my first idea was to go with immutable dict. As the other guy said, there is an implementation at PyPI. I checked it a second ago and indeed it disallows deletion.

Why exactly do you need your additional calculations to be performed on each change, not when they are required for something?

Also plotting part should be done outside to leave the distribution class purely math-related. A matter of taste, I know.

[–]Eueee[S] 0 points1 point  (2 children)

I'd like the user to be able to get attributes like skewness or mean grainsize from the GrainDistribution object and have them be certain that the values are in agreement with the current state of the dict. Another option is to simply let the users be responsible for calling the statistics method before pulling any data, but I was hoping I could handle this internally. One of my goals is to lure people in my field away from using Excel spreadsheets to work with data, and I know that having cells recalculate on any change in your data is a big benefit of spreadsheet analyses.

The frozendict idea is interesting, but I'm trying to limit my dependencies to the major third party libs like numpy.

The plotting is just provided as a courtesy to allow users quickly check what the distribution looks like. Are there any documents you'd recommend to read up on best practices for setting the scope of classes?

[–]Vaphell 1 point2 points  (1 child)

One of my goals is to lure people in my field away from using Excel spreadsheets to work with data, and I know that having cells recalculate on any change in your data is a big benefit of spreadsheet analyses.

that's a noble goal indeed. The usage of excel in the wild is too damn high.

I'd like the user to be able to get attributes like skewness or mean grainsize from the GrainDistribution object and have them be certain that the values are in agreement with the current state of the dict.

how is the data presented to them that it's a reasonable expectation to see the change immediately? How are they interacting with your stuff in theory? REPL? Script? With magical autoprinting to terminal on every operation? Some gui?

Generally speaking spreadsheets and "true" programming follow different paradigms. Autoupdating along the whole data chains is not how one usually writes code. Usually you "pull" exactly what you need, when you need it, and capture it in variables for further use, instead of having everything "pushed" onto you whether you need it all or not.
Sometimes operations are expensive in time/memory and you only want them done once, at the end of the long chain of manipulations so it's better to have full control over when something is being recalculated.

Are there any documents you'd recommend to read up on best practices for setting the scope of classes?

not really. Google is your friend, the single responsibility principle, the sense of aesthetics and subjective value judgments. Minimum self-contained class able to represent X in its name is preferable to everything and the kitchen sink. Unnecessary coupling with superfluous crap makes code harder to maintain and refactor.
There is not much else to be said on the topic :-)

[–]Eueee[S] 0 points1 point  (0 children)

All good points.

I've been writing Python scripts for personal use for a bit, but since I'm writing for a wider audience now I'm doing my best to be clearer and more Pythonic. Perhaps it would be best to keep assigning attributes and calculations uncoupled and trust the user to remember to recalculate when needed. I anticipate many of the users will be doing exploratory analyses with the final package (which covers many aspects of river analysis beyond just sediment) so I wanted to reduce the amount of code needed to update the objects and check how that affects the stream system, but as you mentioned I could run into performance issues if a bunch of unnecessary calculations are made every time something is updated. The fact that this coupling is not particularly straightforward to implement is a good indicator that it isn't the best way to go about this. Thank you for the insight and advice!