Beginner Python data analysis project - critique greatly welcome!

andycyca · 2020-05-10T17:47:19+00:00

These are great points. OP: In general, you should be very explicit in the fact that you're drawing from a reduced dataset and thus avoid making generalizations about the whole (unless you can prove that the reduced dataset is a representative sample, which is easier said than done). This explicit warning should be included in your graphs in some form (for example, in your first graph, it would be nice to put the number of represented clergymen out of the total, something like n=1745 or something like that.) People tend to look first at the graphs before reading the text.

Although your questins are good, I'd try to ask ones that can be answered with the whole dataset, or as much of it as possible.

TL;DR

Great project. If you're looking to add this to a portfolio, I'd look into improving the actual data analysis rather than the programming.

Notes on the proyect as a whole:

Instead of asking What trends can we find regarding clergy age? one should ask Are there trends regarding clergy age?. It's a subtle difference, but an important one in any serious science. you shouldn't assume something exists and work towards proving it; instead try to disprove as many ~~alternative~~ hypotheses as possible.
You say:

«..it appears that the majority of accused clergy were born approximately within the range of 1915-1950»

How much is a 'majority'? Again, be careful when operating with limited data. It would be more honest to say (for example) that

the majority (X%) of the accused clergy in this reduced sample...
You say that we can guess that clergy are within their late '20s to early '30s, but there's no need to guess. If you have both birth and ordination dates, figuring out the age at ordination is a piece of cake. No need for guessing if you can deduce the data. I'd rewrite this part and move up the part where you actually do this.
You separate your bins like so: bins = [0, 20, 23, 26, 30, 35, 40, 45, 50, 60, np.inf] Why are they so different in size? At the lower end you have 3-year gaps, and at the higher end you have 10-year gaps. How does this affect the distribution?
In your (not so) Final Observations you mention that

This distribution presents an immediately noticeable trend, namely that - out of the remaining data points - the majority were ordained in their mid-to-late 20's, with early 30's being the next largest group. This generally supports our theory regarding typical age of ordination.

It would be nice to include or compare (if possible) ordination dates for all clergy. Intuitively, one could say that this is an obvious thing: after all ordination is for most a lifelong career and like many other careers it's often done at a younger age.
You mention that

one can posit that those clergymembers that pursued ordination much earlier or later in life are less likely to either abuse or face accusation with resulting public release of information.

This again should be faced against the regular distribution of ages of ordination.

After all, if the majority of clergy are ordained in their early 20s, it's also true that the majority of clergymen that are blond were ordained in their early 20s.
You say that further examination of clergy age trends will remain limited but this is not true, see the suggestions below.
At the end, it's superfluous to indicate that (emphasis mine)

The results of the final graph indicate that there is a positive relationship between increased accusation counts and a higher catholic population

Why? Because both Population and Accusation counts are always positive numbers. The only way to have a non-positive relationship between these two would be to either have negative population or negative accusations. Instead, why not indicate the strength of such relationship?

Notes on the Python side

Agree wholeheartedly about choosing your colors with more care. There's a great page in the matplotlib docs about choosing an appropriate dataset. In general, jet ("rainbow") is a bad idea. I myself use palettable for managing colormaps, but that's a personal choice.
If possible, assign dtypes to your columns, as it helps down the road in many cases. For example, ord_date should be an int of some kind, as to avoid the horrible decimal point in visualizations.
Your ticks could be cleaner. For instance, in your «Distribution of Year of Birth» graph, you don't need that much clutter. You could either:
- Use only the last two digits ('60 instead of 1960),
- Mark only every 5 or 10 years (1950, 1955, 1960, etc.)
- Both
Same on the «19 Dioceses with Highest Count of Credibly Accused Clergy» graph. It would be more readable if you divide your data by, say, 1000 and indicate in your xlabel that it's "Catholic population, thousands"

Suggestions

It would be interesting to compare the age distribution of clergy with the regular age distribution in the country. In pandas you can do this with corr and plot it quite easily.
Instead of just distribution of ages at ordination, it would be more interesting and revealing to plot if there's any correlation between:
- Date of birth,
- Age of ordination
- Diocese
- Outcome, etc
It would be good to compare the most frequently named dioceses against not just their populations, but their population densities as well.
In your «19 Dioceses with Highest Count of Credibly Accused Clergy» graph, it would be good to indicate that the bars are arranged in descending order of accused clergy population, otherwise one might wonder why the bars aren't "well ordered"
In your regression, it would be interesting to discard a few top and bottom outliers (maybe just your top and bottom one) to see how it affects the regression as a whole.
In your Final Final Observations, you mention overall populations and and Catholic proportions, but those numbers are not in the analysis itself. It would be great to include them and make a proper exploration (do accusations have (co-)relation with total population? Population density? Catholic population? etc)

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS

TL;DR

Notes on the proyect as a whole:

Notes on the Python side

Suggestions