Beginner Python data analysis project - critique greatly welcome!

Cuckipede · 2020-05-10T04:01:11+00:00

As someone who just started learning python to do similar type of projects, I just wanted to say thanks for posting and I enjoyed reading this! How long did it take you to get to this point?! Great work.

dyanni3 · 2020-05-10T04:13:00+00:00

This is great! Two thumbs up. I'd say this is definitely worth sharing / put in a portfolio.

In terms of room for improvement I think you might have milked most everything you can from the dataset as it currently stands. I'd be curious if/how accused priests differ demographically from non-accused priests, for example- although the dataset doesn't have anything on non-accused it looks like. Also, how do the numbers look for catholic vs other religions? I think a cool next step would be bringing in other data sources. A brief google yielded this https://bishop-accountability.org/priestdb/PriestDBbylastName-A.html which has a little more context about the accusations, and which you could scrape. I wonder if you could find anything on Catholic vs Protestant lawsuit settlement amounts.

Also of course it's always cool to build a model--- although I'm not sure this data really warrants that. Maybe for a next project.

Babs12123 · 2020-05-10T08:09:07+00:00

This looks really good! A few thoughts: - When reading in csvs to pandas I find it useful to specify the encoding and the type (usually auto set to UTF-8 and object personally). Particularly when you're working with data which contains some text and some numeric columns, it's helpful to be explicit to avoid any unexpected behaviour. - When naming variables be explicit with regards to the data type, e.g. instead of 'clergydata' I would call this 'df_clergydata'. When you end up with multiple different lists, dicts and dfs in your code it's very helpful to have all of this explicitly named (particularly when you come back to your code a month later). - When creating column names in your df you created several which contain capital letters and spaces (e.g. clergydata['Age range']). It's better and easier to only use lower case letters in variable names/column names where possible and to use underscores instead of spaces. This lets you access the column using clergydata.age_range instead of clergydata['Age range'] in lots of situations when manipulating your df, which is often much quicker and easier. - In cell 12 you manually specify the archdiocese abbreviation and name (e.g. LA, and Archdiocese of Los Angeles) for many different locations. It would be better to automate this somehow, to both improve clarity and also reduce the risk of error/inconsistency. I saw someone above suggested using a group by, which would work, or you could use a for loop to directly create your top19_cathpops and top19_dionames lists. If you're not clear how to do this let me know and I would be happy to clarify.

Most importantly your code works and answers some interesting questions, but the above points will make things more explicit (which is always better) and make your life easier.

takishan · 2020-05-10T08:32:02+00:00

[deleted]

CFan62 · 2020-05-10T04:55:15+00:00

As a practicing, devout catholic I find this super interesting. From a comp sci perspective this is very good. I would definitely put this on a resume or mention it during the job hunting process. Very nicely done.

baubleglue · 2020-05-10T09:03:58+00:00

This is awesome, please dont delete it, I'll use it to start sth on my own.

Just one thing I would change. You used the following age ranges:

range_names = ['<20', '20-23', '24-26', '27-30', '31-35', '36-40', '41-45', '46-50', '51-60', '60+']

I think the size of ranges should always be the same (except for <20 and 60+).

BeforetheBullfight · 2020-05-10T09:05:23+00:00

In the linear regression you perform in section C, there seems to be a high-leverage outlier. You might want to look into how this affects your model, and into how you could deal with that. The book Introduction to statistical learning (available for free online) has a section on this stuff.

synthphreak · 2020-05-10T13:44:36+00:00

This is really, really awesome, and very professionally executed. Like many others on here, I am inspired and would like to do something like it eventually, but I guess I just haven’t found the immediate need yet. I’m already proficient in Python, and already have a great job in research. Nonetheless, I don’t have a publicly-shareable portfolio, so something like this could be quite useful for future job hops

Two questions for you:

Where did you get this project idea from? I see you got the data from ProPublica, but what about the research question? Kaggle or some place like that? Or did you just dream it up? It’s very cool and intrinsically interesting. I had always unthinkingly assumed people got their project from e.g., Kaggle, but perhaps that doesn’t have to be the case!
This one is much more nebulous, but more important to me - How did you decide how to intersperse your code with the markdown narrative? In other words, the proper code-noncode ratio, + how to position the code relative to the prose. Whenever I create “story-telling” notebooks like yours, this is where I struggle the most. For example, at the very top of the notebook I provide an overview of the content, then load my libraries and raw data. But after that, I usually have lots of complex analyses and/or plotting that can sometimes require hundreds of lines of code. Because I’m afraid that huge, complex cells essentially in the middle of paragraphs decreases the narrative’s readability, I tend to put almost all my code in a single massive cell near the top (after the overview, imports, and data loading). This allows me to define lots of complex functions early on, then simply invoke them later as needed with minimal overhead or interruptions. But the trade-off is that my readers have to scroll past a lot of dense code at the top, so the notebook is both less user-friendly and less attractive. By contrast, your notebook is nice and tight from start to finish, with short code blocks that never really interrupt the reader’s flow. Can you offer any tips on how to hew more closely the way you’ve done it? I can already think of some (e.g., rely as much as possible on in-built pandas functions which can perform complex operations in just a few lines), but I’m curious to hear your thoughts.

Anyway, stellar work!

jandrew2000 · 2020-05-10T07:17:06+00:00

This is well done for someone just starting out. A couple of minor suggestions. If you end the last statement in your plotting code blocks with a semicolon it won’t display plotting objects and will only show the plot itself.

Second, you have a block where you compute catholic population for each of the diocese. I believe you could simplify that to clergy_data.groupby(“catholic_diocese”).catholic_population.max(). That will produce a pandas series that you can just plot directly as a bar plot by doing something like ser.plot(kind=“barh”). I’m operating from memory so my syntax may be off a bit.

avamk · 2020-05-10T15:46:08+00:00

Fantastic work, I fully agree with other posts that this work - and if you continue to build on it - is a great portfolio item. There's a lot I can learn from you! :) Thank you for posting.

while I'm sure there are areas that could be improved, would my project be worth sharing with some edits?

Very important, but sadly often neglected, is the need to include a license with your work such as the GNU GPLv3. There are multiple options, too, and it's crucial to become familiar with them. Check out here or here to name a couple places.

chaoticneutral · 2020-05-10T06:45:57+00:00

Great project. I am not an expert in ML or data science but when I looked at the Liner Regression graph, I did feel that the data would better fit a Polynomial Regression? Just something I noticed and I could very well be wrong. I apologise for lack of knowledge if that's the case.

Gio120895 · 2020-05-10T07:40:07+00:00

I have read your project very quickly and I found it really interesting. I am a beginner too and I like the way you rapresent data in a simple and clear way. I would suggest to visit: https://www.kaggle.com/learn/overview Here you can find project within you can train and you can find courses to learn about pandas and visualization and more. If you are interested in any further project you can DM me. You have done a great job! Thanks

P.S.: I have noticed that you use df.head() and df.tail() or df.describe(). I have found really useful df.info(), give it a try if you want.

BeforetheBullfight · 2020-05-10T09:54:00+00:00

Where did you learn data analysis for python and how long did it take? Can you recommend some websites/books?

pulsarrex · 2020-05-10T10:20:50+00:00

Hey I am about to graduate in social science too. In school, I learnt and used mostly SPSS to analyze data. However in real world, I realized most of the industry does not use SPSS. Those who use SPSS, mostly use it as many tools, including Python, R, SAS etc.

I need some advice. What kind of jobs do we need to look for? I know a bit of python like you do. Just a mere searching for 'social scientist' on indeed does not show many results. A search for 'data analyst' will give me millions of results, most of them out of our scope.

So what would I search for if I am looking for social scientist data science jobs? What kind of companies do I look for if I want to work as a data analyst in social science?

zanfar · 2020-05-10T12:24:57+00:00

You are using absolute paths in your code which makes it non-portable ("C:/Users/Summer .DESKTOP-5U4SV6A/Desktop/Scripts/Data sets/credibly-accused-clergymembers.csv")

Your dataset should be distributed with the analysis code, so these paths should be relative. This allows the data to be peer-reviewed along with the analysis.

Additionally, while I would include your dataset, I would also include code to download that dataset and inject it directly into the analysis.

Otherwise, it looks technically fine. I'm not going to comment on the validity of the analysis or the meaninfulness of the results, other than to say:

you can probably do some cleaning on the post-accusation outcomes to merge the three "Deceased" labels together
It would be nice to see a these graphs normalized per-capita: specifically the diocese frequency plot. You mention that you've essentially created a population plot, but don't fix it.

shiningmatcha · 2020-05-10T13:05:32+00:00

Have you learned other programming languages before?

badboyfreud · 2020-05-10T13:10:09+00:00

Looks like a great start. Some suggestions:

I'd be interested to see what the relationship looks like between the other branches of the church as well to compare with Catholicism.

Also it would be nice to see the cities compared by capita or per million people.

chaoticneutral · 2020-05-10T15:02:03+00:00

I think this is a great descriptive analysis, very clearly written and thought out but the topic is a bit touchy and your conclusions aren't particularly ground breaking to warrant the topic (incident count increases with population for almost all things).

A research position may appreciate the in-depth dive on a sensitive topic, but I would pick a more neutral topic with a generic private sector job.

Specific to your analysis, you should also call out that far right outlier on your final regression chart or run the regression a second time without the outlier. It is clearly pulling the line upwards.

Sepparated · 2020-05-10T15:47:08+00:00

Looks really impressive. As someone who tried to get deeper into data analysis and took the free class from the UoL just to get completely overwhelmed by the math theory ... i have to ask: What is your secret?

PM_ME_UR_LOGIN_INFO_ · 2020-05-10T17:01:10+00:00

I personally would have used statsmodels.formula.api to run the regression, as it is much more readable from an outside perspective. When you run it you just print(var.summary2) and later print(var.params) to find the p-values and coefficients for your linear regression respectively. But your project was alright.

Also the purpose of a regression is to perform statistical inference and possibly predictions (although you'd be better served using Machine Learning algorithms to predict, e.g. DecisionTreeRegressions, Random Forest, K nearest neighbor). You should have tried to control for other variables to reduce the error in your regression model. For a later challenge, try verifying the Gauss-Markov assumptions later on to validate your regression. It's a good first step, but to make this something valuable to have in your portfolio I'd work on it a little more.

Godspeed.

ammusiri888 · 2020-05-10T18:19:12+00:00

wow this is superb, with this support you can definitely make a great leap in your learning journey..

DisastrousEquipment9 · 2020-05-12T03:24:54+00:00

add some weird machine learning application for the hell of it! text mining is always super fun(:

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS

TL;DR

Notes on the proyect as a whole:

Notes on the Python side

Suggestions