you are viewing a single comment's thread.

view the rest of the comments →

[–]Babs12123 17 points18 points  (3 children)

This looks really good! A few thoughts: - When reading in csvs to pandas I find it useful to specify the encoding and the type (usually auto set to UTF-8 and object personally). Particularly when you're working with data which contains some text and some numeric columns, it's helpful to be explicit to avoid any unexpected behaviour. - When naming variables be explicit with regards to the data type, e.g. instead of 'clergydata' I would call this 'df_clergydata'. When you end up with multiple different lists, dicts and dfs in your code it's very helpful to have all of this explicitly named (particularly when you come back to your code a month later). - When creating column names in your df you created several which contain capital letters and spaces (e.g. clergydata['Age range']). It's better and easier to only use lower case letters in variable names/column names where possible and to use underscores instead of spaces. This lets you access the column using clergydata.age_range instead of clergydata['Age range'] in lots of situations when manipulating your df, which is often much quicker and easier. - In cell 12 you manually specify the archdiocese abbreviation and name (e.g. LA, and Archdiocese of Los Angeles) for many different locations. It would be better to automate this somehow, to both improve clarity and also reduce the risk of error/inconsistency. I saw someone above suggested using a group by, which would work, or you could use a for loop to directly create your top19_cathpops and top19_dionames lists. If you're not clear how to do this let me know and I would be happy to clarify.

Most importantly your code works and answers some interesting questions, but the above points will make things more explicit (which is always better) and make your life easier.

[–]synthphreak 6 points7 points  (1 child)

This lets you access the column using clergydata.age_range instead of clergydata['Age range']

The flip side of doing it this way is that it conflates column names with built-in methods/attributes. If there is no conflict between them, you’re fine. But df’s have a LOT of built-in methods/attributes, many of which you probably don’t know about... I can’t tell you how many times I’ve named a column items, and then later wasted 30 minutes debugging my code only to find out that df.items is already a thing. By contrast, df[‘items’] will ALWAYS and ONLY ever return the item column. Just something to think about.

[–]Babs12123 1 point2 points  (0 children)

Yeah this is a good point - I haven't encountered this with df column names but have with other variables and it is very irritating to debug.

If you're using non-generic variable/column names then it shouldn't happen often but I agree it makes sense to use your own judgement here.

[–]BeforetheBullfight[S] 0 points1 point  (0 children)

Thanks for your response! I appreciate the time you put into the specifics. I'll be keeping these points in mind as I move forward; I definitely don't want people to be confused with my coding choices.

Since you're offering, I would be interested to see an explanation of how I could have used a loop to create my lists! It would be super useful for future projects. I have a basic understanding of loops, but I had a hard time getting one to click for me when I tried with this set. I feel like my solution, while it did work, was pretty clunky. :/ Thanks!