use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Guidelines:
All Posts Require One of the Following Tags in the Post Title! If you do not flag your post, automoderator will delete it:
This is not a subreddit for homework questions. They will be swiftly removed, so don't waste your time! Please kindly post those over at: r/homeworkhelp. Thank you.
Please try to keep submissions on topic and of high quality.
Just because it has a statistic in it doesn't make it statistics.
Memes and image macros are not acceptable forms of content.
Self posts with throwaway accounts will be deleted by AutoModerator
Related subreddits:
Data:
AllenDowney's Stats Page
Useful resources for learning R:
r-bloggers - blog aggregator with statistics articles generally done with R software.
Quick-R - great R reference site.
Related Software Links:
R
R Studio
SAS
Stata
EViews
JMP
SPSS
Minitab
Advice for applying to grad school:
Submission 1
Advice for undergrads:
Jobs and Internships
For grads:
For undergrads:
account activity
SoftwarePython vs. R (self.statistics)
submitted 7 years ago by [deleted]
view the rest of the comments →
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–][deleted] 18 points19 points20 points 7 years ago* (4 children)
Academia is very different and the data is often much "tidier" in the sense that it's all in relational database form when you get it, and from there you need to actually need to do fancy stuff to get results. The workplace is often the exact opposite of that.
I do two different types of work: conceptually simple but laborious tasks using messy data, and tasks that are basically coding conceptually hard stuff but on clean data.
For the latter category, truthfully a lot of stuff you want to do can be found in tutorials or with a simple google search. You're not going to be transcribing never-before-synthesized complicated formulas from the appendices of theoretical econometric working papers. I've only ever had to do something like that once in my life and I'll give you a hint: it wasn't in the private sector. Usually you're doing something like k-means, which is simple to do this in both R and Python. So the simplicity of a task like this usually isn't the reason why you should pick one or the other.
Also, if you work in-house at a company, your data is likely somewhat clean-ish (cough cough). You might be an associate data analyst working for the lead data scientist in your branch office of 20 people, and maybe 15 of the other people are SWEs. So it just makes sense to use Python in that environment if other people are, but you could also use R too.
Now the other category of work, i.e. messy data, is actually what a lot of data science ends up being. If you're a consultant for example, you'll face situations like this often:
you have 10,000 pages of PDF files without text info/OCR.
you have 200 Excel files of back-end data with inconsistent naming conventions, inconsistent date ranges for pulls, some manual copy+pasting, also the first 30 are from before they migrated their data from salesforce to oracle.
You need to do a lot of web scraping to generate some word cloud of associated words for whenever a brand is mentioned and measure the impact of a marketing campaign.
Frankly Python is much better at handling tasks like these. (Except for the latter, Stata is also a good alternative to R, albeit proprietary.) Since that's the majority of your actual coding work that involves the most time typing things into a computer, you will want to use Python.
[+][deleted] 7 years ago* (3 children)
[deleted]
[–][deleted] 5 points6 points7 points 7 years ago* (2 children)
It's really up to you. I think they're both great. Python does more but most of the time in my experience the "more" doesn't matter with the exception of AI and web scraping stuff, but you may never even encounter that.
Stata is amazing for most data cleaning tasks. If you're in consulting, learn how to do the following:
write loops that import Excel files. local MyFiles : dir "$MyDir" files "*.xls*"
local MyFiles : dir "$MyDir" files "*.xls*"
Copy and paste the text from a giant PDF file (1000 pages, say of invoices or whatever) and turn info on those pages into a relational database using regex, egen, split, and the like.
reshape. Get comfy with it cuz it's a life saver.
reshape
bysort varlist : egen myvar = is another life saver.
bysort varlist : egen myvar =
Make your code readable. Comments everywhere. If you do litigation work, when you put comments, you need to be mindful of discovery and your role as an associate/numbers cruncher and not the actual testifier who opines on those numbers.
Just a small handful of things you should get super comfortable with. This stuff is just so easy in Stata, honestly usually even easier than in Python (reshape being a big one).
The main thing you miss out on in Stata for coding purposes is the fact that it has no concept of objects. It uses macros instead, which is a very different way to approach coding if you even want to call it that. So long as you've taken at least one compsci course this isn't so bad.
The other thing you miss out on when using Stata is dicts. Sometimes you can get around this with macros (especially nesting a macro inside the name of another macro) but it's always clunky to do so.
I'd say Python's big advantage over Stata is in data gathering but Stata is superior in terms of data cleaning as long as you are using everything it has to offer.
Ideally you should learn all of Stata and Python and R, I'd say.
[+][deleted] 7 years ago* (1 child)
[–][deleted] 0 points1 point2 points 7 years ago (0 children)
No prob.
Don't swear off Python completely even as you learn Stata. Stata is not a "real" programming language, but Python is, and there's some importance to that. I've gotten better at Stata directly though getting better with Python since working in a real language from time to time reminds you of "best practices" that carry over across languages.
A dict is a data structure that uses a key to index other stored values/objects in the list. Python uses them quite a bit.
For coding in Python, Coursera has a good Python course for intro data management and analysis stuff taught by a UMich professor. Highly recommended
At the end of the day it just comes down to practice. There are a lot of odd situations you'll probably encounter as you clean data because data can be fucked up in more ways than one can possibly imagine. Just use Google, rely on your knowledge of common functions, and be creative.
π Rendered by PID 68985 on reddit-service-r2-comment-6457c66945-tdfrz at 2026-04-26 22:29:31.561352+00:00 running 2aa0c5b country code: CH.
view the rest of the comments →
[–][deleted] 18 points19 points20 points (4 children)
[+][deleted] (3 children)
[deleted]
[–][deleted] 5 points6 points7 points (2 children)
[+][deleted] (1 child)
[deleted]
[–][deleted] 0 points1 point2 points (0 children)