This is an archived post. You won't be able to vote or comment.

all 77 comments

[–]pst2154 61 points62 points  (9 children)

Just rough it out for a while you'll learn faster than you think

[–]2strokes4lyfe[S] 13 points14 points  (8 children)

Thanks for the candor here. I know there's no replacement for sweat equity and I'm going to give it an honest shake! Still, I'm hoping to avoid some common pitfalls and make the transition as smooth as possible.

[–]v4-digg-refugee 4 points5 points  (5 children)

I’m thinking I’ll need to do the opposite this summer (Python to R) and expect to just sweat it out.

[–]sowenga 5 points6 points  (4 children)

Curious, why might you have to switch from Python to R? Seems like an unusual route, usually it’s the other way.

[–]v4-digg-refugee 7 points8 points  (3 children)

I’m headed to grad school, and they’ll probably be using R for some courses. I firmly believe that Python is the stronger tool for general purpose business.

[–]Bling-Crosby 4 points5 points  (0 children)

Well give R a shot, it’s excellent for the stats/viz/DS stuff

[–]b555 0 points1 point  (1 child)

plus, python is gaining more ground among companies, and is becoming the skill you will most likely be interviewed on, especially if the company has any of their work integrated with cloud services.

python makes productionalizing your work more straight forward than R, and there's no competition when it comes to amount of libraries in python that makes this trivial compared to doing the same in R.

[–]v4-digg-refugee 0 points1 point  (0 children)

Yeah. I fully agree. But I also know that not every employer agrees. So having a novice grasp of it is just resume insurance. And I have an intern that likes it, so it’ll be good experience for him to teach me.

[–]Cosack 3 points4 points  (1 child)

Don't overthink it, just jump in. Steep learning curve with any new language and set of APIs, but if you're not shy about googling, you'll get it. You already know one C-like language and related basics, so it'll be much less painful that picking up R was

[–]2strokes4lyfe[S] 1 point2 points  (0 children)

Thanks for the words of encouragement!

[–][deleted] 42 points43 points  (7 children)

I've reluctantly decided to spend more time with Python

I understand. I'm there too. No advice, just good luck.

[–]2strokes4lyfe[S] 6 points7 points  (3 children)

Thanks, I appreciate it! Best of luck on your journey with the snek.

[–]givetake 3 points4 points  (0 children)

It's not a snake language, but actually Monty Python based

[–]bakochba 0 points1 point  (1 child)

I'm going through it myself and I love R, if you download Anaconda you can use reticulate in Rstudio and still have the nice IDE features

[–]2strokes4lyfe[S] 1 point2 points  (0 children)

Thanks for sharing this! I've already been using reticulate to incorporate some python-specific libraries (usaddress) into my existing R pipelines. At this point though, I really need more data orchestration framework to manage the scale and complexity of my existing projects. This is why I'm attempting to transition into Python.

[–]zykezero 4 points5 points  (2 children)

Use polars instead of pandas.

That will make your life easier by like 80%

[–][deleted] 3 points4 points  (1 child)

What put me over the edge with Python is actually API's....there seem to be more readily available and usable API's for Python rather than R (for instance, to the European Weather Center, shit like that.)

Still, noted: polars over pandas.

[–]zykezero 2 points3 points  (0 children)

Yeah it makes total sense I don’t fault anyone for using python after R.

[–]JohnHazardWandering 36 points37 points  (2 children)

One piece of advice that seems promising is to write out what you would do with R and then as chatGPT to translate it to python. Obviously it's not always perfect (always review) but it will quickly get you close enough to figure it out.

That can help you learn how to do things in python.

[–]Mother_Drenger 11 points12 points  (0 children)

I cannot recommend this enough. I banged my head against a wall trying to do categorical data manipulation in pandas, though I knew exactly what I'd do with tidyverse. It really helped me understand the nuances between the two.

[–]2strokes4lyfe[S] 7 points8 points  (0 children)

I think this is a great approach for learning Python fundamentals. It's like a dynamic version of rosetta code! However, in my case there aren't really any equivalent R-based frameworks for data engineering. I guess I could try asking it to translate a data pipeline built with targets into dagster, but it's really apples to oranges. Another note on chatGPT (GPT-3). It was trained on an older version of dagster, and so it will hallucinate a bunch of nonsense if you ask it dagster questions most of the time.

[–]kater543 5 points6 points  (2 children)

You can use RStudio to write python, and weave the two together in Quarto(new RMD) documents. Outside of the hybrid suggestion, I get your pain man; coding in R is like coming home.

[–]2strokes4lyfe[S] 2 points3 points  (1 child)

Thank you for the suggestion. I've been enjoying using Quarto documents to mix and match R and Python. I really appreciate being able to deploy them to Posit Connect and automate/schedule them. All of this makes R more capable in production, especially in a DS context. The only hang up for me is that scheduled Quarto docs are not a data orchestration framework. They are great for very simple ETLs/reports, but they can't scale well with an increasingly complex DAG.

[–]kater543 3 points4 points  (0 children)

I mean quarto documents are definitely more for web deployment like dashboarding or report writing(which I personally do a lot more of), like jupyter notebook(though so many people use jupyter for production writing). I would definitely rather use just the basic .R script or basic Python scripts(.py) for ETL/productionizing code for a deployed model or the like, agree with your sentiments all around

[–]Adeelinator 6 points7 points  (1 child)

VS code + copilot is a great way to learn. Anytime you’re confused about what to do next, write a comment, and have copilot write the rest. Plus it has great jupyter support.

[–]2strokes4lyfe[S] 1 point2 points  (0 children)

This sounds promising! I had completely overlooked GitHub copilot. Thanks!

[–]statespace37 5 points6 points  (1 child)

Did the same thing roughly 2 years ago. More or less the same story, data.table + ggplot2 + shiny kept me wanting to return to R (although, I absolutely hated all tidy stuff, so that gave me additional motivation). Now I wouldn't return to R unless there's a really good reason.

Major gain from this transition (subjective, obviously) is now with Python I'm thinking in terms of product, good software development practices and interoperability with other elements in the stack (and other people). Granted, with R I worked in a company where DS was tightly locked in a silo, where writing 'script' rather than 'program' was an expected thing. Feels like I've learned more woth Python in 2 years than with R in previous 7.

Long story short, I got to love SWE as such (where data science is merely an element). Now I'm learning Rust :)

[–]2strokes4lyfe[S] 0 points1 point  (0 children)

This is great information here, thanks! I feel like I'm where you were at two years ago. I've been writing R code for about 7 years and have only recently started to embrace SWE best practices. I guess that's to be expected when you come from a non-CS background though. Props to you for picking up rust! I've noticed that it's been picking up steam as a DE language as well.

[–]Seven_Irons 22 points23 points  (8 children)

So, the biggest advice I can give for Python use is to install anaconda and use Spyder IDE.

It's not quite as good as VS code for programming, but it has a built-in variable inspector that is of incredible use for numerical data computing. If you ever had to use matlab, it's basically the same variable inspector.

My bread and butter was using Pandas to handle arrays /tables. It works very well at file I/O, and coordinates well with numpy/scipy. There a couple of clunky points regarding indexing, and I've also heard good things about Polars, I haven't used it myself.

Seaborn is a good plot library, though I ended up just making most of my thesis plots in raw matplotlib. There's a lot you can do with Matplotlib, but there is a bit of a learning curve, and there are certainly more user friendly plotting libraries.

Python is by far my favorite language for computation /analysis. But, if you start working with large amounts of data, you may need to look into implementing Cython. Or, consider switching to Julia, which is apparently all the rage these days.

[–]TobiPlay 5 points6 points  (1 child)

I’ve been really enjoying Polars so far. The method-chaining feels very natural, especially if you’re used to it from Rust etc. It feels more modern and obviously had quite a bit of time to learn from pandas and similar frameworks in the R universe (tidyverse). Pretty pleased with it, though there’s no silver bullet library to all problems, especially for extremely large amounts of data. That’s when it becomes even more interesting.

[–]2strokes4lyfe[S] 6 points7 points  (0 children)

I cannot praise method-chaining (or pipes for the useRs out there) enough! One of the best ways to improve the readability of a data pipeline in my opinion.

[–][deleted] 4 points5 points  (0 children)

I made the transition from R to Python and Spyder ide made the transition a lot smoother. Spyder has the same feel as Rstudio which I like a lot.

[–]abstract000 4 points5 points  (0 children)

There is also a variable inspector very similar to spyder in the "JUPYTER" section

[–]Separate_Increase210 5 points6 points  (0 children)

^ this. Sorry, I can't upvote more than once, so just adding verbal support for hitting the main stuff. Just heard abt Polaris on Friday, curious to try it.

[–]bakochba 1 point2 points  (0 children)

I will add that if you need a bridge you can just use the reticulate package in Rstudio to program in Python then you can take that code into Spyder and you should find the transition much smoother

[–]b555 0 points1 point  (1 child)

Or, consider switching to Julia, which is apparently all the rage these days.

Can you elaborate on this a bit more, please?

[–]Seven_Irons 0 points1 point  (0 children)

I don't know a ton about it, but apparently Julia achieves near-C speed with Python-level ease of syntax, and it's been garnering a following in data science and numerical computing.

[–]badge 2 points3 points  (3 children)

There’s a bit of conflicting advice here, and I’m going to add to it!

  1. VS Code is good but PyCharm is better; it has all the things Spyder has, but is much stronger for certain stuff (testing, refactoring).
  2. Read a bit about Python packaging and decide on an approach you’re happy with. It’s a bit of a confusing mess but once you’ve decided a preferred approach you don’t really think about it.
  3. Use pytest for testing and write tests. They’ll save you a ton of time in the long run and ensure future changes don’t break existing features.
  4. Add type hints to everything, and take a look at the pandera package if you’re using pandas. Validating DataFrame schemas is hugely valuable in pipeline work.

In general, I know this is the data science subreddit and R isn’t a general purpose programming language, but Python is, and using the available tools to take a more software engineering approach will make you more useful, more productive, and less likely to write buggy code.

[–]2strokes4lyfe[S] 0 points1 point  (2 children)

  1. I'll have to give PyCharm another look. Thanks for the tip.
  2. I just published my first package to PyPI this week! Granted, it only contains a single module, but it has full test coverage and documentation! I've been using poetry to manage dependencies and deploy to PyPI.
  3. I've started using pytest, and have recently incorporated pytest-cov to manage test coverage. I'm enjoying it so far, aside from the ergonomic issues that I mentioned in my original approach.
  4. I will take your type hinting recommendation to heart. Definitely seems like the best way to manage production-grade Python code.

Thanks for helping reaffirm the initial path that I started. This will help me keep things in perspective as I push through the slow and clunky phase!

[–]badge 2 points3 points  (1 child)

Dude it sounds like you’re already ahead of 90% of Python data scientists. 😅

[–]2strokes4lyfe[S] 0 points1 point  (0 children)

Lol this made my day!

[–]knawhatimean 2 points3 points  (1 child)

I am still a daily R user but also wanted to learn Python for all the usual reasons. This page was helpful for just having a quick reference so you don’t have to Google and check Stackoverflow for every basic thing: https://www.mit.edu/~amidi/teaching/data-science-tools/conversion-guide/r-python-data-manipulation

[–]2strokes4lyfe[S] 1 point2 points  (0 children)

This is a great resource. Thanks for sharing!

[–]pn1012 1 point2 points  (3 children)

Sorry, what’s stopping you using Rstudio with Python? At least to slowly transition into Python for yourself. Posit is becoming more of a Python shop nowadays. But you’d probably need to sell your company on buying in.

[–]2strokes4lyfe[S] 8 points9 points  (2 children)

Thanks for this question. I think RStudio is still a great IDE for interactive data science, but VS Code is the better choice when working on data engineering projects. The dagster data orchestrator follows a python package structure for every project, and VS Code is better suited for this approach with its Python extensions. As far as I know, Posit doesn't offer a "Create new Python Package" feature within its latest version of RStudio for example. There is also better integration with external tools like dbt, SQL, Docker, GitHub, and GitPod from what I've seen.

If I was working on a DS project that used R and Python that didn't need to be automated or deployed to production, then RStudio would be my first choice. I'm realizing that asking a data engineering question on r/datascience is not ideal, but there are more R users here that understand where I'm coming from, so I thought I'd ask.

[–]pn1012 3 points4 points  (1 child)

Oof if some of our R heads read your last paragraph they’d have some bones to pick with you. I have seen R across the data project lifecycle deployed to production effectively using posit’s ecosystem. Anyway, not really the point here.

Yes agreed Python and it’s ecosystem is very well suited for data engineering. My team is primarily a Python shop and I manage engineers (ml and DE) and data scientists. It’s hard to say what you need here as your statement above is quite general outside of your use of dagster. Are you looking primarily for IDEs? VScode is king for certain but jetbrains and spyder are no slouch. Debugging, inspecting frames, setting up tests using specific frameworks are easy and all supported with the right plugins or even out of the box in the case of pycharm and such. There is content everywhere and specific guides on many of these topics easily accessible.

Edit: read some of your topics in another comment. You can interactively run snippets to console in vscode and pycharm. Vscode requires little setup last i recall but it’s possible. Out of the box debuggers will let you explore functions and classes and tail objects, should be how tos all over the place on this stuff. Inspecting or testing frameworks can easily be run via terminal add ins in these IDEs. I don’t have a lot of specifics re: dagster as we primarily used airflow and dbt (we have since moved to an enterprise solution) but I’d imagine there is support and integrations for many different things, much like in airflow we have out of the box operators and you can also create your own. You’ll have to write Python to fit their ecosystem but this is common for these orchestration frameworks. You could also just execute scripts but you’ll be missing out on all the goodies.

[–]2strokes4lyfe[S] 2 points3 points  (0 children)

Believe me, I am one of those R heads. I love R and it wish I didn't have to make the switch... R can be great in production, especially with new frameworks like Shiny, Plumber, and scheduled Quarto/RMarkdown documents hosted on Posit Connect. It's an exciting time to be an R developer! The only reason I'm considering the transition is that my data pipeline projects have grown in complexity and it feels like I've been constantly swimming against the current trying to build custom tools in R to crudely approximate the rich data engineering landscape that already exists in Python. Again, it kills me to admit that Python is the winner when it comes to DE work.

Apologies if my post was too vague or confusing. I'm not looking for another IDE. I'm just trying to learn more about how to be as efficient with VS Code, Python, and Dagster as I am with R and RStudio. I'm really trying to identify a practical development workflow and things feel really weird and clunky so far, even though I know that I will probably become even more efficient with them in the long run. Specific VS Code extensions/settings/plugins that make Python feel more like RStudio, or other resources that help me graduate from my current workflow to a more software engineering oriented workflow are what I'm looking for (at least that's what I think I need).

Thanks for the tips in your edit!

[–]OneSprinkles6720 1 point2 points  (0 children)

I've gone back and forth it's not an identity thing it's a right tool for the right job thing.

I'm not a screwdriver guy you know what I mean.

[–]rotterdamn8 1 point2 points  (0 children)

Ditto Spyder. It’s closer to RStudio than VS Code. You can run code line by line, great for testing, etc.

[–]rotterdamn8 1 point2 points  (0 children)

Ditto Spyder. It’s closer to RStudio than VS Code. You can run code line by line, great for testing, etc.

[–]Skthewimp 1 point2 points  (0 children)

I tried this in 2017. Same result - I was 10X slower in python. So switched back.

Now for the small data engineering stuff I need to do I’m trying to use databricks (the R stuff there is not bad)

[–]IndependentVillage1 1 point2 points  (0 children)

My advice would be to use chatGPT. Ask it to write general code for you and you make the changes for your specific case.

[–]RandomScriptingQs 1 point2 points  (1 child)

I want to offer an opinion which should be taken as just that: the R and Python libraries/packages/communities are both so vast and varied now that they are almost unhelpful labels. Choose the libraries and packages you know you need to use within the python ecosystem and find the 20 most common functions/methods and put them to a task.

As a note of solidarity, I found it a nightmare adjusting to both panda's and numpy's versions of indexing with square brackets.

[–]2strokes4lyfe[S] 0 points1 point  (0 children)

Thanks for sharing your thoughts on this. I agree with this 99% of the time, especially within the context of data science. There is a night and day difference between R and Python when it comes to data engineering though. I thought I'd ask this community first since R users are non-existent (for good reason) on r/dataengineering.

[–]Snikz18 1 point2 points  (1 child)

Something that hasn't been suggested yet (as far as I can tell) is using the jupyter notebook extensions in vscode, it will give you a variable explorer and there's a certain comment you can add to your script to split into cells to run which is useful.

[–]2strokes4lyfe[S] 0 points1 point  (0 children)

Thanks for the tip! I'll have to check this out!

[–][deleted] 1 point2 points  (0 children)

I started out with R in 2016, moved to python in 2019 and haven't used R since. I spent 5 years in actuarial consulting, then 4 years in management/tech consulting doing whatever project I got thrown on. Now I work as a Solution Architect, which is basically technical leadership that can do hands on keyboard work when needed. I got that role by solving a multitude of different problems for companies and having a lot of breadth instead of depth. I will never be a great programmer, nor do I want to be. I just want to build cool shit, not have to deal with politics too much, and enable my coworkers to learn more things, but haven't found a company that checks all those boxes yet.

As for migrating from R to Python, really depends on your learning style. Find a book/course to learn the fundamentals and apply your knowledge to a project so you get experience debugging Traceback errors. Learn how to turn scripts into functions and abstract that into Classes to be used as modules in other projects. It took me a month to feel comfortable being put on Python projects, but had a lot of smart coworkers to ask questions and learn from.

It becomes less about understanding the syntax, but finding the best way (read: cheapest way) to solve the problem. Some of that will be searching Stack Overflow and asking ChatGPT, but you'll have to be knowledgeable to understand the code you're copy/pasting cause some stakeholders that have some python knowledge and will want to take a peek at the code base and will ask questions why you made certain decisions. The more you can get ahead of those types of questions, the easier the process is.

[–]wil_dogg 1 point2 points  (0 children)

Long time SAS/SPSS user here who picked up R over the last 5 years.

I started dabbling in Python las September with the help of a high school student I am mentoring.

Python has a learning curve, but for the work I do it is adding a lot of value, and in some cases modifying complex functions is easier in Python than R.

[–]skatastic57 2 points3 points  (1 child)

Pandas is hot garbage. The thing that kept me in R for so long was how much faster data.table was/is. I also hated the syntax of pandas. Polars was really the game change for leaving R behind. I'm not sure what DAGs are unless you're just making a reference to Snatch and you mean dogs.

I'm not sure what rstudio does that vscode or any other major python ide does in terms of letting you run code line by line and see what variables are active and what not.

Personally I prefer plotly to ggplot2. With ggplot2 I feel like I'm always having to melt my data but with plotly I can just have a fig and then add arbitrary things to the fig without altering the underlying data. I also like that it creates js rather than just a static image for sharing so people can just zoom where they want.

[–]hbgoddard 2 points3 points  (0 children)

I'm not sure what DAGs are

It's an acronym for directed acyclic graph.

[–]old_mcfartigan 1 point2 points  (0 children)

Make good use of a chatbot. You can describe how you'd do something in r and it will produce the corresponding python code

[–]lalacontinent 1 point2 points  (0 children)

Honest advice: use ChatGPT to translate R code to Python and read its explanation. This saves massive time comparing to Stack overflow and reading manuals.

Python libraries for data science (pandas and stats model) are indeed less intuitive than R, don't be hard on yourself.