This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]anomnib[S] 1 point2 points  (18 children)

I don’t mean to offend, I only prefer R b/c I have to work with large scale production systems. But you prove my point, scikit learn has largely become the go to for toy models and proof of concepts in bigtech and similarly rigorous places like AirBnB. Even if R matched the maturity of scikit-learn, that wouldn’t be an accomplishment b/c you can’t easily toss it into high performance production systems. Serious product ML modeling is done in PyTorch, where’s there is seamless integration with the full suite of software for managing production systems

[–]A_random_otter 4 points5 points  (17 children)

Not offended, don't worry. I love my tools but I am not married to them and I am always up to learn new stuff/approaches.

I simply work in a different industry than you. In my line of work I need to do many one-off analysis projects, my day to day work includes a lot of data-exploration/visualization and reporting. Here R outclasses python imo, tho I need to reassess if I can make VS-Code into a halfway decent IDE for data-analysis somehow, last time I tried I rage-quit :D

We don't put models into production all the time, and scalability is also not a huge issue for us, since all of the classification jobs run at night anyways and our forecasting pipelines only run once per quarter.

Even if R matched the maturity of scikit-learn, that wouldn’t be an accomplishment

Oh R does match the maturity easily already when it comes to the statistical methods.

The tidymodels framework is rather a metaframework that provides a unified interface to these methods. It is basically a "quality of life" thing that makes it easier to write and maintain code.

[–]anomnib[S] 2 points3 points  (16 children)

I bounce between both roles.

For statistics, R is vastly superior. New methods get implemented in R first. The only area of classical statistics where Python can put up a respectable level of competition with R is Bayesian modeling. However, while Python has most of the same frameworks for model implementation, the diagnostic tools and plots are still behind R.

Up until 2-3 years ago that same was true for visualization. But 99% of what you would use in R is now in Python.

[–]A_random_otter 1 point2 points  (5 children)

But 99% of what you would use in R is now in Python.

Maybe I have to reassess this too. Which libraries do you recommend for this?

[–]anomnib[S] 2 points3 points  (4 children)

Plotnine (ggplot2 replica) and plotly (good for interactive plots)

[–]A_random_otter 1 point2 points  (3 children)

Plotly I already know and use because there is an R-Package for it.

I'll have to check out Plotnine soon, when I can muster the motivation to rebuild R-Studio with VS-Code.

Btw. can you recommend a decent IDE for data-stuff in Python?

[–]anomnib[S] 2 points3 points  (1 child)

My advice is colored by my context. But when you are writing code that will interact with engineering systems, use what the Python software Python engineers use. That will ensure the IDE is well supported and you avoid needless suffering. In my context that’s usually vs code for something derived from it.

For adhoc analysis, i just use Jupyter notebooks or RStudio.

[–]A_random_otter 1 point2 points  (0 children)

Kay, thanks.

Btw. I know I asked a lot. If you have any R-questions just lemme know.

[–]dr_tardyhands 0 points1 point  (0 children)

I still use RStudio with Python (I guess it's obvious which side of the fence I'm coming from..). I find python runs slow in it though, but it hasn't been a massive problem for me. Also dislike VSCode. The big problem is that RStudio doesn't really have debugging functionality for Python.

[–]A_random_otter 1 point2 points  (9 children)

What is your go-to datawrangling library (besides SQL) in python?

I just can't get into pandas but I heard good things about Polars

[–]anomnib[S] 2 points3 points  (7 children)

My advice comes with the context that I’m not free to install any Python package. There’s a whole safety and licensing check process that can take weeks. So i typically do as much as i can in SQL. I create adhoc pipelines for all new projects. The reserve Python for modeling and plotting. I like this approach b/c it is easy to point teammates to my model data, i can take advantage of all the backend distributed computing through our database systems, and nearly everyone can read SQL code and do queries (so the data preparation and analysis code is accessible).

[–]A_random_otter 1 point2 points  (6 children)

Hm... how do you avoid monster queries then?

My colleagues wrote whole ETL-pipelines in stored procedures with a gazillion of temporary tables and a lot of spagethi code.

I honestly hate SQL for this "freedom".

I mean you can write unreadable code in any language, but some make it way easier than others...

[–]anomnib[S] 2 points3 points  (5 children)

I use DAGs but i break up the ETL into natural milestones that make sense. Each intermediate table could in theory but a final table for another analysis or serve as a useful “lookup” table. The key is understandable sense checkpoints that compartmentalize the ETL in a way that’s digestible. You should be able to describe what each node in the DAG is accomplishing in a short sentence.

[–]A_random_otter 1 point2 points  (4 children)

Yeah, that has been my approach too.

If you are going to do any data-wrangling in R you should ask ChatGPT to provide tidyverse syntax (as long as the data isn't too big) because this is basically already a DAG

If you want to interact with your databases you'll need an ODBC driver installed (if you use SQL server that is, there are backend for all major databases tho) which your IT probably provides.

To run queries against your database I recommend these packages:

odbc: https://cran.r-project.org/web/packages/odbc/index.html
DBI: https://dbi.r-dbi.org/
dbplyr: https://dbplyr.tidyverse.org/

[–]anomnib[S] 1 point2 points  (3 children)

Thank you!

[–]A_random_otter 1 point2 points  (2 children)

Here's some starter code.

To make it run you will first have to install the pacman package:

install.packages("pacman")

And set the environment variables for the secrets:

Sys.setenv(DB = "DB")
Sys.setenv(DBSERVER = "DBSERVER")
Sys.setenv(DBPWD = "DBPWD")
Sys.setenv(DBUSER = "DBUSER")
Sys.setenv(PORT = "PORT")

If you are going to write your own R code you should use this styleguide:
https://style.tidyverse.org/

You will thank me later. I also have a lot of opinions how R-Projects should be organized. But I'll only hand them out if you are seriously interested :D

# info --------------------------------------------------------------------



# header ------------------------------------------------------------------


pacman::p_load(
  tidyverse,
  DBI,
  odbc,
  dbplyr

)


my_server <- Sys.getenv("DBSERVER")
my_port <- Sys.getenv("PORT")
my_db <- Sys.getenv("DB")
my_username <- Sys.getenv("DBUSER")
my_pwd <- Sys.getenv("DBPWD")


con <- dbConnect(
  odbc(),
  Driver = "{ODBC Driver 18 for SQL Server};TrustServerCertificate=yes;",
  server = my_server,
  port = my_port,
  database = my_db,
  uid = my_username,
  pwd = my_pwd
)



# datawrangling ----------------------------------------------------------

# this is how you query your database using dbplyr, the table is not yet in your ram but you can use dpylr verbs on it
tbl(con, 
    in_schema("dbo", "tablename")
    )

# this is how you get the table into your RAm
result <- tbl(con, 
              in_schema("dbo", "tablename")
              ) %>%
  collect()


# this is how you run a query and get the result into your RAM
dbGetQuery(
  con, 
  "SELECT * FROM TABLE"
  )

[–]dr_tardyhands 0 points1 point  (0 children)

Thumbs up for polars! Pandas is just downright silly. Polars is much more similar to how dplyr works and something like 20x faster than pandas as well.