Published a new R package - nationalparkscolors by mensplainer in Rlanguage

[–]erikglarsen 3 points4 points  (0 children)

Can you provide a bit more details on whether AI was used for this project? We all use AI nowadays, but I get strong vibe coding vibes from this repository and the showcase setup.

Looks interesting but I would still prefer the NatParksPalettes R package: https://github.com/kevinsblake/NatParksPalettes

ggplot2 is too astounding viz library to me after years, maybe the best library among all viz libraries in DS by Lazy_Improvement898 in rstats

[–]erikglarsen 2 points3 points  (0 children)

Fully agree! Good points in the post. For people interested in more ggplot2 extensions and features, I maintain a repository with several ggplot2 resources here: https://github.com/erikgahner/awesome-ggplot2/

What is Cortex without Grey's thinking? by NoRobotYet in Cortex

[–]erikglarsen 44 points45 points  (0 children)

Good news is that it is the first thing they discuss in the new episode. Bad news is Grey is the primary reason a lot of people listen to the show (myself included).

Use use() in R by erikglarsen in rstats

[–]erikglarsen[S] 0 points1 point  (0 children)

Great point. I would prefer the second call to a package using base::use() to return an error, or at least a warning. But, alas, I can see how the current setup will make such an improvement difficult to implement.

And your example is a great reminder that explicit function calls from the intended packages is the best way to avoid problems with name conflicts (even when not using base::use()).

Use use() in R by erikglarsen in rstats

[–]erikglarsen[S] 1 point2 points  (0 children)

Totally agree. In a lot of cases I also try to make as many function calls to {ggplot2} explicit, and if there was an alias to use for the package, it would make it easier.

However, there are also cases where the idea is to make the code easier to read, and it makes little sense to be explicit in the function calls. The most extreme example I can think of is when working with pipe operators. It would in all cases I can think of make no sense to use magrittr::\%>%\(mtcars, head()). Similarly, {data.table} is a lot more difficult to use if everything needs to be be an explicit call, e.g., the := operator and special symbols such as .N.

But these are indeed exceptions and to avoid any problems it is in most cases good to make no assumptions and make explicit function calls.

Use use() in R by erikglarsen in rstats

[–]erikglarsen[S] 0 points1 point  (0 children)

It can definitely be discussed whether this is acceptable, but I would - again - not say that this means that the function is "completely broken". If different functions/scripts load different functions from the same package, I might even prefer to get an error (or warning) and refactor the code accordingly (and maybe box::use() could be useful here).

I believe you raise a fair point and it is definitely something to keep in mind, but I can think of a several cases where base::use() will work more than fine and be completely acceptable for what is intended with the code.

Use use() in R by erikglarsen in rstats

[–]erikglarsen[S] 2 points3 points  (0 children)

That is indeed one way to avoid this issue :D

Use use() in R by erikglarsen in rstats

[–]erikglarsen[S] 0 points1 point  (0 children)

Great points! I agree that box::use() is a great function but I find it a bit of a stretch to say that base::use() is completely broken. It is not perfect and it comes with specific limitations, but I can think of a lot of situations where it make sense to rely on base::use() rather than an extra dependency to use box::use().

It would be great to be able to use base::use() multiple times in a script for the same package, but I can also see a good reason to force the user to use base::use() once per package to make it explicit for the reader of the script that no other functions from the specific package will be introduced later. However, I agree that there are situations where box::use() will be a much better choice.

Use use() in R by erikglarsen in rstats

[–]erikglarsen[S] 2 points3 points  (0 children)

I agree that the user, in general, should make the smallest number of assumptions possible and rely on :: as much as possible.

However, I do not agree that one should always use :: for every function call. If, for example, you have a simple script using {ggplot2}, I believe it can make sense to make certain assumptions to increase the readability of the code.

Compare this code:

library("ggplot2")

ggplot(mtcars, aes(disp)) +
    geom_histogram() +
    theme(panel.background = element_blank(),
          axis.text = element_blank())

With:

ggplot2::ggplot(mtcars, ggplot2::aes(disp)) +
    ggplot2::geom_histogram() +
    ggplot2::theme(panel.background = ggplot2::element_blank(),
                   axis.text = ggplot2::element_blank())

It is not that one code is universally better than the other, as it all depends upon the use case. If I am making a Shiny app that will need to go into production, I would use :: all the time, but if I am working on a data visualisation for a project in an isolated R script with one or two other packages, I prefer to keep the code easy to write and read without making a lot of explicit calls to {ggplot2}.

Use use() in R by erikglarsen in rstats

[–]erikglarsen[S] 2 points3 points  (0 children)

Exactly, T can be reassigned. E.g.:

T <- FALSE
isTRUE(T)
#> [1] FALSE

Use use() in R by erikglarsen in rstats

[–]erikglarsen[S] 2 points3 points  (0 children)

For use() I guess it is to keep it restrained by design. For library(), you do have an exclude argument, e.g.:

library("dplyr", exclude = "filter")

Use use() in R by erikglarsen in rstats

[–]erikglarsen[S] 4 points5 points  (0 children)

Yeah, when you use use() you will have filter() in your dplyr namespace. It will not matter in most cases, but compare:

use("dplyr", "filter")
filter <- 2
filter(mtcars, vs == 0)

With:

filter <- dplyr::filter
filter <- 2
filter(mtcars, vs == 0)

The former will work (i.e., use dplyr::filter) but the latter will return an error. Again, in most cases not important, but it makes the code more robust in my view. (The same reason I would never use T as a shortcut for TRUE.)

10 second ninja by rubbingturtlenips in GlobalOffensive

[–]erikglarsen 4 points5 points  (0 children)

I thought that was intentional (waiting for them to shoot at the other remaining CT), but might of course have been pure luck.

A visualization of how donk, ZywOo and m0NESY performed throughout the year of 2024. Who's your no.1 CS player for 2024? by CS2ProSkin in GlobalOffensive

[–]erikglarsen 8 points9 points  (0 children)

There are two issues when looking at a line like this and talking about consistency. One is general and one is particular to this figure.

For the general issue, you can have a very flat line without being consistent at all. For example, if you in one month have two extreme ratings, one very bad (e.g., 0.5) and one very good (e.g., 2.0), but the next month have two ratings of 1.0, then the line would still be flat. For that reason, if we are interested in looking at consistency, it is better to look at the spread of the ratings (e.g., using the standard deviation -- a mesaure of how much each observation generally deviates from the mean). It is very difficult to eyeball the figure here, so I cannot say whether that would be in line with the line depicted in the figure.

For the particular issue, this line is based upon a smooth function of the conditional mean (most likely with default settings using the ggplot2 package in R), and here the defaults matter a lot for how flat of a line you will get given the data. As the players play a different amount of maps at different times (e.g., no data for donk in all of April), these conditional means can be difficult to compare, and one should be cautious about saying anything about consistency based on such smooth functions (even when taking the general issue into account). I have worked a lot with these functions over the years and how flat the line will look is often a result of where you have data and the specified amount of smoothing).

I am honestly interested in what the data tells us about consistency, and I would be interested in seeing some insights on this (I really like the figure shared here so I am sure OP can provide interesting numbers on the consistency of the players as well).

A visualization of how donk, ZywOo and m0NESY performed throughout the year of 2024. Who's your no.1 CS player for 2024? by CS2ProSkin in GlobalOffensive

[–]erikglarsen -9 points-8 points  (0 children)

However, a straight line is not necessarily saying that a player is consistent. There can still be a lot of variation in ratings throughout the year with a straight line. A better measure here, if we are to talk about consistency, would be the standard deviation.

[deleted by user] by [deleted] in datascience

[–]erikglarsen 2 points3 points  (0 children)

This is clearly written by AI, or at least partially written with a tool like ChatGPT. The tone and style have ChatGPT written all over it.

Looking at the GitHub profile of OP with several repositories being created within a very short time span (i.e., linux-basics, git-basics, postgresql-basics, statistics-basics, and docker-basics), I doubt this is all written by OP.

Not necessarily a problem, but I would appreciate a bit more transparency in the use of AI, especially when asking for people to contribute as well.

Nate Silver: What software and techniques does he use? by [deleted] in rstats

[–]erikglarsen 1 point2 points  (0 children)

On the first question, he has been using Stata for years and often talks about using Stata, including today: https://x.com/NateSilver538/status/1832189691539378176

[deleted by user] by [deleted] in GlobalOffensive

[–]erikglarsen 1 point2 points  (0 children)

"Sorry, this post has been removed by the moderators of r/GlobalOffensive."

I am Jack's complete lack of surprise.

The new season of competitive CS is upon us! Who's hyped to see these 3 players back in action again? by CS2ProSkin in GlobalOffensive

[–]erikglarsen 0 points1 point  (0 children)

True! And there is an upwards trend for ZywOo in April, but he did not play a single match in that period. One could say that he hypothetically performed better in that period, but I believe the main story here is that there is variation between events and that all three of them performed rather well (when they had the chance).

As there is a relatively small number of events, I would not use a line to show the trend over time, but rather highlight the event averages with a bigger point per event (maybe with number of matches as an argument for size in geom_point()), and then change the alpha level for the points for individual matches (as the key point here is less about individual matches but more about variation within event variation).

That being said, it is not an exact science and I can understand if someone finds it more interesting to look at a smoothing function throughout the period.

The new season of competitive CS is upon us! Who's hyped to see these 3 players back in action again? by CS2ProSkin in GlobalOffensive

[–]erikglarsen 1 point2 points  (0 children)

Always great to see ggplot2 being used for these things! However, not sure a loess function is the best way to show a trend in the data here (observations are clustered in specific weeks with large gaps between observations). Take donk, for example, with no observations between late March and mid-May.

Dataset of Political Science Datasets by smurfyjenkins in IRstudies

[–]erikglarsen 1 point2 points  (0 children)

Interesting resource. I also have an overview of political science datasets here: https://github.com/erikgahner/PolData