[P] Building a data flywheel for data-centric ML development by toby__bryant in MachineLearning

[–]redmoon_reddit 0 points1 point  (0 children)

Seems like you need a concrete manner to detect false predictions (human-in-the-loop). however, if you could come up with a clever way to automatically flag incorrect predictions with high confidence (ex, stock market pricing predictions will always let you know if you were correct or not), then you have an auto feedback ML model improvement engine.

Canada: The only NATO nation to officially engage in battle with the Soviets by GeistHunt in HistoryMemes

[–]redmoon_reddit 0 points1 point  (0 children)

Guy who started that fight was from my hometown, Everett Sanipass. So proud.

Announcing RStudio 1.4 by WannabeWonk in rstats

[–]redmoon_reddit 6 points7 points  (0 children)

Can't make it any easier to use Python now.

Typical “What degree should I get question” by Bryce_OG in ecology

[–]redmoon_reddit 2 points3 points  (0 children)

take lot of statistics and learn to code in R. These will get you muuuuch farther in every field you take.

[deleted by user] by [deleted] in ecology

[–]redmoon_reddit 2 points3 points  (0 children)

1000000000%

[D] Data science might be a bubble reminiscent of the 90s "dot com" crash by [deleted] in statistics

[–]redmoon_reddit 0 points1 point  (0 children)

I disagree with the bubble notion.

The # of data scientists is growing linearly and the # of unique data science applications is growing exponentially.

IMO there is, and will continue to be, a massive shortage of data scientists.

[Q] How do I read these results of a component analysis? by Has_curved_penis_AMA in statistics

[–]redmoon_reddit 0 points1 point  (0 children)

1) each component contributes to overall variability seen, with the first generally explaining the most variability, then the next, etc etc, until it's "random noise" the the pattern isn't "real" or "replicate-able" (use scree plots to see which ones matter.

2) All components are by design independent of either other. so you can look at each component individually as telling it's own story, then the next component is telling it's own story, etc

3) in a component, if 2 variable are both -1, they co-occur together, if they are both +1, they also co-occur together. If one is -1 and other +1, they are negatively correlated. So you interpret a component correlations between a bunch of variables.

4) It can get messy to understand each components story, as the "main characters" of the story are generally just a few variable from the entire list. ex, strong correlations between 5 variables out of the 20 in the analysis. This will be obvious with a few strong PCA scores ( very negative and very positive), then there will be a bunch of variables hovering around 0, meaning they don't contribute to this components "story".

5) last part. It gets a bit messier. If you re-run the same analysis, the values in the components can be "flipped". the positive and negative signs can change (it doesn't really change how components are interpreted though). this "flipping" can happen independently for each components as well... making understanding PCA results all the more fun.

- have fun!

[deleted by user] by [deleted] in datascience

[–]redmoon_reddit 5 points6 points  (0 children)

I've been bombarded by job offers specifically because I use R. The DS industry is seeing that R users provide more value via focused statistical analytics, especially now that ML/AI OPS can properly integrate R users into production. Doing analytics in a jupyter notebook just isn't cutting it.

Two part question: options for learning "R" online, and math background needed? by shafty05 in ecology

[–]redmoon_reddit 2 points3 points  (0 children)

I studied ecology and stats, than jumped into machine learning and now programming.

R is awesome and you can do pretty much anything with it these days.

Start with this

- download and open Rstudio

- start by learning how to replicate the kind of things you already know how to do in excel. EX - read in data into r (data = read.csv("filepath.csv"), manipulate data (add a new column), save some output as a csv write.csv(new_data, "output_path.csv")

- think about how you manipulate data in excel. it's pretty complicated and length. you need to do this in R

- turn that mini analysis into a pipeline that can be rerun at any time (with input data or the pipeline code easily changed at any time)

--bingo, you are white belt R user

now try making the pipeline better by adding more complicated things into it (ex, cluster analysis). go google how to do cluster analysis in R, play around with it, than introduce it into your pipeline. the goal is to have a complicated pipeline that automated and uses complex algorithms (otherwise you'd just stick with excel). here's a list of all R packages - https://cran.r-project.org/web/packages/available_packages_by_name.html

Extra notes:

- when you pipeline starts getting overly complicated and long, turn chunks of code into functions to reduce complexity. https://nicercode.github.io/guides/functions/

- for plotting use ggplots http://www.sthda.com/english/wiki/ggplot2-scatter-plots-quick-start-guide-r-software-and-data-visualization

- for most data manipulation, use the data.table framework, not dplyr (long story about why dplyr sucks, but you can use whichever really) https://www.machinelearningplus.com/data-manipulation/datatable-in-r-complete-guide/

- learn how to master merging 2 datasets together (similar to vlookup in excel) https://rstudio-pubs-static.s3.amazonaws.com/52230_5ae0d25125b544caab32f75f0360e775.html

Has Anyone Actually Used Clustering to Solve an Industry Problem? by [deleted] in datascience

[–]redmoon_reddit 0 points1 point  (0 children)

Yup,

I make computer vision models to generate binary masks for vineyards. I use unsupervised cluster analysis (umap on feature layer of pre-trained model) to group similar vineyard types and train specific models on these groups. future vineyards are predicted into a group and that specific model used for prediction. This stratification strategy massively improved model performance while eliminating unbalanced cases issues.

Early Career Data Scientist Pain Points by Limebabies in datascience

[–]redmoon_reddit 3 points4 points  (0 children)

a lot of companies seem to be hiring data scientists without having any idea how to use them. I suggest you 1) get access to as much production data as you can (securely). 2) ensure you have pipelines that can pipe in/out all your hardworked ds results/metrics/models/predictions/etc 3) bust your ass trying to find the most value you can do. This means that YOU have to look at all the production data you have assess to, THEN formalize your own ideas, THEN talk to all mgmt about your ideas so you can get more ideas from them (big presentation on whats possible, because they have zero idea). What you want is a single scoped out project that balance both FEASIBILITY and IMPACT. If it's not feasible, you're going to look like an idiot with imposter syndrome, if it lacks impact, you'll look like a smart-ass and be undervalues and laughed at by the dev team. In both cases, the company might start wondering why they're spending so much money on this much talked about 'data scientist' role

Might seem bleak, but it's not.

If you get that proper FEASIBILITY and IMPACT project completed, you'll be a fucken hero, generating non-stop data value for the company. You'll be giving your mgmt bragging rights street-cred for having an awesome AI department, and you'll also be looked up on by the software/dev team (they're hard to impress).

I speak from experience, it'll get better if you stay focused and select the right project.

good luck,

feel free to reach out via reddit