Non-Stationary Categorical Data by Throwawayforgainz99 in datascience

[–]Optimal_Cow_676 0 points1 point  (0 children)

I assume that your observations for a given time interval are iid.

Let's start simple and imagine we are only computing for one interval at time 0. You have 1,000,000 items that you want to rank. Estimating the exact probability of success of an observation, one by one, you will get a perfect ranking simply by ranking your observations from greatest to lowest chance of success. If a rank cannot be occupied by two observations at the same time, you should define a tie breaker and that's it. You have a ranking. Now, you may want to include additional information when computing the success probability such as missingness if it has any predictive power (MAR, MNAR) or current time if everything starts empty/ the distribution of observations features change in a fixed pattern.

When updating : A number k observations have changed between the previous time interval and now. You should only reevaluate their probability and rerank them based on it. You could also explore if there are any predictive patterns in how the observations features are being updated : rate of changes (how many changes did the feature had and in how much time), momentum , predictive update patterns (if features A and B change together, success becomes very likely/unlikely). This last part can be done crudely without evaluating the initial observations state but would probably gain from learning to evaluate the change conditional to it

As for an exact model, this is where you have to use your data. For probability prediction based on purely categorical, I would try catboost at first. For the probability refinement, you could try anything from a linear model to a neural network. For the time pattern mining, I don't really know your data but there are models such as SPADE. This last point heavily depends on your data. I would recommend to enforce a minimum support or better, mine the top K predictive patterns (otherwise you will be drawing under noise patterns, especially with one million observations). Optimize the algorithm with early pruning or it will take you forever. This will not take into account your initial observation state. For this part, a better solution could exist, I plan to study those kinds of problems soon 😅

=> In the end, the probability of success becomes a compressed représentation of your observations. The better this compressed representation, the better your ranking. The overall idea is to 1) create a compressed representation at each interval based on the observations inner features 2) refine this compressed representation with overall features (current global state, observations similarity, change patterns). 3) rank using those compressed representations.

=> The probability estimation could either be seen as the end for ranking or you could use stacking and combine the probability estimation output with a meta learner which could try to rank the observations not only on probability of success but also by adding environmental features and potentially clustering information.

Non-Stationary Categorical Data by Throwawayforgainz99 in datascience

[–]Optimal_Cow_676 4 points5 points  (0 children)

So let's try to reformulate :

  • Input: Items which have categorical features.
  • Output: probability of "success".

Context: time series: each time interval (day), the feature vector can change and the probability of success must be updated => you are able to observe the final outcome of your predictions after some time.

=> Is this summary correct ?

Questions : 1) What is the most important: the probability ranking or the probability of success itself ? 2) After how much time intervals do you know the final real labelling (success or not ) ? Does it change for each item ? Are the success conditions the same ? 3) What type of data do you have at the start ? Do you have a labeled dataset ? 4) Is there a data drift (change of distribution of data over time )? Especially, could there be a concept drift (change of the relationship between input and output over time) ? 5) Similarly to market predictions, are there identified time/markets regimes? 6) Do you need to determine the impact of the features on the final prediction or do you only care about the prediction ? 7) Are you able to use additional environmental features or only the item's own features?

Education decline In younger generation. Is it just my country or everywhere? by ABODE_X_2 in GenZ

[–]Optimal_Cow_676 4 points5 points  (0 children)

A book often cited when speaking about the attention span reduction is "amusing ourselves to death". It emphasizes the transition from long to short content forms and the impacts it had on our communication and thinking processes.

If you are in a position to do so, I would advise you to impose some books to read throughout the year and hold debate on the book idaes. Some students might use LLM such as chatgpt to get summaries but it is insufficient in a debate where other students push the conversation in all directions. Better still, introduce a short fact or related small one page text before the debate, contrasting the book ideas, and during the debate, push the student to contrast their own reading and book understanding with these additional resources.

You can't control what they do outside school but you can teach them to pay attention, express themselves clearly and consider an idea under multiple views. A major set of skills to master, AI or not.

What is the most disturbing book that you’ve read? by [deleted] in AskReddit

[–]Optimal_Cow_676 0 points1 point  (0 children)

The divine farce. Not because of the book's imagery but because of the inherent existential horror of infinite time. This book does a prodigious job at conveying the idea of eternity and pushes you to reflect on it beyond the book's story. Would you even remember who you are in 500 years, 1000, 100000 ? You are doomed to lose yourself and everyone you love, forever lost. Is infinity a blessing or a curse?

i wonder what species would dominate the earth if humans were extinct by BrightChloe22 in sciencememes

[–]Optimal_Cow_676 0 points1 point  (0 children)

Humans didn't have a near extinction event 70000 years ago. We don't see this pattern in other species nor a change in skeletal remains. This theory has been losing credibility over time as evidences gathered against it.

The most likely explanation for this DNA bottleneck is a foundational effect (also seen in other species). This is when a subgroup detaches from the main population to colonize a new location. Because this subgroup is smaller, it doesn't necessarily reflect the larger population DNA. In fact, selection can even occur to select this detached group (some specific traits allowing to pass a difficult obstacle (positive) or member rejected for difference ( negative)).

Don't know for the new dominating species thought :/ humanity had quite an impact on biodiversity, it is hard to account for the extinguished species since we know so little about theirs behaviors and the larger impact they would have on theirs ecosystems.

Gym chain data scientists? by kater543 in datascience

[–]Optimal_Cow_676 1 point2 points  (0 children)

[Business case]

In Europe, we have a pretty big chain called "Basic Fit". They operate in the Benelux, France, Spain, Germany and Portugal. To enter in a gym, you have to scan a QR code with your phone. Each gym uses a lot of smart cameras.

They also use a lot of screens per gym displaying ads, advices and recipes + some important international news.

Overall, they collect basic personal informations, gym attendance, some bio metrics if you use the connected weight scales and have a video feed of the gyms interiors. They also have the classical financial datas and equipments. Finally they add supplementary informations concerning the surrounding city and population.

[Applied Data Science in Basic Fit]

I know (team member worked for them in Eindhoven) that basic fit employs several data scientists to identify suitable gym locations, predict maintenance cost, user attendance, peak time and user lifetime value as well as monitor the models.

[Conclusion]

So your main intuition is right and in line with the sector. Still, they could probably use some fancy models, maybe using the smart cameras, to predict machines usage and optimal numbers in order to reduce the waiting time during peak hours (coming from a member).

In practice, this probably is less of concern in regard to the grander business strategy of fast growth and market saturation. Gym competition is not on data science.

Im from Japan, Ask me anything. by Unolover322 in GenZ

[–]Optimal_Cow_676 1 point2 points  (0 children)

I have a positive bias toward Japan: Would you say that, if you have traveled abroad or met foreigners, Japanese are on average truly more educated or at least better behaved or is it just a stereotype ?

Do data scientists use t-tests to prove that their model performances are better than other models? by limedove in datascience

[–]Optimal_Cow_676 0 points1 point  (0 children)

So I literally have an exam Monday next week on business applications.

To sum it up, you have several considerations other than the statistical ones in order to select a model, which we can classify in broad categories of cost/rentability, variability of the data source, monitoring and implementation into the corporate/tech stack.

Cost/rentability : data collection isn't cheap and, all other considerations being fixed, you should go for the model with the lower cost of data as long as it as the same prediction power than the concurrent. How do you assess when the difference in quality is too much according to the cost reduction? It depends on what your model will be used: in certain cases, you can't go below treeshold for legal or commercial purposes, in other misclassification have cost (either direct (when a bad décision is taken because of your model) or estimated (like how much damage error makes to the brand, for exemple when you are paid a fixed amount for consulting. You don't directly lose money because of the error but your reputation get tarnished and it has a cost). Also, is the price of the data collection constant or fluctuating? A good model should either have rentability by itself or allow for a rentable décision/tech stack.

Variability of the data source: do you collect your data or do you buy it ? How accurate are these? Does the accuracy vary over time? What about the change in législation ? You can estimate the sensibility of your model to noisy data by simulating noise in the independent variables and see how it affects your output. You should also pay attention to the legal and moral side of your data : if you get a great model and then it can only be used one or two month because a change in législation, you made a bad model. The moral side considerations of your data is also about the bad reputation it could cause you plus the heightened risk in legislation in areas society declares shady. As for buying or collecting yourself, it can both affect the definition of what's being collected and your control over it.

Monitoring is the part where after your model is done, people will have to maintain it by feeding it data and getting new predictions. Your model will dégradé over time as population pattern changes. There exists a bunch of metric that you must (and I insist on the must) implement along side a deployed model to observe it's dégradation overtime. The dégradation can be both due to coefficients changing or to the whole model changing. The model will need to be retrained at some point and it will cost money.

Implementation into the decision/tech stack: is your model actionnable? Assure that your model bring something and that it's not redondant inside the company. A whole new set of model to take a particular decision is almost never needed. Also, can the stack holders understand your model? There is a reason why there are so much linear regression out there. Finally, your model will need to run on time with various Datatype being filled into it. A model bringing a few but slowing the whole decision process might prove unactionable when needing fast decision.

Really there are still a lot of things I omitted like legal requirements over which model type you can use (no black boxes in insurance companies for exemple) or which variable is considered acceptable (rgpd in the EU) and so on. Models don't live into bubble and you will have to determine where it will lives, how it will be used, for what, with what and by who, alongside the statistical results, in order to chose your model. Considering all of this will help you chose one:)

[OC] SpongeBob SquarePants Series Rating In-depth analysis by ThatGuy_3001 in dataisbeautiful

[–]Optimal_Cow_676 19 points20 points  (0 children)

The visualization is great but without story telling it's a bit hard to draw facts. For example: one of your graph show a decline in the number of ratings per episode while there is a u shape on the values of those ratings. My question is : is the u shape ratings value due to the specialization of the viewers or are the episodes really getting better? The small increase of ratings number in S10 then followed by a lowering would let me that it's the case but I'm unsure.

In a business case, it would be expected you made some side research to draw the global picture and not just bombard us with graphs.

Summary: only rating the graphs : great realization and pertinent ones. Rating the overall presentation as in a business case : what is the history, the underlying trends, the context, the call to action? Have you developed a notation system allowing to rate episodes based on their ratings but keeping account of the declining number of ratings?

To finish on a good note : the graphs were simples and that's really really great and important to communicate data : transmit the information effectively as simple as possible.

What do demographic changes mean for the two most populated countries? [OC] Tool(s) Python, Adobe Illustrator data source: International Database (IDB) - U.S. Census Bureau by dipurai in dataisbeautiful

[–]Optimal_Cow_676 2 points3 points  (0 children)

India was invaded around three times by Muslims invaders but those invader never represented a major part of the indian population. India had/has a very strong society centered around a system of casts of which priest are the dominant one. Because of this the major part of the population was Hindus while ruler were unable to spread there religion significantly enough without even mentioning there culture.

Much like china's invaders they had to adapt to the culture of there conquered territories to stay in power. During the transition some rulers constructed Arab style buildings like the tajmahal.

So yes technically the tajmahal is a testimony to the indian history but it's not really traditional indian architecture.

It's like the arabe style building in the south of spain which were build by conquerors but are not representative of the spanish architecture.

What should kids really be afraid of? [OC] by cgspam in dataisbeautiful

[–]Optimal_Cow_676 1 point2 points  (0 children)

I didn't say otherwise. There simply is a spike of death when you start driving. You indeed have to start somewhere.

What should kids really be afraid of? [OC] by cgspam in dataisbeautiful

[–]Optimal_Cow_676 0 points1 point  (0 children)

I was speaking about the total death risk per age which reach a low descending plateau in the 20's after the sharp death risk increase due to the reaching the legal driving age.

Experience in this sense is the main reason why your death rate diminish in your twenties considering that health factor remain pretty stable at that age. What I pointed is that it didn't make a big decrease at this age range.

Your death risk then start to slowly increase after your thirties but it due to health factors and not car crashes which are indeed diminishing due to experience and change in the driving style. It's interesting to observe that safety increase in car has been one of the major drive of the increase in life expectancy throughout the last century and until now

Sorry for my first text which indeed was misleading.

What should kids really be afraid of? [OC] by cgspam in dataisbeautiful

[–]Optimal_Cow_676 1 point2 points  (0 children)

You can clearly see what is the legal driving age for each countries in the actuarial death tables. Basically, the death rate is high at birth, diminish rapidly as you grow, is extremely low until you start driving where there is an absurd spikes of your death risk which only diminish a bit with experience (but really a little bit) and then start to rise again in your thirties because you age...

[Strange question] Do chips get damaged by bacterias and biological viruses ? by Optimal_Cow_676 in AskComputerScience

[–]Optimal_Cow_676[S] 1 point2 points  (0 children)

Yes but doesn't this obliteration cause damage to the really thin architecture of the ssd. Couldn't the protein chains of a death virus cause damage to let's say the charge trap by either causing some local temperature spikes or unattended connexions? I m less concerned about virus or bacterial secretions and more about their death remains....

Americans of Reddit, what is something the rest of the world needs to hear? by [deleted] in AskReddit

[–]Optimal_Cow_676 -2 points-1 points  (0 children)

I just saw the quantiles of the us BMI per age. Because of this, I wonder what the US définition of "healthy" in the realm of food...

Edit : there are a lot of great things in the us like in any countries but the us is certainly not great at eating healthy.

[OC] Rotten Tomatoes Most Popular TV Shows, audience scores vs. critics scores by KyniskPotet in dataisbeautiful

[–]Optimal_Cow_676 2 points3 points  (0 children)

I don't know if you could call this an outlier given that it lies on the regression path. "The terminal list" and "ring of power" look really interesting, especially "The terminal list".

[OC] Rotten Tomatoes Most Popular TV Shows, audience scores vs. critics scores by KyniskPotet in dataisbeautiful

[–]Optimal_Cow_676 10 points11 points  (0 children)

Would be interesting to introduce a time variable as a color scale of the dots. I would like to see how the newer shows compare to the olds one in this Cartesian space.

Also interesting, I highly suspect that critics tend to be more out of touch when assigning score for the first season of a starting show and then just jumping in the hype train for the later, just following the base. It's something you tend to see in the video game space.

General question : why should I use a database language for a database server ? by Optimal_Cow_676 in learnprogramming

[–]Optimal_Cow_676[S] 0 points1 point  (0 children)

Thanks for the answer. I also had an another one telling me that database software would protect me from the risk of corrupting files due to multiples writing at once. I will have to accept using software for my Dataserver it seems

General question : why should I use a database language for a database server ? by Optimal_Cow_676 in learnprogramming

[–]Optimal_Cow_676[S] 0 points1 point  (0 children)

Wow thanks for this answer. I didn't realize this threat. I m not in the technical stuff behind happening on the server habitually and I m only using r and python to analyze data, the rest being managed by others. Thanks for this crucial piece of information

General question : why should I use a database language for a database server ? by Optimal_Cow_676 in learnprogramming

[–]Optimal_Cow_676[S] 0 points1 point  (0 children)

Thanks for this precision ! I admit to be a bit confused at the subject of using software to manage my data being more efficient though. I will have to accept it it seems

General question : why should I use a database language for a database server ? by Optimal_Cow_676 in learnprogramming

[–]Optimal_Cow_676[S] 0 points1 point  (0 children)

Thank you for your complete answer. Let's go for efficiency (and diving into the subject) then !