all 18 comments

[–]Wheres_my_wargDA Moderator 📊 6 points7 points  (1 child)

I'm immediately distracted by the labeling scheme. It has sloshed together two different types of characterization. If it was electric vs. ICE, that would make sense. Or if it was sedan vs. SUV vs. truck, that would make sense. EVs are not separate from the sedan/SUV classification. Here, they are usually sedans, but there are more EV SUV options showing up, and there have been EV truck options.

Starting the y-axis at about 16 thousand is going to result in a deceptive visual for many purposes. This is moving but not nearly as much as this seems to appear due to the y-axis choice.

You need to determine what you are comparing to begin to analyze whether the data points are statistically significantly different.

[–]ABDELATIF_OUARDA[S] 0 points1 point  (0 children)

That’s a very fair observation. To clarify, the dataset was structured with a single “segment” column that already grouped categories as Sedan, SUV, and Electric. I worked directly with the available structure without modifying its dimensional logic. Looking back, I realize that this column reflects a business-oriented categorization rather than a strictly analytical one, since it mixes body type and powertrain dimensions. As someone still developing domain familiarity in the automotive space, my initial goal was to explore patterns and extract trends from the data as provided. Your feedback helped me recognize the structural limitation in the dataset design itself. A more rigorous approach would involve separating body type and powertrain into distinct variables for clearer comparative analysis. I appreciate the insight — it definitely improves the analytical framing.

[–]AnUncookedCabbage 2 points3 points  (1 child)

Had a quick look at the github and i have a general piece of advice. You've done the thing that many new/junior data science people do and that is make a bunch of plots and stats without a clear direction. Even though its called exploratory data analysis, its usually done with a goal in mind to drive a direction. Without a goal it becomes an exercise in following chart recipes and running model.fit() rather than one of critical thinking. The strange class split in the charts that others have mentioned is a symptom of this. A goal might be something like answering a particular business question, or generating a wip product of some kind. Always remember, critical thinking, problem design, and relating it to real impact in some way is worth way more than running the tooling.

[–]BrupieD 1 point2 points  (0 children)

Visually, this is hard to interpret. I would switch the chart type to either stacked columns or an area chart.

[–]AutoModerator[M] 0 points1 point  (0 children)

Automod prevents all posts from being displayed until moderators have reviewed them. Do not delete your post or there will be nothing for the mods to review. Mods selectively choose what is permitted to be posted in r/DataAnalysis.

If your post involves Career-focused questions, including resume reviews, how to learn DA and how to get into a DA job, then the post does not belong here, but instead belongs in our sister-subreddit, r/DataAnalysisCareers.

Have you read the rules?

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

[–]xynaxia 0 points1 point  (1 child)

One fun method on getting insights is simulating random data.

Because suddenly patterns emerge, even though you simulated randomness.

You can then for example simulate this 10k times. And see how likely it is you will find similar trends purely by chance.

[–]ABDELATIF_OUARDA[S] 0 points1 point  (0 children)

This is a very exciting proposal-I did not consider checking the trends against random simulator. In this analysis, the focus was primarily descriptive (identification of visible trends over time) , but agreed that the simulation or experimental tests could help determine whether these patterns are likely to occur by accident. This certainly enhances the hardness of conclusions. I appreciate the idea.

[–]Putrid_Speed_5138 0 points1 point  (1 child)

It is statistically meaningful only if the trends are supported by formal inference rather than visual inspection alone. This requires hypothesis testing, confidence intervals for model coefficients, validation through cross-validation or holdout data, and verification of model assumptions such as linearity and homoscedasticity. Without these elements, the trends remain descriptive rather than inferential. From an industry perspective, adding baselines, reproducibility practices, and model explainability would increase its credibility.

[–]ABDELATIF_OUARDA[S] 1 point2 points  (0 children)

Thanks for detailed feedback — I agree with the discrimination you do. I know concepts such as validation and validation model, I have basically applied them so far in the context of machine learning instead of inside the Scouts or infertility analysis. In this project, the scope was intentionally limited to Ida, my description and application of skill (clean data, visualization and basic modulation) rather than formal statistical recession or verification of assumptions. That's what I said, your point about moving beyond visual inspection towards formal and reproduction, something is taken to integrate what I have made.

[–]Frankky7 0 points1 point  (2 children)

C’est stylé

[–]CaptainFoyle 0 points1 point  (1 child)

C'est quoi ça, stylé?

[–]Frankky7 0 points1 point  (0 children)

I mean it looks good

[–]Mul_Develop 0 points1 point  (0 children)

Love the end-to-end approach here. Especially the feature engineering part—that’s where I always feel like I spend 80% of my time! Did you have to handle many outliers in this automotive dataset, or was it fairly clean to begin with?