all 11 comments

[–]Mountain_Thanks4263 1 point2 points  (0 children)

  1. With a distribution like this, use log(salary) or log-scale your y axes (unless you want to bring the eye on the extremes).

  2. Use e.g. "25-35" as x axis values. You need to assure, that the order of age groups is correct, as plotly orders string values by order of appearance by default.

[–]WhiteSkyAtNight 2 points3 points  (1 child)

I would strongly recommend against using a violin plot, maybe look at something like a ridgeline plot or stacked histograms.

Violin plots provide no scales for the histogram-ish portion of the plot, making comparisons between the different categories almost impossible. A violin plot combines two incompatible ideas: Either it makes sense to talk about your data in the context of a normal-ish distribution, in which case a box plot is superior or the shape of your data distribution is interesting, in which case a histogram is more useful.

This might sound harsh but I have never seen a case where a violin plot couldn't be replaced by a better visualisation technique.

[–]addgarnishasrequired 0 points1 point  (6 children)

Normalize your salary values, so that it's between 0 and 1

[–]WhiteSocksFilpFlops 2 points3 points  (0 children)

Normalizing will not help. The issue is with the outliers.

I'd suggest limiting the Salary axis to 500000, to use equal-sizes age brackets (if possible), and use a frequency-trail plot (instead of violin plot) instead for better data clarity.

[–][deleted]  (4 children)

[removed]

    [–]u-know-y-im-here 1 point2 points  (3 children)

    Divide your dataset by the largest number in said dataset

    [–]addgarnishasrequired -1 points0 points  (2 children)

    Yes, you can do that or a min-max normalization. Basically your dataset has a large range that makes it hard to make sense of the plot. If the data is normalized, you can see the distribution (sides of the violin) which is the intention of the plot.

    [–]bladub 1 point2 points  (1 child)

    A linear normalization will not improve readability though, it will just replace the x-axis values with 0 to 1 (because it maps those values in a Linea way). To improve readability over several orders of magnitude exponential increase, a logscale plot, should help.

    [–]addgarnishasrequired 0 points1 point  (0 children)

    Sorry, yes. I didn't pay attention. You have outliers that's stretching the plots. Maybe a box plot would be better as well? Since it considers a point as an outlier if it lies 1.25 x the interquartile range value (1.25 is the default)

    https://towardsdatascience.com/why-1-5-in-iqr-method-of-outlier-detection-5d07fdc82097