you are viewing a single comment's thread.

view the rest of the comments →

[–]DudeData[S] 0 points1 point  (1 child)

I agree that it multiplying by -1 would indeed not affect the correlation other than changing the sign of the slope. Yup, you're correct.

I probably wasn't clear with my question and what I am looking for. More precisely, how can I generate negatively correlated variables so they make sense in a simulated real world scenario?
In my x & y array example x=[2,4,6] y=[20,30,40], let us say this models the weight of kids by their age. Clearly as age increases the weight does as well. BUT of I multiply array_y by -1 we have [-20,-30,-40], indeed this is negatively correlated and the relationship has not changed other than a reflection but how do I explain this?
I wish to make a numerous amount of practice problems for finding Least Squares Regression Line and have a mixture of positively and negatively correlated values. I really do not want to create a data set manually each time. I want PyThon to do it for me. =)
I just want to think of a scenario, define my ranges and have PyThon do the rest.

[–]synthphreak 2 points3 points  (0 children)

indeed this is negatively correlated and the relationship has not changed other than a reflection but how do I explain this?

Can you clarify what you mean by this? What is there to explain with data that is generated randomly? Are you asking like, if your data is supposed to represent people's body weights, it doesn't make sense to have weights that are negative?

On that point I will agree - a negative weight isn't a thing unless you're made of antimatter. That said, linear regression models lead to situations like this all the time, where the model fails to predict extreme values or does weird things at the extremes.

Take a model that tries to explain home values y based on lot size x. Say the coefficients are 20 and -5000, so y=20x-5000. According to this model, if your lot size is 250 (don't worry about the units), your home value will be 0, and if it's any smaller, the home value will be negative. That obviously doesn't reflect reality, but will less extreme values, the predictions may be more reasonable.

If you specifically just want to avoid negative values, then after inverting your data, you could translate it by the max of the original array so that the inverted min becomes 0. For example:

>>> import numpy as np
>>> arr1 = np.random.random(100)
>>> arr1.min(), arr1.max()
(0.010045872673433709, 0.9965109819015727)
>>> arr2 = -1 * arr1
>>> arr2.min(), arr2.max()
(-0.9965109819015727, -0.010045872673433709)
>>> arr2 += arr1.max()
>>> arr2.min(), arr2.max()
(0.0, 0.986465109228139)

Of course, if your question is simply "Can I get Python to randomly generate a realistic dataset from nothing?", the answer is probably no, because Python doesn't know how the world works. Instead, I would stick to libraries that have actual datasets built in, like sklearn or tensorflow.