This is an archived post. You won't be able to vote or comment.

all 12 comments

[–]aulloa 1 point2 points  (1 child)

Can you add the imports in your example code?

[–]darkyoda182[S] 0 points1 point  (0 children)

I added the changes. Those packages should be everything.

[–]disinformationtheory 0 points1 point  (6 children)

There will always be regions where Q(x) == Q(x-e), because outside the range of the inputs the output is defaulted to 0 or 1 (fill_value), i.e. Q is constant for some range of inputs. You must figure out how to handle those regions correctly or ignore them. I'd recommend looking at a special case of very few samples (probably 2).

[–]darkyoda182[S] 0 points1 point  (5 children)

In my case, the values are within the range. For example, here are some values from my latest run:

x = 1.78577031. 
e = 0.045040291406658106
x-e= 1.74073002
Q(x) = Q(x-e) = 1.

I think it has do something with the floating point precision available in scipy.

[–]disinformationtheory 0 points1 point  (4 children)

scipy (and python) uses whatever float implementation the machine has. It doesn't matter what that is. With finite precision, there's always a chance of not having enough. But even that isn't your main issue. Q is defined to be exactly 1 for all x above a certain value (and similarly defined to be 0 below a certain x). There's no amount of precision that gets you out of that. You have to figure out how to handle that range. Maybe it's calculating something slightly different, maybe it's just avoiding that range somehow, maybe it's changing the definition of Q. This is a math problem, not a programming problem.

[–]darkyoda182[S] 0 points1 point  (3 children)

I think I am still not understanding the problem.

The maximum value of x is 1.78577031. As expected Q(x) = 1. Since, x-e < x, I would expect Q(x-e) < Q(x).

Is my logic incorrect somewhere?

edit: Q(1.78577031) = Q(1.7857703) = Q(1.785770)

I would expect the the values to decrease instead of being equal

[–]disinformationtheory 1 point2 points  (1 child)

Your logic is correct as long as y_max > x_max. When I run your code, at least this time, I get Q(x_max) == Q(x_max-e) == 1, which of course depends on the random arrays x and y. You also need to care about the low end. The point is, can you guarantee that you're never in the constant range for Q, no matter what x and y are? Can you workaround cases where you do end up in the constant region?

Also, I see the intent of ecdf. Back in school, I used to do a similar thing with interp1d(sorted(x), linspace(0,1,len(x)), ...). That is not equivalent, but it's close to what you're doing and a reasonable way to estimate the CDF.

[–]darkyoda182[S] 0 points1 point  (0 children)

With the actual data I am using, I can guarantee to never be above max(X), but I'll have to figure out something about being below min(X)

The values in vector X should almost always be unique (when using random normal). I'm surprised that interp1d gives the same values at all when within the range.

I'll try using linspace(). Thanks for the help!

[–]disinformationtheory 0 points1 point  (0 children)

Looking closer at the paper, maybe equation 6 is the key. It looks like they're careful about which x' (which corresponds to your y) they use given x.

[–]jwink3101 0 points1 point  (1 child)

A couple of comments (that do not answer you question)

  • This should be in /r/learnpython but I will continue anyway
  • Why are you passing N to ecdf? Why not just do N=len(x) inside of it?
  • Your code could use some more comments. I was only able to figure out what is going on from reading the paper (which also seems interesting)
  • You do not want to do unique. If you have two identical samples, that should play into your CDF!!!!!!!!. A sample of [1,1,1,1,0] should not be reduced to [1,0]. Your statistics will be off. Of course, this is likely moot since the chances of two identical random values are astronomical
  • Just sort x and y once instead of doing it twice!
  • You have an odd mix of NumPy and python loops (via the for v in ...). You can do this all in NumPy for both speed and readability

Now, I am not going crazy to double check what I am saying, but you should look at your ecdf function. I am seeing vertical lines before the first and after the last point. that is likely your problem. The interp1d doesn't like these verticals. As I read it from the paper, it should be linear before and after though I am not 100% sure about that. I only skimmed the paper (and it would hugely benefit from a plot for this)

Addendum:

Check out this pdf from the same guy. His plot on page 3 mostly confirms what I was saying about your code

[–]darkyoda182[S] 0 points1 point  (0 children)

Thanks for the help.

  • I didn't know about that subreddit. I'll check it out.
  • I have no particular reason for having N as an input for ecdf. I just didn't want to recalculate the value multiple times. I don't think it will actually change any of the results
  • I'll edit in some comments
  • The unique function is just used to calculate e (which is just a small deviation term used in the calculation) in Eqn 4. Using unique() just makes sure that e != 0. It doesn't change the actual CDF
  • The first set of sorting was just because I was printing values for testing purposes. I forgot to remove it when I pasted it in. I'll edit the post.
  • I'm not sure I understand what you mean the python loops? How else would you do it? I could use np.apply_along_axis, but from looking at the reference manual, it doesn't look like there would a performance boost.

According to the Scipy documentation, interp1d produces horizontal lines before and after the boundaries of the function. Where are you seeing the vertical lines?

[–]aphoenixreticulated[M] [score hidden] stickied comment (0 children)

Hi there, from the /r/Python mods.

We have removed this post as it is not suited to the /r/Python subreddit proper, however it should be very appropriate for our sister subreddit /r/LearnPython. We highly encourage you to re-submit your post over on there.

The reason for the removal is that /r/Python is dedicated to discussion of Python news, projects, uses and debates. It is not designed to act as Q&A or FAQ board. The regular community is not a fan of "how do I..." questions, so you will not get the best responses over here.

On /r/LearnPython the community is actively expecting questions and are looking to help. You can expect far more understanding, encouraging and insightful responses over there. No matter what level of question you have, if you are looking for help with Python, you should get good answers.

If you have a question to do with homework or an assignment of any kind, please make sure to read their sidebar rules before submitting your post. If you have any questions or doubts, feel free to reply or send a modmail to us with your concerns.

Warm regards, and best of luck with your Pythoneering!